CrowdStrike BSOD Fix with AWS SSM Automation At Scale
During yesterdays events, I found several references for fixing EC2 Instances by attaching volumes to helper Instances and deleting the CrowdStrike file that was causing the issue. But unfortunately, this really doesn’t scale well. You’re limited to processing one broken Instance at a time per helper Instance you have running, and you have to try and keep track of what resources are available.
I made some changes from what I found, and have an SSM Automation Doc that spins up helper EC2 Instances per execution, allowing the process to scale much more quickly.
Process
The SSM Automation Doc does the following:
- Gathers information from the broken Instance including AZ and Subnet
- The Helper Instance has to be in the same AZ as the EBS Volume you’re trying to fix
- Spins up a new Windows Server EC2 Instance (defaults to 2022 base currently available, AMI may need to be changed as time progresses)
- Shuts down the broken Instance, if not already shut down
- Removes the Volume from the broken Instance and attaches it to the helper Instance
- Sleeps for a bit to ensure that the Helper Instance is up and talking to SSM
- If you’re running into timeouts on the RunCommand, bump this up
- Deletes the CrowdStrike files
- Removes Volume from Helper and re-attaches to source EC2 Instance
- Starts the source EC2 Instance
- Terminates the helper Instance
Requirements
- A Security Group with outbound rules to
0.0.0.0/0
for- TCP/443 (HTTPS)
- UDP/53 and TCP/53 (DNS)
- An IAM Role with an Instance Profile to allow the helper Instance to reach SSM
- The Instance ID
- The Volume ID of the root volume
- This probably can be retrieved from the Instance lookup, but initial attempts were error prone
- Added benefit of this serving as a reference for what Volume ID belongs with what Instance in case you mess something up
How To Use
Create Automation Doc
Go to SSM
-> Documents
, select Create Document
on the top right, and select Automation
Copy and paste the automation code below into the editor. Change any defaults you’d like for Roles/Security Groups
Execute Against Instance
Get the Instance ID and Root Volume ID from the Instance
Run the Automation Doc, providing both:
Wait for the automation doc to complete
IMPORTANT
The Automation will fail on RunCommand
if the helper Instance is unable to communicate with SSM. This requires a valid IAM Role and network access to SSM.
The RunCommand
step will also fail if the EC2 Instance has not finished starting up and connecting to SSM, if this happens increase the timeout of the Sleep
stage to 2-3 minutes with PT2M
or PT3M
, respectively. Note that using smaller Instance Types as helper Instances (ie: t2.micro
) is not recommended, they take longer to get going.
If the execution fails, you must reattach the original Volume to the original EC2 Instance if you want to execute the Automation Doc against that Instance again.
If the execution fails, terminate the helper Instance, or it will stick around. They are all conveniently named HelperInstance
, and can be easily identified.
Automation Doc
description: Automation to fix CrowdStrike BSOD with Ephemeral Helper Instances
schemaVersion: '0.3'
parameters:
InstanceId:
type: String
description: The ID of the EC2 Instance to fix
VolumeId:
type: String
description: The ID of the EBS volume to detach and reattach
HelperInstanceType:
type: String
default: c7i.large
description: The type of the helper EC2 instance
HelperInstanceAMI:
type: String
default: ami-00d990e7e5ece7974
description: The AMI ID of the helper EC2 Instance, defaults to Server 2022 base
SSMRole:
type: String
default: AmazonSSMRoleForInstancesQuickSetup
description: The IAM role to use for the SSM document
# Recommend setting a default SG here
SecurityGroupId:
type: String
# default: sg-123456789132
description: The ID of the security group to associate with the helper instance
HelperSleep:
type: String
default: PT1M
description: How long to sleep to wait for the helper Instance to come up, default 1 minute
mainSteps:
- name: GetInstanceInfo
action: aws:executeAwsApi
nextStep: LaunchHelperInstance
isEnd: false
inputs:
Service: ec2
Api: DescribeInstances
InstanceIds:
- '{{ InstanceId }}'
outputs:
- Name: AvailabilityZone
Selector: $.Reservations[0].Instances[0].Placement.AvailabilityZone
Type: String
- Name: SubnetId
Selector: $.Reservations[0].Instances[0].SubnetId
Type: String
- name: LaunchHelperInstance
action: aws:executeAwsApi
nextStep: WaitForHelperInstanceRunning
isEnd: false
inputs:
Service: ec2
Api: RunInstances
ImageId: '{{ HelperInstanceAMI }}'
InstanceType: '{{ HelperInstanceType }}'
MinCount: 1
MaxCount: 1
IamInstanceProfile:
Name: '{{ SSMRole }}'
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeType: gp3
Encrypted: true
MetadataOptions:
HttpTokens: required
Placement:
AvailabilityZone: '{{ GetInstanceInfo.AvailabilityZone }}'
SubnetId: '{{ GetInstanceInfo.SubnetId }}'
SecurityGroupIds:
- '{{ SecurityGroupId }}'
TagSpecifications:
- ResourceType: instance
Tags:
- Key: Name
Value: HelperInstance
outputs:
- Name: InstanceId
Selector: $.Instances[0].InstanceId
Type: String
- name: WaitForHelperInstanceRunning
action: aws:waitForAwsResourceProperty
nextStep: StopInstance
isEnd: false
inputs:
Service: ec2
Api: DescribeInstances
InstanceIds:
- '{{ LaunchHelperInstance.InstanceId }}'
PropertySelector: $.Reservations[0].Instances[0].State.Name
DesiredValues:
- running
- name: StopInstance
action: aws:changeInstanceState
nextStep: WaitForInstanceStopped
isEnd: false
inputs:
InstanceIds:
- '{{ InstanceId }}'
DesiredState: stopped
- name: WaitForInstanceStopped
action: aws:waitForAwsResourceProperty
nextStep: DetachVolume
isEnd: false
inputs:
Service: ec2
Api: DescribeInstances
InstanceIds:
- '{{ InstanceId }}'
PropertySelector: $.Reservations[0].Instances[0].State.Name
DesiredValues:
- stopped
- name: DetachVolume
action: aws:executeAwsApi
nextStep: WaitForVolumeAvailable
isEnd: false
inputs:
Service: ec2
Api: DetachVolume
VolumeId: '{{ VolumeId }}'
- name: WaitForVolumeAvailable
action: aws:waitForAwsResourceProperty
nextStep: AttachVolumeToHelper
isEnd: false
inputs:
Service: ec2
Api: DescribeVolumes
VolumeIds:
- '{{ VolumeId }}'
PropertySelector: $.Volumes[0].State
DesiredValues:
- available
- name: AttachVolumeToHelper
action: aws:executeAwsApi
nextStep: WaitForVolumeAttached
isEnd: false
inputs:
Service: ec2
Api: AttachVolume
VolumeId: '{{ VolumeId }}'
InstanceId: '{{ LaunchHelperInstance.InstanceId }}'
Device: /dev/sdf
- name: WaitForVolumeAttached
action: aws:waitForAwsResourceProperty
nextStep: Sleep
isEnd: false
inputs:
Service: ec2
Api: DescribeVolumes
VolumeIds:
- '{{ VolumeId }}'
PropertySelector: $.Volumes[0].Attachments[0].State
DesiredValues:
- attached
- name: Sleep
action: aws:sleep
nextStep: DeleteCrowdStrikeDriver
isEnd: false
inputs:
Duration: '{{HelperSleep}}'
- name: DeleteCrowdStrikeDriver
action: aws:runCommand
nextStep: DetachVolumeFromHelper
isEnd: false
inputs:
DocumentName: AWS-RunPowerShellScript
InstanceIds:
- '{{ LaunchHelperInstance.InstanceId }}'
Parameters:
commands:
- |
Remove-Item -Path "d:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys" -Force
Get-ChildItem
- name: DetachVolumeFromHelper
action: aws:executeAwsApi
nextStep: WaitForVolumeAvailableAgain
isEnd: false
inputs:
Service: ec2
Api: DetachVolume
VolumeId: '{{ VolumeId }}'
- name: WaitForVolumeAvailableAgain
action: aws:waitForAwsResourceProperty
nextStep: AttachVolumeToOriginal
isEnd: false
inputs:
Service: ec2
Api: DescribeVolumes
VolumeIds:
- '{{ VolumeId }}'
PropertySelector: $.Volumes[0].State
DesiredValues:
- available
- name: AttachVolumeToOriginal
action: aws:executeAwsApi
nextStep: StartInstance
isEnd: false
inputs:
Service: ec2
Api: AttachVolume
VolumeId: '{{ VolumeId }}'
InstanceId: '{{ InstanceId }}'
Device: /dev/sda1
- name: StartInstance
action: aws:changeInstanceState
nextStep: WaitForInstanceRunning
isEnd: false
inputs:
InstanceIds:
- '{{ InstanceId }}'
DesiredState: running
- name: WaitForInstanceRunning
action: aws:waitForAwsResourceProperty
nextStep: TerminateHelperInstance
isEnd: false
inputs:
Service: ec2
Api: DescribeInstances
InstanceIds:
- '{{ InstanceId }}'
PropertySelector: $.Reservations[0].Instances[0].State.Name
DesiredValues:
- running
- name: TerminateHelperInstance
action: aws:changeInstanceState
isEnd: true
inputs:
InstanceIds:
- '{{ LaunchHelperInstance.InstanceId }}'
DesiredState: terminated
Generating Commands
WARNING - Only 100 concurrent automation executions are allowed by default in an AWS Account. If you have more Instances than that, you need to batch them and wait for batches to complete before running new ones
Instead of manually looking up Instance IDs and Volume IDs, I recommend generating the commands to run them. There are definitely more robust ways to script and automate all this, but time is short, and this is simple.
Here’s a basic TypeScript script that takes a list of Instance IDs you provide (at the top) and generates SSM Commands for all of them. Bash, Python, or anything else preferred can be used instead. The Instance IDs can be retrieved from reports in AWS or other systems.
import { EC2Client, DescribeInstancesCommand } from "@aws-sdk/client-ec2"
import { GetCallerIdentityCommand, STSClient } from '@aws-sdk/client-sts'
import * as fs from 'fs'
const REGION = 'us-east-1'
// Change to the document name in AWS
const CROWDSTRIKE_AUTO_DOC_NAME = 'CrowdStrikeFix'
// Instance IDs to generate commands for
const instanceIds = [
'i-0b56bf9865e747d30'
]
// Setup AWS Clients
const ec2Client = new EC2Client( {
region: REGION,
} )
const stsClient = new STSClient( {
region: REGION,
} )
// Get a list of all AWS Instances
async function listInstances() {
const instances = await ec2Client.send( new DescribeInstancesCommand( {} ) )
return instances.Reservations!
}
async function main() {
const instances = await listInstances()
const commands: string[] = []
for ( let reservation of instances ) {
for ( let instance of reservation.Instances! ) {
if ( instanceIds.includes( instance.InstanceId! ) ) {
const rootVolume = instance.BlockDeviceMappings!.find( mapping => mapping.DeviceName === instance.RootDeviceName )
if ( rootVolume ) {
console.log(`Generated Command for ${instance.InstanceId!} - ${rootVolume.Ebs!.VolumeId!}`)
const command = `aws ssm start-automation-execution --document-name "${CROWDSTRIKE_AUTO_DOC_NAME}" --parameters "InstanceId=${instance.InstanceId!},VolumeId=${rootVolume.Ebs!.VolumeId!}" --region ${REGION}`
commands.push( command )
}
}
}
}
// Account ID used for file name
const accountId = ( await stsClient.send( new GetCallerIdentityCommand( {} ) ) ).Account!
const filePath = `${accountId}_commands.sh`
fs.writeFileSync( filePath, commands.join( '\n' ) )
console.log( `AWS CLI commands written to ${filePath}` )
}
main().then( () => {
console.log( 'done' )
} )
Output:
A file called <accountId>_commands.sh
will be created in the directory, that looks like:
aws ssm start-automation-execution --document-name "CrowdStrikeFix" --parameters "InstanceId=i-0b56bf9865e747d30,VolumeId=vol-021a9e2a04b9d2888" --region us-east-1