AWS EC2 Auto Scaling – Weird Customer Scenario

Scenario:
Few months ago I got a customer asking me to help reviewing their EC2 Auto Scaling Group configuration and ensure that when they shutdown the instance behind the Elastic Load Balancer (ELB), that this instance should not terminate or spin up a new instance by Auto Scaling.

I reviewed EC2 Auto Scaling Configuration and found they had “Suspended Processes” set to “Replace Unhealthy” so it will not replace any unhealthy EC2 instances in their Auto Scaling Group. So this was fine and meeting their expectation, BUT we are defeating the whole purpose of “Auto Scaling” here and there is a DOWNSIDE of this, however customer was still comfortable with this.

So what would be the downside of this above approach, please refer my below explanation and lab configuration steps:

Lab Configuration Steps:
1. Created below Auto Scaling Group configuration in my lab.



2. Two new instances were added to my Auto Scaling Group (ASG) and both are healthy.





3. Then I went ahead and stopped one of the instance (instance id ending with XX9fb) that is part of Auto Scaling Group.
 


4. Instance reported “Unhealthy” in the Auto Scaling Group (AGS) Console.


5. I waited for almost 15 mins and no Instance Termination or Launch triggered in ASG Activity and my unhealthy instance in ASG was still available. In ideal configuration this “UNHEALTHY” instance should be terminated by Auto Scaling but it didn’t because I have “Suspended Processes” set to “Replace Unhealthy” in my Auto Scaling Group configuration. This is what customer was also expecting.

6. I started instance back (instance id ending with XX9fb) but I verified that the instance still stayed in “Unhealthy” state even after all the health checks were passing and this is desired behavior in this scenario. I’m explaining below why the instance is not reporting “Healthy” stays in “Unhealthy” state forever even after you bring it back online.

So the downside of suspending “Replace Unhealthy” that I can see is that after you start your instance again, it will never report instance as “Healthy” in Auto Scaling Group again, it will stay in “Unhealthy” state until you manually change its state to “Healthy” using AWS CLI command below. When ReplaceUnhealthy process is suspended like in this case, an instance that is marked as Unhealthy basically enters this state indefinitely until a user manually launch the CLI command.

aws autoscaling set-instance-health –instance-id i-xxxxxxxxxxxx –health-status Healthy

The reason behind this behavior is because an Autoscaling Group never recovers automatically an instance from Unhealthy to Healthy state. This is because in a normal situation (i.e. with all the autoscaling processes enabled), as soon as an instance is marked as Unhealthy, the Autoscaling Group start the replacement process. Hence, after few moments it terminates the unhealthy instance and launches another one following the configuration of the Launch Configuration associated with no need to further check the health of the instance that was marked as Unhealthy.
 
Reference documentation:
What Is Amazon EC2 Auto Scaling ?
https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html
Suspending and Resuming Scaling Processes:
https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-suspend-resume-processes.html

I hope this helps, please let me know if you have any feedback.