Azure Virtual Machine Scale Sets Autoscaling with Terraform

Pre-requisite: Learn VMSS Autoscaling Concept using Azure Portal

Create VMSS
Create Autoscaling Default Profile
Percentage CPU Rule
Available Memory Bytes Rule
LB SYN Count Rule
Create Autoscaling Recurrence Profile - Weekdays
Create Autoscaling Recurrence Profile - Weekends
Create Autoscaling Fixed Profile

Resource: azurerm_monitor_autoscale_setting
- Notification Block
- Profile Block-1: Default Profile
  1. Capacity Block
  2. Percentage CPU Metric Rules
    1. Scale-Up Rule: Increase VMs by 1 when CPU usage is greater than 75%
    2. Scale-In Rule: Decrease VMs by 1when CPU usage is lower than 25%
  3. Available Memory Bytes Metric Rules
    1. Scale-Up Rule: Increase VMs by 1 when Available Memory Bytes is less than 1GB in bytes
    2. Scale-In Rule: Decrease VMs by 1 when Available Memory Bytes is greater than 2GB in bytes
  4. LB SYN Count Metric Rules (JUST FOR firing Scale-Up and Scale-In Events for Testing and also knowing in addition to current VMSS Resource, we can also create Autoscaling rules for VMSS based on other Resource usage like Load Balancer)
    1. Scale-Up Rule: Increase VMs by 1 when LB SYN Count is greater than 10 Connections (Average)
    2. Scale-Up Rule: Decrease VMs by 1 when LB SYN Count is less than 10 Connections (Average)

Step-00: Introduction

VMSS Autoscaling
Default Profile
Recurrence Profile
Fixed Profile
Each Profile will have following Rules
Percentage CPU Increase and Decrease Rule
Available Memory Bytes Increase and Decrease Rule
LB SYN Count Increase and Decrease Rule

Update Files

c8-02-bastion-host-linuxvm.tf: Add Bastion Custom Data

New Files: Web Linux VMSS

c7-05-web-linux-vmss-autoscaling-default-profile.tf
c7-06-web-linux-vmss-autoscaling-default-and-recurrence-profiles.tf
c7-07-web-linux-vmss-autoscaling-default-recurrence-fixed-profiles.tf

Step-01: c8-02-bastion-host-linuxvm.tf

Add Custom Data for Bastion Host which will install HTTPD related binaries.
This will install the Apache Bench tool for load testing.
This Apache Bench helps us to generate huge load on our Application to trigger Scale-Out and Scale-In events for Autoscaling

# Locals Block for custom data
locals {
bastion_host_custom_data = <<CUSTOM_DATA
#!/bin/sh
#sudo yum update -y
sudo yum install -y httpd
sudo systemctl enable httpd
sudo systemctl start httpd  
sudo systemctl stop firewalld
sudo systemctl disable firewalld
sudo yum install -y telnet
sudo chmod -R 777 /var/www/html 
sudo echo "Welcome to stacksimplify - Bastion Host - VM Hostname: $(hostname)" > /var/www/html/index.html
CUSTOM_DATA  
}


# Resource-1: Create Public IP Address
resource "azurerm_public_ip" "bastion_host_publicip" {
  name                = "${local.resource_name_prefix}-bastion-host-publicip"
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location
  allocation_method   = "Static"
  sku = "Standard"
}

# Resource-2: Create Network Interface
resource "azurerm_network_interface" "bastion_host_linuxvm_nic" {
  name                = "${local.resource_name_prefix}-bastion-host-linuxvm-nic"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  ip_configuration {
    name                          = "bastion-host-ip-1"
    subnet_id                     = azurerm_subnet.bastionsubnet.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id = azurerm_public_ip.bastion_host_publicip.id 
  }
}

# Resource-3: Azure Linux Virtual Machine - Bastion Host
resource "azurerm_linux_virtual_machine" "bastion_host_linuxvm" {
  name = "${local.resource_name_prefix}-bastion-linuxvm"
  #computer_name = "bastionlinux-vm"  # Hostname of the VM (Optional)
  resource_group_name = azurerm_resource_group.rg.name
  location = azurerm_resource_group.rg.location
  size = "Standard_DS1_v2"
  admin_username = "azureuser"
  network_interface_ids = [ azurerm_network_interface.bastion_host_linuxvm_nic.id ]
  admin_ssh_key {
    username = "azureuser"
    public_key = file("${path.module}/ssh-keys/terraform-azure.pub")
  }
  os_disk {
    caching = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }
  source_image_reference {
    publisher = "RedHat"
    offer = "RHEL"
    sku = "83-gen2"
    version = "latest"
  }
  custom_data = base64encode(local.bastion_host_custom_data)  
}

Step-02: c7-05-web-linux-vmss-autoscaling-default-profile.tf

Create Base Autoscaling Resource without any profiles

resource "azurerm_monitor_autoscale_setting" "web_vmss_autoscale" {
  name                = "${local.resource_name_prefix}-web-vmss-autoscale-profiles"
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location
  target_resource_id  = azurerm_linux_virtual_machine_scale_set.web_vmss.id
  # Notification Block  
  notification {
      email {
        send_to_subscription_administrator    = true
        send_to_subscription_co_administrator = true
        custom_emails                         = ["myadminteam@ourorg.com"]
      }
    }    
}

Step-03: Profile-1: Default Profile - Percentage CPU Metric

File: c7-05-web-linux-vmss-autoscaling-default-profile.tf

################################################################################
################################################################################
#######################  Profile-1: Default Profile  ###########################
################################################################################
################################################################################    
# Profile-1: Default Profile 
  profile {
    name = "default"
  # Capacity Block     
    capacity {
      default = 2
      minimum = 2
      maximum = 6
    }
###########  START: Percentage CPU Metric Rules  ###########    
  ## Scale-Up 
    rule {
      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = 1
        cooldown  = "PT5M"
      }            
      metric_trigger {
        metric_name        = "Percentage CPU"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_vmss.id
        metric_namespace   = "microsoft.compute/virtualmachinescalesets"        
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "GreaterThan"
        threshold          = 75
      }
    }

  ## Scale-In 
    rule {
      scale_action {
        direction = "Decrease"
        type      = "ChangeCount"
        value     = 1
        cooldown  = "PT5M"
      }        
      metric_trigger {
        metric_name        = "Percentage CPU"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_vmss.id
        metric_namespace   = "microsoft.compute/virtualmachinescalesets"                
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "LessThan"
        threshold          = 25
      }
    }
###########  END: Percentage CPU Metric Rules   ###########

Step-04: Profile-1: Default Profile - Available Memory Bytes Metric

###########  START: Available Memory Bytes Metric Rules  ###########    
  ## Scale-Up 
    rule {
      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = 1
        cooldown  = "PT5M"
      }            
      metric_trigger {
        metric_name        = "Available Memory Bytes"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_vmss.id
        metric_namespace   = "microsoft.compute/virtualmachinescalesets"        
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "LessThan"
        threshold          = 1073741824 # Increase 1 VM when Memory In Bytes is less than 1GB
      }
    }

  ## Scale-In 
    rule {
      scale_action {
        direction = "Decrease"
        type      = "ChangeCount"
        value     = 1
        cooldown  = "PT5M"
      }        
      metric_trigger {
        metric_name        = "Available Memory Bytes"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_vmss.id
        metric_namespace   = "microsoft.compute/virtualmachinescalesets"                
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "GreaterThan"
        threshold          = 2147483648 # Decrease 1 VM when Memory In Bytes is Greater than 2GB
      }
    }
###########  END: Available Memory Bytes Metric Rules  ###########

Step-05: Profile-1: Default Profile - LB SYN Count Metric

###########  START: LB SYN Count Metric Rules - Just to Test scale-in, scale-out  ###########    
  ## Scale-Up 
    rule {
      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = 1
        cooldown  = "PT5M"
      }      
      metric_trigger {
        metric_name        = "SYNCount"
        metric_resource_id = azurerm_lb.web_lb.id 
        metric_namespace   = "Microsoft.Network/loadBalancers"        
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "GreaterThan"
        threshold          = 10 # 10 requests to an LB
      }
    }
  ## Scale-In 
    rule {
      scale_action {
        direction = "Decrease"
        type      = "ChangeCount"
        value     = 1
        cooldown  = "PT5M"
      }      
      metric_trigger {
        metric_name        = "SYNCount"
        metric_resource_id = azurerm_lb.web_lb.id
        metric_namespace   = "Microsoft.Network/loadBalancers"                
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "LessThan"
        threshold          = 10
      }
    }
###########  END: LB SYN Count Metric Rules  ###########    
  } # End of Profile-1

} # End of Auto Scale Resource

Step-06: Execute Terraform Commands

# Terraform Initialize
terraform init

# Terraform Validate
terraform validate

# Terraform Plan
terraform plan

# Terraform Apply
terraform apply -auto-approve

Step-07: Verify Resources

# Other Resources (Untouched)
1. Resource Group 
2. VNETs and Subnets
3. Bastion Host Linux VM

# VMSS Resource
1. Verify the VM Instances in VMSS Resources
2. 2 VM Instances should be created as per Capacity Block from Profile-1: Default Profile
  # Capacity Block     
    capacity {
      default = 2
      minimum = 2
      maximum = 6
    }
3. Verify the Autoscaling Policy in Scaling Tab of VMSS Resource

Step-08: Test Scale-Out and Scale-In scenarios

# Connect to Bastion Host Linux VM
ssh -i ssh-keys/terraform-azure.pem azureuser@<Bastion-Host-LinuxVM-PublicIP>
sudo su - 

# Run the Load Test using Apache Bench
ab -k -t 1200 -n 9050000 -c 100 http://<Web-LB-Public-IP>/index.html
ab -k -t 1200 -n 9050000 -c 100 http://52.149.253.66/index.html

# Verify Scale-Out Event
1. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Instances
2. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Configure tab -> Open LB Connection Rule
3. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Run History Tab
4. Scale-Out Observation: A new VM should be created in VM Instances Tab of VMSS 

# Wait for 10 to 15 Minutes
- Wait for 10 to 15 minutes for "Scale-In" Event to Trigger

# Verify Scale-In Event
1. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Instances
2. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Configure tab -> Open LB Connection Rule
3. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Run History Tab
4. Scale-In Observation: 1 VM should be deleted in VM Instances Tab of VMSS and should come down to value present in capacity block (capacity.minimum = 2 VMs)

Step-09: c7-05-web-linux-vmss-autoscaling-default-profile.tf

Comment All code in c7-05.
In c7-06, we will add profile-2 and profile-3

Step-10: Autoscaling Profile-2: Recurrence Profiles: Weekday Profile

c7-06-web-linux-vmss-autoscaling-default-and-recurrence-profiles.tf

## Major Changes in this Block
# 1. Capacity Block Values Change - Week Days (Minimum = 4, default = 4, Maximum = 20)
# 2. Recurrence Block for Week Days
# Profile-2: Recurrence Profile - Week Days
  profile {
    name = "profile-2-weekdays"
  # Capacity Block     
    capacity {
      default = 4
      minimum = 4
      maximum = 20
    }
  # Recurrence Block for Week Days (5 days)
    recurrence {
      timezone = "India Standard Time"
      days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
      hours = [0]
      minutes = [0]      
    }  
##########  START: Percentage CPU Metric Rules  ###########    
  ## Scale-Out 
    rule {
      .....
      REST ALL SAME DEFAULT Profile
      .....
      .....
}

Step-11: Autoscaling Profile-3: Recurrence Profiles: Weekend Profile

## Major Changes in this Block
# 1. Capacity Block Values Change - Weekends (Minimum = 3, default = 3, Maximum = 20)
# 2. Recurrence Block for Weekends
# Profile-2: Recurrence Profile - Weekends
  profile {
    name = "profile-3-weekends"
  # Capacity Block     
    capacity {
      default = 3
      minimum = 3
      maximum = 6
    }
  # Recurrence Block for Weekends (2 days)
    recurrence {
      timezone = "India Standard Time"
      days = ["Saturday", "Sunday"]
      hours = [0]
      minutes = [0]      
    }    
###########  START: Percentage CPU Metric Rules  ###########    
  ## Scale-Out 
    rule {
      .....
      REST ALL SAME DEFAULT Profile
      .....
      .....
}

Step-12: Apply and Verify VMSS Resource - Autoscaling Profile-2 and 3

# Terraform Validate
terraform validate

# Terraform Plan
terraform plan

# Terraform Apply 
terraform apply -auto-approve

# Verify VMSS Resource
1. Verify the VM Instances in VMSS Resources
2. 3 or 4 VM Instances should be created as per Capacity Block from Profile-2 or 3 based on the day you are testing 
  # Capacity Block     
    capacity {
      default = 3 or 4 
      minimum = 3 or 4
      maximum = 6
    }
3. Verify the Autoscaling Policy in Scaling Tab of VMSS Resource 
4. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -
  a. Profile-1: Default Profile
  b. Profile-2: Weekday Profile
  c. Profile-3: Weekend Profile

Step-13: Test Scale-Out and Scale-In scenarios

# Connect to Bastion Host Linux VM
ssh -i ssh-keys/terraform-azure.pem azureuser@<Bastion-Host-LinuxVM-PublicIP>
sudo su - 

# Run the Load Test using Apache Bench
ab -k -t 1200 -n 9050000 -c 100 http://<Web-LB-Public-IP>/index.html
ab -k -t 1200 -n 9050000 -c 100 http://52.149.253.66/index.html

# Verify Scale-Out Event
1. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Instances
2. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Configure tab -> Open LB Connection Rule
3. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Run History Tab
4. Scale-Out Observation: A new VM should be created in VM Instances Tab of VMSS 

# Wait for 10 to 15 Minutes
- Wait for 10 to 15 minutes for "Scale-In" Event to Trigger

# Verify Scale-In Event
1. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Instances
2. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Configure tab -> Open LB Connection Rule
3. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Run History Tab
4. Scale-In Observation: 1 VM should be deleted in VM Instances Tab of VMSS and should come down to value present in capacity block (capacity.minimum = 3 or 4 VMs)

Step-14: c7-06-web-linux-vmss-autoscaling-default-and-recurrence-profiles.tf

Comment All code in c7-06.
In c7-07 we will add profile-4 with fixed_date block as the day we are testing (Current Day)

Step-15: Autoscaling Profile-4: Fixed Date Profile

File: c7-07-web-linux-vmss-autoscaling-default-recurrence-fixed-profiles.tf

## Major Changes in this Block
# 1. Capacity Block Values Change  (Minimum = 5, default = 5, Maximum = 20)
# 2. Fixed  Block for a specific day
# Profile-4: Fixed Profile for a Specific Day
  profile {
    name = "profile-4-fixed-profile"
  # Capacity Block     
    capacity {
      default = 5
      minimum = 5
      maximum = 20
    }
  # Fixed Block for a specific day
    fixed_date {
      timezone = "India Standard Time"
      start    = "2090-08-15T00:00:00Z"  # CHANGE TO THE DATE YOU ARE TESTING
      end      = "2090-08-15T23:59:59Z"  # CHANGE TO THE DATE YOU ARE TESTING
    }  
###########  START: Percentage CPU Metric Rules  ###########    
  ## Scale-Out 
    rule {
      .....
      REST ALL SAME DEFAULT Profile
      .....
      .....
}

Step-16: Apply and Verify VMSS Resource - Autoscaling Profile-4

# Terraform Validate
terraform validate

# Terraform Plan
terraform plan

# Terraform Apply 
terraform apply -auto-approve

# Verify VMSS Resource
1. Verify the VM Instances in VMSS Resources
2. 5 VM Instances should be created as per Capacity Block from Profile-4  
  # Capacity Block     
    capacity {
      default = 5
      minimum = 5
      maximum = 20
    }
3. Verify the Autoscaling Policy in Scaling Tab of VMSS Resource 
4. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -
  a. Profile-1: Default Profile
  b. Profile-2: Weekday Profile
  c. Profile-3: Weekend Profile
  d. Profile-4: Fixed Date Profile

Step-17: Test Scale-Out and Scale-In scenarios

# Connect to Bastion Host Linux VM
ssh -i ssh-keys/terraform-azure.pem azureuser@<Bastion-Host-LinuxVM-PublicIP>
sudo su - 

# Run the Load Test using Apache Bench
ab -k -t 1200 -n 9050000 -c 100 http://<Web-LB-Public-IP>/index.html
ab -k -t 1200 -n 9050000 -c 100 http://52.149.253.66/index.html

# Verify Scale-Out Event
1. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Instances
2. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Configure tab -> Open LB Connection Rule
3. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Run History Tab
4. Scale-Out Observation: A new VM should be created in VM Instances Tab of VMSS 

# Wait for 10 to 15 Minutes
- Wait for 10 to 15 minutes for "Scale-In" Event to Trigger

# Verify Scale-In Event
1. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Instances
2. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Configure tab -> Open LB Connection Rule
3. Go to -> Virtual Machine Scale Sets -> hr-dev-web-vmss -> Settings -> Scaling -> Run History Tab
4. Scale-In Observation: 1 VM should be deleted in VM Instances Tab of VMSS and should come down to value present in capacity block (capacity.minimum = 5 VMs)

Step-18: Delete Resources

# Delete Resources
terraform destroy 
[or]
terraform apply -destroy -auto-approve

# Important Notes
1. If any error occures during Destroy, again run same destroy command
2. If error continues during destroy consistently and no resources getting deleted, delete the Resource Group using Azure Portal Management Console.

# Error-1: Sample Error during Destroy
azurerm_subnet.appsubnet: Destruction complete after 21s
╷
│ Error: Error waiting for removal of Backend Address Pool Association for NIC "hr-dev-web-linuxvm-nic" (Resource Group "hr-dev-rg"): Code="OperationNotAllowed" Message="Operation 'startTenantUpdate' is not allowed on VM 'hr-dev-web-linuxvm' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)." Details=[]

# Clean-Up Files
rm -rf .terraform* 
rm -rf terraform.tfstate*

Additional References

Azure Autoscaling Open issues

https://serverfault.com/questions/973421/terraform-autoscale-rule-to-scale-instance-to-specific-instance-count
https://github.com/hashicorp/terraform-provider-azurerm/issues/3870

Understanding Autoscaling in Azure

https://docs.microsoft.com/en-us/azure/azure-monitor/autoscale/autoscale-understanding-settings