News: VMwareGuruz has been  Voted Top 50 vBlog 2018. 

Automation

Automating VM Provisioning for Costico: A Comprehensive Guide

Introduction

Managing infrastructure for a global company like Costco with 20+ vCenters, over 5000 ESXi hosts, and 45,000 virtual machines is no small feat. Before automation, the team faced challenges such as extended provisioning times, manual errors, and difficulty scaling resources efficiently. As a Platform Engineer, supporting such a VMware infrastructure requires automation to streamline repetitive tasks, enhance efficiency, and reduce errors. In this blog post, we’ll walk through how we automated VM provisioning end-to-end using tools like GitHub, Visual Studio Code, Ansible, Python, PowerShell, ServiceNow, and Grafana.

The Infrastructure Landscape

Costco operates a global infrastructure:

  • 20+ vCenters distributed across various regions.
  • 5000+ ESXi Hosts, each hosting multiple virtual machines.
  • 45,000 VMs providing services such as networking, storage, and compute for internal and external customers.
  • A tenant-based model, where each tenant corresponds to a specific business unit or customer.
  • Two major Business Units (BUs) with distinct VM templates for Linux, Windows, and VDI setups.
  • Access to the provisioning platform is managed through a web-based tool called Costico-VMs-Delivery, secured via LDAP and SSO, with MFA tools for enhanced security.
  • AZ-based infrastructure for each location, with required tags such as environment (prod, dev), cost center, and application type, helping to identify resources and streamline operational tasks efficiently.
  • Aria Suite tools (vROps and vRA) used to monitor capacity at each location and resize the AZ when insufficient capacity is detected.
  • enable – disable – admin enabled concept applied to manage Platform AZ providers, ensuring controlled access and seamless transitions between operational states.
  • Ticket mechanism integrated with ServiceNow to track and resolve VM provisioning failures. For example, if a VM creation fails due to insufficient AZ capacity, the ticket includes logs and suggested actions for resolution.
  • Dashboards using Grafana or similar tools to monitor overall capacity and provisioning statistics in real-time. These dashboards display key metrics like CPU/RAM usage, disk IOPS, and provisioning trends.

To manage such a massive infrastructure, automation is not just a luxury but a necessity.

Tools of the Trade

  1. GitHub: Source control for managing Ansible playbooks, PowerShell, and Python scripts.
  2. Visual Studio Code: IDE for writing and debugging automation scripts.
  3. Ansible: The backbone of our automation for provisioning and configuration management.
  4. ServiceNow: ITSM platform for handling VM provisioning requests and failure ticketing.
  5. Python: For scripting advanced automation and API integrations.
  6. PowerShell: For VMware-specific tasks requiring PowerCLI.
  7. GitHub Actions and Jenkins: CI/CD tools to automate testing and deployment pipelines.
  8. Aria Suite (vROps and vRA): For capacity management and automated scaling based on AZ utilization.
  9. Grafana: For real-time dashboards showing capacity, statistics, and provisioning metrics.

VM Provisioning Workflow

The VM provisioning process involves multiple steps:

  1. Request Intake: A user raises a request via ServiceNow, specifying VM requirements.
  2. Approval Workflow: Requests are routed for managerial or automated approval.
  3. Automation Trigger: Upon approval, a webhook triggers the automation pipeline.
  4. VM Provisioning: Ansible playbooks, Python scripts, and PowerCLI commands create the VM, configure it, and attach it to the tenant’s environment.
  5. Capacity Monitoring: Leveraging vROps to verify available resources in the AZ and resize the environment if necessary.
  6. Ticket Management: Provisioning failures automatically generate tickets in ServiceNow, allowing for root cause analysis and resolution.
  7. Monitoring Dashboards: Grafana dashboards provide real-time insights into capacity usage, provisioning success rates, and overall system health.
  8. Notification and Handoff: The requestor is notified, and the VM is handed off for use.

End-to-End Implementation

1. ServiceNow Integration

ServiceNow acts as the central hub for handling VM provisioning requests:

  • Catalog Form: Users submit requests through a custom catalog item with fields for OS, CPU, RAM, storage, and networking preferences.
  • Webhook Trigger: Approved requests send a payload via a webhook to the automation pipeline hosted on GitHub.
  • Ticketing Mechanism: Provisioning failures generate ServiceNow tickets with detailed logs for troubleshooting.

2. GitHub and CI/CD

  • Repository Structure:
    • playbooks/: Ansible playbooks for provisioning and configuration.
    • scripts/: Custom PowerCLI and Python scripts.
    • templates/: YAML templates for different VM configurations based on the two BUs.
    • .github/workflows/: GitHub Actions for CI/CD.
  • CI/CD Pipeline:
    • Linting: Validates Ansible playbooks and Python/PowerShell scripts.
    • Unit Tests: Simulates playbook execution using mock environments and pytest for Python scripts.
    • Integration Tests: Validates end-to-end workflows in a staging environment, ensuring interactions between ServiceNow, Ansible, and vSphere function correctly.
    • Deployment: Runs approved playbooks against the target infrastructure.

3. Ansible for Automation

Ansible playbooks handle provisioning tasks:

  • VM Creation:
    - name: Create Virtual Machine
      hosts: localhost
      gather_facts: no
      tasks:
        - name: Deploy VM from template
          vmware_guest:
            hostname: "{{ vcenter_hostname }}"
            username: "{{ vcenter_username }}"
            password: "{{ vcenter_password }}"
            validate_certs: no
            datacenter: "{{ datacenter_name }}"
            cluster: "{{ cluster_name }}"
            template: "{{ vm_template }}"
            name: "{{ vm_name }}"
            disk:
              - size_gb: "{{ disk_size }}"
                type: thin
            hardware:
              memory_mb: "{{ memory_mb }}"
              num_cpus: "{{ num_cpus }}"
  • Network Configuration:
    - name: Configure Network
      hosts: localhost
      tasks:
        - name: Attach VM to network
          vmware_guest_network:
            hostname: "{{ vcenter_hostname }}"
            username: "{{ vcenter_username }}"
            password: "{{ vcenter_password }}"
            validate_certs: no
            name: "{{ vm_name }}"
            networks:
              - name: "{{ network_name }}"
                type: vmxnet3

4. Custom Scripts with PowerCLI and Python

For tasks requiring advanced scripting, we used PowerCLI and Python:

  • PowerCLI Script for Tagging VMs:
    Connect-VIServer -Server $vcenterHostname -User $vcenterUsername -Password $vcenterPassword
    New-Tag -Name $tagName -Category $tagCategory
    New-TagAssignment -Tag $tagName -Entity $vmName
    Disconnect-VIServer -Confirm:$false
  • Python Script for Validation:
    import requests
    
    def validate_vm_existence(vcenter_url, vm_name, token):
        headers = {"Authorization": f"Bearer {token}"}
        response = requests.get(f"{vcenter_url}/api/vcenter/vm/{vm_name}", headers=headers)
        if response.status_code == 200:
            print("VM exists:", response.json())
        else:
            print("VM not found or error occurred:", response.text)

5. Monitoring and Dashboards

Grafana dashboards aggregate data from vROps and ServiceNow to provide:

  • Real-time Capacity Metrics: Monitor AZ usage and identify bottlenecks with updates every five minutes, visualized as heatmaps and line graphs in Grafana.
  • Provisioning Statistics: Track success and failure rates.
  • Failure Analysis: Highlight common issues leading to ServiceNow tickets.

6. Notification and Handoff

After successful provisioning:

  • Email Notification: The requester receives an email with VM details.
  • ServiceNow Update: The request is marked as completed, and logs are attached.

Challenges and Lessons Learned

  1. Scalability: Automating for a global scale required testing playbooks against diverse environments.
  2. Error Handling: Added robust logging and retries to handle transient issues in vSphere.
  3. Code Quality: Implemented CI/CD pipelines with linting, unit testing, and integration testing for reliable automation.
  4. Collaboration: Leveraging GitHub improved version control and collaboration across teams.

Conclusion

By integrating tools like GitHub, Ansible, ServiceNow, PowerShell, Python, the Aria Suite, and Grafana, we achieved a seamless VM provisioning pipeline for Costco’s massive VMware infrastructure. This automation reduced provisioning times from days to minutes, improved accuracy, and freed up engineers for higher-value tasks.

Whether you’re managing a small data center or a global infrastructure, the principles and tools outlined here can help streamline your operations. Happy automating!

 

Related posts
Automation

Automating VM Provisioning for Costico: A Comprehensive Guide

Automation

Automating ESXi Host Deployment in Cisco UCS Infrastructure with Ansible Tower and Bitbucket

Automation

VMware Automation: "Streamlining vCenter Upgrades from 7.x to 8.x with PowerCLI and Ansible"

Automation

VMware Automation: "Mastering VMware Diagnostics with VCF Tool"