Virtual Machine Orchestration on GCE

Summary

In this article we tackle VM orchestration. We I touched on in other articles, the desire is to dynamically spin up VMs as necessary. Some of the constructions in Google Cloud that are used are instance templates, instance groups, load balancers, health checks, salt (both state and reactor).

First Things First

In order to dynamically spin up VMs we need an instance group. For an instance group to work dynamically we need an instance template.

Instance Template

For this instance template, I will name it web-test. The name for this is important but we’ll touch on that later on.

GCE - Instance Template - Name
GCE – Instance Template – Name
GCE - Instance Template - CentOS
GCE – Instance Template – CentOS

For this demonstration we used CentOS 8. It can be any OS but our Salt state is tuned for CentOS.

GCE - Automation
GCE – Automation

As we touched on in the Cloud-init on Google Compute Engine article, we need to automate the provisioning and configuration on this. Since Google’s CentOS image does not come with this we use the startup script to load it. Once loaded and booted, cloud-init configures the local machine as a salt-minion and points it to the master.

Startup Script below

#!/bin/bash

if ! type cloud-init > /dev/null 2>&1 ; then
  # Log startup of script
  echo "Ran - `date`" >> /root/startup
  sleep 30
  yum install -y cloud-init

  if [ $? == 0 ]; then
    echo "Ran - yum success - `date`" >> /root/startup
    systemctl enable cloud-init
    # Sometimes GCE metadata URI is inaccessible after the boot so start this up and give it a minute
    systemctl start cloud-init
    sleep 10
  else
    echo "Ran - yum fail - `date`" >> /root/startup
  fi

  # Reboot either way
  reboot
fi

cloud-init.yaml below

#cloud-config

yum_repos:
    salt-py3-latest:
        baseurl: https://repo.saltstack.com/py3/redhat/$releasever/$basearch/latest
        name: SaltStack Latest Release Channel Python 3 for RHEL/Centos $releasever
        enabled: true
        gpgcheck: true
        gpgkey: https://repo.saltstack.com/py3/redhat/$releasever/$basearch/latest/SALTSTACK-GPG-KEY.pub

salt_minion:
    pkg_name: 'salt-minion'
    service_name: 'salt-minion'
    config_dir: '/etc/salt'
    conf:
        master: saltmaster263.us-central1-c.c.woohoo-blog-2414.internal
    grains:
        role:
            - web
GCE – Instance Template – Network tags – allow-health-checks

The network tag itself does not do anything at this point. Later on we will tie this into a firewall ACL to allow the Google health checks to pass.

Now we have an instance template. From our Intro to Salt Stack article we should have a salt server.

SaltStack Server

We have a state file here to provision the state but from our exposure we need salt to automagically do a few things.

Salt is a fairly complex setup so I have provided some of the files at the very bottom. I did borrow many ideas from this page of SaltStack’s documentation – https://docs.saltstack.com/en/latest/topics/tutorials/states_pt4.html

The first thing is to accept new minions as this is usually manual. We then need it to apply a state. Please keep in mind there are security implications of auto accepting. These scripts do not take that into consideration as they are just a baseline to get this working.

In order to have these automatically work, we need to use Salt reactor which listens to events and acts on them. Our reactor file looks like this. We could add some validation, particularly on the accept such as validating the minion name has web in it to push the wordpress state.

{# test server is sending new key -- accept this key #}
{% if 'act' in data and data['act'] == 'pend' %}
minion_add:
  wheel.key.accept:
  - match: {{ data['id'] }}
{% endif %}
{% if data['act'] == 'accept' %}
initial_load:
  local.state.sls:
    - tgt: {{ data['id'] }}
    - arg:
      - wordpress
{% endif %}

This is fairly simple. When a minion authenticates for the first time, acknowledge it and then apply the wordpress state we worked on in our articicle on Salt State. Since we may have multiple and rotating servers that spin up and down we will use Google’s Load Balancer to point Cloudflare to.

Cloudflare does offer load balancing but for the integration we want, its easier to use Google. The load balancer does require an instance group so we need to set that up first.

Instance Groups

Instance groups are one of the constructions you can point a load balancer towards. Google has two types of instance groups. Managed, which it will auto scale based on health checks. There is also managed which you have to manually add VMs to. We will choose managed

GCE - New Managed Instance
GCE – New Managed Instance

This name is not too important so it can be any one you like.

GCE - Instance Group
GCE – Instance Group

Here we set the port name and number, an instance template. For this lab we disabled autoscaling but in the real world this is why you want to set all of this up.

Instance Group - Health Check
Instance Group – Health Check

The HealthCheck expects to receive an HTTP 200 message for all clear. It is much better than a TCP check as it can validate the web server is actually responding with expected content. Since WordPress sends a 301 to redirect, we do have to set the Host HTTP Header here, otherwise the check will fail. Other load balancers only fail on 400-599 but Google does expect only a HTTP 200 per their document – https://cloud.google.com/load-balancing/docs/health-check-concepts

Instance Group Provisioning
Instance Group Provisioning

And here you can see it is provisioning! While it does that, let’s move over to the load balancer.

Firewall Rules

The health checks for the load balancer come from a set range of Google IPs that we need to allow. We can allow these subnets via network tags. Per Google’s Healthcheck document, the HTTP checks come from two ranges.

VPC - Allow Health Checks!
VPC – Allow Health Checks!

Here we only allow the health checks from the Google identified IP ranges to machines that are tagged with “allow-health-checks” to port 443.

Google Load Balancer

Initial

This is a crash course into load balancers if you have never set them up before. It is expected you have some understanding of front end, back end and health checks. In the VPC section we need to allow these

Google Load Balancer - Start configuration
Google Load Balancer – Start configuration
Google Load Balancer - Internet
Google Load Balancer – Internet

Back End Configuration

Google’s load balancers can be used for internal only or external to internal. We want to load balance external connections.

Google Load Balancer - Back End Create
Google Load Balancer – Back End Create

We will need to create a back end endpoint.

Luckily this is simple. We point it at a few objects we already created and set session affinity so that traffic is persistent to a single web server. We do not want it hopping between servers as it may confuse the web services.

Front End Configuration

Health Check Validation

Give the load balancer provisioning a few minutes to spin up. It should then show up healthy if all is well. This never comes up the first time. Not even in a lab!

Google Load Balancer - Healthy!
Google Load Balancer – Healthy!

Troubleshooting

The important part is to walk through the process from beginning to end when something does not work. Here’s a quick run through.

  • On provisioning, is the instance group provisioning the VM?
  • What is the status of cloud-init?
  • Is salt-minion installing on the VM and starting?
  • Does the salt-master see the minion?
  • Reapply the state and check for errors
  • Does the load balancer see health?

Final Words

If it does come up healthy, the last step is to point your DNS at the load balancer public IP and be on your way!

Since Salt is such a complex beast, I have provided most of the framework and configs here – Some of the more sensitive files are truncated but left so that you know they exist. The standard disclaimer applies in that I cannot guarantee the outcome of these files on your system or that they are best practices from a security standpoint.

Cloud-Init on Google Compute Engine

Summary

Yesterday I was playing a bit with Google Load Balancers and they tend to work best when you connect them to an automated instance group. I may touch on that in another article but in short that requires some level of automation. In an instance group, it will attempt to spin up images automatically. Based on health checks, it will introduce them to the load balanced cluster.

The Problem?

How do we automate provisioning? I have been touching on SaltStack in a few articles . Salt is great for configuration management but in an automated fashion, how do you get Salt on there? This was my goal. To get Salt Installed on a newly provisioned VM.

Method

Cloud-init is a very widely known method of provisioning a machine. From my brief understanding it started with Ubuntu and then took off. In Spinning Up Rancher With Kubernetes, I was briefly exposed to it. It makes sense and is widely supported. The concept it simple. Have a one time provisioning of the server.

Google Compute Engine

Google Compute Engine or GCE does support pushing cloud-init configuration (cloud-config) using metadata. You can set the “user-data” field and if cloud-init is installed it will be able to find this.

The problem is the only image that seems to support this out of the box is Ubuntu and my current preferred platform is CentOS although this is starting to change.

Startup Scripts

So if we don’t have cloud-init, what can we do? Google does have the functionality for startup and shutdown scripts via “startup-script” and “shutdown-script” meta fields. I do not want a script that runs every time. I also do not want to re-invent the wheel writing a failsafe script that will push salt-minion out and reconfigure it. For this reason I came up with a one time startup script.

The Solution

Startup Script

This is the startup script I came up with.

#!/bin/bash

if ! type cloud-init > /dev/null 2>&1 ; then
  echo "Ran - `date`" >> /root/startup
  sleep 30
  yum install -y cloud-init

  if [ $? == 0 ]; then
    echo "Ran - Success - `date`" >> /root/startup
    systemctl enable cloud-init
    #systemctl start cloud-init
  else
    echo "Ran - Fail - `date`" >> /root/startup
  fi

  # Reboot either way
  reboot
fi

This script checks to see if cloud-init exists. If it does, move along and don’t waste cpu. If it does not, we wait 30 seconds and install it. Upon success, we enable and either way we reboot.

Workaround

I played with this for a good part of a day, trying to get it working. Without the wait and other logging logic in the script, the following would happen.

2019-11-14T18:04:37Z DEBUG DNF version: 4.0.9
2019-11-14T18:04:37Z DDEBUG Command: dnf install -y cloud-init
2019-11-14T18:04:37Z DDEBUG Installroot: /
2019-11-14T18:04:37Z DDEBUG Releasever: 8
2019-11-14T18:04:37Z DEBUG cachedir: /var/cache/dnf
2019-11-14T18:04:37Z DDEBUG Base command: install
2019-11-14T18:04:37Z DDEBUG Extra commands: ['install', '-y', 'cloud-init']
2019-11-14T18:04:37Z DEBUG repo: downloading from remote: AppStream
2019-11-14T18:05:05Z DEBUG error: Curl error (7): Couldn't connect to server for http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=stock [Failed to connect to mirrorlist.centos.org port 80: Connection timed out] (http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=stock).
2019-11-14T18:05:05Z DEBUG Cannot download 'http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=stock': Cannot prepare internal mirrorlist: Curl error (7): Couldn't connect to server for http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=stock [Failed to connect to mirrorlist.centos.org port 80: Connection timed out].
2019-11-14T18:05:05Z DDEBUG Cleaning up.
2019-11-14T18:05:05Z SUBDEBUG
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/dnf/repo.py", line 566, in load
    ret = self._repo.load()
  File "/usr/lib64/python3.6/site-packages/libdnf/repo.py", line 503, in load
    return _repo.Repo_load(self)
RuntimeError: Failed to synchronize cache for repo 'AppStream'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 64, in main
    return _main(base, args, cli_class, option_parser_class)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 99, in _main
    return cli_run(cli, base)
  File "/usr/lib/python3.6/site-packages/dnf/cli/main.py", line 115, in cli_run
    cli.run()
  File "/usr/lib/python3.6/site-packages/dnf/cli/cli.py", line 1124, in run
    self._process_demands()
  File "/usr/lib/python3.6/site-packages/dnf/cli/cli.py", line 828, in _process_demands
    load_available_repos=self.demands.available_repos)
  File "/usr/lib/python3.6/site-packages/dnf/base.py", line 400, in fill_sack
    self._add_repo_to_sack(r)
  File "/usr/lib/python3.6/site-packages/dnf/base.py", line 135, in _add_repo_to_sack
    repo.load()
  File "/usr/lib/python3.6/site-packages/dnf/repo.py", line 568, in load
    raise dnf.exceptions.RepoError(str(e))
dnf.exceptions.RepoError: Failed to synchronize cache for repo 'AppStream'
2019-11-14T18:05:05Z CRITICAL Error: Failed to synchronize cache for repo 'AppStream'

Interestingly it would work on the second boot. I posted on ServerFault about this. – https://serverfault.com/questions/991899/startup-script-centos-8-yum-install-no-network-on-first-boot. I will try to update this article if it goes anywhere as the “sleep 30” is annoying. The first iteration had a sleep 10 and it did not work.

It was strange because I could login and manually run the debug on it and it would succeed.

sudo google_metadata_script_runner --script-type startup --debug

Cloud-Init

Our goal was to use this right? Cloud-Init has a nice module for installing and configuring Salt – https://cloudinit.readthedocs.io/en/latest/topics/modules.html#salt-minion

#cloud-config

yum_repos:
    salt-py3-latest:
        baseurl: https://repo.saltstack.com/py3/redhat/$releasever/$basearch/latest
        name: SaltStack Latest Release Channel Python 3 for RHEL/Centos $releasever
        enabled: true
        gpgcheck: true
        gpgkey: https://repo.saltstack.com/py3/redhat/$releasever/$basearch/latest/SALTSTACK-GPG-KEY.pub

salt_minion:
    pkg_name: 'salt-minion'
    service_name: 'salt-minion'
    config_dir: '/etc/salt'
    conf:
        master: salt.example.com
    grains:
        role:
            - web

This sets up the repo for Salt. I prefer their repo over Epel as Epel tends to be dated. It then sets some simple salt-minion configs to get it going!

How do you set this?

You can set this two ways. One is from the command line if you have the SDK.

% gcloud compute instances create test123-17 --machine-type f1-micro --image-project centos-cloud --image-family centos-8 --metadata-from-file user-data=cloud-init.yaml,startup-script=cloud-bootstrap.sh

Or you can use the console and paste it in plain text.

GCE - Automation - Startup and user-data
GCE – Automation – Startup and user-data

Don’t feel bad if you can’t find these settings. They are buried here.

Finding Automation Settings
Finding Automation Settings

Final Words

In this article we walked through automating the provisioning. You can use cloud-init for all sorts of things such as ensuring its completely up to date before handing off as well as adding users and keys. For our need, we just wanted to get Salt on there so it could plug into config management.