Investigate and solve Compute Engine cold starts like a detective🕵🏽‍♀️

Season of Scale

Stephanie Wong
Google Cloud - Community

--

Season of Scale

“Season of Scale” is a blog and video series to help enterprises and developers build scale and resilience into your design patterns. In this series we plan on walking you through some patterns and practices for creating apps that are resilient and scalable, two essential goals of many modern architecture exercises.

In Season 2, we’re covering how to optimize your applications to improve instance startup time! If you haven’t seen Season 1, check it out here.

  1. How to improve Compute Engine startup times (this article)
  2. How to improve App Engine startup times
  3. How to improve Cloud Run startup times

Shaving seconds off compute startup times might take a bit of detective work. How do you know if the issue lies within request, provision, or boot phases? In this article, we hone in on profiling Compute instances. I’ll explain how to pinpoint whether provisioning, scripts, or images contribute to slower instance startup times.

Check out the video

Review

So far we have looked at how Critter Junction, a multiplayer online game following life simulation as a critter. They’ve successfully launched and globally scaled a their gaming app on Compute Engine. With their growing daily active users, we helped them set up autoscaling, global load balancing, and autohealing to handle globally distributed and constantly rising traffic.

Cold start time woes

But, Critter Junction’s been seeing longer than wanted startup times for their Compute Engine instances, even though they set everything according to our autoscaling recommendations. They knew they were running some logic on their game servers on Compute Engine, like taking user inputs to spawn them into a new critter’s island. After profiling their startup times, they were seeing more than 380 second cold start times, while the response latency for a request was in the 300 millisecond range.

They also did a performance test to see how long Compute Engine was taking to create their instances versus how much time their code was taking to run,

Right from Cloud Shell, it showed:

Request, Provision, Boot

Request is the time between asking for a VM and getting a response back from the Create Instance API acknowledging that you’ve asked for it. You can profile this by timing how long it takes Google Cloud to respond to the Insert Instance REST command.

Provision is the time Compute Engine takes to find space for the VM on its architecture. Use the Get Instance API on a regular basis and wait for the status flag to change from provisioning to running.

Boot time is when startup scripts and other custom code executes up to the point when the instance is available. Just repeatedly poll a health check that is served by the same runtime as your app. Then time the change between receiving 500, 400 and 200 status codes.

After doing these, Critter Junction noticed the majority of instance startup time usually happened during the boot phase, when the instance executes startup scripts. This is not uncommon, so you should profile your boot scripts to see which phases are creating performance bottlenecks.

Introducing the Seconds Variable

To get a sense of what stages of your script are taking the most boot time, one trick is to wrap each section of your startup script with a command that utilizes the SECONDS variable, then append the time elapsed for each stage to a file, and set up a new endpoint to serve that file when requested.

SECONDS=0
# do some work
duration=$SECONDS
echo "$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed."

This let Critter Junction dig even deeper to poll the endpoint and get data back without too much heavy lifting or modification to their service.

And there it was!

An example graph generated by timing the startup phases of the instance. Notice that the graph on the right is in sub-second scale.

The performance bottleneck seemed to be public images — preconfigured combinations of the OS and bootloaders. These images are great when you want to get up and running, but as you start building production-level systems, the large portion of bootup time is no longer booting the OS, but the user-executed startup sequence that grabs packages and binaries, and initializes them.

Use custom images

Critter Junction was able to address this by creating custom images for their instances. Which you can do from source disks, images, snapshots, or images stored in Cloud Storage, then use the images to create VM instances.

Custom images list

When the target instance is booted, the image information is copied right to the hard drive. This is great when you’ve created and modified a root persistent disk to a certain state and want to save that state to reuse it with new instances, and when your setup includes installing (and compiling) big libraries, or pieces of software.

Armed and ready

When you’re trying to scale to millions of requests per second, being serviced by thousands of instances, a small change in boot time can make a big difference in costs, response time and most importantly, the perception of performance by your users. Stay tuned for what’s next for Critter Junction.

And remember, always be architecting.

Next steps and references:

--

--

Stephanie Wong
Google Cloud - Community

Google Cloud Developer Advocate and producer of awesome online content. Creator of the series, GCP Networking End-to-End; host of Google’s Next onAir. @swongful