Scale and Resilience Aren’t Just Buzzwords

Season of Scale

Stephanie Wong
Google Cloud - Community

--

Introduction

“Season of Scale” is a blog and video series to help enterprises and developers build scale and resilience into your design patterns. In this series we plan on walking you through some patterns and practices for creating apps that are resilient and scalable, two essential goals of many modern architecture exercises.

In Season 1, we’re covering Infrastructure Automation and High Availability:

  1. Patterns for scalable and resilient applications (this article)
  2. Infrastructure as code
  3. Immutable infrastructure
  4. Where to scale your workloads
  5. Globally autoscaling web services
  6. High Availability (Autohealing & Auto updates)

In this article I’ll walk you through the basics behind scale and resilience.

Check out the video

With everything going on in the world, there’s been undeniable disruption for businesses. Social isolation around the globe has settled in, and traditional brick and mortar companies have been forced to shift operations into an online space, while already-online companies are seeing a massive spike in daily traffic.

This might have put a lot of pressure on you to adapt and evolve your online architectures, and face the realization that scalable and resilient architectures aren’t just buzzwords, but absolutely imperative in today’s climate. This becomes exacerbated when you’re both resource and business constrained.

This rang true for Critter Junction, a multiplayer gaming company that’s gained massive popularity in the last few months. Online players can interact with one another in a virtual world that follows life simulation as a critter.

Building and operating a gaming stack at this scale requires a lot of careful planning. So my team and I are stepping in to help Critter Junction revisit their design practices for ultimate global success.

Business and operational constraints

So far Critter Junction has been great at running individual machines on premise, but hasn’t been able to automatically scale to many machines to handle peaks and dips in traffic.

At times, they end up running overutilized machines, and other times run underutilized machines. They also haven’t proactively built resilience into their architecture. A failure in networking, image updates, or peak load could lead to a disruption in their game player’s experience.

On top of that, Critter Junction has its own business and operational pressures.

  1. Their CTO is all about becoming more agile to adapt to fluctuating user demands.
  2. Their developers are focusing on reducing the time it takes to investigate failures.
  3. And their operators care about finding ways to automatically recover from failures.

Scalability and reliability

This is a lot to think about! To start, let’s clarify what scalability and resilience are.

Scalability is the measure of a system’s ability to handle varying amounts of work by adding or removing resources from the system.

For example, a scalable web app can not only work well with 1 user, but also with 100 million users, and it can gracefully handle peaks and dips in traffic.

The good news is the cloud gives you the flexibility to adjust the resources consumed by an app. The bad news is: without proper design, you could end up using more resources than you need, like what Critter Junction’s done on-premise. What you want instead is to reduce costs by removing under-utilized resources without compromising performance or user experience.

That being said, even scalable apps can face disruptions. Without resilience built in, system failures can throw a wrench in your operations.

Resilience means designing to withstand failures. A resilient app is one that continues to function despite failures of system components.

And this requires planning at all levels of your architecture. It influences how you lay out your infrastructure and network, and how you design your data storage and app. It even extends to people and culture.

Let’s face it — building and operating resilient apps isn’t the easiest when you want to get up and running quickly. This is especially true for distributed gaming stacks, which means multiple layers of infrastructure, networks, and services.

Looking forward

Over this series, we’re going to walk Critter Junction through Google Cloud design best practices to help them build both scalable and resilient apps.

All in all these patterns and best practices will fall into 3 themes:

  1. Automation — because automating your infrastructure provisioning, testing, and app deployments increases consistency and speed, and minimizes human error.
  2. Loose coupling — because treating your system as a collection of loosely coupled, independent components gives you flexibility and resilience.
  3. Data-driven design — because collecting metrics to understand the behavior of your app is critical. Decisions about when to scale your app, or whether a particular service is unhealthy, need to be based on data.

These themes are going to be crucial to laying a stronger foundation to scale and withstand failures. Mistakes and outages happen, but Critter Junction is on a mission to improve the design of their app architecture and dev processes. Stay tuned to find out how.

And remember, always be architecting.

Next steps and references:

--

--

Stephanie Wong
Google Cloud - Community

Google Cloud Developer Advocate and producer of awesome online content. Creator of the series, GCP Networking End-to-End; host of Google’s Next onAir. @swongful