Achieving Resiliency With Queues: Building A System That Never Skips A Beat In A Billion

Zach McCormick
Achieving Resiliency With Queues: Building A System That Never Skips A Beat In A Billion

Braze processes billions and billions of events per day on behalf of its customers, resulting in billions of hyper-focused, personalized messages—but failing to send one of those messages has consequences. To make sure those key messages are always correct and always on time, Braze takes a strategic approach to how we leverage job queues.

What’s a Job Queue?

A typical job queue is an architectural pattern where processes submit computation jobs to a queue and other processes actually execute the jobs. This is usually a good thing—when used properly, it gives you degrees of concurrency, scalability, and redundancy that you can’t get with a traditional request–response paradigm. Many workers can be executing different jobs simultaneously in multiple processes, multiple machines, or even multiple data centers for peak concurrency. You can assign certain worker nodes to work on certain queues and send particular jobs to specific queues, allowing you to scale resources as needed. If a worker process crashes or a data center goes offline, other workers can execute the remaining jobs.

While you can certainly apply these principles and run a job-queueing system easily at a small scale, the seams start to show (and even burst) when you’re processing billions and billions of jobs. Let’s take a look at a few problems Braze has faced as we’ve grown from processing thousands, to millions, and now billions of jobs per day.

Check out the rest of this blog post at Building Braze!

Zach McCormick

Zach McCormick

Zach is a software engineer and manager passionate about building and maintaining global-scale distributed systems. He has experience with distributed systems, web applications, and mobile applications across a variety of industries, including marketing automation, fintech, IoT, healthcare, and mobile cybersecurity, as well as across a variety of languages and technologies, including Python, Java, JavaScript, Ruby, PostgreSQL, MySQL, MongoDB, Redis, and others. He currently works for Braze.
Zach McCormick

Interested in chatting?

I'm always happy to chat about software engineering challenges of all sorts - architecture, organizational, or otherwise. Just drop me an email at zachary.tyler.mccormick@gmail.com.