All Articles

Faults in our stars

We built a game engine using Elixir/Erlang, which can add real time engagements like Poll, Quiz, Fill In The Blanks, etc on top a live video stream or a pre-recorded video being streamed much like (if you remember) HQ Trivia - except on steroids. It has comments, reactions (like Instagram) and a real-time leaderboard as well. This can be streamed via our Apps/website/mobile-website. Sounds great right?

We used to do these games/digital-events on weekly basis a year and a half ago but then we pivoted, the underlying architecture/decisions changed multiple times and now we do it for big events intermittently in weeks (sometimes months).

Except last day, when about thousands of odd people were viewing a live 360 degree music video of well known artists and people were on the player page, the comments suddenly stopped coming in mid-way.

Uh-Oh!

War Room Story Time

So what happened? Our clients talk to engine and engine talks back to them with data via MQTT - a highly optimised pub-sub broker. Each message type has its own topic. We could see player count working and increasing, so clearly other supervision trees were fine but for some reason comments system was down.

Was MQTT broker bonking for clients because of load?

We went to MQTT dashboard at port 18083 and checked the load, it was under-utilized. The subscription topic showed up as well in the listeners tab. I manually subscribed to the player count topic from console and received timely message of count on it. So MQTT is sending messages fine, did the MQTT disconnect happen for all the comments workers?

We checked Observer and found no Comments workers.

What happened?

Visit Sentry and we find multiple occurences of this error below in the Comments worker that aggregates, filters and publishes comments

protocol-sentry-error.png

Huh? We checked the DynamicSupervisor spec, it takes the default configuration.

From HexDocs: :max_restarts - the maximum number of restarts allowed in a time frame. Defaults to 3.

So that checks out - the Supervision tree will and should go down after trying to come back up 3 times.

Bug in the system? GIT BLAME!!!!

We couldn’t stop the event mid-way to investigate because it was live-streaming and would’ve kicked users out on event state change. Rest of the features were working fine.

Usual suggestions were made by members of warroom, some of them namely:

  • “Why is the system fragile”
  • “The point of building with Elixir was that it brings back workers so that we remain happy”
  • “Why is the restart strategy transient, permanent would’ve solved this issue”
  • “Why isn’t the test covering this function”
  • “The basics isn’t right of the system”

We’ll address all of these, later in the post - Stick around?

We follow the line number from Sentry and see this code -

protocol-code-bug.png

Do you see the bug in this (seemingly-harmless) function? Did you find it?

It’s the statement inside Logger.info.

umm, but what’s wrong with that?

Well except, "#{}" expects String inside the interpolation block. All it needed was #{inspect(request)} because the Sentry error showed us that it got {#Reference<0.1573182675.3632529412.140924>, {:ok, "a77eb2e7744d4c29b12ec3a5fc3c731f"}} - a Tuple instead.

Wait, but this is a 14 months old code and all the test cases never come here, infact we tested comments multiple times with real users and it worked fine. What happened?

As the comment inside the function says, it’s a catch-all for handling unhandled handle_info messages.

The intent was correct (also harmless, because the workers would’ve worked fine without this function), the implementation was not. Either way, it should have never come there randomly for a tuple we don’t know the origin of.

So what kind of messages does handle_info gets?

Besides the synchronous and asynchronous communication provided by call/3 and cast/2, "regular" messages sent by functions such as Kernel.send/2, Process.send_after/4 and similar, can be handled inside the handle_info/2 callback.

Our message publishing was indeed a recursive Process.send_after, but why would it randomly send an unhandled {#Reference<0.1573182675.3632529412.140924>, {:ok, "a77eb2e7744d4c29b12ec3a5fc3c731f"}} out of no where. That didn’t add up. Rest of the code relied on GenServer.cast, so no where this random tuple could’ve come from.

We checked - its not an UUID or an API-KEY in our system.

Strange?

Digging little further, I found this - closed Github issue in Gun library

gun-lib-ref.png

Ah-ha! Someone with the same random tuple pattern in their handle_infos. And we were also on OTP 20.

Gun Library is from NineNines, the people who built Cowboy server. Pretty solid source.

Apparently, it starts happening under high load -

gun-lib-issue-comment.png

They filed a bug in Erlang for this stray message on tcp close - https://bugs.erlang.org/browse/ERL-1049

erlang-issue.png

Our MQTT connections were indeed via TCP protocol from engine to broker.

And it’s fixed in OTP 22.1.1 patch.

Yaay?

No, not yet. We’ll now address the suggestions. How could we have avoided this?

Tests? The tests were plenty for this module/worker to test the behavior and logic of the worker. How would you write test for this function?

protocol-code-bug.png

What would you be testing?

That a process message is sent to handle_info callback? Well, that behavior has iron-clad tests in Erlang.

That the Logger prints a String? Imagine, the development cycle if developers had to assert Logger Strings at every log point.

Was the developer at fault (pun-intended)? I don’t think so, yes he made a human mistake in String interpolation but the intent was right - to write a catch-all function for unhandled GenServer messages. Should he have written test for logging Strings? Nope, this should’ve been caught in PR review process and not tests.

But it wasn’t. Does that mean the System/Elixir/Process is bad? Certainly not.

Could making restart strategy to permanent for these workers would’ve solved this? Nope, the bad code was failing regularly at load because of unexpected message. permanent would’ve just restarted the process repeatedly, the outcome would’ve been the same.

Some of the suggestions were in my opinions part of the classic X-Y problem.

Let me introduce you to Exception Driven Development - a software development process coined by the guy who made StackOverflow.

I resonate with the author on feeling embarrassed when a system crashes. -

When a user informs me about a bona fide error they've experienced with my software, I am deeply embarrassed. And more than a little ashamed. I have failed to see and address the issue before they got around to telling me. I have neglected to crash responsibly.

Was there a bug in the system? Yes. Did the comments system crash responsibly? Yes. Did we have a central repository which pointed us to the exact location of the crash? Yes. Did Elixir’s Supervision tree construct restarted the Comments workers till threshold? Yes. Did it crash other supervision trees? No.

Error logs are de-facto to-do list for any good software engineer. Coming to the point of reducing bugs in the system that are hard to reproduce like the one we faced or bugs in general.

Broad-based trend analysis of error reporting data shows that 80% of customer issues can be solved by fixing 20% of the top-reported bugs. Even addressing 1% of the top bugs would address 50% of the customer issues. The same analysis results are generally true on a company-by-company basis too.

TL;DR: If you fix a bug that no actual user will ever encounter, what have you actually fixed?

Given a limited pool of developer time, I’d prefer to allocate it toward fixing problems real users are having with my software based on cold, hard data. That’s what I call Exception-Driven Development.

People have built APIs for decades now and websites that are up 24/7 with users still crash because of Internal server errors or doing complicated queries with no indexes in databases, no matter how many smart experienced engineers you put in the room or do TDD.

Just to put things in perspective: The game engine is one complex project as its not a typical CRUD application which serves data via HTTP APIs. It’s entirely stateful, providing real time data analytics plus all engagement data for thousands of users and also pings users certain data like live-player-count/comments/reactions every second. So it’s a big computational datapoint, which is used intermittently when there’s a need for such event.

These ending statements from CodingHorror’s blogpost are words of wisdom for anyone who is trying to build a product in tech -

Ship your software, get as many users in front of it as possible, and intently study the error logs they generate. Use those exception logs to hone in on and focus on the problem areas of your code. Rearchitect and refactor your code so the top 3 errors can't happen any more. Iterate rapidly, deploy, and repeat the process. This data-driven feedback loop is so powerful you'll have (at least from the users' perspective) a rock stable app in a handful of iterations.
Am I advocating shipping buggy code? Incomplete code? Bad code? Of course not. I'm saying that the sooner you can get your code out of your editor and in front of real users, the more data you'll have to improve your software. Exception logs are a big part of that; so is usage data.
Your software will ship with bugs anyway. Everyone's software does. Real software crashes. Real software loses data. Real software is hard to learn, and hard to use. The question isn't how many bugs you will ship with, but how fast can you fix those bugs? If your team has been practicing exception-driven development all along, the answer is -- why, we can improve our software in no time at all! Just watch us make it better! And that is sweet, sweet music to every user's ears.

Sometimes, the philosophy of Let it crash seems confusing to people who didn’t invest time on it. Yes, it’s supposed to be “fault” tolerant which means failures can take features down minimizing its impact on other things in place. The problem wasn’t Elixir/Erlang, tests or the developer who wrote the bad code; but the process. We needed to fail small, fast and frequently, so that we don’t fail big regularly.

P.S - Was there a way to recover? Yes there was, but we chose to skip it because we didn’t want to break things unintentionally on production for a live event. We use Distillery for our releases. We could’ve executed commands in the Docker container that was runnning the release and run remote_console on the binary. This would’ve allowed manual function call to the DynamicSupervisor responsible to bring the Comments workers back up (only on the off-chance that the random tuple didn’t come along with it; if it did, it would’ve crashed again)