Backdrop

Imagine you have hundreds of production servers being blasted with requests from millions of users on a per-second basis. A slight delay in serving a request may lead the customer to uninstall the application. A sluggish or unresponsive service may cause things to go haywire. And you are responsible for making a small code change in such a system. What do you do? How do you go about it?

This article is my retrospective on a potential screw-up on our production system due to a bad code I wrote.

I work as a Backend Developer at a startup where we deal with features and bug fixes affecting millions of people daily. A small fracture in the code and it doesn’t take time before the shit hits the fan.

We are a messaging-based platform and a large part of the messaging volume are Stickers. Something like this —

We have a dedicated team for curating content for stickers and they recently came up with the idea of animated stickers. Customers got bored of the static stickers and putting some animation in them could prove useful for us.

The new requirement

From a technical standpoint on the server end, a small requirement came up. We needed to put some hacks in place to maintain backward compatibility with older clients.

Whenever a new client i.e. a supported build for the animated stickers sends a packet to an older client i.e. the unsupported build, we should send a customized message to the older client telling them to upgrade to the newer build.

We cannot directly send an animated sticker as the older client won’t understand it and the application might crash.

Well, this looks pretty simple, right? Just a couple of version checks on the server-side and you’re good to go.

However, the client one-upped this by adding that the intercepted message should be customized and should be different for every animated sticker.

This was also not much of an effort. We just had to save the intercept messages in our existing sticker database (which is Mongo right now) and put a cache on top of it to prevent redundant calls to Mongo and we were good to go.

The problem with this requirement

The problem was with the identification of an animated sticker whenever one came in our server flow. The simplest way would have been where the client tells us if it’s a static sticker or an animated sticker. But this was not always possible — we were told — because of some bug on an older version of the application.

Every sticker packet has a stickerId associated with it. It is a unique Id generated by the sticker team while storing sticker content in the database. An animated sticker packet (JSON packet) would look something like this

{  
   "t": "st",
   "md": { 
      "is_anim": 1 
   },
   "stkId": "loveydovey123"  
}

The is_anim field may or may not be present always. So we needed another way of differentiating animated from a static sticker.

Without much deliberation, we came up with a solution. We store the sticker data in MongoDB and we also keep a field in the document for each sticker indicating if it is an enhanced sticker.

I wrote the code in such a way that for every given sticker (for which is_anim is missing), the code would hit an in-memory cache — which would keep the intercepted message (if any) as described before and would also tell us if the given sticker packet is an animated one for 2 hours.

In case of a cache miss the code will hit the persistent storage i.e. Mongo in this case and populate the in-memory cache as well. Let us look at the pseudo-code

public boolean isAnimatedSticker(String stickerId)  
{       
   boolean isAnimated = CacheManager.get(stickerId); 
   
   if (isAnimated != null && isAnimated == true)   
   {
      return true;  
   }
      
   if (isAnimated == null)  
   {
      return populateCacheFromDbAndReturnData(stickerId); return false;}
   }
}

A simple piece of code right? It checks the cache to see if the sticker is enhanced, if nothing is found in the cache it hits the database and populates the cache, and also returns the desired result.

All this worked right off the bat and was tested thoroughly on our staging and dev environments.

There was a lot of pressure from the product manager for animated stickers as their release was due in a couple of days (release of a new build in the market). So we were in a hurry to get the changeset deployed on production.

My mentor and some other seniors in the team were very careful in taking any changes to production at that time owing to a major service outage already affecting us.

The crucial question!

While reviewing the code, my mentor raised a crucial question I had failed to address. He asked me why I was querying Mongo and how many times will I do it. I told him when and why I would be doing it. Then he asked me

Will I hit Mongo for every unique sticker packet once on every production machine to check if it is animated ?

I mentioned that we will do this only for the ones for which is_anim field is missing. Basically for every static sticker and some animated stickers. He said this will break production.

WTF

He didn’t say it that calmly though. He was pissed.

Why was this a problem?

We have about 50 production servers and every server will have its own in-memory cache. Out of the 24k messages per second in peak time, we have about 8k messages for stickers.

Out of these 8000 messages, 3000 are unique stickers - We have about 12,000 unique stickers on the app today and stickers is one of our most used and loved feature.

That means 3k Mongo calls per production server which when escalated to 50 servers would mean around 150k calls on Mongo in a very short period. This will bring down the stickers Mongo server as it is not very well scaled right now. It is tuned to serve around 1–2k calls at max.

This scenario would have taken place at the time of deploying my code on production. Although we have a 4-minute gap between server restarts, the design still sucked and my manager wasn’t going to take any chances with it. This was alarming also because the changeset was written in a manner where it would affect the main messaging flow.

It should have been written asynchronously in the first place where any outage in the animated stickers code would not affect other services in the infrastructure.

This was supposed to be a simple change to be done under the radar but I ended up being ridiculed for doing such a poor effort.

What I learned from this small (major ?) incident is that before diving into the task assigned and becoming coding ninjas right away, you should adhere to certain guidelines.

Think before you Code

Never go off writing code based on the first solution that comes to your mind. Always write down a basic pseudo-code first and then consider all possibilities where things could go wrong and break the system. This should be done irrespective of how large or small the changeset is. Even for a hotfix!

WTF

Some Key parameters to Ponder

Code is anything but a few lines of the language syntax. Before finalizing any solution for the problem at hand, do consider the following parameters

  • Code should be clean and easy to understand.
  • Should be Unit Testable.
  • If your code is interacting with external services or databases, make sure to not overwhelm them. You should have some sort of caching or rate-limiting in place.
  • Before writing any code from scratch, always try and find in the current system if something has already been implemented.

Load Testing is a Must

Most of the time functional testing is not enough. Like in this case, if some sort of load testing had been done, this problem would not have surfaced right before deployment. So there should always be certain automated test cases — or as most people call it a Regression Suite — ready to run on your changeset along with some sort of fabricated load to verify if nothing breaks.

Say No to Hacks

The server guys get a lot of requests from the client to put in hacks just because of some buggy builds that went out. Sometimes it is easy to put in a hack but over some time things get messy. The code quickly changes from an elegantly structured piece of software to garbled shit no one can understand.

Start saying NO to the client guys. Only accommodate changes if they can be done in time, with proper functional and load testing and thorough code review.

Go the Async Way

I’m sure most of the audience here would be familiar with the concept of Microservices. A small component of code that can be built, tested, and deployed independently of the main code. That is how stuff should be written in my opinion.

You can’t go on modifying the main code flow because a small mistake on your part could cause a major outage. So try and write code that is asynchronous and thus works independently off the main flow.

Of course, don’t go on making every feature a microservice in itself.

The final solution

I did not mention the final solution which we came at after discussions because that in itself was another hack. There is another component of every sticker called the category Id.

We have different packs of stickers and each pack has a unique categoryId and in it, every sticker in itself has a stickerId. We were told that only 3 packs of animated stickers will be released on production and so we decided on supporting just those three (in cases where the client does not provide any info on sticker being animated in the JSON itself). So we hardcoded those 3 category IDs, something like this

if(packet.getMetdata.contains(anim_field) || categoryId is one of list(a,b,c))  
{   
   isAnimated = True;
} 

This again is not an elegant solution but there isn’t one for such problems I guess. There has to be some sort of common ground between the client and the server.

Leave a comment