The topic of this post is rather confusing but it is a simple trick to save yourself from massive amounts of data preprocessing. It does not solve all the preprocessing problems, but in certain situations it does wonders in terms of scalability and resource saving. Let me explain the problem and then the solution.
The Preprocessing Problem
Imagine a situation where you need to show a daily status update to your customers when they login to your site or app, or perhaps you needed to display result of a task that they performed earlier. Another situation is where you need to run cleanup on data that may have an expiration time (TTL) associated with it. In my particular case it was a result of a submission that the user performed earlier and returned to the app at a later time to receive the result.
In all of these cases, a simple solution is to use a preprocessing job that runs in the background and performs the update function at the appropriate times. My first attempt was to run a cron job that processed all the submissions and stored the result for the user to collect when they log back in.
This worked well for a while until the number of users and submissions started to grow. The background processing job that used to take few minutes was now running for hours and even then results were still not ready for users when the returned to the app. Figure 1 below shows this solution.
To solve this problem we started by running multiple threads of the same cron job, splitting the process queue by the number of jobs. Unfortunately this didn’t scale when the number of user submission started to reach hundreds of thousands, and in cases where the jobs exited prematurely due to database exceptions or network issues, it became tricky to restart from the point in time of failure. In addition millions of results were pushed for churned users who never returned to the app, wasting all those resources.
Deliver on Check Solution
The solution, as it turns out, is pretty simple. What if we check for and deliver the result when the user logs in and not a moment before. This solution splits problem horizontally and processes individual users as they log into the system. Compared to running many parallel preprocessing cron jobs, we would be splitting it into hundreds of thousands of threads by processing each user’s jobs at the time of their login. Parallel processing at this scale turns hours or even days of preprocessing to milliseconds. It also has the added benefit of saving resources when the user never returns and we never deliver the data.
You can use this system especially well with timer based systems, for example a loot timer, where the client shows a “fake” progress to the user based on a timestamp returned by the server and collection really happens on the next login or client can force a login to get the reward when timer runs out.
The down side of not using a preprocessing system is that you incur the cost at the time of user logging into the system. This means that the response time of the login response is going to be slower compared to what you would get if the data was already processed by a background thread. This is specially true for collections that require lot of processing time.
Also remember that you are splitting the background process N number of ways where N equals to the number of users logging into your system at a given time, this will create a CPU spike if a lot of users return at the same exact time.
On the plus side if a preprocess fails for a given user it only effects the user in question and the collection will automatically happen on the next login so you are saving yourself from widespread outages and delays.
If you are looking to improve processing of your application I have few more tips in my earlier Scaling to Millions post that you may find useful.
Feel free to reach out if you have any questions or comments on this topic.