What is thread starvation?
If you have never come accross the concept of thread starvation, then I would say count yourself lucky. Thread starvation can happen in any multithreaded application. When it happens it can either cause the system to crash immediately or cause the system to come to a crawl and then crash. That being said it usually ends with the system crashing and burning.
In this blog post we will focus on thread starvation in the JVM. Before we explain thread starvation, let's just first explain something, threads take up memory, no object comes for free, each object has an address in memory, and threads are no exception. Depending on the implementation of the JVM the memory allocated for the JVM thread may come off the heap or its own location outside of heap. Now the objects that the thread spawns may come off of the heap (most implementations are like that) but the stack size that the thread has comes from the Xss setting which defines the stack size. Meaning if a thread has a lot of function calls (recursive algorithms are notorious for that) then the result can cause a StackOverFlow exception, meaning the setting is not set large enough. There is a limit for the number of threads a JVM can have but its relatively high. That being said each thread takes memory and the amount of memory we have is finite, meaning theres a defined limit, its just dependent on the application setup (i.e memory settings).
Now that we've had a quick overview of threads and how they exist in the JVM, let's imagine a scenario. Let's say we have an endpoint that is exposed and does not have any caching layers. This is because it returns customized data and the data is different for each user. When we hit the endpoint it runs a SQL query, meaning it makes a network call, waits for the data, parses it and returns it back to the user. Overtime the dataset that the SQL query executes on grows and becomes more expensive. Similarly overtime the customer base grows and acquires more customers meaning more requests to the endpoint. The endpoint is not optimized for the amount of requests that are coming in and the SQL query is not optimized for the amount of data that is being queried. This means that the endpoint is taking longer to respond to the requests and the threads are waiting for the data to come back. This is where thread starvation comes in, the threads are waiting for the data to come back and the threads are not being released. Now we are starting to develop a serious problem. We are chewing up all of our memory and the system is starting to slow down. Now that we understand how it could happen, let's discuss how to identify the scenario.
As an engineer how would you identify that thread starvation is happening in your JVM? The easiest way I know of doing so is by capturing a set of thread dumps, I like to do 1 to 2 second intervals over 60 seconds. This gives you a minute window overview of what is happening inside of your JVM. Tools such as jstack is very useful for capturing such thread dumps. Then take all the thread dumps, unzip them and open all of them in your favorite text editor. Start by reviewing the first thread dump, the one at time 1 second. What you are looking for is a thread that typically has a deep stack trace and is WAITING or RUNNABLE but performing some low level operation. Meaning, its either performing a transaction such as running a query, making a network call, manipulating a dataset. Now if you find that capture the thread id and search your entire set of thread dumps for that thread id and if you see it accross all your thread dumps then you have a slow thread. Thats the first step in identifiying that you have thread starvation. The last thing you need to identify, go back to thread dump at time 1 second, ask yourself this, how many threads do you have? If its high hundreds you are definitely in the red zone, most likely the most promonent thread is your slow thread. Meaning you most likely have a large set of these threads lingering taking up memory. Eventually the JVM will run out of memory, crash and burn and the customers will start getting a 502 Gateway Timeout.
How do you fix it? The answer is its not straight forward, I mean you can go down the easy route and throw money at it and just add more memory but its not solving the problem. What you need to do is ask yourself, am I using the correct topology, can I alter my SQL query to be more efficient, can I cache the data, can I optimize the endpoint to be more efficient. These are the questions you need to ask yourself. The answer is not always easy but its the right way to go about it. Remember, thread starvation is a symptom of a larger problem, its not the problem itself.
There are various architectures that exist, most CMS systems have caching layers, which prevent many requests from hitting the endpoint. This prevents all of these threads from being spawned and waiting for the data to come back. Other systems use queuing and they only run X jobs at a time and the rest are queued up (spoiler this will be the topic for my next blog post). The point is, there are many ways to solve the problem, you just need to find the right one for your application.