Why does the java container recommend using ExitOnOutOfMemoryError instead of HeapDumpOnOutOfMemoryError ?
This article was last updated on: July 24, 2024 am
preface
I haven’t written an article for a long time, and the reason why I suddenly had a whim today is because yesterday there was such a situation:
A memory leak occurred in the user (customer) microservice on the backend of one of our company’s mobile apps, resulting in OutOfMemoryError, but because of our carefully optimized openjdk container parameters, this failure was for the userCompletely imperceptible. 💪💪💪
So how do we do it?
HeapDumpOnOutOfMemoryError VS ExitOnOutOfMemoryError
We all know, Java instances deployed on traditional virtual machines. In order to better analyze the problem, it is generally necessary to add: -XX:+HeapDumpOnOutOfMemoryError
of this parameter’s . After adding this parameter, if a memory overflow is encountered, HeapDump will be automatically generated, and later we can get this HeapDump to analyze the problem more accurately.
But, “My lord, times have changed!”
The development of container technology has brought great challenges to the traditional operation and maintenance model, which is revolutionary:
- Traditional applications are “permanent” vs container pods are “ephemeral temporary existence”
- Traditional applications are relatively difficult to scale and scale, and container scaling is silky smooth
- The traditional application operation model focuses on “locating problems” vs container operation mode is: “rapid recovery”
- In traditional applications, an instance reporting HeapDumpError will have one less vs container that can be automatically started after shutdown, and the specified number of replicas has been reached
- …
To summarize briefly, after using container platforms, our work tends to:
- Encountered a failure and failed quickly
- Recover quickly from failures
- Try to make the user “unperceived” of the fault
Therefore, for Java application containers, we also have to optimize to meet this demandOutOfMemoryError
Examples of failures:
- Encounter a failure and fail quickly, i.e. “exit as quickly as possible, end quickly”
- After the problematic Java application container instance exits, the new instance quickly starts to fill;
- “Quick exit, fast termination”, while working with LB, user requests will not be distributed during exit and cold start.
-XX:+ExitOnOutOfMemoryError
This is exactly what this needs are:
When this parameter is passed, the JVM exits immediately when an OutOfMemoryError is thrown. You can pass this parameter if you want to terminate the application.
detail
Let’s revisit the failure: “A memory leak occurred in a customer microservice on the backend of one of our company’s mobile apps, resulting in OutOfMemoryError”
The customer application is outlined below:
- Stateless
- Deployed with Deployment, there are 6 replicas
- Services are provided through SVC
The complete process is as follows:
- 6 copies, 1 of which appears
OutOfMomoryError
- Because the jvm parameters of the replica are configured as follows:
-XX:+ExitOnOutOfMemoryError
, the instance’s JVM (PID is 1) immediately exits. - Because
pid 1
The process exits, and the pod immediately exitsTerminating
status, and becomes:Terminated
- At the same time, the SVC load balancer of the customer will remove the replica from the SVC load balancer, and user requests will not be distributed to the node.
- K8S detects an inconsistency between the number of replicas and Deployment replicas and starts 1 new replica.
- After the new part of the Readiness Probe probe passes, the customer’s SVC load balancer adds this new copy to the load balancer and receives user requests.
During this process, the user is basically “unaware” of the background failure.
Of course, to do this, there are many details and doorways in the JVM parameters and startup script. Such as: The startup script should be: exec java ....$*
Have the opportunity to write another article to share.
New questions
In the previous chapter, we explained “why Java containers recommend using ExitOnOutOfMemoryError instead of HeapDumpOnOutOfMemoryError”, but careful friends will also find that new configurations will also bring new problems, such as:
- During the period when the JVM goes from fullgc to > OutOfMemoryError, the user’s experience will still deteriorate, how can it be “failure unaware”?
- Replace “HeapDumpOnOutOutOfMemoryError” with “ExitOnOutOfMemoryError”, so how do I locate the root cause of the problem and solve it? Isn’t it more fragrant to use 2 parameters together?
These can actually be solved by other means:
- During the period when the JVM goes from fullgc to > OutOfMemoryError, the user’s experience will still deteriorate, how can it be “failure unaware”?
- A: The configuration is reasonable
Readiness Probe
as long asReadiness Probe
If the probe fails, K8S will automatically remove the node from the SVC. So reasonableReadiness Probe
This refers to when the application is not available.Readiness Probe
Probes are bound to fail. Therefore, it is generally not to probe whether a port is listening, but to detect whether the corresponding API is normal. As follows. - A: With Prometheus JVM Exporter + Prometheus + AlertManger, a properly configured AlertRule is configured. Such as: “past X time, GC total time>5s” alarm, manual intervention after the alarm to deal with in advance.
- A: The configuration is reasonable
- Replace “HeapDumpOnOutOutOfMemoryError” with “ExitOnOutOfMemoryError”, so how do I locate the root cause of the problem and solve it? Isn’t it more fragrant to use 2 parameters together?
- Answer: The purpose is to “exit quickly, finish fast”. After all, it takes time to do HeapDump, which may cause a decline in experience during this time. So, only “ExitOnOutOfMemoryError”, the faster the exit, the better.
- Answer: As for the analysis problem, it can be analyzed by other means, such as embedding “Tracing agent” to do Tracing’s monitoring, and locating the root cause by analyzing the traces at the time of failure.
- Prometheus Alertrule gctime After the alarm, manually passed
jcmd
and other commands to do heapdump manually.
1 |
|
summary
New technology brings new changes, and we need to look at “best practices, optimal configurations” with a development perspective.
In 2016, the optimal parameters of Java for VM deployment are not necessarily the optimal solution today.