Why does the java container recommend using ExitOnOutOfMemoryError instead of HeapDumpOnOutOfMemoryError ?

This article was last updated on: July 24, 2024 am

preface

I haven’t written an article for a long time, and the reason why I suddenly had a whim today is because yesterday there was such a situation:

A memory leak occurred in the user (customer) microservice on the backend of one of our company’s mobile apps, resulting in OutOfMemoryError, but because of our carefully optimized openjdk container parameters, this failure was for the userCompletely imperceptible. 💪💪💪

So how do we do it?

HeapDumpOnOutOfMemoryError VS ExitOnOutOfMemoryError

We all know, Java instances deployed on traditional virtual machines. In order to better analyze the problem, it is generally necessary to add: -XX:+HeapDumpOnOutOfMemoryErrorof this parameter’s . After adding this parameter, if a memory overflow is encountered, HeapDump will be automatically generated, and later we can get this HeapDump to analyze the problem more accurately.

But, “My lord, times have changed!”

The development of container technology has brought great challenges to the traditional operation and maintenance model, which is revolutionary:

Traditional applications are “permanent” vs container pods are “ephemeral temporary existence”
Traditional applications are relatively difficult to scale and scale, and container scaling is silky smooth
The traditional application operation model focuses on “locating problems” vs container operation mode is: “rapid recovery”
In traditional applications, an instance reporting HeapDumpError will have one less vs container that can be automatically started after shutdown, and the specified number of replicas has been reached
…

To summarize briefly, after using container platforms, our work tends to:

Encountered a failure and failed quickly
Recover quickly from failures
Try to make the user “unperceived” of the fault

Therefore, for Java application containers, we also have to optimize to meet this demandOutOfMemoryErrorExamples of failures:

Encounter a failure and fail quickly, i.e. “exit as quickly as possible, end quickly”
After the problematic Java application container instance exits, the new instance quickly starts to fill;
“Quick exit, fast termination”, while working with LB, user requests will not be distributed during exit and cold start.

-XX:+ExitOnOutOfMemoryErrorThis is exactly what this needs are:

When this parameter is passed, the JVM exits immediately when an OutOfMemoryError is thrown. You can pass this parameter if you want to terminate the application.

detail

Let’s revisit the failure: “A memory leak occurred in a customer microservice on the backend of one of our company’s mobile apps, resulting in OutOfMemoryError”

The customer application is outlined below:

Stateless
Deployed with Deployment, there are 6 replicas
Services are provided through SVC

The complete process is as follows:

6 copies, 1 of which appearsOutOfMomoryError
Because the jvm parameters of the replica are configured as follows: -XX:+ExitOnOutOfMemoryError, the instance’s JVM (PID is 1) immediately exits.
Becausepid 1The process exits, and the pod immediately exitsTerminatingstatus, and becomes:Terminated
At the same time, the SVC load balancer of the customer will remove the replica from the SVC load balancer, and user requests will not be distributed to the node.
K8S detects an inconsistency between the number of replicas and Deployment replicas and starts 1 new replica.
After the new part of the Readiness Probe probe passes, the customer’s SVC load balancer adds this new copy to the load balancer and receives user requests.

During this process, the user is basically “unaware” of the background failure.

Of course, to do this, there are many details and doorways in the JVM parameters and startup script. Such as: The startup script should be: exec java ....$*

Have the opportunity to write another article to share.

New questions

In the previous chapter, we explained “why Java containers recommend using ExitOnOutOfMemoryError instead of HeapDumpOnOutOfMemoryError”, but careful friends will also find that new configurations will also bring new problems, such as:

During the period when the JVM goes from fullgc to > OutOfMemoryError, the user’s experience will still deteriorate, how can it be “failure unaware”?
Replace “HeapDumpOnOutOutOfMemoryError” with “ExitOnOutOfMemoryError”, so how do I locate the root cause of the problem and solve it? Isn’t it more fragrant to use 2 parameters together?

These can actually be solved by other means:

During the period when the JVM goes from fullgc to > OutOfMemoryError, the user’s experience will still deteriorate, how can it be “failure unaware”?
1. A: The configuration is reasonableReadiness Probeas long asReadiness ProbeIf the probe fails, K8S will automatically remove the node from the SVC. So reasonableReadiness ProbeThis refers to when the application is not available. Readiness ProbeProbes are bound to fail. Therefore, it is generally not to probe whether a port is listening, but to detect whether the corresponding API is normal. As follows.
2. A: With Prometheus JVM Exporter + Prometheus + AlertManger, a properly configured AlertRule is configured. Such as: “past X time, GC total time>5s” alarm, manual intervention after the alarm to deal with in advance.
Replace “HeapDumpOnOutOutOfMemoryError” with “ExitOnOutOfMemoryError”, so how do I locate the root cause of the problem and solve it? Isn’t it more fragrant to use 2 parameters together?
1. Answer: The purpose is to “exit quickly, finish fast”. After all, it takes time to do HeapDump, which may cause a decline in experience during this time. So, only “ExitOnOutOfMemoryError”, the faster the exit, the better.
2. Answer: As for the analysis problem, it can be analyzed by other means, such as embedding “Tracing agent” to do Tracing’s monitoring, and locating the root cause by analyzing the traces at the time of failure.
3. Prometheus Alertrule gctime After the alarm, manually passedjcmdand other commands to do heapdump manually.

readinessProbe:
  httpGet:
    path: /actuator/info
    port: 8088
    scheme: HTTP
  initialDelaySeconds: 60
  timeoutSeconds: 3
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 3

summary

New technology brings new changes, and we need to look at “best practices, optimal configurations” with a development perspective.

In 2016, the optimal parameters of Java for VM deployment are not necessarily the optimal solution today.