How to debug/analyze extremely long GC pauses in Node.js/V8

I am trying to analyze a memory/GC problem in a relatively complex Node.js server app. Even under very moderate load, it is becoming unresponsive for noticeable periods, and these pauses grow longer over time. Running with the --trace-gc parameter shows that extremely long garbage collection times are the likely cause:

[4805]      537 ms: Mark-sweep 17.6 (46.4) -> 10.3 (47.4) MB, 20 ms [allocation failure] [GC in old space requested].
[4805]     1338 ms: Mark-sweep 31.3 (58.4) -> 19.2 (57.2) MB, 40 ms [allocation failure] [promotion limit reached].
[4805]     2662 ms: Mark-sweep 58.0 (79.2) -> 43.9 (85.2) MB, 109 ms [Runtime::PerformGC] [promotion limit reached].
[4805]     4014 ms: Mark-sweep 90.1 (111.5) -> 70.6 (113.9) MB, 114 ms [allocation failure] [promotion limit reached].
[4805]     7283 ms: Mark-sweep 129.7 (153.9) -> 112.0 (158.9) MB, 511 ms [allocation failure] [promotion limit reached].
[4805]    10979 ms: Mark-sweep 184.6 (210.9) -> 160.3 (212.9) MB, 422 ms [Runtime::PerformGC] [promotion limit reached].
[4805]  1146869 ms: Mark-sweep 243.8 (271.4) -> 191.6 (267.9) MB, 1856 ms [allocation failure] [promotion limit reached].
[4805]  1731440 ms: Mark-sweep 282.1 (307.4) -> 197.5 (298.9) MB, 1 / 11230 ms [allocation failure] [promotion limit reached].
[4805]  2024385 ms: Mark-sweep 291.0 (320.8) -> 197.3 (306.9) MB, 9076 ms [Runtime::PerformGC] [promotion limit reached].
[4805]  2623396 ms: Mark-sweep 290.9 (317.1) -> 196.9 (311.9) MB, 1 / 15401 ms [allocation failure] [promotion limit reached].
[4805]  3223769 ms: Mark-sweep 291.4 (323.6) -> 187.8 (318.9) MB, 1 / 13385 ms [allocation failure] [promotion limit reached].
[4805]  4225777 ms: Mark-sweep 280.1 (324.2) -> 190.6 (315.9) MB, 1 / 13266 ms [allocation failure] [promotion limit reached].
[4805]  4705442 ms: Mark-sweep 286.2 (321.4) -> 195.2 (314.9) MB, 1 / 17256 ms [Runtime::PerformGC] [promotion limit reached].
[4805]  5225595 ms: Mark-sweep 288.3 (324.0) -> 201.7 (316.9) MB, 1 / 22266 ms [Runtime::PerformGC] [promotion limit reached].
[4805]  6127372 ms: Mark-sweep 296.5 (324.6) -> 200.5 (316.9) MB, 1 / 28325 ms [allocation failure] [promotion limit reached].
[4805]  6523938 ms: Mark-sweep 297.8 (328.9) -> 198.8 (323.9) MB, 1 / 27213 ms [allocation failure] [promotion limit reached].
[4805]  7355394 ms: Mark-sweep 292.1 (330.7) -> 223.9 (322.9) MB, 60202 ms [allocation failure] [promotion limit reached].

The full (--trace-gc-verbose) output can be found here.

Read More:   Rails include javascripts assets folder recursively

These logs are the result of running the server with the following parameters:

--expose-gc --trace-gc --trace-gc-verbose --trace-gc-ignore-scavenger --max-old-space-size=1000

The longer it runs, the longer the pauses get (often several minutes), until it eventually locks up completely after a couple of hours. Available memory never runs out, and the RSS does not even come close to the 1000mb old space limit, so it does not seem to be a leak. It looks to me like there might be something rather unusual within the code, making it very “difficult” for the GC to do its job in acceptable time frames.

My question is: How do I go about analyzing this problem further, and narrowing down the possible causes? Any recommendable tools to help with issues like this one? I’m essentially looking for a more efficient approach than naively turning off and on parts of the code, which is extremely cumbersome and time consuming.

As an aside, I would greatly appreciate links to any documentation that explains the terms/messages used in the GC debug output (such as “promotion limit reached”), and the numbers listed there. I have a very basic understanding of how the V8 GC works (this helped a lot), but most of that output is still beyond me.

In case it matters: This is running on Node.js v0.10.33 on Ubuntu 14.04 Server.

A while ago we moved on to io.js, where this problem simply does not occur anymore (presumably due to the much more recent V8 version). I never did find the cause for this issue with Node v0.10 though, let alone a fix.

Read More:   Handlebars: multiple conditions IF statement?

Are you able to reproduce the issue on a single node? I think if I was in the situation I would probably do a mix of the following:

  • Write a loader that let me replicate on a local instance
  • If not, put an instance in prod that will receive a subset of traffic, and modify it to do the below
  • Add node-heapdump to your source, call it at an interval and export the results to a json file in N minute intervals.
  • Potentially, if you’re running locally you might also be able to take advantage of memwatch.
  • Wait for the slow GC to start.
  • Grab a number of heap dumps around a time when you know slow GC started to occur.
  • Load them up into chrome and analyze them using the three snap shot technique (I suppose you could call this the N snapshot technique in our case)

Basically, you will load the heaps and start looking through them to try and understand what type of thing is stacking up, what is holding it and as a result you’ll get an understanding of why GC is taking so long.

Available memory never runs out, and the RSS does not even come close to the 1000mb old space limit, so it does not seem to be a leak. It looks to me like there might be something rather unusual within the code, making it very “difficult” for the GC to do its job in acceptable time frames.

Here you may be looking for long and circular retainment trees. But at the end of the day even that is the case you should be able to identify what the root of that tree is, what is in it and experiment with ways to reduce remove.

Read More:   Websocket API to replace REST API? [closed]

I also agree with @dandavis and suspect closures.

This answer may not be as specific as you would like, but I’d suggest looking at the good package that’s part of Walmart’s hapi.js framework. It does a great job of extending logging beyond --trace-gc. It’s a process monitor that listens for one or more of the following events:

  • ops – System and process performance – CPU, memory, disk, and other metrics.
  • response – Information about incoming requests and the response. This maps to either the “response” or “tail” event emitted from hapi servers.
  • log – logging information not bound to a specific request such as system errors, background processing, configuration errors, etc. Maps to the “log” event emitted from hapi servers.
  • error – request responses that have a status code of 500. This maps to the “request-error” hapi event.
  • request – Request logging information. This maps to the hapi ‘request’ event that is emitted via request.log().

You would have to pull in the Hapi libary for this to work, but it may be worth it temporarily for the purposes of debugging. In general I highly recommend Hapi for scaling Node.js apps, the guys at Walmart have been doing amazing things with it over the past year.

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Similar Posts