Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ZGC on jdk11 #211

Open
jborgers opened this issue Nov 10, 2018 · 15 comments
Open

Support for ZGC on jdk11 #211

jborgers opened this issue Nov 10, 2018 · 15 comments

Comments

@jborgers
Copy link

Hi,

I am using ZGC on jdk-11 and the gc-log files don't parse correctly. I get:

INFO [DataReaderFacade]: GCViewer version 1.36-SNAPSHOT (2018-10-23T20:08:13+0000)
INFO [DataReaderFactory]: File format: Oracle / OpenJDK unified jvm logging
INFO [DataReaderUnifiedJvmLogging]: Reading Oracle / OpenJDK unified jvm logging format...
INFO [DataReaderUnifiedJvmLogging]: Using The Z Garbage Collector
WARNING [DataReaderUnifiedJvmLogging]: Failed to parse gc event (com.tagtraum.perf.gcviewer.imp.UnknownGcTypeException: Unknown gc type: 'Garbage Collection (Warmup)') on line number 12 (line="[0,595s][info][gc ] GC(0) Garbage Collection (Warmup) 116M(11%)->102M(10%)")
WARNING [DataReaderUnifiedJvmLogging]: Failed to parse gc event (com.tagtraum.perf.gcviewer.imp.UnknownGcTypeException: Unknown gc type: 'Garbage Collection (Warmup)') on line number 23 (line="[1,048s][info][gc ] GC(1) Garbage Collection (Warmup) 208M(20%)->164M(16%)")
WARNING [DataReaderUnifiedJvmLogging]: Failed to parse gc event (com.tagtraum.perf.gcviewer.imp.UnknownGcTypeException: Unknown gc type: 'Garbage Collection (Warmup)') on line number 34 (line="[2,070s][info][gc ] GC(2) Garbage Collection (Warmup) 402M(39%)->494M(48%)")
WARNING [DataReaderUnifiedJvmLogging]: Failed to parse gc event (com.tagtraum.perf.gcviewer.imp.UnknownGcTypeException: Unknown gc type: 'Garbage Collection (Allocation Rate)') on line number 45 (line="[3,167s][info][gc ] GC(3) Garbage Collection (Allocation Rate) 502M(49%)->716M(70%)")
WARNING [DataReaderUnifiedJvmLogging]: Failed to parse gc event (com.tagtraum.perf.gcviewer.imp.UnknownGcTypeException: Unknown gc type: 'Garbage Collection (Allocation Rate)') on line number 56 (line="[4,573s][info][gc ] GC(4) Garbage Collection (Allocation Rate) 718M(70%)->580M(57%)")
[..]
etc.
I attached the gc log file I used.
I used -Xlog:gc,gc+phases:gc.log
Is the gc-phases info also used? Other tags/levels to include?

Would be great if ZGC is also supported.
Thanks!
Jeroen

@jborgers
Copy link
Author

[0.065s][info][gc] Using The Z Garbage Collector
[0,403s][info][gc,phases] GC(0) Pause Mark Start 0,240ms
[0,492s][info][gc,phases] GC(0) Concurrent Mark 88,955ms
[0,493s][info][gc,phases] GC(0) Pause Mark End 0,074ms
[0,493s][info][gc,phases] GC(0) Concurrent Process Non-Strong References 0,244ms
[0,493s][info][gc,phases] GC(0) Concurrent Reset Relocation Set 0,000ms
[0,493s][info][gc,phases] GC(0) Concurrent Destroy Detached Pages 0,001ms
[0,495s][info][gc,phases] GC(0) Concurrent Select Relocation Set 1,426ms
[0,508s][info][gc,phases] GC(0) Concurrent Prepare Relocation Set 13,118ms
[0,509s][info][gc,phases] GC(0) Pause Relocate Start 0,325ms
[0,595s][info][gc,phases] GC(0) Concurrent Relocate 85,943ms
[0,595s][info][gc ] GC(0) Garbage Collection (Warmup) 116M(11%)->102M(10%)
[0,806s][info][gc,phases] GC(1) Pause Mark Start 0,103ms
[0,994s][info][gc,phases] GC(1) Concurrent Mark 187,553ms
[0,994s][info][gc,phases] GC(1) Pause Mark End 0,096ms
[0,995s][info][gc,phases] GC(1) Concurrent Process Non-Strong References 0,267ms
[0,997s][info][gc,phases] GC(1) Concurrent Reset Relocation Set 2,512ms
[0,997s][info][gc,phases] GC(1) Concurrent Destroy Detached Pages 0,001ms
[0,999s][info][gc,phases] GC(1) Concurrent Select Relocation Set 1,515ms
[1,001s][info][gc,phases] GC(1) Concurrent Prepare Relocation Set 2,673ms
[1,003s][info][gc,phases] GC(1) Pause Relocate Start 0,144ms
[1,047s][info][gc,phases] GC(1) Concurrent Relocate 44,026ms
[1,048s][info][gc ] GC(1) Garbage Collection (Warmup) 208M(20%)->164M(16%)
[1,501s][info][gc,phases] GC(2) Pause Mark Start 0,175ms
[1,949s][info][gc,phases] GC(2) Concurrent Mark 447,220ms
[1,953s][info][gc,phases] GC(2) Pause Mark End 0,176ms
[1,953s][info][gc,phases] GC(2) Concurrent Process Non-Strong References 0,281ms
[1,953s][info][gc,phases] GC(2) Concurrent Reset Relocation Set 0,476ms
[1,953s][info][gc,phases] GC(2) Concurrent Destroy Detached Pages 0,001ms
[1,955s][info][gc,phases] GC(2) Concurrent Select Relocation Set 1,677ms
[1,962s][info][gc,phases] GC(2) Concurrent Prepare Relocation Set 7,211ms
[1,963s][info][gc,phases] GC(2) Pause Relocate Start 0,199ms
[2,069s][info][gc,phases] GC(2) Concurrent Relocate 105,749ms
[2,070s][info][gc ] GC(2) Garbage Collection (Warmup) 402M(39%)->494M(48%)
[2,102s][info][gc,phases] GC(3) Pause Mark Start 0,148ms
[3,012s][info][gc,phases] GC(3) Concurrent Mark 910,322ms
[3,013s][info][gc,phases] GC(3) Pause Mark End 0,224ms
[3,013s][info][gc,phases] GC(3) Concurrent Process Non-Strong References 0,087ms
[3,014s][info][gc,phases] GC(3) Concurrent Reset Relocation Set 0,969ms
[3,014s][info][gc,phases] GC(3) Concurrent Destroy Detached Pages 0,008ms
[3,016s][info][gc,phases] GC(3) Concurrent Select Relocation Set 1,502ms
[3,025s][info][gc,phases] GC(3) Concurrent Prepare Relocation Set 9,092ms
[3,026s][info][gc,phases] GC(3) Pause Relocate Start 0,215ms
[3,167s][info][gc,phases] GC(3) Concurrent Relocate 140,633ms
[3,167s][info][gc ] GC(3) Garbage Collection (Allocation Rate) 502M(49%)->716M(70%)
[3,204s][info][gc,phases] GC(4) Pause Mark Start 0,084ms
[4,408s][info][gc,phases] GC(4) Concurrent Mark 1204,101ms
[4,408s][info][gc,phases] GC(4) Pause Mark End 0,168ms
[4,409s][info][gc,phases] GC(4) Concurrent Process Non-Strong References 0,184ms
[4,410s][info][gc,phases] GC(4) Concurrent Reset Relocation Set 1,274ms
[4,410s][info][gc,phases] GC(4) Concurrent Destroy Detached Pages 0,004ms
[4,412s][info][gc,phases] GC(4) Concurrent Select Relocation Set 1,793ms
[4,422s][info][gc,phases] GC(4) Concurrent Prepare Relocation Set 9,828ms
[4,422s][info][gc,phases] GC(4) Pause Relocate Start 0,325ms
[4,573s][info][gc,phases] GC(4) Concurrent Relocate 150,815ms
[4,573s][info][gc ] GC(4) Garbage Collection (Allocation Rate) 718M(70%)->580M(57%)
[4,606s][info][gc,phases] GC(5) Pause Mark Start 0,142ms
[6,059s][info][gc,phases] GC(5) Concurrent Mark 1453,747ms
[6,060s][info][gc,phases] GC(5) Pause Mark End 0,257ms
[6,060s][info][gc,phases] GC(5) Concurrent Process Non-Strong References 0,073ms
[6,062s][info][gc,phases] GC(5) Concurrent Reset Relocation Set 1,446ms
[6,062s][info][gc,phases] GC(5) Concurrent Destroy Detached Pages 0,004ms
[6,063s][info][gc,phases] GC(5) Concurrent Select Relocation Set 1,776ms
[6,070s][info][gc,phases] GC(5) Concurrent Prepare Relocation Set 6,290ms
[6,074s][info][gc,phases] GC(5) Pause Relocate Start 0,236ms
[6,168s][info][gc,phases] GC(5) Concurrent Relocate 94,363ms
[6,168s][info][gc ] GC(5) Garbage Collection (Allocation Rate) 582M(57%)->640M(62%)
[6,206s][info][gc,phases] GC(6) Pause Mark Start 0,162ms
[7,598s][info][gc,phases] GC(6) Concurrent Mark 1392,599ms
[7,598s][info][gc,phases] GC(6) Pause Mark End 0,056ms
[7,599s][info][gc,phases] GC(6) Concurrent Process Non-Strong References 0,106ms
[7,600s][info][gc,phases] GC(6) Concurrent Reset Relocation Set 0,850ms
[7,600s][info][gc,phases] GC(6) Concurrent Destroy Detached Pages 0,004ms
[7,601s][info][gc,phases] GC(6) Concurrent Select Relocation Set 1,634ms
[7,612s][info][gc,phases] GC(6) Concurrent Prepare Relocation Set 10,437ms
[7,612s][info][gc,phases] GC(6) Pause Relocate Start 0,259ms
[7,819s][info][gc,phases] GC(6) Concurrent Relocate 206,858ms
[7,819s][info][gc ] GC(6) Garbage Collection (Allocation Rate) 644M(63%)->670M(65%)
[7,908s][info][gc,phases] GC(7) Pause Mark Start 0,508ms
[9,622s][info][gc,phases] GC(7) Concurrent Mark 1713,567ms
[9,622s][info][gc,phases] GC(7) Pause Mark End 0,048ms
[9,622s][info][gc,phases] GC(7) Concurrent Process Non-Strong References 0,141ms
[9,623s][info][gc,phases] GC(7) Concurrent Reset Relocation Set 1,305ms
[9,624s][info][gc,phases] GC(7) Concurrent Destroy Detached Pages 0,115ms
[9,625s][info][gc,phases] GC(7) Concurrent Select Relocation Set 1,905ms
[9,637s][info][gc,phases] GC(7) Concurrent Prepare Relocation Set 11,572ms
[9,638s][info][gc,phases] GC(7) Pause Relocate Start 0,256ms
[9,822s][info][gc,phases] GC(7) Concurrent Relocate 184,197ms
[9,822s][info][gc ] GC(7) Garbage Collection (Allocation Rate) 672M(66%)->712M(70%)
[9,901s][info][gc,phases] GC(8) Pause Mark Start 0,111ms
[11,706s][info][gc,phases] GC(8) Concurrent Mark 1805,117ms
[11,707s][info][gc,phases] GC(8) Pause Mark End 0,039ms
[11,707s][info][gc,phases] GC(8) Concurrent Process Non-Strong References 0,055ms
[11,708s][info][gc,phases] GC(8) Concurrent Reset Relocation Set 1,464ms
[11,708s][info][gc,phases] GC(8) Concurrent Destroy Detached Pages 0,003ms
[11,710s][info][gc,phases] GC(8) Concurrent Select Relocation Set 1,528ms
[11,736s][info][gc,phases] GC(8) Concurrent Prepare Relocation Set 26,036ms
[11,736s][info][gc,phases] GC(8) Pause Relocate Start 0,384ms
[12,115s][info][gc,phases] GC(8) Concurrent Relocate 379,047ms
[12,116s][info][gc ] GC(8) Garbage Collection (Allocation Rate) 720M(70%)->716M(70%)
[12,227s][info][gc,phases] GC(9) Pause Mark Start 0,236ms
[14,629s][info][gc,phases] GC(9) Concurrent Mark 2401,959ms
[14,629s][info][gc,phases] GC(9) Pause Mark End 0,039ms
[14,629s][info][gc,phases] GC(9) Concurrent Process Non-Strong References 0,065ms
[14,633s][info][gc,phases] GC(9) Concurrent Reset Relocation Set 3,317ms
[14,633s][info][gc,phases] GC(9) Concurrent Destroy Detached Pages 0,002ms
[14,634s][info][gc,phases] GC(9) Concurrent Select Relocation Set 1,529ms
[14,653s][info][gc,phases] GC(9) Concurrent Prepare Relocation Set 18,641ms
[14,653s][info][gc,phases] GC(9) Pause Relocate Start 0,350ms
[14,963s][info][gc,phases] GC(9) Concurrent Relocate 309,640ms
[14,963s][info][gc ] GC(9) Garbage Collection (Allocation Rate) 720M(70%)->744M(73%)
[15,002s][info][gc,phases] GC(10) Pause Mark Start 0,143ms
[17,163s][info][gc,phases] GC(10) Concurrent Mark 2160,944ms
[17,163s][info][gc,phases] GC(10) Pause Mark End 0,050ms
[17,163s][info][gc,phases] GC(10) Concurrent Process Non-Strong References 0,167ms
[17,165s][info][gc,phases] GC(10) Concurrent Reset Relocation Set 2,551ms
[17,165s][info][gc,phases] GC(10) Concurrent Destroy Detached Pages 0,001ms
[17,167s][info][gc,phases] GC(10) Concurrent Select Relocation Set 1,519ms
[17,183s][info][gc,phases] GC(10) Concurrent Prepare Relocation Set 16,020ms
[17,184s][info][gc,phases] GC(10) Pause Relocate Start 0,511ms
[17,193s][info][gc ] Allocation Stall (main) 269,000ms
[17,444s][info][gc,phases] GC(10) Concurrent Relocate 260,281ms
[17,444s][info][gc ] GC(10) Garbage Collection (Allocation Rate) 746M(73%)->720M(70%)
[17,501s][info][gc,phases] GC(11) Pause Mark Start 0,149ms
[19,609s][info][gc,phases] GC(11) Concurrent Mark 2107,962ms
[19,609s][info][gc,phases] GC(11) Pause Mark End 0,166ms
[19,609s][info][gc,phases] GC(11) Concurrent Process Non-Strong References 0,074ms
[19,612s][info][gc,phases] GC(11) Concurrent Reset Relocation Set 2,248ms
[19,612s][info][gc,phases] GC(11) Concurrent Destroy Detached Pages 0,001ms
[19,613s][info][gc,phases] GC(11) Concurrent Select Relocation Set 1,440ms
[19,626s][info][gc,phases] GC(11) Concurrent Prepare Relocation Set 13,066ms
[19,627s][info][gc,phases] GC(11) Pause Relocate Start 0,325ms
[19,628s][info][gc ] Allocation Stall (main) 96,544ms
[19,851s][info][gc,phases] GC(11) Concurrent Relocate 224,531ms
[19,852s][info][gc ] GC(11) Garbage Collection (Allocation Rate) 726M(71%)->742M(72%)
[19,901s][info][gc,phases] GC(12) Pause Mark Start 0,152ms
[21,897s][info][gc,phases] GC(12) Concurrent Mark 1995,935ms
[21,897s][info][gc,phases] GC(12) Pause Mark End 0,042ms
[21,897s][info][gc,phases] GC(12) Concurrent Process Non-Strong References 0,176ms
[21,899s][info][gc,phases] GC(12) Concurrent Reset Relocation Set 1,862ms
[21,899s][info][gc,phases] GC(12) Concurrent Destroy Detached Pages 0,000ms
[21,901s][info][gc,phases] GC(12) Concurrent Select Relocation Set 1,744ms
[21,916s][info][gc,phases] GC(12) Concurrent Prepare Relocation Set 14,910ms
[21,917s][info][gc,phases] GC(12) Pause Relocate Start 0,246ms
[21,926s][info][gc ] Allocation Stall (main) 87,571ms
[22,162s][info][gc,phases] GC(12) Concurrent Relocate 245,384ms
[22,162s][info][gc ] GC(12) Garbage Collection (Allocation Rate) 748M(73%)->718M(70%)
[22,203s][info][gc,phases] GC(13) Pause Mark Start 0,099ms
[24,339s][info][gc,phases] GC(13) Concurrent Mark 2136,728ms
[24,340s][info][gc,phases] GC(13) Pause Mark End 0,047ms
[24,340s][info][gc,phases] GC(13) Concurrent Process Non-Strong References 0,065ms
[24,342s][info][gc,phases] GC(13) Concurrent Reset Relocation Set 1,931ms
[24,342s][info][gc,phases] GC(13) Concurrent Destroy Detached Pages 0,001ms
[24,343s][info][gc,phases] GC(13) Concurrent Select Relocation Set 1,828ms
[24,359s][info][gc,phases] GC(13) Concurrent Prepare Relocation Set 15,294ms
[24,359s][info][gc,phases] GC(13) Pause Relocate Start 0,372ms
[24,363s][info][gc ] Allocation Stall (main) 55,435ms
[24,607s][info][gc,phases] GC(13) Concurrent Relocate 247,894ms
[24,607s][info][gc ] GC(13) Garbage Collection (Allocation Rate) 724M(71%)->742M(72%)
[24,709s][info][gc,phases] GC(14) Pause Mark Start 0,395ms
[26,991s][info][gc,phases] GC(14) Concurrent Mark 2281,351ms
[26,991s][info][gc,phases] GC(14) Pause Mark End 0,039ms
[26,991s][info][gc,phases] GC(14) Concurrent Process Non-Strong References 0,091ms
[26,993s][info][gc,phases] GC(14) Concurrent Reset Relocation Set 2,315ms
[26,993s][info][gc,phases] GC(14) Concurrent Destroy Detached Pages 0,001ms
[26,995s][info][gc,phases] GC(14) Concurrent Select Relocation Set 1,446ms
[27,010s][info][gc,phases] GC(14) Concurrent Prepare Relocation Set 14,856ms
[27,010s][info][gc,phases] GC(14) Pause Relocate Start 0,351ms
[27,020s][info][gc ] Allocation Stall (main) 279,595ms
[27,282s][info][gc,phases] GC(14) Concurrent Relocate 272,125ms
[27,282s][info][gc ] GC(14) Garbage Collection (Allocation Rate) 744M(73%)->728M(71%)
[27,301s][info][gc,phases] GC(15) Pause Mark Start 0,109ms
[29,769s][info][gc,phases] GC(15) Concurrent Mark 2467,893ms
[29,769s][info][gc,phases] GC(15) Pause Mark End 0,081ms
[29,769s][info][gc,phases] GC(15) Concurrent Process Non-Strong References 0,092ms
[29,771s][info][gc,phases] GC(15) Concurrent Reset Relocation Set 1,924ms
[29,771s][info][gc,phases] GC(15) Concurrent Destroy Detached Pages 0,000ms
[29,773s][info][gc,phases] GC(15) Concurrent Select Relocation Set 1,517ms
[29,788s][info][gc,phases] GC(15) Concurrent Prepare Relocation Set 14,953ms
[29,788s][info][gc,phases] GC(15) Pause Relocate Start 0,308ms
[30,014s][info][gc,phases] GC(15) Concurrent Relocate 225,541ms
[30,014s][info][gc ] GC(15) Garbage Collection (Allocation Rate) 730M(71%)->522M(51%)

@yanqili-fiix
Copy link

We are also facing the same issue, my team and I are currently taking a stab at adding ZGC support by adding and categorizing ZGC log events. We are looking to open a PR on this.

Cheers,
Yanqi

@chewiebug
Copy link
Owner

chewiebug commented Jan 6, 2019 via email

@yanqilee
Copy link
Contributor

yanqilee commented Jan 15, 2019

Hi @chewiebug,

@Marysunithajoseph and I have been working on this for the past few days, our progress so far is on https://github.com/macmms/GCViewer/tree/feature/protoZGC

As of now, we are parsing the [gc,phases] for the pause/concurrent events, [gc,start] for starting a GC cycle, "[gc,heap] Capacity" for total heap size, as well as "[gc] Garbage Collection" for preUsed and postUsed heap size.
A couple things we've faced so far:

  1. The [gc,phases] are the only lines where we can get pause times, so we have to include it for ZGC logging, however, it is not being parsed for other GCs. Including [gc,phases] for parsing in DataReaderUnifiedJvmLogging breaks existing test cases for other GCs, we added a pattern so that it only parses [gc,phases] for ZGC events.
  2. Heap use information is collected by marking the GC cycle start with [gc,start], append to the event on "[gc,heap] Capacity" for total heap, then append preUsed and postUsed heap size from "[gc] Garbage Collection (cause)" line and return the event
  3. For default GC logging, preUsed and postUsed heap info is extracted, however, there is no pause time or total heap size info.
  4. The heap and pause information looks good on the UI, the only caveat is that things like "Garbage Collection (Allocation Rate)" is the stat for the entire GC cycle and not a pause event (currently affect pause average). We are thinking of adding a section in the "Event details" with "Gc cycles" under "Concurrent GCs", add the GC cycles events there, and possibly include lines like "[1460.994s][info][gc] Allocation Stall (//LogOn//) 289.397ms" in that section if we find that information useful.

Wanted to have a checkpoint here and see what you think.
Would appreciate if you have any feedbacks.

Thanks,
Yanqi

@yanqilee
Copy link
Contributor

screen shot 2019-01-15 at 3 26 03 pm
Current event details page with summary

@yanqilee
Copy link
Contributor

screen shot 2019-01-16 at 10 32 39 am
New look after some more tinkering.

@yanqilee
Copy link
Contributor

PR is up at #216

@chewiebug
Copy link
Owner

Hi @yanqilee

Sorry for not answering earlier! I have very little time for GCViewer lately. Still I'd like to help your pull request along so that it can be merged - it looks good and close to finished!

I'll remark on your screenshots here and add some comments to the pull request directly.

Looking at the screenshots you posted, I see, that parsing if the ZGC logs already looks really good! The events are there, pauses are measured and collected in the event details panel.

What strikes me as odd compared to the currently used concept in GCViewer is the introduction of the "GC causes" section. I see in your "all" sample, that ZGC thinks of all of these events as one collection. Based on your knowledge about ZGC (I have none so far), is this, what is interesting to know about ZGC collections to be able to tune the garbage collector?

Comparing to the current implementation, I'd have expected one collection cycle to represent one "gc pause" (like in G1 collector - there are also gc,phases present, but they look rather like parts of one collection, than like separate pauses).

To be able to treat G1 and ZGC in a more similar way, I could also imagine, to sum up the "Pause" parts of the ZGC collector and treat "Gargabe Collection (Allocation Rate)" as one GC pause and introduce a new section "gc phases", which shows the separate "Pause" parts. Would that make sense for you?

The second thing: The "gc causes" in the second screenshot have huge numbers. What's the intention of those? Or what should be there?

@yanqilee
Copy link
Contributor

Hi @chewiebug,

No problem! Thanks for responding!

For the Gc causes section, it might have been over-engineered in hindsight. To address the second thing, the numbers are huge because they represent the amount the heap size changed after the garbage collection cycle and not the pause time (I can see this being super confusing). I’m going to remove the “Gc Causes” section.

The pause and concurrent event times should give a good amount of information to tune the garbage collector. The memory usage information is also present via the “Garbage Collection” events.
The caveat with ZGC logs that is that the phases provides only pause duration information, while GC cycles (i.e. “Garbage Collection (Allocation Rate)”) only provide memory information, so I had to piece those together.

To compare with G1/Shenandoah, the [gc,phases] in ZGC correspond to their pause/concurrent events, for example “Pause Initial Mark”, “concurrent marking”, etc. While ZGC Garbage Collection cycles like “Garbage Collection (Allocation Rate)” correspond to events like “Pause Full”, “Pause Young”, etc.

Seems to me like the [gc,phases] in G1/Shenandoah are more like sub-phases, and the actual phases like “Pause Initial Mark” are on their own [gc] line. While in ZGC logging there was a design change, in which the pause/concurrent events are moved into [gc,phases] section, with [gc] only having the big events like “Pause Full” and “Pause Young”, does that make sense?

So instead of introducing a new “Gc phases” section on the viewer, I’m thinking of put GC cycles in the “Full GC Pauses” section, it won’t have any pause duration information but it will serve as a reference point to garbage collection reasons and count, as well as storing memory info.

It would look something like this:
new_gcviewer_snapshot

I think that would make the most sense, what do you think?

I have commented on the PR to questions you asked; please let me know if you need any more clarification.

I’m a little occupied at the moment as well, I should have time to work on this in about two weeks, but I will be available for any questions you have, and discuss the design in the meantime :)

@chewiebug
Copy link
Owner

Hi @yanqilee,

Thank you very much for your explanations and thoughts! I have now also read some articles about ZGC myself to get a better understanding of this new garbage collection algorithm.

If I try to summarise your explanations and what I got from this article and this youtube presentation I see the following concepts / differences introduces by ZGC compared to previous algorithms:

  • to me, the two biggest conceptual differences seem to be those
    • load barriers
    • colored pointers
    • all the heavy lifting is done in the concurrent phases
    • pause time does not relate to the size of the heap, but rather the number of life objects present (scales better, especially for really large heaps, than the former)
  • just looking at the logs, the differences manifest in the following points:
    • [gc] lines (mostly) only contain memory information, no pause times
    • [gc,phases] lines only contain pause times and no memory information
    • stop the world and concurrent parts of a collection cycle are all logged as part of the same gc event (I usually saw either stop-the-world or concurrent events in the logs)
    • => there is no single line to parse and get all memory and pause information for on gc event
[15,002s][info][gc,phases] GC(10) Pause Mark Start 0,143ms
[17,163s][info][gc,phases] GC(10) Concurrent Mark 2160,944ms
[17,163s][info][gc,phases] GC(10) Pause Mark End 0,050ms
[17,163s][info][gc,phases] GC(10) Concurrent Process Non-Strong References 0,167ms
[17,165s][info][gc,phases] GC(10) Concurrent Reset Relocation Set 2,551ms
[17,165s][info][gc,phases] GC(10) Concurrent Destroy Detached Pages 0,001ms
[17,167s][info][gc,phases] GC(10) Concurrent Select Relocation Set 1,519ms
[17,183s][info][gc,phases] GC(10) Concurrent Prepare Relocation Set 16,020ms
[17,184s][info][gc,phases] GC(10) Pause Relocate Start 0,511ms
[17,193s][info][gc ] Allocation Stall (main) 269,000ms
[17,444s][info][gc,phases] GC(10) Concurrent Relocate 260,281ms
[17,444s][info][gc ] GC(10) Garbage Collection (Allocation Rate) 746M(73%)->720M(70%)

Checking the current code of GCViewer, the concepts used are the following:

  • stop-the-world events (GCEvent and subclasses) contain pause and memory information
  • concurrent events (ConcurrentGCEvent) always contain pause information and may contain memory information
  • drawing of charts relies on this kind of events
  • information in the datapanel (memory) relies on this kind of events

Conclusions

I believe, your implementation is on a very good way. There is just one issue, which currently breaks the chart (no memory information shown) and the data panel (memory): The GCEvents don't contain both pause times and memory information.

Suggestion for implementation

My suggestion to merge your current implementation concepts with the current concepts present in GCViewer: Add a the list of "phases" events and a new method "addPhase()" to the AbstractGCEvent class. Override the "addPhases()" method in GCEventUJL to accumulate the pauses present in the "phases" list. Then we will have the "standard" (GCViewer) GCEvents containing memory and pause information.

To accomodate the new kind of "phases" of ZGC, you could add the separate "phases" list in GCModel and add a separate section "gc phases" (similar to the "gc causes" you added in one of your earlier attempts).

The "Event details" tab would then show the following:

  • "gc phases" showing the [gc,phases] lines of the gc log
  • "gc pauses" showing the [gc] lines of the log including pause times as the summery of the stop-the world events in its internal "phases" list
  • "full gc pauses" would probably never show anything in the case of ZGC, because that is a concept not necessary any more (so just leave that section for compatibility with the other algorithms)

The (memory) data panel would again show meaningful data.

What do you think of my suggestion?

@yanqilee
Copy link
Contributor

yanqilee commented Feb 25, 2019

Hi @chewiebug,

Thanks for your thought and detailed suggestion, that helps a lot.

I think it makes sense to add a phases section and have it contribute to the gc event that is the collection cycle.

To summarize your suggestion the way I understood it:

  • Introduce the notion of phases by declaring a List of phases and have the ability to add them with the method addPhase(T) in AbstractGCEvent, override the addPhase(T) method in GCEventUJL to accumulate the pause time for the parent event
  • While parsing phases in DataReaderUJL, add these events to the phases List of the parent event (start of the cycle denoted by [gc,start])
  • The GC event pause time will be the sum of the 3 stop-the-world events (Pause Mark Start, Pause Mark End, Pause Relocate Start) in that cycle
  • Add the separate phases list in GCModel and ModelDetailsPanel to have new phases section show up in "Event details" tab

Please let me know if I misunderstood anything.

I also have a question regarding the complications involving the inclusion of phases.

Prior to ZGC parsing, phases were ignored. Unit tests were built around that, once I included [gc,phases] in the INCLUDE_STRINGS, unit test cases for UJL were breaking because there are [gc,phases] lines in the test files (CMS, G1, etc) that is now giving a warning "Failed to parse gc event" which fails the assertion of 0 warnings.

I see you have commented on the below code in the PR, we have added those as a way to only include ZGC [gc,phases]. I don't quite like the solution and I've been trying to find better ways to go about this, I haven't come up with a better solution yet.

private static final Pattern PATTERN_INCLUDE_STRINGS_PHASE = Pattern.compile("\\[(gc,phases[ ]*)][ ]GC\\(([0-9]+)\\)[ ](?<type>[-a-zA-Z1 ()]+)(([ ]([0-9]{1}.*))|$)");

private boolean isParseablePhaseEvent(String line) {

Do you have any suggestions regarding this? Please let me know if you have any questions.

I will be making these changes soon, cheers.

@chewiebug
Copy link
Owner

Hi @yanqilee,

If think, you summarised my suggestion very nicely! If you implement it that way, I believe, we will have a solid baseline to add gc,phases parsing for other gc algorithms. Adding gc,phases support for all other gc algorithms should be fairly easy then.

Which leads to your question regarding the complications (breaks some unittests, because new parser warnings are introduced). I suggest the following:

  • add "[gc,phases" to the INCLUDE_STRINGS array
  • add "[gc,phases,start" to the EXCLUDE_STRINGS array
  • remove your isParseablePhaseEvent() method and PATTERN_INCLUDE_STRINGS_PHASE
  • comment assertThat("number of errors", handler.getCount(), is(0)); in UnittestHelper#getGCModelFromLogFile() => test failures should all disappear
    • please add a comment like // TODO add support for [gc,phases in all gc algorithms, so that I won't forget to add gc,phases support for all other algorithms

Then finish your implementation.

Is this enough information for you to continue, or would you like me to also update the comments in the pull request?

I am looking forward to merging your pull request!

Best regards,
Jörg

@yanqilee
Copy link
Contributor

Hi @chewiebug,

That's perfect, that cleared all my doubts and I'll be able to continue and make all the changes.
Thanks for the fast reply, looking forward to working on this.

Cheers,
Yanqi

@chewiebug
Copy link
Owner

#216 is now merged and support for parsing ZGC logs is present in the latest SNAPSHOT version.

@jborgers: The log you posted in your second comment of this issue will be parsed, but won't yield much in terms of display in the chart. If you add the gc,start information to your logs, you will see much more.

@jborgers
Copy link
Author

jborgers commented Jun 10, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants