Bug 1952888
| Summary: | ps command only outputting ~130 procs/min | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | John Siddle <jsiddle> |
| Component: | crash | Assignee: | ltao |
| Status: | CLOSED ERRATA | QA Contact: | xiaoying yan <yiyan> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.3 | CC: | aubaker, cye, lijiang, ltao, ruyang, stalexan, yiyan |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | crash-7.3.1-5.el8 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-10 13:55:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
(In reply to John Siddle from comment #0) > Description of problem: > > For galvatron task ID 886910425 the 'ps' command was running very slowly. > With this core's ~65k processes it would take ~7 hours to finish the command. > Hello John, How can I access to the vmcore which has the ~65k processes and causes the performance degradation? I'd like to have a debug with it. > > Version-Release number of selected component (if applicable): > > Issue exists both on the version of crash installed on galvatron, > 7.2.7-3.el8, and upstream version 7.2.9++ (commit dccdf95c) > > > How reproducible: > > Always > > > Steps to Reproduce: > 1. retrace-server-interact 886910425 crash Which server should I login to invoke the command? Thanks! Tao Liu > 2. ps > 3. > > Actual results: > > Running 'ps > ps.out' in crash: > > crash 7.2.7-3.el8: > > [jsiddle@galvatron crash]$ wc -l ps.out ; sleep 60 ; wc -l ps.out > 273 ps.out > 411 ps.out > > 138/min > > crash 7.2.9++: > > [jsiddle@galvatron crash]$ wc -l ps.out ; sleep 60 ; wc -l ps.out > 209 ps.out > 341 ps.out > > 132/min > > > Expected results: > > > Additional info: Hi Tao, thanks for looking into this. You can access the core with: # ssh <kerberos>@galvatron.cee.redhat.com #crash /cores/exceptions/siddle/vmcore_bz1952888 /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1160.el7.x86_64/vmlinux Hello John, The vmcore in /cores/exceptions/siddle/vmcore_bz1952888 is no longer accessible. Previously I can access to it, but now I cannot. Maybe because it has time limits for storing the vmcores in galvatron.cee.redhat.com server? Anyway, is it possible for me to access to the vmcore again? Thanks! Tao Liu Hi Tao,
Sorry about that. I've re-queued the core so you should be able to access it on galvatron with:
# retrace-server-interact 339468762 crash
Hi John, Thanks! Now I can access to the vmcore. Thanks, Tao Liu Hello John,
The reason for "ps output slowly" is because the processes have so many threads(nearly 20000).
When "ps" command is invoked, crash will travel through all threads of each process(from 'first' to 'last'), and accumulate
each thread's rss to get the final rss of the process:
while (first <= last)
{
if (!readmem(first->task +
....
rss += sync_rss;
if (!readmem(first->task +
....
rss += sync_rss;
if (first == last)
break;
first++;
}
I modified the code a little to print the number of threads for each process, we can see each process hold threads nearly 20000,
which means the above while loop will be executed 20000 and readmem will be executed 40000 for each process. Merging 2 readmem into 1 doesn't increase the performance much, and readmem cannot be cached because the reading address differs from each run.
----21870 threads----
300 10765 4 ffff8816603d2100 IN 4.6 50726924 12236960 java
----21941 threads----
301 11080 10 ffff882936c4c200 IN 5.0 51334600 13329952 java
----21870 threads----
302 10765 4 ffff881615fce300 IN 4.6 50726924 12236960 java
----19918 threads----
303 10688 13 ffff8826c1dd6300 IN 7.7 64202900 20482612 rocksdb:high3
----19918 threads----
304 10688 29 ffff8826c1dd4200 IN 7.7 64202900 20482612 rocksdb:high4
----19918 threads----
305 10688 12 ffff8816867f2100 IN 7.7 64202900 20482612 rocksdb:high5
----19918 threads----
306 10688 13 ffff8816ef3a9080 IN 7.7 64202900 20482612 rocksdb:high6
----19918 threads----
307 10688 18 ffff8817413e0000 IN 7.7 64202900 20482612 rocksdb:high7
----21941 threads----
308 11080 3 ffff8828a2e72100 IN 5.0 51334600 13329952 java
----19918 threads----
----21941 threads----
309 10688 26 ffff8817413e4200 IN 7.7 64202900 20482612 java
----21870 threads----
310 11080 29 ffff882818394200 IN 5.0 51334600 13329952 java
....
We can have a comparison with another vmcore which ps command exit fast:
----1 threads----
832 1 0 ffff91f8b713dc40 IN 0.1 81592 5536 dbus-daemon
----0 threads----
834 1 0 ffff91f8b9ee8000 IN 0.1 29432 3108 chronyd
----2 threads----
836 1 0 ffff91f8b6adbd80 IN 0.3 390660 17212 NetworkManager
----8 threads----
837 1 0 ffff91f8b4f03d80 IN 0.5 1770768 25032 polkitd
----1 threads----
838 1 0 ffff91f8b6addc40 IN 0.1 160152 6968 rngd
----5 threads----
839 1 0 ffff91f8b68c5c40 IN 0.1 101724 3996 gssproxy
----5 threads----
840 1 0 ffff91f8bad48000 IN 0.1 101724 3996 gssproxy
----5 threads----
841 1 0 ffff91f8bad49ec0 IN 0.1 101724 3996 gssproxy
----5 threads----
842 1 0 ffff91f8bad4bd80 IN 0.1 101724 3996 gssproxy
From my point, I think it is difficult to optimize the performance when we have so many threads to iterate for a vmcore. And performance degradation is expected for the case.
Any thoughts?
Thanks,
Tao Liu
I did some analysis, the performance bottle neck is task.c:show_ps_data:get_task_mem_usage(tc->task, tm).
And I modified the code a little, in order to show the time consuming status, the result is as follows:
300 10765 4 ffff8816603d2100 IN 4.6 50726924 12236960 --IO:71.488000ms UC:665.982000ms TO:820.022000ms--java
301 11080 10 ffff882936c4c200 IN 5.0 51334600 13329952 --IO:72.305000ms UC:666.907000ms TO:818.050000ms--java
302 10765 4 ffff881615fce300 IN 4.6 50726924 12236960 --IO:71.338000ms UC:666.138000ms TO:819.704000ms--java
303 10688 13 ffff8826c1dd6300 IN 7.7 64202900 20482612 --IO:65.100000ms UC:605.082000ms TO:741.355000ms--rocksdb:high3
304 10688 29 ffff8826c1dd4200 IN 7.7 64202900 20482612 --IO:65.154000ms UC:605.144000ms TO:741.154000ms--rocksdb:high4
305 10688 12 ffff8816867f2100 IN 7.7 64202900 20482612 --IO:65.081000ms UC:604.962000ms TO:740.791000ms--rocksdb:high5
306 10688 13 ffff8816ef3a9080 IN 7.7 64202900 20482612 --IO:64.864000ms UC:604.935000ms TO:740.851000ms--rocksdb:high6
307 10688 18 ffff8817413e0000 IN 7.7 64202900 20482612 --IO:64.972000ms UC:604.859000ms TO:740.734000ms--rocksdb:high7
308 11080 3 ffff8828a2e72100 IN 5.0 51334600 13329952 --IO:71.905000ms UC:666.892000ms TO:817.826000ms--java
309 10688 26 ffff8817413e4200 IN 7.7 64202900 20482612 --IO:65.026000ms UC:604.982000ms TO:741.286000ms--java
where TO is the total time spent by get_task_mem_usage, IO is time spent on IO operation within get_task_mem_usage, UC is time spent on decompression of dump pages within get_task_mem_usage. The provided vmcore is compressed by zlib.
We can see from the chart, zlib decompression takes the most share of time(~ 80%). So the real performance bottle neck is the zlib decompress operation.
Currently there is an attempt to introduce zstd into makedumpfile and crash-utility. zstd has better performance on decompression. So let's hold this bz for a while, after bz1988894 and bz1988895 introduced zstd, then let's see how the performance goes.
Although the zstd patch is not ready, but I made a preview zstd build of crash and makedumpfile, the result is as follows:
zstd:
300 10765 4 ffff8816603d2100 IN 4.6 50726924 12236960 --IO:65.617000ms UC:134.430000ms TO:335.789000ms--java
301 11080 10 ffff882936c4c200 IN 5.0 51334600 13329952 --IO:65.305000ms UC:135.441000ms TO:329.671000ms--java
302 10765 4 ffff881615fce300 IN 4.6 50726924 12236960 --IO:65.902000ms UC:136.298000ms TO:338.498000ms--java
303 10688 13 ffff8826c1dd6300 IN 7.7 64202900 20482612 --IO:58.565000ms UC:119.940000ms TO:293.834000ms--rocksdb:high3
304 10688 29 ffff8826c1dd4200 IN 7.7 64202900 20482612 --IO:60.134000ms UC:123.286000ms TO:301.443000ms--rocksdb:high4
305 10688 12 ffff8816867f2100 IN 7.7 64202900 20482612 --IO:57.937000ms UC:120.782000ms TO:293.018000ms--rocksdb:high5
306 10688 13 ffff8816ef3a9080 IN 7.7 64202900 20482612 --IO:59.598000ms UC:122.807000ms TO:298.603000ms--rocksdb:high6
307 10688 18 ffff8817413e0000 IN 7.7 64202900 20482612 --IO:60.480000ms UC:125.481000ms TO:303.518000ms--rocksdb:high7
308 11080 3 ffff8828a2e72100 IN 5.0 51334600 13329952 --IO:73.477000ms UC:156.957000ms TO:379.909000ms--java
309 10688 26 ffff8817413e4200 IN 7.7 64202900 20482612 --IO:59.791000ms UC:121.758000ms TO:298.450000ms--java
310 11080 29 ffff882818394200 IN 5.0 51334600 13329952 --IO:66.661000ms UC:137.595000ms TO:337.225000ms--java
zlib:
300 10765 4 ffff8816603d2100 IN 4.6 50726924 12236960 --IO:69.370000ms UC:368.684000ms TO:582.293000ms--java
301 11080 10 ffff882936c4c200 IN 5.0 51334600 13329952 --IO:66.690000ms UC:354.471000ms TO:551.126000ms--java
302 10765 4 ffff881615fce300 IN 4.6 50726924 12236960 --IO:68.033000ms UC:358.157000ms TO:565.365000ms--java
303 10688 13 ffff8826c1dd6300 IN 7.7 64202900 20482612 --IO:60.599000ms UC:322.041000ms TO:500.626000ms--rocksdb:high3
304 10688 29 ffff8826c1dd4200 IN 7.7 64202900 20482612 --IO:61.274000ms UC:323.296000ms TO:501.506000ms--rocksdb:high4
305 10688 12 ffff8816867f2100 IN 7.7 64202900 20482612 --IO:61.183000ms UC:320.451000ms TO:497.747000ms--rocksdb:high5
306 10688 13 ffff8816ef3a9080 IN 7.7 64202900 20482612 --IO:60.073000ms UC:321.479000ms TO:496.462000ms--rocksdb:high6
307 10688 18 ffff8817413e0000 IN 7.7 64202900 20482612 --IO:60.322000ms UC:322.834000ms TO:498.485000ms--rocksdb:high7
308 11080 3 ffff8828a2e72100 IN 5.0 51334600 13329952 --IO:69.348000ms UC:357.250000ms TO:557.994000ms--java
309 10688 26 ffff8817413e4200 IN 7.7 64202900 20482612 --IO:63.237000ms UC:326.888000ms TO:510.211000ms--java
310 11080 29 ffff882818394200 IN 5.0 51334600 13329952 --IO:68.812000ms UC:358.102000ms TO:558.209000ms--java
zstd does have better decompression performance.
As for the UC time difference between comment 7 and comment 8, I guess it's related to different system loading. I will do more testing on
time measurement, but the above chart can give us a glimpse about its better performance...
Upstream patch: d52cccfaa96e ("Improve the ps performance for vmcores with large number of threads")
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (crash bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1864 |
Description of problem: For galvatron task ID 886910425 the 'ps' command was running very slowly. With this core's ~65k processes it would take ~7 hours to finish the command. Version-Release number of selected component (if applicable): Issue exists both on the version of crash installed on galvatron, 7.2.7-3.el8, and upstream version 7.2.9++ (commit dccdf95c) How reproducible: Always Steps to Reproduce: 1. retrace-server-interact 886910425 crash 2. ps 3. Actual results: Running 'ps > ps.out' in crash: crash 7.2.7-3.el8: [jsiddle@galvatron crash]$ wc -l ps.out ; sleep 60 ; wc -l ps.out 273 ps.out 411 ps.out 138/min crash 7.2.9++: [jsiddle@galvatron crash]$ wc -l ps.out ; sleep 60 ; wc -l ps.out 209 ps.out 341 ps.out 132/min Expected results: Additional info: