Bug 2049284
| Summary: | [RHEL 8.7] makedumpfile -D --dump-dmesg runs in a loop | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Audra Mitchell <aubaker> | |
| Component: | kexec-tools | Assignee: | Philipp Rudo <prudo> | |
| Status: | CLOSED ERRATA | QA Contact: | xiaoying yan <yiyan> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 8.4 | CC: | dwysocha, jieli, lichliu, prudo, ruyang, xiawu | |
| Target Milestone: | rc | Keywords: | Triaged | |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | kexec-tools-2.0.20-69.el8 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2069200 (view as bug list) | Environment: | ||
| Last Closed: | 2022-11-08 10:46:41 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2069200 | |||
FWIW, here is the crash utility code that detects this problem. Note that 'hq_enter()' is a function that stores a pointer in a hash table and detects duplicate entries, hence the way crash deals with this problem: https://github.com/crash-utility/crash/blob/master/kernel.c#L5395 5384 hq_open(); 5385 5386 idx = log_first_idx; 5387 while (idx != log_next_idx) { 5388 logptr = log_from_idx(idx, logbuf); 5389 5390 dump_log_entry(logptr, msg_flags); 5391 5392 if (!hq_enter((ulong)logptr)) { 5393 error(INFO, "\nduplicate log_buf message pointer\n"); 5394 break; 5395 } 5396 5397 idx = log_next(idx, logbuf); 5398 5399 if (idx >= log_buf_len) { 5400 if (log_first_idx > log_next_idx) 5401 idx = 0; 5402 else { 5403 error(INFO, "\ninvalid log_buf entry encountered\n"); 5404 break; 5405 } 5406 } 5407 5408 if (CRASHDEBUG(1) && (idx == log_next_idx)) 5409 fprintf(fp, "\nfound log_next_idx OK\n"); 5410 } 5411 5412 hq_close(); Hi, I posted the fix for this bug upstream. I decided to go with the generic Brent algorithm. @Dave: I added an Suggested-by for you (hope you don't mind) and added you on Cc (in case you want to take a look ;)) (In reply to Philipp Rudo from comment #10) > Hi, > > I posted the fix for this bug upstream. I decided to go with the generic > Brent algorithm. > > @Dave: I added an Suggested-by for you (hope you don't mind) and added you > on Cc (in case you want to take a look ;)) Great job Philipp - yes it's fine to add me as Suggested-by, I think that was the right thing to do. I'll definitely put on my list to review, maybe later today or tomorrow. Thanks for posting the patches. (In reply to Philipp Rudo from comment #10) > Hi, > > I posted the fix for this bug upstream. I decided to go with the generic > Brent algorithm. > > @Dave: I added an Suggested-by for you (hope you don't mind) and added you > on Cc (in case you want to take a look ;)) Hey Philipp - do you have a brew / test build I can use? I was thinking of running through a series of vmcores and comparing the existing makedumpfile output with these new patches, just as a good test, and also so I can step through the code a bit. Actually nevermind - I figured out what I needed to rebuild so I can use the upstream code plus your patches to test it out. (In reply to Dave Wysochanski from comment #13) > Actually nevermind - I figured out what I needed to rebuild so I can use the > upstream code plus your patches to test it out. Too late, I finally managed to get the brew build running https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469 I had to fudge the patches a little though and haven't properly tested it yet. A quick test worked however. (In reply to Philipp Rudo from comment #14) > (In reply to Dave Wysochanski from comment #13) > > Actually nevermind - I figured out what I needed to rebuild so I can use the > > upstream code plus your patches to test it out. > > Too late, I finally managed to get the brew build running > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469 > I had to fudge the patches a little though and haven't properly tested it > yet. A quick test worked however. Thanks! I'll definitely install and do another set of tests with your build. FWIW, I rebuilt upstream plus your patches and am seeing a lot of differences when I run "makedumpfile -D --dump-dmesg" on the same vmcores, comparing makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64 and the upstream + your 3 patches. I backed out your 3 patches and it still failed so it looks like a separate upstream bug (my upstream is at "59b1726 [PATCH] sadump, kaslr: fix failure of calculating kaslr_offset"). Any idea what's going on here (see below)? Example (same vmcore) 1. Upstream + your 3 patches: FAIL $ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore /tmp/164553756-dmesg.txt __vtop4_x86_64: Can't get a valid pgd. readmem: Can't convert a virtual address(ffffffff99c18604) to physical address. readmem: type_addr: 0, addr:ffffffff99c18604, size:390 check_release: Can't get the address of system_utsname. makedumpfile Failed. 2. Makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64: SUCCESS $ makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore /tmp/164553756-dmesg.txt The dmesg log is saved to /tmp/164553756-dmesg.txt. makedumpfile Completed. 3. Makedumpfile upstream at latest (minus your patches) $ git log --oneline | head -1 59b1726 [PATCH] sadump, kaslr: fix failure of calculating kaslr_offset $ make LINKTYPE=dynamic make: Nothing to be done for 'all'. $ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore /tmp/164553756-dmesg.txt __vtop4_x86_64: Can't get a valid pgd. readmem: Can't convert a virtual address(ffffffff99c18604) to physical address. readmem: type_addr: 0, addr:ffffffff99c18604, size:390 check_release: Can't get the address of system_utsname. makedumpfile Failed. (In reply to Philipp Rudo from comment #14) > (In reply to Dave Wysochanski from comment #13) > > Actually nevermind - I figured out what I needed to rebuild so I can use the > > upstream code plus your patches to test it out. > > Too late, I finally managed to get the brew build running > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469 > I had to fudge the patches a little though and haven't properly tested it > yet. A quick test worked however. This build does not have the problem in comment #15 and #16 so this is an upstream bug unrelated to these patches. FWIW, my test script I'm running against all the vmcores on our internal system. I extracted the brewbuild with: $ rpm2cpio kexec-tools-2.0.20-46.el8_4.3.x86_64.rpm | cpio -idv $ cat test-bz2049284.sh #!/bin/bash echo "START TEST: $(date)" for t in $(ls -1d /cores/retrace/tasks/[1-9]*); do echo "processing $t" if [ ! -e $t/downloaded ]; then echo "SKIPPING $t because $t/downloaded does not exist" continue fi grep -q 03070978 $t/downloaded if [ $? -eq 0 ]; then echo SKIPPING $t because of potential infinite loop makedumpfile bug continue fi if [ ! -e $t/crash/vmcore ]; then echo "SKIPPING $t because $t/crash/vmcore does not exist" continue fi T=$(basename $t) mkdir -p ./$T makedumpfile --dump-dmesg $t/crash/vmcore ./$T/dmesg-$(rpm -qf `which makedumpfile`).txt ./usr/sbin/makedumpfile --dump-dmesg $t/crash/vmcore ./$T/dmesg-makedumpfile-brewbuild.txt diff -q ./$T/dmesg-$(rpm -qf `which makedumpfile`).txt ./$T/dmesg-makedumpfile-brewbuild.txt if [ $? -ne 0 ]; then echo FOUND DIFFERENCE between old and new makedumpfile for $T fi done echo "END TEST: $(date)" The patches (via the brew build) look verygood to me as far as testing goes. For regression, I ran through over 1,000 vmcores on our production system comparing the original makedumpfile with the brew build and there was no difference in output for "makedumpfile --dump-dmesg". And if I run the brew build on the original vmcore I get this handled appropriately: $ ./usr/sbin/makedumpfile --dump-dmesg /cores/retrace/tasks/783792817/crash/vmcore --message-level 31 /tmp/dmesg-test2.txt ... log_buf : ffffffff9320487c log_end : 0 log_buf_len : 1048576 log_first_idx : 0 log_next_idx : 241384 dump_dmesg: Cycle when parsing dmesg detected. dump_dmesg: The printk log_buf is most likely corrupted. dump_dmesg: log_buf = 0xffffffff9320487c, idx = 0x39644 makedumpfile Failed. $ tail /tmp/dmesg-test2.txt [338781.860977] RPC: fragment too large: 1212501072 [338786.843612] RPC: fragment too large: 1224736768 [338787.881610] RPC: fragment too large: 50399744 [371866.141781] VXI[28382]: segfault at 2e2e2e000018 ip 0000000000777ef6 sp 00007f886ea556d0 error 4 in VXI[400000+524000] [425150.902862] RPC: fragment too large: 612067950 [425156.535970] RPC: fragment too large: 352518400 [425163.679392] RPC: fragment too large: 1212501072 [425168.603647] RPC: fragment too large: 1224736768 [425169.804132] RPC: fragment too large: 50399744 [472823.108369] traps: VXI[15158] general protection ip:7fb15facaf6f sp:7fb154c38640 error\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 Hi Dave, (In reply to Dave Wysochanski from comment #15) > (In reply to Philipp Rudo from comment #14) > > (In reply to Dave Wysochanski from comment #13) > > > Actually nevermind - I figured out what I needed to rebuild so I can use the > > > upstream code plus your patches to test it out. > > > > Too late, I finally managed to get the brew build running > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469 > > I had to fudge the patches a little though and haven't properly tested it > > yet. A quick test worked however. > > Thanks! I'll definitely install and do another set of tests with your build. > > FWIW, I rebuilt upstream plus your patches and am seeing a lot of > differences when I run "makedumpfile -D --dump-dmesg" on the same vmcores, > comparing makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64 and the > upstream + your 3 patches. I backed out your 3 patches and it still failed > so it looks like a separate upstream bug (my upstream is at "59b1726 [PATCH] > sadump, kaslr: fix failure of calculating kaslr_offset"). Any idea what's > going on here (see below)? > > > Example (same vmcore) > > 1. Upstream + your 3 patches: FAIL > $ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore > /tmp/164553756-dmesg.txt > __vtop4_x86_64: Can't get a valid pgd. > readmem: Can't convert a virtual address(ffffffff99c18604) to physical > address. > readmem: type_addr: 0, addr:ffffffff99c18604, size:390 > check_release: Can't get the address of system_utsname. > > makedumpfile Failed. > > > 2. Makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64: SUCCESS > $ makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore > /tmp/164553756-dmesg.txt > > The dmesg log is saved to /tmp/164553756-dmesg.txt. > > makedumpfile Completed. For makedumpfile you have to explicitly specify which compression algorithms shall be supported when building the binary. For Fedora we use $ make LINKTYPE=dynamic USELZO=on USESNAPPY=on USEZSTD=on (the USEZSTD is rather new and not supported by rhel yet but it shouldn't make a difference when you use the line for older versions as well. AFAIK unknown options are simply ignored). The problem you are hitting here is, that makedumpfile only checks whether the required compression algorithm is compiled in when you compress a dump but not when you read from one... So you get this meaningless error message when makedumpfile tries to read from the dump for the first time... Hi Dave, (In reply to Dave Wysochanski from comment #19) > The patches (via the brew build) look verygood to me as far as testing goes. > For regression, I ran through over 1,000 vmcores on our production system > comparing the original makedumpfile with the brew build and there was no > difference in output for "makedumpfile --dump-dmesg". > > And if I run the brew build on the original vmcore I get this handled > appropriately: > > $ ./usr/sbin/makedumpfile --dump-dmesg > /cores/retrace/tasks/783792817/crash/vmcore --message-level 31 > /tmp/dmesg-test2.txt > ... > log_buf : ffffffff9320487c > log_end : 0 > log_buf_len : 1048576 > log_first_idx : 0 > log_next_idx : 241384 > dump_dmesg: Cycle when parsing dmesg detected. > dump_dmesg: The printk log_buf is most likely corrupted. > dump_dmesg: log_buf = 0xffffffff9320487c, idx = 0x39644 > > makedumpfile Failed. > $ tail /tmp/dmesg-test2.txt > [338781.860977] RPC: fragment too large: 1212501072 > [338786.843612] RPC: fragment too large: 1224736768 > [338787.881610] RPC: fragment too large: 50399744 > [371866.141781] VXI[28382]: segfault at 2e2e2e000018 ip 0000000000777ef6 sp > 00007f886ea556d0 error 4 in VXI[400000+524000] > [425150.902862] RPC: fragment too large: 612067950 > [425156.535970] RPC: fragment too large: 352518400 > [425163.679392] RPC: fragment too large: 1212501072 > [425168.603647] RPC: fragment too large: 1224736768 > [425169.804132] RPC: fragment too large: 50399744 > [472823.108369] traps: VXI[15158] general protection ip:7fb15facaf6f > sp:7fb154c38640 > error\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\ > x00\x00 Thanks for testing! Could you do me a favor and write a short email to the upstream mailing list with your test results. I think that's something the upstream folks are interested in, too. (In reply to Philipp Rudo from comment #22) > Hi Dave, > > (In reply to Dave Wysochanski from comment #15) > > (In reply to Philipp Rudo from comment #14) > > > (In reply to Dave Wysochanski from comment #13) > > > > Actually nevermind - I figured out what I needed to rebuild so I can use the > > > > upstream code plus your patches to test it out. > > > > > > Too late, I finally managed to get the brew build running > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469 > > > I had to fudge the patches a little though and haven't properly tested it > > > yet. A quick test worked however. > > > > Thanks! I'll definitely install and do another set of tests with your build. > > > > FWIW, I rebuilt upstream plus your patches and am seeing a lot of > > differences when I run "makedumpfile -D --dump-dmesg" on the same vmcores, > > comparing makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64 and the > > upstream + your 3 patches. I backed out your 3 patches and it still failed > > so it looks like a separate upstream bug (my upstream is at "59b1726 [PATCH] > > sadump, kaslr: fix failure of calculating kaslr_offset"). Any idea what's > > going on here (see below)? > > > > > > Example (same vmcore) > > > > 1. Upstream + your 3 patches: FAIL > > $ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore > > /tmp/164553756-dmesg.txt > > __vtop4_x86_64: Can't get a valid pgd. > > readmem: Can't convert a virtual address(ffffffff99c18604) to physical > > address. > > readmem: type_addr: 0, addr:ffffffff99c18604, size:390 > > check_release: Can't get the address of system_utsname. > > > > makedumpfile Failed. > > > > > > 2. Makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64: SUCCESS > > $ makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore > > /tmp/164553756-dmesg.txt > > > > The dmesg log is saved to /tmp/164553756-dmesg.txt. > > > > makedumpfile Completed. > > For makedumpfile you have to explicitly specify which compression algorithms > shall be supported when building the binary. For Fedora we use > > $ make LINKTYPE=dynamic USELZO=on USESNAPPY=on USEZSTD=on > > (the USEZSTD is rather new and not supported by rhel yet but it shouldn't > make a difference when you use the line for older versions as well. AFAIK > unknown options are simply ignored). > The problem you are hitting here is, that makedumpfile only checks whether > the required compression algorithm is compiled in when you compress a dump > but not when you read from one... > So you get this meaningless error message when makedumpfile tries to read > from the dump for the first time... Thank you for pointing that out! Also for doing the patch to fixup the error message. I wondered if I had gotten the build step wrong and sure enough... But the new error message is better. I gave your v2 patchset a good test and thumbs-up on the list. Thanks again. @Dave W.: I've posted the fix for rhel 8.7. Please open a z-stream request when you also want to have it fixed in 8.4. Thanks Philipp (In reply to Philipp Rudo from comment #27) > @Dave W.: I've posted the fix for rhel 8.7. Please open a z-stream request > when you also want to have it fixed in 8.4. > Thanks for getting this into 8.7 - this may be enough for now. I'm not sure about z-stream backports, maybe 8.6 would make sense? Is there some reason you're thinking about 8.4? If we do that I think we need to do it for 8.6 as well (8.5 seems out of the question due to no EUS and out of time for last z-stream). Our production vmcore machines will get upgraded and I can also upgrade individual packages like kexec-tools for this bug so we don't need 8.4 backport for those. Hi Dave, (In reply to Dave Wysochanski from comment #29) > (In reply to Philipp Rudo from comment #27) > > @Dave W.: I've posted the fix for rhel 8.7. Please open a z-stream request > > when you also want to have it fixed in 8.4. > > > Thanks for getting this into 8.7 - this may be enough for now. I'm not sure > about z-stream backports, maybe 8.6 would make sense? Is there some reason > you're thinking about 8.4? If we do that I think we need to do it for 8.6 > as well (8.5 seems out of the question due to no EUS and out of time for > last z-stream). Our production vmcore machines will get upgraded and I can > also upgrade individual packages like kexec-tools for this bug so we don't > need 8.4 backport for those. I wasn't sure if you need it for galvatron as it currently runs 8.4. But when you have ways to upgrade individual packages on it that's a lot easier than a z-stream for us, too. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (kexec-tools bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7705 |
Description of problem: Running this command: "makedumpfile -D --dump-dmesg vmcore vmcore-dmesg.txt" will run indefinetely causing the vmcore-dmesg.txt to continuously to grow. Version-Release number of selected component (if applicable): $ rpm -q kexec-tools kexec-tools-2.0.20-46.el8_4.2.x86_64 $ uname -r 4.18.0-305.25.1.el8_4.x86_64 How reproducible: With this core- every time. Steps to Reproduce: Added in private comment to the BZ Actual results: Command never completes. Expected results: Expect command to complete or error out. Additional info: