Bug 2049284

Summary:	[RHEL 8.7] makedumpfile -D --dump-dmesg runs in a loop
Product:	Red Hat Enterprise Linux 8	Reporter:	Audra Mitchell <aubaker>
Component:	kexec-tools	Assignee:	Philipp Rudo <prudo>
Status:	CLOSED ERRATA	QA Contact:	xiaoying yan <yiyan>
Severity:	medium	Docs Contact:
Priority:	high
Version:	8.4	CC:	dwysocha, jieli, lichliu, prudo, ruyang, xiawu
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kexec-tools-2.0.20-69.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2069200 (view as bug list)		Environment:
Last Closed:	2022-11-08 10:46:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2069200

Description Audra Mitchell 2022-02-01 20:43:23 UTC

Description of problem:

  Running this command:
    "makedumpfile -D --dump-dmesg vmcore vmcore-dmesg.txt" 

  will run indefinetely causing the vmcore-dmesg.txt to continuously to grow.

Version-Release number of selected component (if applicable):

  $ rpm -q kexec-tools
  kexec-tools-2.0.20-46.el8_4.2.x86_64

  $ uname -r
  4.18.0-305.25.1.el8_4.x86_64

How reproducible:

  With this core- every time.


Steps to Reproduce:

  Added in private comment to the BZ

Actual results:

  Command never completes.

Expected results:

  Expect command to complete or error out.


Additional info:

Comment 7 Dave Wysochanski 2022-03-03 13:19:07 UTC

FWIW, here is the crash utility code that detects this problem.  Note that 'hq_enter()' is a function that stores a pointer in a hash table and detects duplicate entries, hence the way crash deals with this problem:

https://github.com/crash-utility/crash/blob/master/kernel.c#L5395

 5384         hq_open();
 5385 
 5386         idx = log_first_idx;
 5387         while (idx != log_next_idx) {
 5388                 logptr = log_from_idx(idx, logbuf);
 5389 
 5390                 dump_log_entry(logptr, msg_flags);
 5391 
 5392                 if (!hq_enter((ulong)logptr)) {
 5393                         error(INFO, "\nduplicate log_buf message pointer\n");
 5394                         break;
 5395                 }
 5396 
 5397                 idx = log_next(idx, logbuf);
 5398 
 5399                 if (idx >= log_buf_len) {
 5400                         if (log_first_idx > log_next_idx)
 5401                                 idx = 0;
 5402                         else {
 5403                                 error(INFO, "\ninvalid log_buf entry encountered\n");
 5404                                 break;
 5405                         }
 5406                 }
 5407 
 5408                 if (CRASHDEBUG(1) && (idx == log_next_idx))
 5409                         fprintf(fp, "\nfound log_next_idx OK\n");
 5410         }
 5411 
 5412         hq_close();

Comment 10 Philipp Rudo 2022-03-07 17:26:53 UTC

Hi,

I posted the fix for this bug upstream. I decided to go with the generic Brent algorithm.

@Dave: I added an Suggested-by for you (hope you don't mind) and added you on Cc (in case you want to take a look ;))

Comment 11 Dave Wysochanski 2022-03-07 17:47:47 UTC

(In reply to Philipp Rudo from comment #10)
> Hi,
> 
> I posted the fix for this bug upstream. I decided to go with the generic
> Brent algorithm.
> 
> @Dave: I added an Suggested-by for you (hope you don't mind) and added you
> on Cc (in case you want to take a look ;))

Great job Philipp - yes it's fine to add me as Suggested-by, I think that was the right thing to do.
I'll definitely put on my list to review, maybe later today or tomorrow.  Thanks for posting the patches.

Comment 12 Dave Wysochanski 2022-03-07 22:56:09 UTC

(In reply to Philipp Rudo from comment #10)
> Hi,
> 
> I posted the fix for this bug upstream. I decided to go with the generic
> Brent algorithm.
> 
> @Dave: I added an Suggested-by for you (hope you don't mind) and added you
> on Cc (in case you want to take a look ;))

Hey Philipp - do you have a brew / test build I can use?  I was thinking of running through a series of vmcores and comparing the existing makedumpfile output with these new patches, just as a good test, and also so I can step through the code a bit.

Comment 13 Dave Wysochanski 2022-03-08 13:37:31 UTC

Actually nevermind - I figured out what I needed to rebuild so I can use the upstream code plus your patches to test it out.

Comment 14 Philipp Rudo 2022-03-08 14:30:39 UTC

(In reply to Dave Wysochanski from comment #13)
> Actually nevermind - I figured out what I needed to rebuild so I can use the
> upstream code plus your patches to test it out.

Too late, I finally managed to get the brew build running https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469
I had to fudge the patches a little though and haven't properly tested it yet. A quick test worked however.

Comment 15 Dave Wysochanski 2022-03-08 16:12:36 UTC

(In reply to Philipp Rudo from comment #14)
> (In reply to Dave Wysochanski from comment #13)
> > Actually nevermind - I figured out what I needed to rebuild so I can use the
> > upstream code plus your patches to test it out.
> 
> Too late, I finally managed to get the brew build running
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469
> I had to fudge the patches a little though and haven't properly tested it
> yet. A quick test worked however.

Thanks!  I'll definitely install and do another set of tests with your build.

FWIW, I rebuilt upstream plus your patches and am seeing a lot of differences when I run "makedumpfile -D --dump-dmesg" on the same vmcores, comparing makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64 and the upstream + your 3 patches.  I backed out your 3 patches and it still failed so it looks like a separate upstream bug (my upstream is at "59b1726 [PATCH] sadump, kaslr: fix failure of calculating kaslr_offset").  Any idea what's going on here (see below)?


Example (same vmcore)

1. Upstream + your 3 patches: FAIL
$ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore /tmp/164553756-dmesg.txt
__vtop4_x86_64: Can't get a valid pgd.
readmem: Can't convert a virtual address(ffffffff99c18604) to physical address.
readmem: type_addr: 0, addr:ffffffff99c18604, size:390
check_release: Can't get the address of system_utsname.

makedumpfile Failed.


2. Makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64: SUCCESS
$ makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore /tmp/164553756-dmesg.txt

The dmesg log is saved to /tmp/164553756-dmesg.txt.

makedumpfile Completed.

Comment 16 Dave Wysochanski 2022-03-08 16:14:05 UTC

3. Makedumpfile upstream at latest (minus your patches)

$ git log --oneline | head -1
59b1726 [PATCH] sadump, kaslr: fix failure of calculating kaslr_offset
$ make LINKTYPE=dynamic
make: Nothing to be done for 'all'.
$ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore /tmp/164553756-dmesg.txt
__vtop4_x86_64: Can't get a valid pgd.
readmem: Can't convert a virtual address(ffffffff99c18604) to physical address.
readmem: type_addr: 0, addr:ffffffff99c18604, size:390
check_release: Can't get the address of system_utsname.

makedumpfile Failed.

Comment 17 Dave Wysochanski 2022-03-08 16:22:03 UTC

(In reply to Philipp Rudo from comment #14)
> (In reply to Dave Wysochanski from comment #13)
> > Actually nevermind - I figured out what I needed to rebuild so I can use the
> > upstream code plus your patches to test it out.
> 
> Too late, I finally managed to get the brew build running
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469
> I had to fudge the patches a little though and haven't properly tested it
> yet. A quick test worked however.

This build does not have the problem in comment #15 and #16 so this is an upstream bug unrelated to these patches.

Comment 18 Dave Wysochanski 2022-03-08 16:26:14 UTC

FWIW, my test script I'm running against all the vmcores on our internal system. I extracted the brewbuild with:
$ rpm2cpio kexec-tools-2.0.20-46.el8_4.3.x86_64.rpm | cpio -idv

$ cat test-bz2049284.sh
#!/bin/bash
echo "START TEST: $(date)"
for t in $(ls -1d /cores/retrace/tasks/[1-9]*); do
	echo "processing $t"
	if [ ! -e $t/downloaded ]; then
		echo "SKIPPING $t because $t/downloaded does not exist"
		continue
	fi
	grep -q 03070978 $t/downloaded
	if [ $? -eq 0 ]; then
		echo SKIPPING $t because of potential infinite loop makedumpfile bug
		continue
	fi
	if [ ! -e $t/crash/vmcore ]; then
		echo "SKIPPING $t because $t/crash/vmcore does not exist"
		continue
	fi
	T=$(basename $t)
	mkdir -p ./$T
	makedumpfile --dump-dmesg $t/crash/vmcore ./$T/dmesg-$(rpm -qf `which makedumpfile`).txt
	./usr/sbin/makedumpfile --dump-dmesg $t/crash/vmcore ./$T/dmesg-makedumpfile-brewbuild.txt
	diff -q ./$T/dmesg-$(rpm -qf `which makedumpfile`).txt ./$T/dmesg-makedumpfile-brewbuild.txt
	if [ $? -ne 0 ]; then
		echo FOUND DIFFERENCE between old and new makedumpfile for $T
	fi
done
echo "END TEST: $(date)"

Comment 19 Dave Wysochanski 2022-03-08 17:06:43 UTC

The patches (via the brew build) look verygood to me as far as testing goes. 
For regression, I ran through over 1,000 vmcores on our production system comparing the original makedumpfile with the brew build and there was no difference in output for "makedumpfile --dump-dmesg".

And if I run the brew build on the original vmcore I get this handled appropriately:

$ ./usr/sbin/makedumpfile --dump-dmesg /cores/retrace/tasks/783792817/crash/vmcore --message-level 31 /tmp/dmesg-test2.txt
...
log_buf       : ffffffff9320487c
log_end       : 0
log_buf_len   : 1048576
log_first_idx : 0
log_next_idx  : 241384
dump_dmesg: Cycle when parsing dmesg detected.
dump_dmesg: The printk log_buf is most likely corrupted.
dump_dmesg: log_buf = 0xffffffff9320487c, idx = 0x39644

makedumpfile Failed.
$ tail /tmp/dmesg-test2.txt
[338781.860977] RPC: fragment too large: 1212501072
[338786.843612] RPC: fragment too large: 1224736768
[338787.881610] RPC: fragment too large: 50399744
[371866.141781] VXI[28382]: segfault at 2e2e2e000018 ip 0000000000777ef6 sp 00007f886ea556d0 error 4 in VXI[400000+524000]
[425150.902862] RPC: fragment too large: 612067950
[425156.535970] RPC: fragment too large: 352518400
[425163.679392] RPC: fragment too large: 1212501072
[425168.603647] RPC: fragment too large: 1224736768
[425169.804132] RPC: fragment too large: 50399744
[472823.108369] traps: VXI[15158] general protection ip:7fb15facaf6f sp:7fb154c38640 error\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

Comment 22 Philipp Rudo 2022-03-09 12:18:11 UTC

Hi Dave,

(In reply to Dave Wysochanski from comment #15)
> (In reply to Philipp Rudo from comment #14)
> > (In reply to Dave Wysochanski from comment #13)
> > > Actually nevermind - I figured out what I needed to rebuild so I can use the
> > > upstream code plus your patches to test it out.
> > 
> > Too late, I finally managed to get the brew build running
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469
> > I had to fudge the patches a little though and haven't properly tested it
> > yet. A quick test worked however.
> 
> Thanks!  I'll definitely install and do another set of tests with your build.
> 
> FWIW, I rebuilt upstream plus your patches and am seeing a lot of
> differences when I run "makedumpfile -D --dump-dmesg" on the same vmcores,
> comparing makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64 and the
> upstream + your 3 patches.  I backed out your 3 patches and it still failed
> so it looks like a separate upstream bug (my upstream is at "59b1726 [PATCH]
> sadump, kaslr: fix failure of calculating kaslr_offset").  Any idea what's
> going on here (see below)?
> 
> 
> Example (same vmcore)
> 
> 1. Upstream + your 3 patches: FAIL
> $ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore
> /tmp/164553756-dmesg.txt
> __vtop4_x86_64: Can't get a valid pgd.
> readmem: Can't convert a virtual address(ffffffff99c18604) to physical
> address.
> readmem: type_addr: 0, addr:ffffffff99c18604, size:390
> check_release: Can't get the address of system_utsname.
> 
> makedumpfile Failed.
> 
> 
> 2. Makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64: SUCCESS
> $ makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore
> /tmp/164553756-dmesg.txt
> 
> The dmesg log is saved to /tmp/164553756-dmesg.txt.
> 
> makedumpfile Completed.

For makedumpfile you have to explicitly specify which compression algorithms shall be supported when building the binary. For Fedora we use 

$ make LINKTYPE=dynamic USELZO=on USESNAPPY=on USEZSTD=on

(the USEZSTD is rather new and not supported by rhel yet but it shouldn't make a difference when you use the line for older versions as well. AFAIK unknown options are simply ignored).
The problem you are hitting here is, that makedumpfile only checks whether the required compression algorithm is compiled in when you compress a dump but not when you read from one...
So you get this meaningless error message when makedumpfile tries to read from the dump for the first time...

Comment 23 Philipp Rudo 2022-03-09 12:22:19 UTC

Hi Dave,

(In reply to Dave Wysochanski from comment #19)
> The patches (via the brew build) look verygood to me as far as testing goes. 
> For regression, I ran through over 1,000 vmcores on our production system
> comparing the original makedumpfile with the brew build and there was no
> difference in output for "makedumpfile --dump-dmesg".
> 
> And if I run the brew build on the original vmcore I get this handled
> appropriately:
> 
> $ ./usr/sbin/makedumpfile --dump-dmesg
> /cores/retrace/tasks/783792817/crash/vmcore --message-level 31
> /tmp/dmesg-test2.txt
> ...
> log_buf       : ffffffff9320487c
> log_end       : 0
> log_buf_len   : 1048576
> log_first_idx : 0
> log_next_idx  : 241384
> dump_dmesg: Cycle when parsing dmesg detected.
> dump_dmesg: The printk log_buf is most likely corrupted.
> dump_dmesg: log_buf = 0xffffffff9320487c, idx = 0x39644
> 
> makedumpfile Failed.
> $ tail /tmp/dmesg-test2.txt
> [338781.860977] RPC: fragment too large: 1212501072
> [338786.843612] RPC: fragment too large: 1224736768
> [338787.881610] RPC: fragment too large: 50399744
> [371866.141781] VXI[28382]: segfault at 2e2e2e000018 ip 0000000000777ef6 sp
> 00007f886ea556d0 error 4 in VXI[400000+524000]
> [425150.902862] RPC: fragment too large: 612067950
> [425156.535970] RPC: fragment too large: 352518400
> [425163.679392] RPC: fragment too large: 1212501072
> [425168.603647] RPC: fragment too large: 1224736768
> [425169.804132] RPC: fragment too large: 50399744
> [472823.108369] traps: VXI[15158] general protection ip:7fb15facaf6f
> sp:7fb154c38640
> error\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00

Thanks for testing!

Could you do me a favor and write a short email to the upstream mailing list with your test results. I think that's something the upstream folks are interested in, too.

Comment 26 Dave Wysochanski 2022-03-16 14:10:09 UTC

(In reply to Philipp Rudo from comment #22)
> Hi Dave,
> 
> (In reply to Dave Wysochanski from comment #15)
> > (In reply to Philipp Rudo from comment #14)
> > > (In reply to Dave Wysochanski from comment #13)
> > > > Actually nevermind - I figured out what I needed to rebuild so I can use the
> > > > upstream code plus your patches to test it out.
> > > 
> > > Too late, I finally managed to get the brew build running
> > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43642469
> > > I had to fudge the patches a little though and haven't properly tested it
> > > yet. A quick test worked however.
> > 
> > Thanks!  I'll definitely install and do another set of tests with your build.
> > 
> > FWIW, I rebuilt upstream plus your patches and am seeing a lot of
> > differences when I run "makedumpfile -D --dump-dmesg" on the same vmcores,
> > comparing makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64 and the
> > upstream + your 3 patches.  I backed out your 3 patches and it still failed
> > so it looks like a separate upstream bug (my upstream is at "59b1726 [PATCH]
> > sadump, kaslr: fix failure of calculating kaslr_offset").  Any idea what's
> > going on here (see below)?
> > 
> > 
> > Example (same vmcore)
> > 
> > 1. Upstream + your 3 patches: FAIL
> > $ ./makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore
> > /tmp/164553756-dmesg.txt
> > __vtop4_x86_64: Can't get a valid pgd.
> > readmem: Can't convert a virtual address(ffffffff99c18604) to physical
> > address.
> > readmem: type_addr: 0, addr:ffffffff99c18604, size:390
> > check_release: Can't get the address of system_utsname.
> > 
> > makedumpfile Failed.
> > 
> > 
> > 2. Makedumpfile in kexec-tools-2.0.20-46.el8_4.2.x86_64: SUCCESS
> > $ makedumpfile --dump-dmesg /cores/retrace/tasks/164553756/crash/vmcore
> > /tmp/164553756-dmesg.txt
> > 
> > The dmesg log is saved to /tmp/164553756-dmesg.txt.
> > 
> > makedumpfile Completed.
> 
> For makedumpfile you have to explicitly specify which compression algorithms
> shall be supported when building the binary. For Fedora we use 
> 
> $ make LINKTYPE=dynamic USELZO=on USESNAPPY=on USEZSTD=on
> 
> (the USEZSTD is rather new and not supported by rhel yet but it shouldn't
> make a difference when you use the line for older versions as well. AFAIK
> unknown options are simply ignored).
> The problem you are hitting here is, that makedumpfile only checks whether
> the required compression algorithm is compiled in when you compress a dump
> but not when you read from one...
> So you get this meaningless error message when makedumpfile tries to read
> from the dump for the first time...


Thank you for pointing that out!  Also for doing the patch to fixup the error message.  I wondered if I had gotten the build step wrong and sure enough...  But the new error message is better.

I gave your v2 patchset a good test and thumbs-up on the list.  Thanks again.

Comment 27 Philipp Rudo 2022-03-30 13:42:40 UTC

@Dave W.: I've posted the fix for rhel 8.7. Please open a z-stream request when you also want to have it fixed in 8.4.

Thanks
Philipp

Comment 29 Dave Wysochanski 2022-04-05 15:00:19 UTC

(In reply to Philipp Rudo from comment #27)
> @Dave W.: I've posted the fix for rhel 8.7. Please open a z-stream request
> when you also want to have it fixed in 8.4.
> 
Thanks for getting this into 8.7 - this may be enough for now.  I'm not sure about z-stream backports, maybe 8.6 would make sense?  Is there some reason you're thinking about 8.4?  If we do that I think we need to do it for 8.6 as well (8.5 seems out of the question due to no EUS and out of time for last z-stream).  Our production vmcore machines will get upgraded and I can also upgrade individual packages like kexec-tools for this bug so we don't need 8.4 backport for those.

Comment 30 Philipp Rudo 2022-04-06 11:02:07 UTC

Hi Dave,

(In reply to Dave Wysochanski from comment #29)
> (In reply to Philipp Rudo from comment #27)
> > @Dave W.: I've posted the fix for rhel 8.7. Please open a z-stream request
> > when you also want to have it fixed in 8.4.
> > 
> Thanks for getting this into 8.7 - this may be enough for now.  I'm not sure
> about z-stream backports, maybe 8.6 would make sense?  Is there some reason
> you're thinking about 8.4?  If we do that I think we need to do it for 8.6
> as well (8.5 seems out of the question due to no EUS and out of time for
> last z-stream).  Our production vmcore machines will get upgraded and I can
> also upgrade individual packages like kexec-tools for this bug so we don't
> need 8.4 backport for those.

I wasn't sure if you need it for galvatron as it currently runs 8.4. But when you have ways to upgrade individual packages on it that's a lot easier than a z-stream for us, too.

Comment 36 errata-xmlrpc 2022-11-08 10:46:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (kexec-tools bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7705