2137631 – kdump fails to generate a crash dump for systemd-252~rc1-608.fc38

Bug 2137631 - kdump fails to generate a crash dump for systemd-252~rc1-608.fc38

Summary: kdump fails to generate a crash dump for systemd-252~rc1-608.fc38

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-10-25 17:04 UTC by Gursewak Mangat
Modified:	2022-11-24 17:13 UTC (History)
CC List:	15 users (show)
Fixed In Version:	systemd-252.2-591.fc38
Clone Of:
Environment:
Last Closed:	2022-11-24 17:13:26 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FC-641	0	None	None	None	2022-10-29 14:58:45 UTC

Description Gursewak Mangat 2022-10-25 17:04:30 UTC

Description of problem:
kdump fails to generate a crash dump file after a kernel crash

Version-Release number of selected component (if applicable):
Fails to capture kdump using systemd-252~rc1-608.fc38.x86_64 and later(fedora-rawhide)
Successfully able to capture and view the kdump folder using systemd-251.5-607.fc38

How reproducible:
Every time for rawhide

Steps to Reproduce:
1. Start and launch a fedora VM. We can use https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20221023.n.0/compose/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20221023.n.0.x86_64.qcow2 to run a rawhide version.
2. Install the required packages for kdump
   dnf install --enablerepo=updates-debuginfo kexec-tools
3. Update the kernel parameters to add `crashkernel=256M`
4. Activate the kdump system service at startup using the following command.
"systemctl enable kdump.service --now"
5. Reboot the system and make sure the kdump.service is running by running "systemctl status kdump.service"
6. Trigger a kernel crash using
   echo 1 > /proc/sys/kernel/sysrq
   echo c > /proc/sysrq-trigger


Actual results:
After the system boots up, there is no kernel dump folder under /var/crash/

Expected results:
After the system reboots after a kernel panic, there should be a kernel dump in `/var/crash/<dumpdir>`

Comment 1 Dusty Mabe 2022-10-28 16:09:13 UTC

Adding in Coiby to cc since he might know more from the kdump side (i.e. this could be something where kdump needs to be updated).

Comment 2 Zbigniew Jędrzejewski-Szmek 2022-10-28 17:11:56 UTC

Hmm, but is there any specific reason why systemd would be responsible for the bug? Isn't this between kdump and the kernel?

Comment 3 Dusty Mabe 2022-10-31 02:21:06 UTC

Hey zbyszek. In Fedora CoreOS we track packages diffs meticulously. When a test starts failing we can see the exact package set change where the test started failing. We determined that the systemd update was the one that caused the test to start failing. Reverting the systemd update (pinning on the older version) allowed the test to start passing again.

More details in https://github.com/coreos/fedora-coreos-tracker/issues/1320#issuecomment-1281048996

This of course doesn't mean that the ultimate fix for this probelm has to be in systemd, but it does strongly indicate that there is some sort of behavior change in systemd 252.

Comment 4 Coiby 2022-10-31 08:44:20 UTC

Hi,

I notice the following logs in kdump kernel,
```
[    2.820364] systemd[1]: Starting initrd-parse-etc.service - Mountpoints Configured in the Real Root...
 [    2.865125] systemd-sysroot-fstab-check[494]: This program is only useful in the initrd.
[    2.868212] systemd[1]: initrd-parse-etc.service: Main process exited, code=exited, status=1/FAILURE
[    2.868329] systemd[1]: initrd-parse-etc.service: Failed with result 'exit-code'.
 [    2.868588] systemd[1]: Failed to start initrd-parse-etc.service - Mountpoints Configured in the Real Root.
```

One difference between systemd-251.5-607.fc38 and  systemd-252~rc1-608.fc38.x86_64 is the change [1] of initrd-parse-etc.service. If I reverted the change,
[root@s390x-kvm-063 ~]# diff -u /usr/lib/systemd/system/initrd-parse-etc.service{.bak,}
--- /usr/lib/systemd/system/initrd-parse-etc.service.bak        2022-10-31 04:19:04.170099763 -0400
+++ /usr/lib/systemd/system/initrd-parse-etc.service    2022-10-31 04:31:48.467748136 -0400
@@ -23,7 +23,7 @@
 
 # FIXME: once dracut is patched to install the symlink, change to:
 # ExecStart=/usr/lib/systemd/systemd-sysroot-fstab-check
-ExecStart=@/usr/lib/systemd/system-generators/systemd-fstab-generator systemd-sysroot-fstab-check
+ExecStart=-systemctl --no-block start initrd-fs.target
 
 # We want to enqueue initrd-cleanup.service/start after we finished the part
 # above. It can't be part of the initial transaction, because non-oneshot units

kdump worked again. So we are now sure somehow the recent change of systemd [1] made initrd-parse-etc.service failed to parse /etc/fstab.


[1] https://github.com/systemd/systemd/commit/45bcfcb36cec9bf810686ed956ff215ac1db07d5

Comment 5 Coiby 2022-11-01 10:48:46 UTC

By luck, I found dracut's squashfs module doesn't  play well with newer initrd-parse-etc.service.  If we remove kernel-modules which provides the squashfs driver to forbid using dracut's squashfs module,  this issue will also be be gone.  

It's also worthy to mention if we disable kexec-tools' emergency shell and use dracut's emergency shell instead, we are automatically dropped into the emergency shell after  initrd-parse-etc.service  fails. After repeating failing initrd-parse-etc.service-> entering emergency shell->quitting the emergency shell for several rounds, somehow vmcore dumping will be started and then be finished successfully.

Comment 6 Dusty Mabe 2022-11-03 19:29:32 UTC

I'm trying to figure out what the path forward is on this? Does the systemd change that caused the regression need to be fixed or does dracut need to be fixed?

Comment 7 Zbigniew Jędrzejewski-Szmek 2022-11-03 22:31:04 UTC

So… the dracut squashfs module sets up an overlay: squashfs + tmpfs.
systemd-sysroot-fstab-check calls in_initrd(), which checks for /etc/initrd and whether
/ is a tmpfs. This check is fairly primitive: it checks if the fstype is ramfs or tmpfs.
We can say that the overlay with tmpfs *is* a temporary file system, but this check as it
is written was never smart enough to detect this. The fact the check in initrd-parse-etc.service
didn't call this function previously is just an oversight. Code in other places in systemd
does use this function quite a bit, so there would be inconsistent behaviour,
some parts assuming that we're in an initrd, and others not.

I think we should make in_initrd() smarter and also check if an overlay with a tmpfs is
used. But that's not really a bug: the dracut module is reaching deep into systemd internals,
monkey patching the mount structure. The fact that this ever worked was just an accident
of implementation.

I'll reassign this to systemd for now, for the improvements in in_initrd(). I hope that this
will be enough to make kdump work again.

Comment 8 Zbigniew Jędrzejewski-Szmek 2022-11-03 22:31:39 UTC

(https://github.com/dracutdevs/dracut/blob/master/modules.d/99squash/init-squash.sh
is the relevant code in dracut.)

Comment 9 Zbigniew Jędrzejewski-Szmek 2022-11-07 12:17:11 UTC

https://github.com/systemd/systemd/pull/25280

Comment 10 Coiby 2022-11-10 10:33:11 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #7)
> So… the dracut squashfs module sets up an overlay: squashfs + tmpfs.
> systemd-sysroot-fstab-check calls in_initrd(), which checks for /etc/initrd
> and whether
> / is a tmpfs. This check is fairly primitive: it checks if the fstype is
> ramfs or tmpfs.
> We can say that the overlay with tmpfs *is* a temporary file system, but
> this check as it
> is written was never smart enough to detect this. The fact the check in
> initrd-parse-etc.service
> didn't call this function previously is just an oversight. Code in other
> places in systemd
> does use this function quite a bit, so there would be inconsistent behaviour,
> some parts assuming that we're in an initrd, and others not.
> 
> I think we should make in_initrd() smarter and also check if an overlay with
> a tmpfs is
> used. But that's not really a bug: the dracut module is reaching deep into
> systemd internals,
> monkey patching the mount structure. The fact that this ever worked was just
> an accident
> of implementation.

Thanks for the detailed explanation! The root cause is clear now. I can confirm adding Environment=SYSTEMD_IN_INITRD=1/lenient to initrd-parse-etc.service could make kdump work.

> 
> I'll reassign this to systemd for now, for the improvements in in_initrd().
> I hope that this
> will be enough to make kdump work again.
>
> https://github.com/systemd/systemd/pull/25280

I can confirm kdump works again with latest systemd. Thanks!

Comment 11 Dusty Mabe 2022-11-10 13:13:48 UTC

(In reply to Coiby from comment #10)
> 
> I can confirm kdump works again with latest systemd. Thanks!

Hey Coiby, did you test a Fedora package build or did you build from source upstream? I'm just trying to find out if the fix has made it into rawhide yet.

Comment 12 Coiby 2022-11-11 01:11:21 UTC

(In reply to Dusty Mabe from comment #11)
> (In reply to Coiby from comment #10)
> > 
> > I can confirm kdump works again with latest systemd. Thanks!
> 
> Hey Coiby, did you test a Fedora package build or did you build from source
> upstream? I'm just trying to find out if the fix has made it into rawhide
> yet.

Hi, I built it from upstream source.

Comment 13 Fedora Update System 2022-11-24 17:09:17 UTC

FEDORA-2022-3339528ed3 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-3339528ed3

Comment 14 Fedora Update System 2022-11-24 17:13:26 UTC

FEDORA-2022-3339528ed3 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.