Description of problem: kdump fails to generate a crash dump file after a kernel crash Version-Release number of selected component (if applicable): Fails to capture kdump using systemd-252~rc1-608.fc38.x86_64 and later(fedora-rawhide) Successfully able to capture and view the kdump folder using systemd-251.5-607.fc38 How reproducible: Every time for rawhide Steps to Reproduce: 1. Start and launch a fedora VM. We can use https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20221023.n.0/compose/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20221023.n.0.x86_64.qcow2 to run a rawhide version. 2. Install the required packages for kdump dnf install --enablerepo=updates-debuginfo kexec-tools 3. Update the kernel parameters to add `crashkernel=256M` 4. Activate the kdump system service at startup using the following command. "systemctl enable kdump.service --now" 5. Reboot the system and make sure the kdump.service is running by running "systemctl status kdump.service" 6. Trigger a kernel crash using echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger Actual results: After the system boots up, there is no kernel dump folder under /var/crash/ Expected results: After the system reboots after a kernel panic, there should be a kernel dump in `/var/crash/<dumpdir>`
Adding in Coiby to cc since he might know more from the kdump side (i.e. this could be something where kdump needs to be updated).
Hmm, but is there any specific reason why systemd would be responsible for the bug? Isn't this between kdump and the kernel?
Hey zbyszek. In Fedora CoreOS we track packages diffs meticulously. When a test starts failing we can see the exact package set change where the test started failing. We determined that the systemd update was the one that caused the test to start failing. Reverting the systemd update (pinning on the older version) allowed the test to start passing again. More details in https://github.com/coreos/fedora-coreos-tracker/issues/1320#issuecomment-1281048996 This of course doesn't mean that the ultimate fix for this probelm has to be in systemd, but it does strongly indicate that there is some sort of behavior change in systemd 252.
Hi, I notice the following logs in kdump kernel, ``` [ 2.820364] systemd[1]: Starting initrd-parse-etc.service - Mountpoints Configured in the Real Root... [ 2.865125] systemd-sysroot-fstab-check[494]: This program is only useful in the initrd. [ 2.868212] systemd[1]: initrd-parse-etc.service: Main process exited, code=exited, status=1/FAILURE [ 2.868329] systemd[1]: initrd-parse-etc.service: Failed with result 'exit-code'. [ 2.868588] systemd[1]: Failed to start initrd-parse-etc.service - Mountpoints Configured in the Real Root. ``` One difference between systemd-251.5-607.fc38 and systemd-252~rc1-608.fc38.x86_64 is the change [1] of initrd-parse-etc.service. If I reverted the change, [root@s390x-kvm-063 ~]# diff -u /usr/lib/systemd/system/initrd-parse-etc.service{.bak,} --- /usr/lib/systemd/system/initrd-parse-etc.service.bak 2022-10-31 04:19:04.170099763 -0400 +++ /usr/lib/systemd/system/initrd-parse-etc.service 2022-10-31 04:31:48.467748136 -0400 @@ -23,7 +23,7 @@ # FIXME: once dracut is patched to install the symlink, change to: # ExecStart=/usr/lib/systemd/systemd-sysroot-fstab-check -ExecStart=@/usr/lib/systemd/system-generators/systemd-fstab-generator systemd-sysroot-fstab-check +ExecStart=-systemctl --no-block start initrd-fs.target # We want to enqueue initrd-cleanup.service/start after we finished the part # above. It can't be part of the initial transaction, because non-oneshot units kdump worked again. So we are now sure somehow the recent change of systemd [1] made initrd-parse-etc.service failed to parse /etc/fstab. [1] https://github.com/systemd/systemd/commit/45bcfcb36cec9bf810686ed956ff215ac1db07d5
By luck, I found dracut's squashfs module doesn't play well with newer initrd-parse-etc.service. If we remove kernel-modules which provides the squashfs driver to forbid using dracut's squashfs module, this issue will also be be gone. It's also worthy to mention if we disable kexec-tools' emergency shell and use dracut's emergency shell instead, we are automatically dropped into the emergency shell after initrd-parse-etc.service fails. After repeating failing initrd-parse-etc.service-> entering emergency shell->quitting the emergency shell for several rounds, somehow vmcore dumping will be started and then be finished successfully.
I'm trying to figure out what the path forward is on this? Does the systemd change that caused the regression need to be fixed or does dracut need to be fixed?
So… the dracut squashfs module sets up an overlay: squashfs + tmpfs. systemd-sysroot-fstab-check calls in_initrd(), which checks for /etc/initrd and whether / is a tmpfs. This check is fairly primitive: it checks if the fstype is ramfs or tmpfs. We can say that the overlay with tmpfs *is* a temporary file system, but this check as it is written was never smart enough to detect this. The fact the check in initrd-parse-etc.service didn't call this function previously is just an oversight. Code in other places in systemd does use this function quite a bit, so there would be inconsistent behaviour, some parts assuming that we're in an initrd, and others not. I think we should make in_initrd() smarter and also check if an overlay with a tmpfs is used. But that's not really a bug: the dracut module is reaching deep into systemd internals, monkey patching the mount structure. The fact that this ever worked was just an accident of implementation. I'll reassign this to systemd for now, for the improvements in in_initrd(). I hope that this will be enough to make kdump work again.
(https://github.com/dracutdevs/dracut/blob/master/modules.d/99squash/init-squash.sh is the relevant code in dracut.)
https://github.com/systemd/systemd/pull/25280
(In reply to Zbigniew Jędrzejewski-Szmek from comment #7) > So… the dracut squashfs module sets up an overlay: squashfs + tmpfs. > systemd-sysroot-fstab-check calls in_initrd(), which checks for /etc/initrd > and whether > / is a tmpfs. This check is fairly primitive: it checks if the fstype is > ramfs or tmpfs. > We can say that the overlay with tmpfs *is* a temporary file system, but > this check as it > is written was never smart enough to detect this. The fact the check in > initrd-parse-etc.service > didn't call this function previously is just an oversight. Code in other > places in systemd > does use this function quite a bit, so there would be inconsistent behaviour, > some parts assuming that we're in an initrd, and others not. > > I think we should make in_initrd() smarter and also check if an overlay with > a tmpfs is > used. But that's not really a bug: the dracut module is reaching deep into > systemd internals, > monkey patching the mount structure. The fact that this ever worked was just > an accident > of implementation. Thanks for the detailed explanation! The root cause is clear now. I can confirm adding Environment=SYSTEMD_IN_INITRD=1/lenient to initrd-parse-etc.service could make kdump work. > > I'll reassign this to systemd for now, for the improvements in in_initrd(). > I hope that this > will be enough to make kdump work again. > > https://github.com/systemd/systemd/pull/25280 I can confirm kdump works again with latest systemd. Thanks!
(In reply to Coiby from comment #10) > > I can confirm kdump works again with latest systemd. Thanks! Hey Coiby, did you test a Fedora package build or did you build from source upstream? I'm just trying to find out if the fix has made it into rawhide yet.
(In reply to Dusty Mabe from comment #11) > (In reply to Coiby from comment #10) > > > > I can confirm kdump works again with latest systemd. Thanks! > > Hey Coiby, did you test a Fedora package build or did you build from source > upstream? I'm just trying to find out if the fix has made it into rawhide > yet. Hi, I built it from upstream source.
FEDORA-2022-3339528ed3 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-3339528ed3
FEDORA-2022-3339528ed3 has been pushed to the Fedora 38 stable repository. If problem still persists, please make note of it in this bug report.