Bug 1809179
Summary: | [RHEL-7.9] Dump doesn't automatically start; gives error "kdump: error: Dump target is not mounted." | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Steve Bonds <ij2fdc402> |
Component: | kexec-tools | Assignee: | Pingfan Liu <piliu> |
Status: | CLOSED WONTFIX | QA Contact: | Emma Wu <xiawu> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.7 | CC: | kdump-bugs, piliu, ruyang, xiawu |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-02 07:27:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1653509 |
Description
Steve Bonds
2020-03-02 14:38:56 UTC
Hi Steve, (In reply to Steve Bonds from comment #0) > Description of problem: > > When a crash is triggered, the dump fails to start and gives the error: > > Starting Kdump Vmcore Save Service... > kdump: dump target is > kdump: error: Dump target is not mounted. > kdump: saving vmcore failed > FAILED Failed to start Kdump Vmcore Save Service. > See 'systemctl status kdump-capture.service' for details. > > The XFS mount process can be seen to start: > > Found device /dev/mapper/vgroot-crash_lv. > Starting File System Check on /dev/mapper/vgroot-crash_lv... > systemd-fsck[488]: Started File System Check on > /dev/mapper/vgroot-crash_lv./sbin/fsck.xfs: XFS file system. > > Started dracut initqueue hook. > Reached target Remote File Systems (Pre). > Reached target Initrd Root File System. > Starting Reload Configuration from the Real Root... > Mounting /kdumproot/var/crash... > Reached target Remote File Systems. > Started Reload Configuration from the Real Root. > Reached target Initrd File Systems. > Reached target Initrd Default Target. > SGI XFS with ACLs, security attributes, realtime, no debug enabled > Starting dracut pre-pivot and cleanup hook... > XFS (dm-0): Mounting V4 Filesystem > Started dracut pre-pivot and cleanup hook. > Starting Kdump Vmcore Save Service... > kdump: dump target is > kdump: error: Dump target is not mounted. > kdump: saving vmcore failed > FAILED Failed to start Kdump Vmcore Save Service. > > At this point it seems unclear why the dump fails to start. Setting > `default=shell` and running the same service start manually works fine. > > Version-Release number of selected component (if applicable): > > Name : kexec-tools > Version : 2.0.15 > Release : 33.el7 > > How reproducible: > > Not especially. :-) > > On certain affected servers it seems to happen regularly, on other > supposedly identical servers it never seems to happen. This is probably due > to an upstream issue mounting the /var/crash area and returning a false > success for some short period of time before the mount area is actually > ready. While the cause for this may remain unknown, there are ways to make > the dump process more resilient when this or similar issues happen. > > So far this has only been observed on XFS filesystems. > > Steps to Reproduce: > > 1. (on an affected server) > 2. initiate the crash process with SysRq or NMI > 3. watch crash kernel start > 4. watch crash kernel fail to capture dump > > Actual results: > > dump: dump target is > kdump: error: Dump target is not mounted. > kdump: saving vmcore failed > > Expected results: > > kdump: dump target is /dev/mapper/vgroot-crash_lv > kdump: saving to /kdumproot/var/crash///127.0.0.1-2020-02-29-19:17:56/ > kdump: saving vmcore-dmesg.txt > kdump: saving vmcore-dmesg.txt complete > kdump: saving vmcore > The kernel version is not supported. > The makedumpfile operation may be incomplete. > > Copying data > ... > > Additional info: > > While there may be multiple possible causes for the delay in mounting the > crash destination, fixing this doesn't require actually finding the cause. > > One fix is the classic fix for all race conditions: add a delay. For example > in the SRPM the "dracut-kdump.sh" file could be modified like > > BEFORE: > > #!/bin/sh > > # continue here only if we have to save dump. > if [ -f /etc/fadump.initramfs ] && [ ! -f > /proc/device-tree/rtas/ibm,kernel-dump ]; then > exit 0 > fi > > exec &> /dev/console > . /lib/dracut-lib.sh > . /lib/kdump-lib-initramfs.sh > > set -o pipefail > DUMP_RETVAL=0 > > export PATH=$PATH:$KDUMP_SCRIPT_DIR > > AFTER (add a sleep): > > #!/bin/sh > > # continue here only if we have to save dump. > if [ -f /etc/fadump.initramfs ] && [ ! -f > /proc/device-tree/rtas/ibm,kernel-dump ]; then > exit 0 > fi > > # Avoid upstream race condition > echo "kdump wait to avoid race conditions" > /dev/console > sleep 60 > > exec &> /dev/console > . /lib/dracut-lib.sh > . /lib/kdump-lib-initramfs.sh > > set -o pipefail > DUMP_RETVAL=0 > > export PATH=$PATH:$KDUMP_SCRIPT_DIR > > Another possible fix would be to add some retries to the service start so if > it fails for any reason, there's a delay and it's retried a few times with > an additional delay between retries. > > A more targeted retry would be to adjust `kdump-lib-initramfs.sh` to allow > for return code checks and limited retries for the following: > > local _dev=$(findmnt -k -f -n -r -o SOURCE $1) > local _mp=$(findmnt -k -f -n -r -o TARGET $1) > > echo "kdump: dump target is $_dev" > > if [ -z "$_mp" ]; then > echo "kdump: error: Dump target $_dev is not mounted." > return 1 > fi > > One possible example to add a local retry: > > for try in 1 2 3; do local _dev=$(findmnt -k -f -n -r -o SOURCE $1) && > break || sleep 10; done > for try in 1 2 3; do local _mp=$(findmnt -k -f -n -r -o TARGET $1) && > break || sleep 10; done I remember discussing a similar problem with RHEL-7 kexec-tools upstream (see <http://lists.infradead.org/pipermail/kexec/2020-March/024603.html>) which was reported for a use-case targeting saving vmcore on iSCSI server. But, I think the root-cause seems to be the same: - Basically whenever a File System Check (fsck) starts on the underlying dump target (targeted for saving vmcore) in the kdump kernel, we have issues reported with kdump failure (either 'dracut-initqueue timeout' starts or we have a kdump failure - as you shared), - In all such cases, if we drop to kdump shell (by specifying 'default=shell' in the kdump.conf configuration file) and manually mount the intended dump target, the mount is successful and thereafter we can save vmcore manually on the intended dump target (as fsck was able to run in the meanwhile and the intended dump target can now be found). So, I think adding some kind of a retry/timeout attempt while trying to find intended dump target might help and probably adding it `kdump-lib-initramfs.sh` makes more sense. I will do some debugging and get back with a suggestion/possible fix. Thanks, Bhupesh It seems unlikely that the delays in my specific case are fsck-related because XFS doesn't actually do any fsck. Instead it prints a misleading error message. (See https://bugzilla.redhat.com/show_bug.cgi?id=1546294) :-) My suggestion would be to allow systemd to retry the failed crash steps several times. This looked like a good source for how to do that: https://stackoverflow.com/questions/39284563/how-to-set-up-a-systemd-service-to-retry-5-times-on-a-cycle-of-30-seconds Systemd retries would cover all possible failures to make the process more resilient. To cover my specific issue, a targeted retry of the `findmnt` commands would be a nice addition to the general retry above. I see you marked `needinfo`. What information can I provide? (In reply to Steve Bonds from comment #5) > It seems unlikely that the delays in my specific case are fsck-related > because XFS doesn't actually do any fsck. Instead it prints a misleading > error message. (See https://bugzilla.redhat.com/show_bug.cgi?id=1546294) :-) Right, in this case, it might be a misleading fsck-related debug print, but like I shared, we have had similar issues reported with other setups - such as iSCSI setups as well. > My suggestion would be to allow systemd to retry the failed crash steps > several times. This looked like a good source for how to do that: > > https://stackoverflow.com/questions/39284563/how-to-set-up-a-systemd-service- > to-retry-5-times-on-a-cycle-of-30-seconds > > Systemd retries would cover all possible failures to make the process more > resilient. > > To cover my specific issue, a targeted retry of the `findmnt` commands would > be a nice addition to the general retry above. I think this would be a much better approach. > I see you marked `needinfo`. What information can I provide? I think the XFS related misleading error message was what I was wondering about. Thanks for clarifying the same. I would work on a possible solution and share it shortly. Would need your help in verifying the same as I don't have a setup available right now, where this can be reproduced reliably. Thanks. We're seeing this more and more on our Oracle Linux servers. You may hear from them since we're pursuing a solution as part of a service request there. Our current workaround is the following added to /usr/lib/dracut/modules.d/99kdumpbase/kdump.sh before regenerating the kdump initrd files: # Avoid upstream race condition. Provide useful output to the impatient sysadmin for timeleft in {15..1}; do echo "kdump $timeleft second wait to avoid race conditions" > /dev/console sleep 1 done Based on testing from our affected servers, the mount typically completes within one or two seconds and usb settles within 7 seconds, so 15 seconds is plenty of overkill. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |