Bug 1474200

Summary: kdump error handler was triggered twice
Product: [Fedora] Fedora Reporter: Xunlei Pang <xlpang>
Component: dracutAssignee: dracut-maint-list
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: bhe, dracut-maint-list, harald, jonathan, kdump-team-bugs, ruyang, xlpang, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-14 07:51:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xunlei Pang 2017-07-24 06:41:41 UTC
Description of problem:
kdump error handler was triggered again after entering kdump shell(using "default shell"), and it emits endless messages like below:
[***   ] (1 of 2) A start job is running for...p Error Handler (31s / no limit)


It disturbs kdump shell input although the kdump shell can work.

Version-Release number of selected component (if applicable):


How reproducible:
Configure /etc/kdump.conf as follows:
grep -v ^# /etc/kdump.conf

path /var/crash
core_collector makedumpfile -l --message-level 1 -d 31
default shell
dracut_args -o "lvm"

Make sure that "/var/crash/" is on a lvm volume, like the root lvm; then
omit "lvm" dracut module to let the kdump target unavailable during kdump boot so that kdump error handler service is triggered.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
There is only one kdump error handler service triggered, or the second one terminate quickly not affecting the first one(i.e. kdump shell).


Additional info:
         Starting Create list of required st... nodes for the current kernel...
         Starting Journal Service...
         Starting Apply Kernel Variables...
[  OK  ] Listening on udev Kernel Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Started Create list of required sta...ce nodes for the current kernel.
[  OK  ] Started Apply Kernel Variables.
         Starting Create Static Device Nodes in /dev...
[  OK  ] Started Create Static Device Nodes in /dev.
[    2.024346] audit: type=1130 audit(1500877715.696:2): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-tmpfiles-setup-dev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
         Starting udev Kernel Device Manager...
[  OK  ] Started udev Kernel Device Manager.
[    2.038381] audit: type=1130 audit(1500877715.710:3): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-udevd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
         Starting udev Coldplug all Devices...
[  OK  ] Started Journal Service.
[    2.061908] audit: type=1130 audit(1500877715.734:4): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Started udev Coldplug all Devices.
[    2.091858] audit: type=1130 audit(1500877715.764:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-udev-trigger comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Reached target Remote File Systems (Pre).
[  OK  ] Reached target Remote File Systems.
         Starting File System Check on /dev/mapper/fedora-root...
[  OK  ] Reached target Initrd Root File System.
         Starting Reload Configuration from the Real Root...
[FAILED] Failed to start File System Check on /dev/mapper/fedora-root.
See 'systemctl status systemd-fsck-root.service' for details.
[DEPEND] Dependency failed for /sysroot.
[DEPEND] Dependency failed for Local File Systems.
[ SKIP ] Ordering cycle found, skipping dracut pre-pivot and cleanup hook
[ SKIP ] Ordering cycle found, skipping Initrd Default Target
[    2.138626] audit: type=1130 audit(1500877715.811:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-fsck-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
         Mounting Configuration File System...
[  OK  ] Started Reload Configuration from the Real Root.
[    2.220619] audit: type=1130 audit(1500877715.893:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=initrd-parse-etc comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[    2.225578] audit: type=1131 audit(1500877715.898:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=initrd-parse-etc comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Mounted Configuration File System.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Basic System.
[  OK  ] Reached target Initrd File Systems.
[  OK  ] Reached target Initrd Default Target.
         Starting dracut pre-pivot and cleanup hook...
[  OK  ] Started dracut pre-pivot and cleanup hook.
[    2.292104] audit: type=1130 audit(1500877715.964:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-pivot comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
         Starting Kdump Vmcore Save Service...
[    2.316807] 8139cp: 8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004)
[    2.330579] 8139cp 0000:00:03.0 eth0: RTL-8139C+ at 0xffffba7f000b6000, 52:54:00:46:18:74, IRQ 11
[    2.362860] virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
kdump: dump target is 
kdump: error: Dump target  is not mounted.
kdump: saving vmcore failed
[FAILED] Failed to start Kdump Vmcore Save Service.
See 'systemctl status kdump-capture.service' for details.
[ SKIP ] Ordering cycle found, skipping dracut pre-pivot and cleanup hook
[    2.465390] audit: type=1130 audit(1500877716.137:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=kdump-capture comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  OK  ] Stopped target Timers.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Remote File Systems (Pre).
[  OK  ] Stopped target Initrd Default Target.
[  OK  ] Stopped target Initrd Root File System.
[  OK  ] Stopped target Initrd File Systems.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped target Sockets.
[  OK  ] Stopped target Slices.
[  OK  ] Stopped target Initrd Root Device.
[  OK  ] Stopped target System Initialization.
         Stopping udev Kernel Device Manager...
[  OK  ] Stopped target Swap.
[  OK  ] Stopped Apply Kernel Variables.
[    2.496648] systemd-journald[144]: Received SIGTERM from PID 1 (systemd).
         Stopping Journal Service...
         Starting Kdump Emergency...
         Starting Setup Virtual Console...
[  OK  ] Stopped udev Coldplug all Devices.
[  OK  ] Stopped Journal Service.
[  OK  ] Listening on Journal Audit Socket.
         Starting Journal Service...
[  OK  ] Started Kdump Emergency.
[  OK  ] Stopped dracut pre-pivot and cleanup hook.
[  OK  ] Reached target Emergency Mode.
         Stopping Journal Service...[    2.578621] systemd-journald[276]: Received SIGTERM from PID 1 (systemd).

[  OK  ] Stopped Journal Service.
[  OK  ] Closed Journal Audit Socket.
[    2.609871] [drm] Initialized drm 1.1.0 20060810
[  OK  ] Started Setup Virtual Console.[    2.634071] random: crng init done

         Starting Kdump Error Handler...
[    2.656521] [drm] Device Version 0.0
[    2.656801] [drm] Compression level 0 log level 0
[    2.657126] [drm] Currently using mode #0, list at 0x488
[    2.657546] [drm] 12286 io pages at offset 0x1000000
[    2.657955] [drm] 16777216 byte draw area at offset 0x0
[    2.658348] [drm] RAM header offset: 0x3ffe000
[    2.658670] [drm] rom modes offset 0x488 for 128 modes
[    2.663478] [TTM] Zone  kernel: Available graphics memory: 71754 kiB
[    2.663948] [TTM] Initializing pool allocator
[    2.664265] [TTM] Initializing DMA pool allocator
[    2.664636] [drm] qxl: 16M of VRAM memory size
[    2.664952] [drm] qxl: 63M of IO pages memory ready (VRAM domain)
[    2.665535] [drm] qxl: 64M of Surface memory size
[    2.667903] tsc: Refined TSC clocksource calibration: 2593.993 MHz
[    2.668354] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x256411d258c, max_idle_ns: 440795337342 ns
Kdump: Executing default action kdump_emergency_shell
[    2.673350] [drm] main mem slot 1 [f4000000,3ffe000]
[    2.673728] [drm] surface mem slot 2 [f8000000,4000000]
[    2.674100] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    2.674592] [drm] No driver support for vblank timestamp query.
[    2.675268] [drm] fb mappable at 0xF4000000, size 3145728
[    2.675770] [drm] fb: depth 24, pitch 4096, width 1024, height 768

Generating "/run/initramfs/rdsosreport.txt"
[    2.677599] fbcon: qxldrmfb (fb0) is primary device
[    2.707393] Console: switching to colour frame buffer device 128x48
[    2.711683] qxl 0000:00:02.0: fb0: qxldrmfb frame buffer device
[    2.712420] [drm] Initialized qxl 0.1.0 20120117 for 0000:00:02.0 on minor 0
[    2.715352] virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
[  OK  ] Stopped udev Kernel Device Manager.
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create list of required sta...ce nodes for the current kernel.
[  OK  ] Closed udev Control Socket.
[  OK  ] Closed udev Kernel Socket.


Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.


[***   ] (1 of 2) A start job is running for...p Error Handler (31s / no limit)

Comment 1 Ziyue Yang 2017-07-26 06:34:16 UTC
The cause might be the ordering cycle in dracut systemd services.
Following is part of the journalctl output from kdump emergency shell:

[  233.200240] systemd[1]: basic.target: Found ordering cycle on basic.target/start
[  233.201450] systemd[1]: basic.target: Found dependency on sysinit.target/start
[  233.203159] systemd[1]: basic.target: Found dependency on emergency.target/stop
[  233.203920] systemd[1]: basic.target: Found dependency on dracut-pre-pivot.service/start
[  233.205337] systemd[1]: basic.target: Found dependency on initrd.target/start
[  233.206198] systemd[1]: basic.target: Found dependency on basic.target/start
[  233.206919] systemd[1]: basic.target: Breaking ordering cycle by deleting job dracut-pre-pivot.service/start

We have a 
basic => sysinit => emergency => dracut-pre-pivot => initrd => basic
ordering cycle, causing emergency depending on emergency itself.

Changing the Before directive in dracut-pre-pivot.service from

Before=shutdown.target emergency.target

to

Before=shutdown.target

would eliminate the systemd warning message.
I'm considering starting kdump-error-handler service through systemctl in emergency.service twice as the cause of the problem.

Comment 2 Ziyue Yang 2017-08-04 02:39:34 UTC
Just found that if kdump error handler was triggered by a makedumpfile failure like in

https://bugzilla.redhat.com/show_bug.cgi?id=1474706

The problem described here would not happen. Seems that the problem would be triggered only during certain phases. Will investigate more.

Comment 3 Xunlei Pang 2017-08-04 09:44:26 UTC
(In reply to Ziyue Yang from comment #2)
> Just found that if kdump error handler was triggered by a makedumpfile
> failure like in
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1474706
> 
> The problem described here would not happen. Seems that the problem would be
> triggered only during certain phases. Will investigate more.

I guess it's due to a different systemd stage, kdump capture service starts late after sysinit.target, so there should be no such ordering cycle issue.

However with your fix in Comment 1, the problem still exists with the following reproducing steps
====================
[root@ ~]$ grep -v ^# /etc/kdump.conf

path /root/xxx
core_collector makedumpfile -l --message-level 1 -d 31
default shell
====================
(dm-1 is the swap partition)
swapoff /dev/dm-1 &>/dev/null
mkfs.ext4 /dev/dm-1 -F
mount /dev/dm-1 /root/xxx
mkdir -p /root/xxx/var/crash
touch /etc/kdump.conf
kdumpctl restart
umount /root/xxx
mkswap /dev/dm-1
sync
sync
sleep 1
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

So, there is still some other issue we're not clear, or maybe we can solve it from a different view, for example, we allow for multiple failures(i.e. multiple error handler triggered) and try to avoid starting it multiple times if there was one running there.

Comment 4 Dave Young 2017-08-08 02:31:05 UTC
Is this a kdump only issue, or a general issue in dracut?

Comment 5 Ziyue Yang 2017-08-10 04:23:08 UTC
It seems to be a dracut's bug:

https://github.com/dracutdevs/dracut/commit/f24d205537b094939379440ee013cca88c7582ac

I've compiled and installed the newest version of dracut from github, and the error handler bug seems no longer. exist.

Comment 6 Xunlei Pang 2017-08-10 06:07:25 UTC
(In reply to Ziyue Yang from comment #5)
> It seems to be a dracut's bug:
> 
> https://github.com/dracutdevs/dracut/commit/
> f24d205537b094939379440ee013cca88c7582ac
> 
> I've compiled and installed the newest version of dracut from github, and
> the error handler bug seems no longer. exist.

That's great, it past all my tests.

Comment 7 Xunlei Pang 2017-08-10 06:34:03 UTC
(In reply to Xunlei Pang from comment #6)
> (In reply to Ziyue Yang from comment #5)
> > It seems to be a dracut's bug:
> > 
> > https://github.com/dracutdevs/dracut/commit/
> > f24d205537b094939379440ee013cca88c7582ac
> > 
> > I've compiled and installed the newest version of dracut from github, and
> > the error handler bug seems no longer. exist.
> 
> That's great, it past all my tests.

BTW, for this dracut commit, seems there is something wrong with modules.d/98dracut-systemd/dracut-pre-mount.service, it changes to:
"After=basic.target cryptsetup.target" instead of original
"After=dracut-initqueue.service cryptsetup.target" which leads to the following Ordering cycle:
systemd[1]: local-fs.target: Found ordering cycle on local-fs.target/start
systemd[1]: local-fs.target: Found dependency on sysroot.mount/start
systemd[1]: local-fs.target: Found dependency on dracut-pre-mount.service/start
systemd[1]: local-fs.target: Found dependency on basic.target/start
systemd[1]: local-fs.target: Found dependency on sysinit.target/start
[ SKIP ] Ordering cycle found, skipping Local File Systems

Kdump error handler will be triggered again in case of any Ordering cycle, so I think we should use the original dependency instead.

Comment 8 Xunlei Pang 2017-08-10 06:35:48 UTC
(In reply to Xunlei Pang from comment #7)
> (In reply to Xunlei Pang from comment #6)
> > (In reply to Ziyue Yang from comment #5)
> > > It seems to be a dracut's bug:
> > > 
> > > https://github.com/dracutdevs/dracut/commit/
> > > f24d205537b094939379440ee013cca88c7582ac
> > > 
> > > I've compiled and installed the newest version of dracut from github, and
> > > the error handler bug seems no longer. exist.
> > 
> > That's great, it past all my tests.
> 
> BTW, for this dracut commit, seems there is something wrong with
> modules.d/98dracut-systemd/dracut-pre-mount.service, it changes to:
> "After=basic.target cryptsetup.target" instead of original
> "After=dracut-initqueue.service cryptsetup.target" which leads to the
> following Ordering cycle:

Also, this service already contains "DefaultDependencies=no" which means has no dependency on "basic.target".

> systemd[1]: local-fs.target: Found ordering cycle on local-fs.target/start
> systemd[1]: local-fs.target: Found dependency on sysroot.mount/start
> systemd[1]: local-fs.target: Found dependency on
> dracut-pre-mount.service/start
> systemd[1]: local-fs.target: Found dependency on basic.target/start
> systemd[1]: local-fs.target: Found dependency on sysinit.target/start
> [ SKIP ] Ordering cycle found, skipping Local File Systems
> 
> Kdump error handler will be triggered again in case of any Ordering cycle,
> so I think we should use the original dependency instead.

Comment 9 Dave Young 2017-08-11 02:37:08 UTC
Hi, xunlei, can you make a pull request in github so that Harald can take a look?

Comment 10 Xunlei Pang 2017-08-11 03:15:46 UTC
(In reply to Dave Young from comment #9)
> Hi, xunlei, can you make a pull request in github so that Harald can take a
> look?

Sure, will do

Comment 11 Xunlei Pang 2017-08-11 06:12:23 UTC
(In reply to Dave Young from comment #9)
> Hi, xunlei, can you make a pull request in github so that Harald can take a
> look?

Synced to the latest dracut git repository, it's already been fixed.

So all the issues will be fixed if we backport the following two dracut upstream commits:
1)
commit c000a21c25bd436f2b3cc2076cb7025cc82d2807
Author: Harald Hoyer <harald>
Date:   Wed Jun 22 18:12:19 2016 +0200

    dracut-systemd/*.service: conflict with shutdown target
    
    make reboot/poweroff/halt work
    
    also conflict with emergency.target

2) 
commit b1ae591945acc4e9a962bd817e6e6e40e8587d2c
Author: Harald Hoyer <harald>
Date:   Fri Jul 28 11:57:07 2017 +0200

    dracut-systemd: add back missing dependencies
    
    otherwise TEST-20-NFS fails

Comment 12 Xunlei Pang 2017-08-11 06:16:49 UTC
(In reply to Xunlei Pang from comment #11)
> (In reply to Dave Young from comment #9)
> > Hi, xunlei, can you make a pull request in github so that Harald can take a
> > look?
> 
> Synced to the latest dracut git repository, it's already been fixed.
> 
> So all the issues will be fixed if we backport the following two dracut
> upstream commits:
> 1)
> commit c000a21c25bd436f2b3cc2076cb7025cc82d2807
> Author: Harald Hoyer <harald>
> Date:   Wed Jun 22 18:12:19 2016 +0200
> 
>     dracut-systemd/*.service: conflict with shutdown target
>     
>     make reboot/poweroff/halt work
>     
>     also conflict with emergency.target

Sorry, this one is the problematic commit, the fix should be
commit f24d205537b094939379440ee013cca88c7582ac
Author: Harald Hoyer <harald>
Date:   Fri Jul 28 09:05:34 2017 +0200

    dracut-systemd: fixed dependencies
    
    try to break an ordering cycle.
    
    https://github.com/dracutdevs/dracut/issues/259

> 
> 2) 
> commit b1ae591945acc4e9a962bd817e6e6e40e8587d2c
> Author: Harald Hoyer <harald>
> Date:   Fri Jul 28 11:57:07 2017 +0200
> 
>     dracut-systemd: add back missing dependencies
>     
>     otherwise TEST-20-NFS fails

Comment 13 Harald Hoyer 2017-08-11 12:36:32 UTC
is this still broken?

Comment 14 Xunlei Pang 2017-08-14 00:36:25 UTC
(In reply to Harald Hoyer from comment #13)
> is this still broken?

Hi Harald,

We tested on latest Fedora rawhide, it will break.

It can work properly after we applied the two fixes below:
1)
commit f24d205537b094939379440ee013cca88c7582ac
Author: Harald Hoyer <harald>
Date:   Fri Jul 28 09:05:34 2017 +0200

    dracut-systemd: fixed dependencies
    
    try to break an ordering cycle.
    
    https://github.com/dracutdevs/dracut/issues/259

2)
commit b1ae591945acc4e9a962bd817e6e6e40e8587d2c
Author: Harald Hoyer <harald>
Date:   Fri Jul 28 11:57:07 2017 +0200

    dracut-systemd: add back missing dependencies
    
    otherwise TEST-20-NFS fails

Comment 15 Harald Hoyer 2017-08-14 07:51:12 UTC
dracut-046-2.git20170811.fc27.x86_64