Bug 1717323
Summary: | Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Yanan Fu <yfu> | ||||||||||||||
Component: | dracut | Assignee: | Pavel Valena <pvalena> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | ||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||
Priority: | high | ||||||||||||||||
Version: | 8.5 | CC: | bdas, chayang, coli, coughlan, dracut-maint-list, dtardon, esandeen, fkrska, fsumsal, jinzhao, juzhang, leiyang, lmiksik, lnykryn, pvalena, rbalakri, ribarry, rstrode, virt-maint, xfu, xiaohli, xuwei, yanghliu, yfu, ymankad, yuhuang, zjedrzej | ||||||||||||||
Target Milestone: | beta | Keywords: | Reopened, TestBlocker, Triaged | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | dracut-049-209.git20220815.el8 | Doc Type: | If docs needed, set a value | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | 1685358 | Environment: | |||||||||||||||
Last Closed: | 2022-11-08 10:48:46 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | 1685358 | ||||||||||||||||
Bug Blocks: | |||||||||||||||||
Attachments: |
|
Comment 1
juzhang
2019-06-06 01:43:51 UTC
Based on BZ1685358#c31, remove 'TestBlocker' keyword. Thanks. Hit same issue with qemu-kvm-4.0.0-5.module+el8.1.0+3622+5812d9bf. Compare reboot log when panic occurs during reboot and no panic occurs during reboot, I guess panic caused during 'dracut' taking over 'mdmon' process. Attachment contains detailed log files. I will downgrade mdadm and rebuild initramfs image with dracut -f to Verify my guess. And, any suggestion are welcome. thanks, Xu Created attachment 1600508 [details]
log for guest reboot successful
log for guest reboot successful
Created attachment 1600521 [details]
log for panic when reboot
log for panic when reboot
log for guest panic when reboot ----------------------------------------------------------------------------------------- 2019-08-04 02:49:46: [ 33.052522] systemd-shutdow: 37 output lines suppressed due to ratelimiting 2019-08-04 02:49:46: [ 33.211601] systemd-shutdown[1]: Syncing filesystems and block devices. 2019-08-04 02:49:46: [ 33.217907] systemd-shutdown[1]: Sending SIGTERM to remaining processes... 2019-08-04 02:49:46: [ 33.231619] systemd-journald[1376]: Received SIGTERM from PID 1 (systemd-shutdow). 2019-08-04 02:49:46: [ 33.242744] systemd-shutdown[1]: Sending SIGKILL to remaining processes... 2019-08-04 02:49:46: [ 33.252781] systemd-shutdown[1]: Hardware watchdog 'i6300ESB timer', version 0 2019-08-04 02:49:46: [ 33.255000] systemd-shutdown[1]: Unmounting file systems. 2019-08-04 02:49:46: [ 33.257242] [4095]: Remounting '/boot' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-08-04 02:49:46: [ 33.270609] [4096]: Unmounting '/boot'. 2019-08-04 02:49:46: [ 33.282700] XFS (sda1): Unmounting Filesystem 2019-08-04 02:49:46: [ 33.287632] [4097]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-08-04 02:49:46: [ 33.295475] [4097]: Failed to remount '/' read-only: Device or resource busy 2019-08-04 02:49:46: [ 33.298320] [4098]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-08-04 02:49:46: [ 33.300070] [4098]: Failed to remount '/' read-only: Device or resource busy 2019-08-04 02:49:46: [ 33.301562] systemd-shutdown[1]: Not all file systems unmounted, 1 left. 2019-08-04 02:49:46: [ 33.302894] systemd-shutdown[1]: Deactivating swaps. 2019-08-04 02:49:46: [ 33.303924] systemd-shutdown[1]: All swaps deactivated. 2019-08-04 02:49:46: [ 33.304950] systemd-shutdown[1]: Detaching loop devices. 2019-08-04 02:49:46: [ 33.306252] systemd-shutdown[1]: All loop devices detached. 2019-08-04 02:49:46: [ 33.307342] systemd-shutdown[1]: Detaching DM devices. 2019-08-04 02:49:46: [ 33.308975] systemd-shutdown[1]: Detaching DM 253:1. 2019-08-04 02:49:46: [ 33.316142] systemd-shutdown[1]: Not all DM devices detached, 1 left. 2019-08-04 02:49:46: [ 33.317351] systemd-shutdown[1]: Unmounting file systems. 2019-08-04 02:49:46: [ 33.319248] [4099]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-08-04 02:49:46: [ 33.321145] [4099]: Failed to remount '/' read-only: Device or resource busy 2019-08-04 02:49:46: [ 33.322738] systemd-shutdown[1]: Not all file systems unmounted, 1 left. 2019-08-04 02:49:46: [ 33.323865] systemd-shutdown[1]: Detaching DM devices. 2019-08-04 02:49:46: [ 33.325086] systemd-shutdown[1]: Not all DM devices detached, 1 left. 2019-08-04 02:49:46: [ 33.326109] systemd-shutdown[1]: Cannot finalize remaining file systems, DM devices, continuing. 2019-08-04 02:49:46: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 2019-08-04 02:49:46: root 1 15.8 0.0 83028 17540 ? SLs 14:49 0:05 /usr/lib/system 2019-08-04 02:49:46: root 2 0.0 0.0 0 0 ? S 14:49 0:00 [kthreadd] 2019-08-04 02:49:46: root 3 0.0 0.0 0 0 ? I< 14:49 0:00 [rcu_gp] 2019-08-04 02:49:46: root 4 0.0 0.0 0 0 ? I< 14:49 0:00 [rcu_par_gp] 2019-08-04 02:49:46: root 5 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/0:0-rc 2019-08-04 02:49:46: root 6 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/0:0H-k 2019-08-04 02:49:46: root 7 0.2 0.0 0 0 ? I 14:49 0:00 [kworker/u32:0- 2019-08-04 02:49:46: root 8 0.0 0.0 0 0 ? I< 14:49 0:00 [mm_percpu_wq] 2019-08-04 02:49:46: root 9 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/0] 2019-08-04 02:49:46: root 10 0.3 0.0 0 0 ? I 14:49 0:00 [rcu_sched] 2019-08-04 02:49:46: root 11 0.0 0.0 0 0 ? S 14:49 0:00 [migration/0] 2019-08-04 02:49:46: root 12 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/0] 2019-08-04 02:49:46: root 13 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/0] 2019-08-04 02:49:46: root 14 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/1] 2019-08-04 02:49:46: root 15 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/1] 2019-08-04 02:49:46: root 16 0.0 0.0 0 0 ? S 14:49 0:00 [migration/1] 2019-08-04 02:49:46: root 17 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/1] 2019-08-04 02:49:46: root 18 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/1:0-cg 2019-08-04 02:49:46: root 19 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/1:0H-k 2019-08-04 02:49:46: root 20 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/2] 2019-08-04 02:49:46: root 21 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/2] 2019-08-04 02:49:46: root 22 0.0 0.0 0 0 ? S 14:49 0:00 [migration/2] 2019-08-04 02:49:46: root 23 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/2] 2019-08-04 02:49:46: root 24 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/2:0-cg 2019-08-04 02:49:46: root 25 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/2:0H-k 2019-08-04 02:49:46: root 26 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/3] 2019-08-04 02:49:46: root 27 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/3] 2019-08-04 02:49:46: root 28 0.0 0.0 0 0 ? S 14:49 0:00 [migration/3] 2019-08-04 02:49:46: root 29 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/3] 2019-08-04 02:49:46: root 30 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/3:0-cg 2019-08-04 02:49:46: root 31 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/3:0H-k 2019-08-04 02:49:46: root 32 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/4] 2019-08-04 02:49:46: root 33 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/4] 2019-08-04 02:49:46: root 34 0.0 0.0 0 0 ? S 14:49 0:00 [migration/4] 2019-08-04 02:49:46: root 35 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/4] 2019-08-04 02:49:46: root 36 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/4:0-cg 2019-08-04 02:49:46: root 37 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/4:0H-k 2019-08-04 02:49:46: root 38 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/5] 2019-08-04 02:49:46: root 39 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/5] 2019-08-04 02:49:46: root 40 0.0 0.0 0 0 ? S 14:49 0:00 [migration/5] 2019-08-04 02:49:46: root 41 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/5] 2019-08-04 02:49:46: root 42 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/5:0-ev 2019-08-04 02:49:46: root 43 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/5:0H] 2019-08-04 02:49:46: root 44 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/6] 2019-08-04 02:49:46: root 45 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/6] 2019-08-04 02:49:46: root 46 0.0 0.0 0 0 ? S 14:49 0:00 [migration/6] 2019-08-04 02:49:46: root 47 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/6] 2019-08-04 02:49:46: root 48 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/6:0-cg 2019-08-04 02:49:46: root 49 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/6:0H-k 2019-08-04 02:49:46: root 50 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/7] 2019-08-04 02:49:46: root 51 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/7] 2019-08-04 02:49:46: root 52 0.0 0.0 0 0 ? S 14:49 0:00 [migration/7] 2019-08-04 02:49:46: root 53 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/7] 2019-08-04 02:49:46: root 54 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/7:0-cg 2019-08-04 02:49:46: root 55 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/7:0H-k 2019-08-04 02:49:46: root 56 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/8] 2019-08-04 02:49:46: root 57 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/8] 2019-08-04 02:49:46: root 58 0.0 0.0 0 0 ? S 14:49 0:00 [migration/8] 2019-08-04 02:49:46: root 59 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/8] 2019-08-04 02:49:46: root 60 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/8:0-cg 2019-08-04 02:49:46: root 61 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/8:0H-k 2019-08-04 02:49:46: root 62 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/9] 2019-08-04 02:49:46: root 63 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/9] 2019-08-04 02:49:46: root 64 0.0 0.0 0 0 ? S 14:49 0:00 [migration/9] 2019-08-04 02:49:46: root 65 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/9] 2019-08-04 02:49:46: root 66 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/9:0-cg 2019-08-04 02:49:46: root 67 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/9:0H-k 2019-08-04 02:49:46: root 68 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/10] 2019-08-04 02:49:46: root 69 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/10] 2019-08-04 02:49:46: root 70 0.0 0.0 0 0 ? S 14:49 0:00 [migration/10] 2019-08-04 02:49:46: root 71 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/10] 2019-08-04 02:49:46: root 72 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/10:0-c 2019-08-04 02:49:46: root 73 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/10:0H- 2019-08-04 02:49:46: root 74 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/11] 2019-08-04 02:49:46: root 75 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/11] 2019-08-04 02:49:46: root 76 0.0 0.0 0 0 ? S 14:49 0:00 [migration/11] 2019-08-04 02:49:46: root 77 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/11] 2019-08-04 02:49:46: root 78 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/11:0-e 2019-08-04 02:49:46: root 79 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/11:0H- 2019-08-04 02:49:46: root 80 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/12] 2019-08-04 02:49:46: root 81 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/12] 2019-08-04 02:49:46: root 82 0.0 0.0 0 0 ? S 14:49 0:00 [migration/12] 2019-08-04 02:49:46: root 83 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/12] 2019-08-04 02:49:46: root 84 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/12:0-e 2019-08-04 02:49:46: root 85 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/12:0H- 2019-08-04 02:49:46: root 86 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/13] 2019-08-04 02:49:46: root 87 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/13] 2019-08-04 02:49:46: root 88 0.0 0.0 0 0 ? S 14:49 0:00 [migration/13] 2019-08-04 02:49:46: root 89 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/13] 2019-08-04 02:49:46: root 90 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/13:0-c 2019-08-04 02:49:46: root 91 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/13:0H- 2019-08-04 02:49:46: root 92 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/14] 2019-08-04 02:49:46: root 93 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/14] 2019-08-04 02:49:46: root 94 0.0 0.0 0 0 ? S 14:49 0:00 [migration/14] 2019-08-04 02:49:46: root 95 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/14] 2019-08-04 02:49:46: root 96 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/14:0-c 2019-08-04 02:49:46: root 97 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/14:0H- 2019-08-04 02:49:46: root 98 0.0 0.0 0 0 ? S 14:49 0:00 [cpuhp/15] 2019-08-04 02:49:46: root 99 0.0 0.0 0 0 ? S 14:49 0:00 [watchdog/15] 2019-08-04 02:49:46: root 100 0.0 0.0 0 0 ? S 14:49 0:00 [migration/15] 2019-08-04 02:49:46: root 101 0.0 0.0 0 0 ? S 14:49 0:00 [ksoftirqd/15] 2019-08-04 02:49:46: root 102 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/15:0-c 2019-08-04 02:49:46: root 103 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/15:0H- 2019-08-04 02:49:46: root 105 0.0 0.0 0 0 ? S 14:49 0:00 [kdevtmpfs] 2019-08-04 02:49:46: root 106 0.0 0.0 0 0 ? I< 14:49 0:00 [netns] 2019-08-04 02:49:46: root 107 0.4 0.0 0 0 ? S 14:49 0:00 [kauditd] 2019-08-04 02:49:46: root 108 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/0:1-cg 2019-08-04 02:49:46: root 109 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/2:1-cg 2019-08-04 02:49:46: root 110 0.0 0.0 0 0 ? S 14:49 0:00 [khungtaskd] 2019-08-04 02:49:46: root 111 0.0 0.0 0 0 ? S 14:49 0:00 [oom_reaper] 2019-08-04 02:49:46: root 112 0.0 0.0 0 0 ? I< 14:49 0:00 [writeback] 2019-08-04 02:49:46: root 113 0.0 0.0 0 0 ? S 14:49 0:00 [kcompactd0] 2019-08-04 02:49:46: root 114 0.0 0.0 0 0 ? SN 14:49 0:00 [ksmd] 2019-08-04 02:49:46: root 115 0.0 0.0 0 0 ? SN 14:49 0:00 [khugepaged] 2019-08-04 02:49:46: root 116 0.0 0.0 0 0 ? I< 14:49 0:00 [crypto] 2019-08-04 02:49:46: root 117 0.0 0.0 0 0 ? I< 14:49 0:00 [kintegrityd] 2019-08-04 02:49:46: root 118 0.0 0.0 0 0 ? I< 14:49 0:00 [kblockd] 2019-08-04 02:49:46: root 119 0.0 0.0 0 0 ? I< 14:49 0:00 [tpm_dev_wq] 2019-08-04 02:49:46: root 120 0.0 0.0 0 0 ? I< 14:49 0:00 [md] 2019-08-04 02:49:46: root 121 0.0 0.0 0 0 ? I< 14:49 0:00 [edac-poller] 2019-08-04 02:49:46: root 122 0.0 0.0 0 0 ? S 14:49 0:00 [watchdogd] 2019-08-04 02:49:46: root 123 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/4:1-ev 2019-08-04 02:49:46: root 124 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/1:1-ev 2019-08-04 02:49:46: root 136 0.1 0.0 0 0 ? I 14:49 0:00 [kworker/3:1-ev 2019-08-04 02:49:46: root 137 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/5:1-xf 2019-08-04 02:49:46: root 138 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/7:1-cg 2019-08-04 02:49:46: root 139 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/8:1-ev 2019-08-04 02:49:46: root 140 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/6:1-ev 2019-08-04 02:49:46: root 141 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/9:1-cg 2019-08-04 02:49:46: root 142 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/12:1-m 2019-08-04 02:49:46: root 143 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/11:1-e 2019-08-04 02:49:46: root 144 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/14:1-e 2019-08-04 02:49:46: root 145 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/10:1-e 2019-08-04 02:49:46: root 146 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/15:1-c 2019-08-04 02:49:46: root 147 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/13:1-e 2019-08-04 02:49:46: root 150 0.0 0.0 0 0 ? S 14:49 0:00 [kswapd0] 2019-08-04 02:49:46: root 169 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/u32:1- 2019-08-04 02:49:46: root 201 0.0 0.0 0 0 ? I< 14:49 0:00 [kthrotld] 2019-08-04 02:49:46: root 202 0.0 0.0 0 0 ? I< 14:49 0:00 [acpi_thermal_p 2019-08-04 02:49:46: root 203 0.0 0.0 0 0 ? I< 14:49 0:00 [kmpath_rdacd] 2019-08-04 02:49:46: root 204 0.0 0.0 0 0 ? I< 14:49 0:00 [kaluad] 2019-08-04 02:49:46: root 205 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/3:2-ev 2019-08-04 02:49:46: root 206 0.0 0.0 0 0 ? I< 14:49 0:00 [ipv6_addrconf] 2019-08-04 02:49:46: root 207 0.0 0.0 0 0 ? I< 14:49 0:00 [kstrp] 2019-08-04 02:49:46: root 301 0.0 0.0 0 0 ? I< 14:49 0:00 [iscsi_eh] 2019-08-04 02:49:46: root 302 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/11:2-x 2019-08-04 02:49:46: root 311 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/9:2-ev 2019-08-04 02:49:46: root 318 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/2:2-cg 2019-08-04 02:49:46: root 319 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/2:3-cg 2019-08-04 02:49:46: root 418 0.0 0.0 0 0 ? I< 14:49 0:00 [cnic_wq] 2019-08-04 02:49:46: root 419 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/0 2019-08-04 02:49:46: root 420 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 421 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/2 2019-08-04 02:49:46: root 422 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/3 2019-08-04 02:49:46: root 423 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/4 2019-08-04 02:49:46: root 424 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/5 2019-08-04 02:49:46: root 425 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/6 2019-08-04 02:49:46: root 426 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/7 2019-08-04 02:49:46: root 427 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/8 2019-08-04 02:49:46: root 428 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/9 2019-08-04 02:49:46: root 429 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 430 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 431 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 432 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 433 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 434 0.0 0.0 0 0 ? S< 14:49 0:00 [bnx2i_thread/1 2019-08-04 02:49:46: root 484 0.0 0.0 0 0 ? I< 14:49 0:00 [rpciod] 2019-08-04 02:49:46: root 485 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/u33:0] 2019-08-04 02:49:46: root 486 0.0 0.0 0 0 ? I< 14:49 0:00 [xprtiod] 2019-08-04 02:49:46: root 568 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/9:3-ev 2019-08-04 02:49:46: root 594 46.5 0.0 38852 2764 ? S 14:49 0:13 @usr/sbin/plymo 2019-08-04 02:49:46: root 596 0.0 0.0 0 0 ? S 14:49 0:00 [scsi_eh_0] 2019-08-04 02:49:46: root 597 0.0 0.0 0 0 ? I< 14:49 0:00 [scsi_tmf_0] 2019-08-04 02:49:46: root 608 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/u32:2- 2019-08-04 02:49:46: root 612 0.1 0.0 0 0 ? I 14:49 0:00 [kworker/u32:3- 2019-08-04 02:49:46: root 622 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/10:2-c 2019-08-04 02:49:46: root 645 0.0 0.0 0 0 ? I< 14:49 0:00 [ata_sff] 2019-08-04 02:49:46: root 665 0.0 0.0 0 0 ? S 14:49 0:00 [scsi_eh_1] 2019-08-04 02:49:46: root 689 0.0 0.0 0 0 ? I< 14:49 0:00 [scsi_tmf_1] 2019-08-04 02:49:46: root 693 0.0 0.0 0 0 ? S 14:49 0:00 [scsi_eh_2] 2019-08-04 02:49:46: root 699 0.0 0.0 0 0 ? I< 14:49 0:00 [scsi_tmf_2] 2019-08-04 02:49:46: root 759 0.0 0.0 0 0 ? I< 14:49 0:00 [ttm_swap] 2019-08-04 02:49:46: root 1109 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/10:1H- 2019-08-04 02:49:46: root 1122 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/7:2-ev 2019-08-04 02:49:46: root 1138 0.0 0.0 0 0 ? I< 14:49 0:00 [kmpathd] 2019-08-04 02:49:46: root 1139 0.0 0.0 0 0 ? I< 14:49 0:00 [kmpath_handler 2019-08-04 02:49:46: root 1142 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/8:2-ev 2019-08-04 02:49:46: root 1152 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/1:2-cg 2019-08-04 02:49:46: root 1153 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/1:3-cg 2019-08-04 02:49:46: root 1192 0.0 0.0 0 0 ? I< 14:49 0:00 [kdmflush] 2019-08-04 02:49:46: root 1199 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/6:2-ev 2019-08-04 02:49:46: root 1210 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/1:1H-e 2019-08-04 02:49:46: root 1226 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/8:3-ev 2019-08-04 02:49:46: root 1227 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/15:2-c 2019-08-04 02:49:46: root 1233 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/4:1H-x 2019-08-04 02:49:46: root 1236 0.0 0.0 0 0 ? I< 14:49 0:00 [xfsalloc] 2019-08-04 02:49:46: root 1237 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs_mru_cache] 2019-08-04 02:49:46: root 1240 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-buf/dm-0] 2019-08-04 02:49:46: root 1241 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-data/dm-0] 2019-08-04 02:49:46: root 1243 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-conv/dm-0] 2019-08-04 02:49:46: root 1244 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-cil/dm-0] 2019-08-04 02:49:46: root 1245 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-reclaim/dm 2019-08-04 02:49:46: root 1247 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-log/dm-0] 2019-08-04 02:49:46: root 1248 0.0 0.0 0 0 ? I< 14:49 0:00 [xfs-eofblocks/ 2019-08-04 02:49:46: root 1249 0.0 0.0 0 0 ? S 14:49 0:00 [xfsaild/dm-0] 2019-08-04 02:49:46: root 1250 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/0:1H-e 2019-08-04 02:49:46: root 1252 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/0:2-cg 2019-08-04 02:49:46: root 1281 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/4:2-cg 2019-08-04 02:49:46: root 1326 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/14:2-e 2019-08-04 02:49:46: root 1327 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/0:3-ev 2019-08-04 02:49:46: root 1334 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/15:3-e 2019-08-04 02:49:46: root 1335 0.1 0.0 0 0 ? I 14:49 0:00 [kworker/15:4-e 2019-08-04 02:49:46: root 1340 0.1 0.0 0 0 ? I 14:49 0:00 [kworker/4:3-ev 2019-08-04 02:49:46: root 1345 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/6:1H-e 2019-08-04 02:49:46: root 1389 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/15:1H- 2019-08-04 02:49:46: root 1952 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/13:2-e 2019-08-04 02:49:46: root 2007 0.0 0.0 0 0 ? I< 14:49 0:00 [ib-comp-wq] 2019-08-04 02:49:46: root 2008 0.0 0.0 0 0 ? I< 14:49 0:00 [ib-comp-unb-wq 2019-08-04 02:49:46: root 2009 0.0 0.0 0 0 ? I< 14:49 0:00 [ib_mcast] 2019-08-04 02:49:46: root 2010 0.0 0.0 0 0 ? I< 14:49 0:00 [ib_nl_sa_wq] 2019-08-04 02:49:46: root 2018 0.0 0.0 0 0 ? I< 14:49 0:00 [rdma_cm] 2019-08-04 02:49:46: root 2020 0.0 0.0 0 0 ? I< 14:49 0:00 [iw_cxgb4] 2019-08-04 02:49:46: root 2021 0.0 0.0 0 0 ? I< 14:49 0:00 [Register_iWARP 2019-08-04 02:49:46: root 2087 0.0 0.0 0 0 ? I< 14:49 0:00 [srp_remove] 2019-08-04 02:49:46: root 2095 0.0 0.0 0 0 ? I< 14:49 0:00 [target_complet 2019-08-04 02:49:46: root 2096 0.0 0.0 0 0 ? I< 14:49 0:00 [xcopy_wq] 2019-08-04 02:49:46: root 2134 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/7:1H-k 2019-08-04 02:49:46: root 2157 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/11:1H- 2019-08-04 02:49:46: root 2217 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/3:3-cg 2019-08-04 02:49:46: root 2302 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/13:1H- 2019-08-04 02:49:46: root 2327 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/5:1H-e 2019-08-04 02:49:46: root 2348 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/9:1H-e 2019-08-04 02:49:46: root 2477 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/2:1H-e 2019-08-04 02:49:46: root 2478 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/3:1H-e 2019-08-04 02:49:46: root 2897 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/7:3-cg 2019-08-04 02:49:46: root 2965 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/11:3] 2019-08-04 02:49:46: root 2971 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/14:1H- 2019-08-04 02:49:46: root 3663 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/8:1H-e 2019-08-04 02:49:46: root 3795 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/u32:4] 2019-08-04 02:49:46: root 3817 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/12:2-e 2019-08-04 02:49:46: root 3830 0.0 0.0 0 0 ? I< 14:49 0:00 [kworker/12:1H- 2019-08-04 02:49:46: root 3840 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/10:3] 2019-08-04 02:49:46: root 3857 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/5:2-cg 2019-08-04 02:49:46: root 3883 0.0 0.0 0 0 ? I 14:49 0:00 [kworker/6:3-ev 2019-08-04 02:49:46: root 4100 0.0 0.0 83032 1704 ? S 14:49 0:00 (sd-executor) 2019-08-04 02:49:46: root 4101 0.0 0.0 22184 2984 ? S 14:49 0:00 /usr/bin/sh /us 2019-08-04 02:49:46: root 4106 0.0 0.0 54248 3524 ? R 14:49 0:00 ps aux 2019-08-04 02:49:46: -------------------------------------------------------------- 2019-08-04 02:49:46: 200 tun 2019-08-04 02:49:46: 58 rdma_cm 2019-08-04 02:49:46: 130 watchdog 2019-08-04 02:49:46: 236 device-mapper 2019-08-04 02:49:46: 59 memory_bandwidth 2019-08-04 02:49:46: 60 network_throughput 2019-08-04 02:49:46: 61 network_latency 2019-08-04 02:49:46: 62 cpu_dma_latency 2019-08-04 02:49:46: 227 mcelog 2019-08-04 02:49:46: 183 hw_random 2019-08-04 02:49:46: 144 nvram 2019-08-04 02:49:46: 228 hpet 2019-08-04 02:49:46: 235 autofs 2019-08-04 02:49:46: 231 snapshot 2019-08-04 02:49:46: 63 vga_arbiter 2019-08-04 02:49:46: -------------------------------------------------------------- 2019-08-04 02:49:46: sysfs /sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0 2019-08-04 02:49:46: proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 2019-08-04 02:49:46: devtmpfs /dev devtmpfs rw,seclabel,nosuid,size=14800024k,nr_inodes=3700006,mode=755 0 0 2019-08-04 02:49:46: securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 2019-08-04 02:49:46: tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev 0 0 2019-08-04 02:49:46: devpts /dev/pts devpts rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 2019-08-04 02:49:46: tmpfs /run tmpfs rw,seclabel,nosuid,nodev,mode=755 0 0 2019-08-04 02:49:46: tmpfs /sys/fs/cgroup tmpfs ro,seclabel,nosuid,nodev,noexec,mode=755 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/systemd cgroup rw,seclabel,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0 2019-08-04 02:49:46: pstore /sys/fs/pstore pstore rw,seclabel,nosuid,nodev,noexec,relatime 0 0 2019-08-04 02:49:46: bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/perf_event cgroup rw,seclabel,nosuid,nodev,noexec,relatime,perf_event 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/blkio cgroup rw,seclabel,nosuid,nodev,noexec,relatime,blkio 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/freezer cgroup rw,seclabel,nosuid,nodev,noexec,relatime,freezer 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/hugetlb cgroup rw,seclabel,nosuid,nodev,noexec,relatime,hugetlb 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/cpuset cgroup rw,seclabel,nosuid,nodev,noexec,relatime,cpuset 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/devices cgroup rw,seclabel,nosuid,nodev,noexec,relatime,devices 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,seclabel,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/memory cgroup rw,seclabel,nosuid,nodev,noexec,relatime,memory 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,seclabel,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/rdma cgroup rw,seclabel,nosuid,nodev,noexec,relatime,rdma 0 0 2019-08-04 02:49:46: cgroup /sys/fs/cgroup/pids cgroup rw,seclabel,nosuid,nodev,noexec,relatime,pids 0 0 2019-08-04 02:49:46: configfs /sys/kernel/config configfs rw,relatime 0 0 2019-08-04 02:49:46: /dev/mapper/rhel_bootp--73--227--63-root / xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0 2019-08-04 02:49:46: selinuxfs /sys/fs/selinux selinuxfs rw,relatime 0 0 2019-08-04 02:49:46: hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime,pagesize=2M 0 0 2019-08-04 02:49:46: debugfs /sys/kernel/debug debugfs rw,seclabel,relatime 0 0 2019-08-04 02:49:46: mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0 2019-08-04 02:49:47: [ 33.805660] systemd-shutdown[1]: Successfully changed into root pivot. 2019-08-04 02:49:47: [ 33.876750] systemd-shutdown[1]: Returning to initrd... 2019-08-04 02:49:47: [ 33.952649] watchdog: watchdog0: watchdog did not stop! 2019-08-04 02:49:47: [ 34.105049] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 reboot successful log ------------------------------------------------------------------------------------------ 2019-08-04 02:31:06: [ 43.546859] dracut: Taking over mdmon processes. 2019-08-04 02:31:07: [ 43.662239] dracut Warning: Killing all remaining processes ... 2019-08-04 02:31:08: [ 44.610402] XFS (dm-0): Unmounting Filesystem ... 2019-08-04 02:31:27: [ 64.032496] sd 0:0:0:0: [sda] tag#9 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 2019-08-04 02:31:27: [ 64.135869] sd 0:0:0:0: [sda] tag#9 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK 2019-08-04 02:31:27: [ 64.238950] sd 0:0:0:0: [sda] tag#9 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 2019-08-04 02:31:27: [ 64.346963] dracut Warning: Unmounted /oldroot. 2019-08-04 02:31:27: [ 64.514454] dracut: Disassembling device-mapper devices ... 2019-08-04 02:31:28: [ 64.714151] dracut: Waiting for mdraid devices to be clean. 2019-08-04 02:31:28: [ 64.818261] dracut: Disassembling mdraid devices. 2019-08-04 02:31:28: [ 64.962484] Unregister pv shared memory for cpu 2 ... 2019-08-04 02:31:28: [ 64.962717] Unregister pv shared memory for cpu 9 2019-08-04 02:31:29: [ 65.653416] Unregister pv shared memory for cpu 15 2019-08-04 02:31:29: [ 65.728362] sd 0:0:0:0: [sda] Synchronizing SCSI cache 2019-08-04 02:31:29: [ 65.799791] sd 0:0:0:0: tag#96 Send: scmd 0x000000000857a240 2019-08-04 02:31:29: [ 65.870638] sd 0:0:0:0: tag#96 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 2019-08-04 02:31:29: [ 65.942952] sd 0:0:0:0: tag#96 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK 2019-08-04 02:31:29: [ 66.015502] sd 0:0:0:0: tag#96 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 2019-08-04 02:31:29: [ 66.111530] reboot: Restarting syste Get kernel core file, please download it from http://fileshare.englab.nay.redhat.com/pub/section2/coredump/bug1717323/. crash> bt PID: 1 TASK: ffff8ceb84a9af80 CPU: 8 COMMAND: "shutdown" #0 [ffffa8608312bce8] machine_kexec at ffffffffa5257a9e #1 [ffffa8608312bd40] __crash_kexec at ffffffffa535488d #2 [ffffa8608312be08] panic at ffffffffa52aef28 #3 [ffffa8608312be88] do_exit.cold.22 at ffffffffa52b4478 #4 [ffffa8608312bf08] do_group_exit at ffffffffa52b404a #5 [ffffa8608312bf30] __x64_sys_exit_group at ffffffffa52b40c4 #6 [ffffa8608312bf38] do_syscall_64 at ffffffffa52041cb #7 [ffffa8608312bf50] entry_SYSCALL_64_after_hwframe at ffffffffa5c000ad RIP: 00007fb490077e4e RSP: 00007ffc8accf028 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 00007fb490080530 RCX: 00007fb490077e4e RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f RBP: 00007fb490286d00 R8: 00000000000000e7 R9: 00007ffc8accef38 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000002 R13: 0000000000000001 R14: 00007fb490286d30 R15: 0000000000000000 ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b crash> bt -a |grep PID -A 1 PID: 0 TASK: ffffffffa6412780 CPU: 0 COMMAND: "swapper/0" #0 [fffffe0000007e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84aedf00 CPU: 1 COMMAND: "swapper/1" #0 [fffffe0000032e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84aeaf80 CPU: 2 COMMAND: "swapper/2" #0 [fffffe000005de50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84aec740 CPU: 3 COMMAND: "swapper/3" #0 [fffffe0000088e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b0af80 CPU: 4 COMMAND: "swapper/4" #0 [fffffe00000b3e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b0c740 CPU: 5 COMMAND: "swapper/5" #0 [fffffe00000dee50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b08000 CPU: 6 COMMAND: "swapper/6" #0 [fffffe0000109e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b097c0 CPU: 7 COMMAND: "swapper/7" #0 [fffffe0000134e50] crash_nmi_callback at ffffffffa524b483 -- PID: 1 TASK: ffff8ceb84a9af80 CPU: 8 COMMAND: "shutdown" #0 [ffffa8608312bce8] machine_kexec at ffffffffa5257a9e -- PID: 0 TASK: ffff8ceb84b2df00 CPU: 9 COMMAND: "swapper/9" #0 [fffffe000018ae50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b2af80 CPU: 10 COMMAND: "swapper/10" #0 [fffffe00001b5e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b2c740 CPU: 11 COMMAND: "swapper/11" #0 [fffffe00001e0e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b28000 CPU: 12 COMMAND: "swapper/12" #0 [fffffe000020be50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b297c0 CPU: 13 COMMAND: "swapper/13" #0 [fffffe0000236e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b48000 CPU: 14 COMMAND: "swapper/14" #0 [fffffe0000261e50] crash_nmi_callback at ffffffffa524b483 -- PID: 0 TASK: ffff8ceb84b497c0 CPU: 15 COMMAND: "swapper/15" #0 [fffffe000028ce50] crash_nmi_callback at ffffffffa524b483 crash> You don't have permission to access /pub/section2/coredump/bug1717323/127.0.0.1-2019-08-05-15:33:26/vmcore on this server. Could you upload the vmlinuz as well, please? Harald Hoyer, Permission changed, please download vmlinuz in same folder. Thanks, Xu Using the "crash" utility gives:
$ crash vmlinux vmcore
crash> ps | grep -v '\['
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 1 0 8 ffff8ceb84a9af80 ?? 0.0 0 0 shutdown
614 1 0 ffff8cf1d2b80000 IN 0.0 38852 2924 plymouthd
So I strongly suggest to test, if adding the kernel command line parameter `plymouth.enable=0` fixes the problem.
*** Bug 1720181 has been marked as a duplicate of this bug. *** Hi Bandan, QE hit this issue again in our RHEL-8.2.0 gating test. Since QE apply the workaround in our gating test, 25 modules have been tested, this is the first time we hit it again with the workaround. Test version: qemu:qemu-kvm-4.2.0-1.module+el8.2.0+4793+b09dd2fb.x86_64 kernel: kernel-4.18.0-151.el8.x86_64 guest iso: RHEL-8.2.0-20191120.n.0 The step is: 1. Boot RHEL8.2.0 guest 2. Switch runlevel 5 to 3 3. Executed system_powerdown command with qmp monitor 4. Guest kernel panic Guest kernel line: # cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-151.el8.x86_64 root=/dev/mapper/rhel_dell--md3460--c01m01-root ro console=ttyS0,115200 crashkernel=auto resume=/dev/mapper/rhel_dell--md3460--c01m01-swap rd.lvm.lv=rhel_dell-md3460-c01m01/root rd.lvm.lv=rhel_dell-md3460-c01m01/swap biosdevname=0 net.ifnames=0 console=tty0 biosdevname=0 net.ifnames=0 rd.plymouth=0 plymouth.enable=0 movable_node For core file and the guest image, please download here: http://fileshare.englab.nay.redhat.com/pub/section2/coredump/bug1717323/127.0.0.1-2019-11-21-05:08:27/ Thanks! Best regards Yanan Fu (In reply to Yanan Fu from comment #14) > Hi Bandan, > > QE hit this issue again in our RHEL-8.2.0 gating test. > Since QE apply the workaround in our gating test, 25 modules have been > tested, this is the first time we hit it again with the workaround. > > > Test version: > qemu:qemu-kvm-4.2.0-1.module+el8.2.0+4793+b09dd2fb.x86_64 > kernel: kernel-4.18.0-151.el8.x86_64 > guest iso: RHEL-8.2.0-20191120.n.0 > > > The step is: > 1. Boot RHEL8.2.0 guest > 2. Switch runlevel 5 to 3 > 3. Executed system_powerdown command with qmp monitor > 4. Guest kernel panic > > > Guest kernel line: > # cat /proc/cmdline > BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-151.el8.x86_64 > root=/dev/mapper/rhel_dell--md3460--c01m01-root ro console=ttyS0,115200 > crashkernel=auto resume=/dev/mapper/rhel_dell--md3460--c01m01-swap > rd.lvm.lv=rhel_dell-md3460-c01m01/root > rd.lvm.lv=rhel_dell-md3460-c01m01/swap biosdevname=0 net.ifnames=0 > console=tty0 biosdevname=0 net.ifnames=0 rd.plymouth=0 plymouth.enable=0 > movable_node > > > For core file and the guest image, please download here: > http://fileshare.englab.nay.redhat.com/pub/section2/coredump/bug1717323/127. > 0.0.1-2019-11-21-05:08:27/ > > > Thanks! > > Best regards > Yanan Fu Thanks for the report, can you confirm whether setting runlevel to 3 was the workaround suggested for bug 1626342 as well ? (In reply to Bandan Das from comment #15) > > Thanks for the report, can you confirm whether setting runlevel to 3 was the > workaround suggested for bug 1626342 as well ? Hi Bandan, 1. The workaround that be suggested in bug 1626342 is: add "rd.plymouth=0 plymouth.enable=0" in kernel line, this is what QE use now. 2. Setting runlevel 3 can not fix this problem, because when we hit this problem this time, the runlevel is 3, you can refer the test step i added in comment15. Thanks! Best regards Yanan Fu Hi Eric, Just to give a bit of context, this happens once in a while during a repeated reboot of the guest. I don't think there is any disk stress or any other type of I/O involved as part of the test but Yanan can correct me. There's a similar bug 1626342, where disabling plymouthd appeared to work. The trace is the same as shown in this report. However, in the trace with the vmcore in comment 14, there's no plymouthd since this is in runlevel 3. The only thing I saw as relevant from the trace is xfsaild running in an interruptible state. I am not sure whether this means that the process is trying to sync items back to disk ? I asked QE whether they can run the test with a non XFS root to rule out filesystem issues. Do you have any other ideas why this could happen ? Thanks! I put the vmcore from comment #14 on optimus for easy access: https://optimus.gsslab.rdu2.redhat.com/manager/694562946 $ retrace-server-interact 694562946 crash Looking at that latest vmcore, I see nothing in the logs that implicates XFS. The system shuts down, filesystems unmount, then <something gets ratelimited> and it panics: [ 52.382462] systemd-shutdown[1]: Unmounting file systems. [ 52.386963] [2528]: Remounting '/boot' read-only in with options 'seclabel,attr2,inode64,noquota'. [ 52.392461] [2529]: Unmounting '/boot'. [ 52.676528] XFS (sda1): Unmounting Filesystem [ 52.682343] [2530]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. [ 52.697458] systemd-shutdown[1]: All filesystems unmounted. [ 52.700165] systemd-shutdown[1]: Deactivating swaps. [ 52.702585] systemd-shutdown[1]: All swaps deactivated. [ 52.704921] systemd-shutdown[1]: Detaching loop devices. [ 52.707520] systemd-shutdown[1]: All loop devices detached. [ 52.709911] systemd-shutdown[1]: Detaching DM devices. [ 53.026945] printk: shutdown: 7 output lines suppressed due to ratelimiting [ 53.031485] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 (earlier logs showed a system which was unable to complete the remount,ro of the root fs but that doesn't seem to be the case here) xfs threads are still running on dm-0, but that's normal; we can only remount,ro the root fs during shutdown so daemons will hang around. Looking at the xfs_mount structure for the root fs, dm-0 (found by walking the super_blocks list of active supers): crash> xfs_mount.m_flags -x 0xffffa002df51d000 m_flags = 0x500108 that's: XFS_MOUNT_WAS_CLEAN XFS_MOUNT_ATTR2 XFS_MOUNT_RDONLY XFS_MOUNT_COMPAT_IOSIZE so it looks like the XFS root fs has been successfully remounted readonly. I don't see any clues about why this paniced, the ratelimited printks certainly don't help. The AIL appears to be empty, I don't think xfsaild is doing any work: crash> xfs_mount.m_ail 0xffffa002df51d000 m_ail = 0xffffa002d4f36000 crash> xfs_ail 0xffffa002d4f36000 struct xfs_ail { ail_mount = 0xffffa002df51d000, ail_task = 0xffffa002cd834740, ail_head = { next = 0xffffa002d4f36010, prev = 0xffffa002d4f36010 }, ail_target = 158913800877, ail_target_prev = 158913800877, ail_cursors = { next = 0xffffa002d4f36030, prev = 0xffffa002d4f36030 }, ... (In reply to Eric Sandeen from comment #26) > I put the vmcore from comment #14 on optimus for easy access: > > https://optimus.gsslab.rdu2.redhat.com/manager/694562946 > > $ retrace-server-interact 694562946 crash > > Looking at that latest vmcore, I see nothing in the logs that implicates > XFS. The system shuts down, filesystems unmount, then <something gets > ratelimited> and it panics: > > [ 52.382462] systemd-shutdown[1]: Unmounting file systems. > [ 52.386963] [2528]: Remounting '/boot' read-only in with options > 'seclabel,attr2,inode64,noquota'. > [ 52.392461] [2529]: Unmounting '/boot'. > [ 52.676528] XFS (sda1): Unmounting Filesystem > [ 52.682343] [2530]: Remounting '/' read-only in with options > 'seclabel,attr2,inode64,noquota'. > [ 52.697458] systemd-shutdown[1]: All filesystems unmounted. > [ 52.700165] systemd-shutdown[1]: Deactivating swaps. > [ 52.702585] systemd-shutdown[1]: All swaps deactivated. > [ 52.704921] systemd-shutdown[1]: Detaching loop devices. > [ 52.707520] systemd-shutdown[1]: All loop devices detached. > [ 52.709911] systemd-shutdown[1]: Detaching DM devices. > [ 53.026945] printk: shutdown: 7 output lines suppressed due to > ratelimiting > [ 53.031485] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x00007f00 > > (earlier logs showed a system which was unable to complete the remount,ro of > the root fs but that doesn't seem to be the case here) > Yep, that was exactly what I had in mind but thanks for clarifying that here it's not the case. > xfs threads are still running on dm-0, but that's normal; we can only > remount,ro the root fs during shutdown so daemons will hang around. > Ok. (In reply to Eric Sandeen from comment #27) > The AIL appears to be empty, I don't think xfsaild is doing any work: > Thanks for confirming this as well. ... I guess I will go back to investigating the vmcore. Also, in view of the rate limited messages and comment 26, the next guess is that the multipath device(s) failed to detach, at least in this case. Yanan, can ratelimiting be disabled when running this test ? It's probably a systemd-shutdown setting. Can you retest without printk ratelimiting so we can see what happened in here? [ 52.709911] systemd-shutdown[1]: Detaching DM devices. [ 53.026945] printk: shutdown: 7 output lines suppressed due to ratelimiting [ 53.031485] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 ? # sysctl -w kernel.printk_ratelimit=0 should turn it off. Hm I also found: "When you have systemd running to the extent that it can provide you with a shell, please use it to extract useful information for debugging. Boot with these parameters on the kernel command line: systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on in order to increase the verbosity of systemd, to let systemd write its logs to the kernel log buffer, to increase the size of the kernel log buffer, and to prevent the kernel from discarding messages." but TBH systemd is a black box to me. (In reply to Eric Sandeen from comment #31) > Hm I also found: > > "When you have systemd running to the extent that it can provide you with a > shell, please use it to extract useful information for debugging. Boot with > these parameters on the kernel command line: > > systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M > printk.devkmsg=on > Good idea but this is part of the problem. The issue only manifests as part of a reboot test run as part of CI. The reboot fails, but after a while the test kills the guest and starts the next test. QE still hasn't found a way to isolate the environment so that the guest state can be preserved for investigation when this happens. > in order to increase the verbosity of systemd, to let systemd write its logs > to the kernel log buffer, to increase the size of the kernel log buffer, and > to prevent the kernel from discarding messages." > > but TBH systemd is a black box to me. (In reply to Bandan Das from comment #25) > Hi Eric, > > Just to give a bit of context, this happens once in a while during a > repeated reboot of the guest. I don't think there is any disk stress > or any other type of I/O involved as part of the test but Yanan can correct > me. You are right, no any other action. This test only repeat system powerdown. > > There's a similar bug 1626342, where disabling plymouthd appeared to work. > The trace is the same as shown in this report. > > However, in the trace with the vmcore in comment 14, there's no plymouthd > since plymouthd has been disabled in our test with adding "rd.plymouth=0 plymouth.enable=0" in kernel line. But, unfortunately, bz 1626342 happened again with this workaround last night. I will update the details in the bz later. > this is in runlevel 3. The only thing I saw as relevant from the trace is > xfsaild running in an interruptible state. I am not sure whether this means > that > the process is trying to sync items back to disk ? I asked QE whether they > can run the test with a non XFS root to rule out filesystem > issues. Do you have any other ideas why this could happen ? Thanks! From comment 26, Eric has confirmed it is not related with xfs, so let me stop my repeat test with this change now. (In reply to Eric Sandeen from comment #30) > Can you retest without printk ratelimiting so we can see what happened in > here? > > [ 52.709911] systemd-shutdown[1]: Detaching DM devices. > [ 53.026945] printk: shutdown: 7 output lines suppressed due to > ratelimiting > [ 53.031485] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x00007f00 > > ? > > # sysctl -w kernel.printk_ratelimit=0 > > should turn it off. I will have a try with "kernel.printk_ratelimit=0" permanently with "sysctl -w kernel.printk_ratelimit=0", this change will be lost after reboot, so i add "kernel.printk_ratelimit = 0" in the conf file "/use/lib/sysctl.d/50-default.conf" For comment 31 and comment 32, it is possible to hold on the vm when case failed, i can do it. So, i need add "systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on" when repeat the test ? Hi Bandan and Eric, Since this issue is difficult to repeat, i want to double confirm all the must modification then start the test, to avoid invalid test: 1. rd.plymouth=0 plymouth.enable=0 2. kernel.printk_ratelimit=0 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on Is it right? thanks all! Best regards Yanan Fu I'm sorry, I am not a systemd expert and cannot advise on all the various switches and settings. (In reply to Yanan Fu from comment #34) > (In reply to Eric Sandeen from comment #30) > > Can you retest without printk ratelimiting so we can see what happened in > > here? > > > > [ 52.709911] systemd-shutdown[1]: Detaching DM devices. > > [ 53.026945] printk: shutdown: 7 output lines suppressed due to > > ratelimiting > > [ 53.031485] Kernel panic - not syncing: Attempted to kill init! > > exitcode=0x00007f00 > > > > ? > > > > # sysctl -w kernel.printk_ratelimit=0 > > > > should turn it off. > > I will have a try with "kernel.printk_ratelimit=0" permanently > with "sysctl -w kernel.printk_ratelimit=0", this change will be lost after > reboot, > so i add "kernel.printk_ratelimit = 0" in the conf file > "/use/lib/sysctl.d/50-default.conf" > > > For comment 31 and comment 32, it is possible to hold on the vm when case > failed, i can do it. > So, i need add "systemd.log_level=debug systemd.log_target=kmsg > log_buf_len=1M printk.devkmsg=on" when repeat the test ? > > Hi Bandan and Eric, > Since this issue is difficult to repeat, i want to double confirm all the > must modification then start the test, to avoid invalid test: > 1. rd.plymouth=0 plymouth.enable=0 > 2. kernel.printk_ratelimit=0 > 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M > printk.devkmsg=on > > Is it right? thanks all! > Hi Yanan, this looks good. printk_ratelimit is a sysctl setting, so that location for permanent changes appears to be correct but please double check after logging in to the guest with that setting in place. Also, drop me a line when you have the vm in panic state. > > Best regards > Yanan Fu Hi Yanan, Is there update ont his bug ? Were you able to run with debugging enabled and were you able to reproduce the issue and get the vm to a state where it can be investigated ? Hi Bandan, It is ok to run with debugging enabled and stuck the vm, but i can't reproduce it since then after many times. How about your side, have you got any new findings from the debug log and corefile ? QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks Hi all, Reproduce this bz under test[1] on hosts and guest -> (all kernel-4.18.0-208.el8.x86_64) via migration automation: After migration, shutdown or reboot guest will hit kernel panic. How reproducible: nearly 4/60 test[1]: 1. rd.plymouth=0 plymouth.enable=0 2. kernel.printk_ratelimit=0 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on (In reply to Li Xiaohui from comment #40) > Hi all, > Reproduce this bz under test[1] on hosts and guest -> (all > kernel-4.18.0-208.el8.x86_64) via migration automation: > After migration, shutdown or reboot guest will hit kernel panic. > > How reproducible: > nearly 4/60 > > test[1]: > 1. rd.plymouth=0 plymouth.enable=0 > 2. kernel.printk_ratelimit=0 > 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M > printk.devkmsg=on Considering that this pops up once in a while with automation tests, I think we should strongly consider reproducing this outisde of automation so that it's easier to debug. Would you be able to reproduce this outside of automation ? (In reply to Bandan Das from comment #41) > (In reply to Li Xiaohui from comment #40) > > Hi all, > > Reproduce this bz under test[1] on hosts and guest -> (all > > kernel-4.18.0-208.el8.x86_64) via migration automation: > > After migration, shutdown or reboot guest will hit kernel panic. > > > > How reproducible: > > nearly 4/60 > > > > test[1]: > > 1. rd.plymouth=0 plymouth.enable=0 > > 2. kernel.printk_ratelimit=0 > > 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M > > printk.devkmsg=on > > Considering that this pops up once in a while with automation tests, I think > we should strongly > consider reproducing this outisde of automation so that it's easier to > debug. Would you be > able to reproduce this outside of automation ? Hi, It's difficult for me to reproduce this issue by manual since the reproducible rate by auto is low, too. Thanks (In reply to Li Xiaohui from comment #42) > (In reply to Bandan Das from comment #41) > > (In reply to Li Xiaohui from comment #40) > > > Hi all, > > > Reproduce this bz under test[1] on hosts and guest -> (all > > > kernel-4.18.0-208.el8.x86_64) via migration automation: > > > After migration, shutdown or reboot guest will hit kernel panic. > > > > > > How reproducible: > > > nearly 4/60 > > > > > > test[1]: > > > 1. rd.plymouth=0 plymouth.enable=0 > > > 2. kernel.printk_ratelimit=0 > > > 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M > > > printk.devkmsg=on > > > > Considering that this pops up once in a while with automation tests, I think > > we should strongly > > consider reproducing this outisde of automation so that it's easier to > > debug. Would you be > > able to reproduce this outside of automation ? > > Hi, > It's difficult for me to reproduce this issue by manual since the > reproducible rate by auto is low, too. > > Thanks Will you be able to just setup a migration run followed by a poweroff using a simple script ? Basically - 1. migrate from system a to system b 2. poweroff 3. touch a file on system b indicating poweroff succeeded 4. if system a notices the touched file, repeat test If this works, we will have a much simpler environment to debug. (In reply to Bandan Das from comment #43) > (In reply to Li Xiaohui from comment #42) > > (In reply to Bandan Das from comment #41) > > > (In reply to Li Xiaohui from comment #40) > > > > Hi all, > > > > Reproduce this bz under test[1] on hosts and guest -> (all > > > > kernel-4.18.0-208.el8.x86_64) via migration automation: > > > > After migration, shutdown or reboot guest will hit kernel panic. > > > > > > > > How reproducible: > > > > nearly 4/60 > > > > > > > > test[1]: > > > > 1. rd.plymouth=0 plymouth.enable=0 > > > > 2. kernel.printk_ratelimit=0 > > > > 3. systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M > > > > printk.devkmsg=on > > > > > > Considering that this pops up once in a while with automation tests, I think > > > we should strongly > > > consider reproducing this outisde of automation so that it's easier to > > > debug. Would you be > > > able to reproduce this outside of automation ? > > > > Hi, > > It's difficult for me to reproduce this issue by manual since the > > reproducible rate by auto is low, too. > > > > Thanks > > Will you be able to just setup a migration run followed by a poweroff using > a simple script ? > Basically - > 1. migrate from system a to system b > 2. poweroff > 3. touch a file on system b indicating poweroff succeeded if poweroff failed in step 2, automation will find it, so no need this step 3. > 4. if system a notices the touched file, repeat test > > If this works, we will have a much simpler environment to debug. Ok, I will try later with step 1&2&4. I just do simple migration: migrate guest from A to B host, then shutdown guest, but not hit kernel panic, hit another issue when shutdown guest: How reproducible: 3/100 hit poweroff failed issue (the serial log of shutdown guest): [ 113.178451] systemd[1]: Stopped Import network configuration from initramfs. [ 113.182597] audit: type=1131 audit(1592551323.499:196): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=import-state comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 113.188006] systemd[1]: Reached target Shutdown. [ 113.191573] systemd[1]: Reached target Unmount All Filesystems. [ 113.195342] systemd[1]: Reached target Final Step. [ 113.200756] systemd[1]: Starting Power-Off... [ 113.216072] systemd[1]: Shutting down. [ 113.329656] systemd-shutdown[1]: Syncing filesystems and block devices. [ 113.429387] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 113.439691] systemd-journald[911]: Received SIGTERM from PID 1 (systemd-shutdow). [ 113.570366] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 113.576129] systemd-shutdown[1]: Unmounting file systems. [ 113.579284] [3037]: Remounting '/boot' read-only in with options 'seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'. [ 113.704423] [3038]: Unmounting '/boot'. [ 113.827537] XFS (sda1): Unmounting Filesystem [ 113.847095] [3039]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'. [ 114.483254] systemd-shutdown[1]: All filesystems unmounted. [ 114.485783] systemd-shutdown[1]: Deactivating swaps. [ 114.487949] systemd-shutdown[1]: All swaps deactivated. [ 114.489851] systemd-shutdown[1]: Detaching loop devices. [ 114.491854] systemd-shutdown[1]: All loop devices detached. [ 114.494205] systemd-shutdown[1]: Detaching DM devices. [ 114.497777] systemd-shutdown[1]: Detaching DM 253:1. [ 114.563258] systemd-shutdown[1]: Not all DM devices detached, 1 left. [ 114.567250] systemd-shutdown[1]: Detaching DM devices. [ 114.569860] systemd-shutdown[1]: Not all DM devices detached, 1 left. [ 114.571968] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing. [ 114.601160] systemd-shutdown[1]: Successfully changed into root pivot. [ 114.603504] systemd-shutdown[1]: Returning to initrd... [ 114.662209] dracut: Taking over mdmon processes. [ 114.664468] dracut Warning: Killing all remaining processes [ 114.843535] XFS (dm-0): Unmounting Filesystem [ 114.896193] dracut Warning: Unmounted /oldroot. [ 114.948172] dracut: Disassembling device-mapper devices [ 115.026923] dracut: Waiting for mdraid devices to be clean. [ 115.031883] dracut: Disassembling mdraid devices. [ 115.041295] dracut Warning: poweroff failed! [ 115.047690] dracut Warning: QE still can hit this issue from time to time, with RHEL8.3, RHEL8.4, and also with RHEL9, refer https://bugzilla.redhat.com/show_bug.cgi?id=1922896 After review the history of test my test record, seems this issue is amd only. @Bandan, Could you check this issue again with the core trace and core file that be added before ? Feel free to let me know if you have any other request, thanks a lot! (In reply to Yanan Fu from comment #48) > QE still can hit this issue from time to time, with RHEL8.3, RHEL8.4, > and also with RHEL9, refer > https://bugzilla.redhat.com/show_bug.cgi?id=1922896 > > > After review the history of test my test record, seems this issue is amd > only. > It's unfortuante that this is still not fixed and I have absolutely no idea why this would be AMD only. From what I remember about this bug, there's nothing architecture specific in the failure path. > @Bandan, > Could you check this issue again with the core trace and core file that be > added before ? I seemed to have lost the email you had sent me earlier but could you please summarize the steps to execute the reboot test locally ? I remember the last time I tried setting this up, the test tries to download an image from a server not local to Westford, I think that's the kind of thing I would like to avoid while setting this up. > Feel free to let me know if you have any other request, thanks a lot! (In reply to Bandan Das from comment #49) > (In reply to Yanan Fu from comment #48) > > QE still can hit this issue from time to time, with RHEL8.3, RHEL8.4, > > and also with RHEL9, refer > > https://bugzilla.redhat.com/show_bug.cgi?id=1922896 > > > > > > After review the history of test my test record, seems this issue is amd > > only. > > > It's unfortuante that this is still not fixed and I have absolutely no idea > why this > would be AMD only. From what I remember about this bug, there's nothing > architecture specific in the failure path. > > > @Bandan, > > Could you check this issue again with the core trace and core file that be > > added before ? > I seemed to have lost the email you had sent me earlier but could you please > summarize > the steps to execute the reboot test locally ? I remember the last time I > tried setting this up, > the test tries to download an image from a server not local to Westford, I > think that's the kind of > thing I would like to avoid while setting this up. > Hi Bandan, Please check your email, i share one system to you, and the steps to reproduce this issue. Feel free to use that system and contact me if you have any other question. The system is the one that i just reproduce this issue with.Thanks! After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. As comment 48, this bz still exist and RHEL9 have the same issue. QE and Bandon is working one this bz and share info with email recently, so let me reopen it, thanks! Not sure anyone put this in yet, but it seems like maybe it could be a task/group accounting issue? we panic this way in do_exit here: group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { /* * If the last thread of global init has exited, panic * immediately to get a useable coredump. */ if (unlikely(is_global_init(tsk))) panic("Attempted to kill init! exitcode=0x%08x\n", tsk->signal->group_exit_code ?: (int)code); (In reply to Eric Sandeen from comment #54) > Not sure anyone put this in yet, but it seems like maybe it could be a > task/group accounting issue? > > we panic this way in do_exit here: > > group_dead = atomic_dec_and_test(&tsk->signal->live); > if (group_dead) { > /* > * If the last thread of global init has exited, panic > * immediately to get a useable coredump. > */ > if (unlikely(is_global_init(tsk))) > panic("Attempted to kill init! exitcode=0x%08x\n", > tsk->signal->group_exit_code ?: (int)code); Thank you, I will take a look. Interstingly, I also noticed that the probability of hitting this greatly decreases once I turn off the crashkernel reservation. I have been trying to preserve an environment when the guest hits it so that it can be inspected for more clues from inside the guest. Hit same issue on the rhel8.5-av. Test Version: Host: kernel-4.18.0-307.el8.x86_64 qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64 Guest: kernel-4.18.0-305.8.el8.x86_64 Error info: 2021-05-24 04:04:00: [ 20.456698] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 2021-05-24 04:04:00: [ 20.456698] 2021-05-24 04:04:00: [ 20.459155] CPU: 0 PID: 1 Comm: shutdown Tainted: G --------- -t - 4.18.0-305.8.el8.x86_64 #1 2021-05-24 04:04:00: [ 20.461787] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014 2021-05-24 04:04:00: [ 20.464238] Call Trace: 2021-05-24 04:04:00: [ 20.464945] dump_stack+0x5c/0x80 2021-05-24 04:04:00: [ 20.465848] panic+0xe7/0x2a9 2021-05-24 04:04:00: [ 20.466663] do_exit.cold.23+0x57/0xb6 2021-05-24 04:04:00: [ 20.467680] ? __schedule+0x2cc/0x700 2021-05-24 04:04:00: [ 20.468673] do_group_exit+0x3a/0xa0 2021-05-24 04:04:00: [ 20.469644] __x64_sys_exit_group+0x14/0x20 2021-05-24 04:04:00: [ 20.470777] do_syscall_64+0x5b/0x1a0 2021-05-24 04:04:00: [ 20.471771] entry_SYSCALL_64_after_hwframe+0x65/0xca 2021-05-24 04:04:00: [ 20.473136] RIP: 0033:0x7f8744dd190e 2021-05-24 04:04:00: [ 20.474107] Code: 89 fa 41 b8 e7 00 00 00 be 3c 00 00 00 eb 14 0f 1f 44 00 00 89 d7 89 f0 0f 05 48 3d 00 f0 ff ff 77 1a f4 89 d7 44 89 c0 0f 05 <48> 3d 00 f0 ff ff 76 e2 f7 d8 89 05 6a f8 20 00 eb d8 f7 d8 89 05 2021-05-24 04:04:00: [ 20.479050] RSP: 002b:00007ffcfb0e67e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 2021-05-24 04:04:00: [ 20.481066] RAX: ffffffffffffffda RBX: 00007f8744dda210 RCX: 00007f8744dd190e 2021-05-24 04:04:00: [ 20.482965] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f 2021-05-24 04:04:00: [ 20.484874] RBP: 00007f8744fe1e5f R08: 00000000000000e7 R09: 0000000000000000 2021-05-24 04:04:00: [ 20.486776] R10: 0000000000000020 R11: 0000000000000246 R12: 00007f8744fe1e70 2021-05-24 04:04:01: [ 20.488682] R13: 0000000000000001 R14: 00007f8744fe11d0 R15: 0000000000000000 I had been chasing the message from systemd that not all DM devices were able to be detached but that doesn't seem related. Harald hinted me at https://mirrors.edge.kernel.org/pub/linux/utils/boot/dracut/dracut.html#debugging-dracut-on-shutdown to set a breakpoint on dracut and I was optimistic that I can now find more clues before I hit the failure because I would have the dracut shell to investigate the shutdown path. I setup the run and gingerly waited for the reboot fail to occur but interestingly, the dracut breakpoint never hit! Ofcourse it drops me to a shell if reboot succeeds. [ 49.108653] systemd-shutdown[1]: All filesystems unmounted. [ 49.110500] systemd-shutdown[1]: Deactivating swaps. [ 49.112603] systemd-shutdown[1]: All swaps deactivated. [ 49.114307] systemd-shutdown[1]: Detaching loop devices. [ 49.116081] systemd-shutdown[1]: All loop devices detached. [ 49.117639] systemd-shutdown[1]: Detaching DM devices. [ 49.119798] systemd-shutdown[1]: Detaching DM 253:1. [ 49.215844] systemd-shutdown[1]: Not all DM devices detached, 1 left. [ 49.218843] systemd-shutdown[1]: Detaching DM devices. [ 49.221121] systemd-shutdown[1]: Not all DM devices detached, 1 left. [ 49.222983] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing. [ 54.232536] systemd-shutdown[1]: Successfully changed into root pivot. [ 54.236768] systemd-shutdown[1]: Returning to initrd... [ 54.242687] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 ... So, it fails somewhere between systemd priniting "Returning to initrd" and jumping back to dracut. Looking at systemd-shutdown: cat src/shutdown/shutdown.c: if (can_initrd) { r = switch_root_initramfs(); if (r >= 0) { argv[0] = (char*) "/shutdown"; (void) setsid(); (void) make_console_stdio(); log_info("Successfully changed into root pivot.\n" "Returning to initrd..."); execv("/shutdown", argv); Contrary to my initial thought that it's systemd that's failing, the shutdown binary after pivot_root is what is causing the panic but dracut doesn't get a chance to jump to a shell. Now, I am not sure what the mechanism is behind how dracut implements this breakpoint but does this behavior shed some light on what might be wrong here ? Since systemd-shutdown has a if condition for the availability of initrd, and the code has: can_initrd = !in_container && !in_initrd() && access("/run/initramfs/shutdown", X_OK) == 0; I took a hacky approach and renamed /run/initramfs (I was not able to disable dracut-shutdown.service) and then reran the test. I have so far run 200 iterations without hitting a failure. Usually, I hit at least one failure every 50 runs. From the logs, I have verified that systemd isn't returning back control to dracut anymore. Ray, Harald, any idea why execv("/shutdown", argv) would fail after pivot_root ? Is this somehow the same as https://bugzilla.redhat.com/show_bug.cgi?id=1879979#c9 ? (Although that probably wouldn't cause a panic) @rstrode @harald Umm I don't think I did the @trick right, @harald and @rstrode please take a look at comment 57 i haven't read all the comments in the bug (currently away from keyboard so just read last couple...), but if /shutdown is dying you could try putting set -x at the top of it. there was another bug not too long back where /shutdown was having issues. in that case, IIRC, the problem was it was calling "reboot -f" and was failing because systemd wasn't responding. I don't remember how that bug got resolved. might be same issue? you could try changing the reboot -f to reboot -ff (so it reboots directly instead of using ipc) and see if that helps (In reply to Ray Strode [halfline] from comment #59) > i haven't read all the comments in the bug (currently away from keyboard so > just read last couple...), but if /shutdown is dying you could try putting > set -x at the top of it. > Thanks for the tip. I couldn't figure out a easy way to store the script's output across successive reboots; so, I echo-ed debug messages to /dev/kmsg. Here's what a successful poweroff looks like: The DEBUG statements were added to the script. [ 55.451728] systemd-shutdown[1]: Successfully changed into root pivot. [ 55.456491] systemd-shutdown[1]: Returning to initrd... [ 55.465617] DEBUG: Sleeping for 10 seconds [ 65.478270] DEBUG: ACTION is poweroff [ 65.527526] DEBUG: oldroot [ 65.549952] DEBUG: Handing exec [ 65.551463] DEBUG: Checking for breakpoints [ 65.599526] dracut: Taking over mdmon processes. [ 65.601969] Killing all remaining processes [ 65.604687] dracut Warning: Killing all remaining processes [ 66.277627] XFS (dm-0): Unmounting Filesystem [ 66.295378] dracut Warning: Unmounted /oldroot. [ 66.437232] dracut: Disassembling device-mapper devices [ 66.615583] dracut: Waiting for mdraid devices to be clean. [ 66.625671] dracut: Disassembling mdraid devices. [ 66.649428] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 66.707831] ACPI: Preparing to enter system sleep state S5 [ 66.712709] reboot: Power down Compared to a run that fails: [ 54.186189] systemd-shutdown[1]: Returning to initrd... [ 54.192177] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 [ 54.192177] [ 54.195004] CPU: 1 PID: 1 Comm: shutdown Kdump: loaded Tainted: G --------- -t - 4.18.0-305.el8.x86_64 #1 [ 54.197252] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014 [ 54.199270] Call Trace: [ 54.200438] dump_stack+0x5c/0x80 [ 54.201726] panic+0xe7/0x2a9 [ 54.202934] do_exit.cold.23+0x58/0xb6 [ 54.204174] ? __do_page_fault+0x20c/0x4c0 [ 54.205411] do_group_exit+0x3a/0xa0 [ 54.206561] __x64_sys_exit_group+0x14/0x20 [ 54.207790] do_syscall_64+0x5b/0x1a0 [ 54.208927] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 54.210216] RIP: 0033:0x7fe82e93888e [ 54.211324] Code: 89 fa 41 b8 e7 00 00 00 be 3c 00 00 00 eb 14 0f 1f 44 00 00 89 d7 89 f0 0f 05 48 3d 00 f0 ff ff 77 1a f4 89 d7 44 89 c0 0f 05 <48> 3d 00 f0 ff ff 76 e2 f7 d8 89 05 ea f\ 8 20 00 eb d8 f7 d8 89 05 So, it looks like that the combination of pivot_root and exec is what fails. The script probably never starts executing. Even if we consider that some messages were lost somehow, the timestamp difference in the failure case is around ~5000 comprared to around ~20000 in the successful reboot which leads me to believe that the sleep I stuck at the beginning of the shutdown script never got executed. > there was another bug not too long back where /shutdown was having issues. > in that case, IIRC, the problem was it was calling "reboot -f" and was > failing because systemd wasn't responding. I don't remember how that bug > got resolved. might be same issue? I think you mean bug 1879979 here. It could be but I am not sure how it would cause a panic. > > you could try changing the reboot -f to reboot -ff (so it reboots directly > instead of using ipc) and see if that helps Yeah, I had tried this before by adding reboot -ff to a script in /usr/lib/systemd/system-shutdown. That works - I have seen failures at times but the trace is completely different from what you see with this bug. ah right it says "exitcode=0x00007f00" ... which is exit status 127 I guess shells return 127 for "command not found". But that means the execv() is running indeed and it at least started the shell at the top of /shutdown. How high up is your sleep? Is it the very first line after the shebang? If so, maybe shebang is getting corrupted. You could imagine if the top of /shutdown was changed from #!/bin/sh to #!/bin/sh somegobblydegook that it would fail in the way you're seeing. A number of years ago I actually hit a bug like this in a shell script. It took me a long time to figure out because the "somegobblydegook" was a control character so it was invisible. Of course it's a little more far fetched explanation if it's happening intermittently. would suggest a bug in cpio or something. alternatively, if there is some code between the shebang and your sleep, maybe it's that code. One idea to confirm you're really getting into /shutdown is you could edit dracut-restore-initramfs to do something like: mv /run/initramfs/shutdown /run/initramfs/shutdown-real cat << EOF > /run/initramfs/shutdown #!/bin/sh -x echo "/shutdown called" > /dev/kmsg md5sum /run/initramfs/shutdown-real > /dev/kmsg head /run/initramfs/shutdown-real > /dev/kmsg exec /run/initramfs/shutdown-real "$@" EOF That should show /shutdown is at least getting called, give you a hash of the /shutdown that is getting run and show the shebang line. that would need a chmod +x /run/initramfs/shutdown in it too of course (In reply to Ray Strode [halfline] from comment #61) > ah right it says "exitcode=0x00007f00" ... which is exit status 127 I guess > > shells return 127 for "command not found". > > But that means the execv() is running indeed and it at least started the > shell at the top of /shutdown. > > How high up is your sleep? Is it the very first line after the shebang? > > If so, maybe shebang is getting corrupted. > > You could imagine if the top of /shutdown was changed from > > #!/bin/sh > > to > > #!/bin/sh somegobblydegook > > that it would fail in the way you're seeing. A number of years ago I > actually hit a bug like this in a shell script. It took me a long time to > figure out because the "somegobblydegook" was a control character so it was > invisible. Of course it's a little more far fetched explanation if it's > happening intermittently. would suggest a bug in cpio or something. > > alternatively, if there is some code between the shebang and your sleep, > maybe it's that code. > > One idea to confirm you're really getting into /shutdown is you could edit > dracut-restore-initramfs to do something like: > I was about to reply today morning that my /shutdown script is consistent across reboots because I keep a copy of the modified shutdown script in /boot and I copy it over to /run/initramfs in dracut-restore-initramfs but then I saw your comment mention md5sum and head and I thought it would be good to print these out, especially before the system panics. Here's what I noticed: 1. Every once in a while, it appears dracut-restore-initramfs doesn't run (or at least doesn't return success). I know this because this is how the end of my modified dracut-initramfs-restore looks like: --- dracut-initramfs-restore.orig 2021-05-27 04:48:44.140235342 +0800 +++ dracut-initramfs-restore 2021-05-27 03:19:24.845164873 +0800 @@ -40,4 +40,7 @@ exit 1 fi +yes | cp -rf /boot/initramfs/shutdown /run/initramfs/shutdown +echo "MD5SUM is " > /dev/kmsg +md5sum /run/initramfs/shutdown > /dev/kmsg exit 0 Whenever a reboot panic occurs, I notice that the messages above do not get printed including this message (to reiterate, does NOT get printed): Starting Tell Plymouth To Jump To initramfs...^M 2. As another experiment, I printed out the top contents of /run/initramfs/shutdown by adding a script to systemd/system-shutdown. Whenever the panic occurs, /run/initramfs/shutdown doesn't contain the debugging flag I added "set -x". This would explain why the shutdown script debug statements didn't appear in case of a panic. 3. Next, I modified the script in 2 so that if I notice that /run/initramfs does not contain my modified shutdown script, I again copy it over. This made no difference i.e the kernel panic still didn't print my debug statements in the shutdown script. 4. I was not aware that /bin/sh in dracut points to /bin/bash which is NOT statically linked. Could it be possible that there's a corrupt copy of dracut file system and libraries and so the script fails at the shebang line like you said above ? I modified my script in 2 even further so that if I detect that dracut-initramfs-restore didn't run, I forcibly copy over the shutdown script, usr and sbin. This is how it looks like (sorry it's ugly and hacky): #!/bin/bash if cat /run/initramfs/shutdown | grep "ALTERED" then echo "Modified shutdown script in place" > /dev/kmsg else echo "initramfs service did not run!" > /dev/kmsg rm -rf /run/initramfs/shutdown rm -rf /run/initramfs/sbin rm -rf /run/initrams/usr echo "Copying files" > /dev/kmsg cp -R /tmp/initramfs/sbin /run/initramfs/ cp -R /tmp/initramfs/usr /run/initramfs/ cp -R /tmp/initramfs/shutdown /run/initramfs/ head -n 10 /run/initramfs/shutdown > /dev/kmsg sleep 5 fi In short, my modified shutdown script has the "ALTERED" keyword in it and /tmp has a backup copy of initramfs. This survived a full 100 runs of the reboot test and I can verify from the logs that systemd does jump to initrd at the end of shutdown. What could be causing dracut-initramfs-restore to not run ? Could it be possible that there's a corrrupt/stale extraction of /run/initramfs(from a previous run?) and when systemd tries to execve in this broken rootfs, it ends up in a panic ? This doesn't look virtualization related at all but I am not sure why this happens only on AMD! Is you or somebody else/team a better owner of this bug ? I will post an update if the test fails at some point and all this theory goes down the drain :) > mv /run/initramfs/shutdown /run/initramfs/shutdown-real > cat << EOF > /run/initramfs/shutdown > #!/bin/sh -x > echo "/shutdown called" > /dev/kmsg > md5sum /run/initramfs/shutdown-real > /dev/kmsg > head /run/initramfs/shutdown-real > /dev/kmsg > exec /run/initramfs/shutdown-real "$@" > EOF > > That should show /shutdown is at least getting called, give you a hash of > the /shutdown that is getting run and show the shebang line. Here's another experiment that is probably a little less convoluted than the previous one and partly confirms my thinking. The test is: If dracut-initramfs-restore does not run for whatever reason, then, manually copy a statically linked copy of bash to the /run/initramfs (along with the shutdown script): cat /usr/lib/systemd/system-shutdown/debug.sh !/bin/bash if cat /run/initramfs/shutdown | grep "ALTERED" then echo "Modified shutdown script in place" > /dev/kmsg else echo "initramfs service did not run!" > /dev/kmsg rm -rf /run/initramfs/shutdown rm -rf /run/initramfs/bin/bash #rm -rf /run/initrams/usr echo "Copying files" > /dev/kmsg cp -R /tmp/bash /run/initramfs/bin/ cp -R /tmp/initramfs/shutdown /run/initramfs/ head -n 10 /run/initramfs/shutdown > /dev/kmsg sleep 5 fi Now, a failure gives you a little more information (DEBUG prefix statements are mine, dracut prefixes were as before) [ 59.048236] systemd-shutdown[1]: Successfully changed into root pivot. [ 59.049918] systemd-shutdown[1]: Returning to initrd... [ 59.055876] DEBUG: Sleeping for 10 seconds [ 59.059592] DEBUG: ACTION is poweroff [ 59.101670] DEBUG: oldroot [ 59.117388] DEBUG: Handing kexec [ 59.119251] DEBUG: Checking for breakpoints [ 59.133049] DEBUG: Killing all remaining processes [ 59.168026] dracut Warning: Killing all remaining processes [ 59.667445] dracut Warning: poweroff failed! [ 59.683232] dracut Warning: [ 59.719677] dracut Warning: Signal caught! [ 59.738930] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000 [ 59.738930] [ 59.741332] CPU: 4 PID: 1 Comm: shutdown Kdump: loaded Tainted: G --------- -t - 4.18.0-305.el8.x86_64 #1 [ 59.743225] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014 [ 59.744917] Call Trace: [ 59.745786] dump_stack+0x5c/0x80 [ 59.746745] panic+0xe7/0x2a9 [ 59.747630] do_exit.cold.23+0x58/0xb6 [ 59.748594] ? recalc_sigpending+0x17/0x50 [ 59.749614] ? __set_task_blocked+0x6d/0x90 [ 59.750650] do_group_exit+0x3a/0xa0 [ 59.751606] __x64_sys_exit_group+0x14/0x20 [ 59.752657] do_syscall_64+0x5b/0x1a0 [ 59.753659] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 59.754813] RIP: 0033:0x49b90e [ 59.755724] Code: ba 03 00 00 00 0f 05 48 89 c7 e8 0d fe ff ff 48 89 c7 e8 9c fd ff ff 85 c0 79 01 f4 58 5b 5d c3 48 63 ff b8 e7 00 00 00 0f 05 <ba> 3c 00 00 00 48 89 d0 0f 05 eb f9 66 0\ f 1f 44 00 00 41 55 41 54 So I'm just chiming in from the peanut gallery really. I think the only reason I got cc'd on this bug was because plymouth has been implicated in breaking shutdown before :-) You may be the best person to investigate, since you're already kneedeep in it and you've got a reproduction environment primed and active. Failing that, harald or someone else from the plumbers team could probably help I guess. I think, though, the next thing I would figure out is if dracut-restore-initramfs is running at all. Can you post a full boot dmesg log to the bug btw? dracut-restore-initram is called from dracut-shutdown.service, so is dracut-shutdown.service getting run? It does have this at the top of it: ConditionPathExists=!/run/initramfs/bin/sh So if for some reason, /run/initramfs was already partially populated, that could explain why it's not running. dmesg should show if it's runnign since you have systemd debugging turned on dracut-restore-initramfs also has these early exits at the top: do some sanity checks first• [ -e /run/initramfs/bin/sh ] && exit 0• [ -e /run/initramfs/.need_shutdown ] || exit 0• cd /run/initramfs• • [ -f .need_shutdown -a -f "$IMG" ] || exit 1• So it could also be bailing early if .need_shutdown accidentally got deleted or if the initramfs image was inaccessible. It looks like it's doing some heuristics to figure out what the initramfs is called. It could be the heuristics are failing and it's picking the wrong name. Maybe you should put a set -x in it to see how far it gets. Of course, if dracut-restore-initramfs isn't running then how is /run/initramfs/shutdown there at all? Did dracut-restore-initramfs get run earlier and get killed immediately so it only had a chance to dump half the files? might be interesting to put ExecStop=/bin/bash -c "find /run/initramfs/ -type f | xargs md5sum" in dracut-shutdown.service before the ExecStop line that calls dracut-restore-initramfs and also comment out the ConditionPathExists. That should give a clearer picture of the state of /run/initramfs before dracut-restore-initramfs is run. Should also let us see if its running multiple times. Another thing you could try is adding ExecStop=/bin/bash -c '/bin/rm -rf /run/initramfs/*; touch /run/initramfs/.need_shutdown' before the dracut-restore-initramfs and see if that makes things start working reliably. That would also confirm /run/initramfs is getting populated early and incompletely. btw, i don't think the exit status would be 127 if the shell wasn't working. if the shell was missing a library execv() would fail I think. Interesting that poweroff is failing after your workaround too. Maybe the problem i alluded to in comment 59 is happening *too* (In reply to Ray Strode [halfline] from comment #65) > So I'm just chiming in from the peanut gallery really. I think the only > reason I got cc'd on this bug was because plymouth has been implicated in > breaking shutdown before :-) > > You may be the best person to investigate, since you're already kneedeep in > it and you've got a reproduction environment primed and active. > Your ideas in this comment are proof enough how unfit of a person I am for this bug. I am really a kvm guy with minimal knowledge of init. I kept triaging this issue so that we could reach a point where someone more knowledgeable could then take over. Barring that, I am happy to keep looking if no one is up for it. This has been a CI blocker for QE guys for a while now. > Failing that, harald or someone else from the plumbers team could probably > help I guess. > > I think, though, the next thing I would figure out is if > dracut-restore-initramfs is running at all. > > Can you post a full boot dmesg log to the bug btw? > Let me start with attaching two logs, one that fails and one that passes. I will go through the rest of your comments later and try out your suggestions. These messages seem to be the difference: ^[[0;32m OK ^[[0m] Stopped Restore /run/initramfs on shutdown.^M Starting Tell Plymouth To Jump To initramfs...^M > dracut-restore-initram is called from dracut-shutdown.service, so is > dracut-shutdown.service getting run? It does have this at the top of it: > > ConditionPathExists=!/run/initramfs/bin/sh > > So if for some reason, /run/initramfs was already partially populated, that > could explain why it's not running. dmesg should show if it's runnign since > you have systemd debugging turned on > > dracut-restore-initramfs also has these early exits at the top: > > do some sanity checks first• > [ -e /run/initramfs/bin/sh ] && exit 0• > [ -e /run/initramfs/.need_shutdown ] || exit 0• > cd /run/initramfs• > • > [ -f .need_shutdown -a -f "$IMG" ] || exit 1• > > So it could also be bailing early if .need_shutdown accidentally got deleted > or if the initramfs image was inaccessible. It looks like it's doing some > heuristics to figure out what the initramfs is called. It could be the > heuristics are failing and it's picking the wrong name. > > Maybe you should put a set -x in it to see how far it gets. > > Of course, if dracut-restore-initramfs isn't running then how is > /run/initramfs/shutdown there at all? Did dracut-restore-initramfs get run > earlier and get killed immediately so it only had a chance to dump half the > files? > > might be interesting to put > > ExecStop=/bin/bash -c "find /run/initramfs/ -type f | xargs md5sum" > > in dracut-shutdown.service before the ExecStop line that calls > dracut-restore-initramfs and also comment out the ConditionPathExists. That > should give a clearer picture of the state of /run/initramfs before > dracut-restore-initramfs is run. > > Should also let us see if its running multiple times. > > Another thing you could try is adding > > ExecStop=/bin/bash -c '/bin/rm -rf /run/initramfs/*; touch > /run/initramfs/.need_shutdown' > > before the dracut-restore-initramfs and see if that makes things start > working reliably. That would also confirm /run/initramfs is getting > populated early and incompletely. Created attachment 1787677 [details]
Panic Log
Created attachment 1787678 [details]
Successful power down
(In reply to Ray Strode [halfline] from comment #66) > btw, i don't think the exit status would be 127 if the shell wasn't working. > if the shell was missing a library execv() would fail I think. > Hmm.. I am wondering this myself, if you look at comment 63, I do print out the contents of the shutdown script. When it fails, it still has debug enabled but nothing gets printed out. > Interesting that poweroff is failing after your workaround too. Maybe the > problem i alluded to in comment 59 is happening *too* ah your logs don't seem to have systemd.log_level=debug systemd.log_target=kmsg in them Created attachment 1787921 [details]
Successful power down
Created attachment 1787922 [details]
Panic Log
(In reply to Ray Strode [halfline] from comment #71) > ah your logs don't seem to have systemd.log_level=debug > systemd.log_target=kmsg in them I uploaded new ones - I added systemd debug options on the kernel cmdline and also disabled as much ratelimiting I could; too much debug seems to make the test timeout. Please take a look when you get a chance. Bummer about the rate limiting, it seems like it's filtering out almost all the systemd messages. Anyway, from the logs we do see dracut-shutdown is getting run: Stopped Restore /run/initramfs on shutdown. [ 48.657795] audit: type=1131 audit(1622161915.419:289): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dracut-shutdown comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'• which was one open question. So it's likely that dracut-initramfs-restore is getting run. We don't know if it's exiting early. Some of the ideas in comment 65 might help get us further. I did just have one other thought. We could be dealing with low diskspace! /run is by default 50% of ram I think, so on this system that's 7G. If for some reason, /run was filled with 7G of data before shutdown, then that might lead to the cpio archive only getting partially unpacked I guess. of course we don't know for sure that it's only partially unpacked..doing the ExecStop=/bin/bash -c "find /run/initramfs/ -type f | xargs md5sum" bit from comment 65 would tell us that definitively. You could throw in a ExecStop=/bin/df /run in there before dracut-initramfs-restore to get an idea what disk usage is like right before the archive is unpacked. So one thing I didn't realize yesterday was the guest17 log shows multiple boots. Let's take a closer look... ╎❯ egrep 'dracut-shutdown|Kernel panic|Command line|Comm:|reboot:|Power key|login:|Stop.*Restore' guest17.log [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dhcp158--71-root ro console=ttyS0,115200 crashkernel=auto resume=/dev/mapper/rhel_dhcp158--71-swap rd.lvm.lv=rhel_dhcp158-71/root rd.lvm.lv=rhel_dhcp158-71/swap biosdevname=0 net.ifnames=0 console=tty0 log_buf_len=64M systemd.log_level=debug systemd.log_target=kmsg ^ first boot dhcp159-111 login: ^ booted [ 47.089095] systemd-logind[1382]: Power key pressed. ^ power key pressed [ 49.189981] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 [ 49.198699] CPU: 7 PID: 1 Comm: shutdown Kdump: loaded Tainted: G --------- -t - 4.18.0-305.el8.x86_64 #1 ^ /run/initramfs/shutdown exited with 127 [ 0.000000] Command line: elfcorehdr=0x73000000 BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.el8.x86_64 ro console=ttyS0,115200 resume=/dev/mapper/rhel_dhcp158--71-swap biosdevname=0 net.ifnames=0 console=tty0 systemd.log_level=debug systemd.log_target=kmsg irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0 ^ second boot. This time without crashkernel=auto and no root specified. i guess kexec ? [ 6.739997] reboot: Restarting system [ 6.741312] reboot: machine restart ^ restarted automatically [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dhcp158--71-root ro console=ttyS0,115200 crashkernel=auto resume=/dev/mapper/rhel_dhcp158--71-swap rd.lvm.lv=rhel_dhcp158-71/root rd.lvm.lv=rhel_dhcp158-71/swap biosdevname=0 net.ifnames=0 console=tty0 log_buf_len=64M systemd.log_level=debug systemd.log_target=kmsg ^ third boot. dhcp159-111 login: [ 128.196499] systemd[1]: libvirtd.service: Succeeded. ^ booted. [ OK ] Stopped Restore /run/initramfs on shutdown. [ 346.441972] audit: type=1131 audit(1622162347.078:275): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dracut-shutdown comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' ^ only here is dracut-shutdown called ! [ 347.532202] reboot: Power down ^ stayed booted for like 5 minutes before it powered down So dracut-shutdown is not called before the panic. We do see systemd knows about the unit though: ╎❯ egrep "Start.*Restore" guest17.log Starting Restore /run/initramfs on shutdown... [ OK ] Started Restore /run/initramfs on shutdown. Starting Restore /run/initramfs on shutdown... [ OK ] Started Restore /run/initramfs on shutdown. So this means the ConditionPathExists=!/run/initramfs/bin/sh isn't triggering (i.e. /run/initramfs isn't getting mispopulated in early boot) hmm... too bad all the debug messages are getting rate limited (In reply to Ray Strode [halfline] from comment #76) IIUC, that's the behavior of the test script in case of a panic. > [ 47.089095] systemd-logind[1382]: Power key pressed. > ^ power key pressed > Script invoked reboot. > [ 49.189981] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x00007f00 > [ 49.198699] CPU: 7 PID: 1 Comm: shutdown Kdump: loaded Tainted: G > --------- -t - 4.18.0-305.el8.x86_64 #1 > ^ /run/initramfs/shutdown exited with 127 > > [ 0.000000] Command line: elfcorehdr=0x73000000 > BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.el8.x86_64 ro > console=ttyS0,115200 resume=/dev/mapper/rhel_dhcp158--71-swap biosdevname=0 > net.ifnames=0 console=tty0 systemd.log_level=debug systemd.log_target=kmsg > irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off > udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug > transparent_hugepage=never nokaslr novmcoredd hest_disable > disable_cpu_apicid=0 iTCO_wdt.pretimeout=0 > ^ second boot. This time without crashkernel=auto and no root specified. i > guess kexec ? > Yes, this is the kdump kernel. > [ 6.739997] reboot: Restarting system > [ 6.741312] reboot: machine restart > ^ restarted automatically > Yes. > [ 0.000000] Command line: > BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.el8.x86_64 > root=/dev/mapper/rhel_dhcp158--71-root ro console=ttyS0,115200 > crashkernel=auto resume=/dev/mapper/rhel_dhcp158--71-swap > rd.lvm.lv=rhel_dhcp158-71/root rd.lvm.lv=rhel_dhcp158-71/swap biosdevname=0 > net.ifnames=0 console=tty0 log_buf_len=64M systemd.log_level=debug > systemd.log_target=kmsg > ^ third boot. > > dhcp159-111 login: [ 128.196499] systemd[1]: libvirtd.service: Succeeded. > ^ booted. > > [ OK ] Stopped Restore /run/initramfs on shutdown. > [ 346.441972] audit: type=1131 audit(1622162347.078:275): pid=1 uid=0 > auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 > msg='unit=dracut-shutdown comm="systemd" exe="/usr/lib/systemd/systemd" > hostname=? addr=? terminal=? res=success' > ^ only here is dracut-shutdown called ! > Here the test script timed out and so it again caused a power down. > [ 347.532202] reboot: Power down > ^ stayed booted for like 5 minutes before it powered down > > So dracut-shutdown is not called before the panic. We do see systemd knows > about the unit though: > > ╎❯ egrep "Start.*Restore" guest17.log > Starting Restore /run/initramfs on shutdown... > [ OK ] Started Restore /run/initramfs on shutdown. > Starting Restore /run/initramfs on shutdown... > [ OK ] Started Restore /run/initramfs on shutdown. > > So this means the ConditionPathExists=!/run/initramfs/bin/sh isn't > triggering (i.e. /run/initramfs isn't getting mispopulated in early boot) > > hmm... too bad all the debug messages are getting rate limited I can try tweaking the ratelimiting numbers to see if I can get a little bit more messages. The script quickly starts complaining, @Yanan, any idea how we can tweak the timeout to turn off all ratelimiting as described here: https://access.redhat.com/solutions/1417483 ? The test fails otherwise. (In reply to Ray Strode [halfline] from comment #75) > Bummer about the rate limiting, it seems like it's filtering out almost all > the systemd messages. Anyway, from the logs we do see dracut-shutdown is > getting run: > > Stopped Restore /run/initramfs on shutdown. > > [ 48.657795] audit: type=1131 audit(1622161915.419:289): pid=1 uid=0 > auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 > msg='unit=dracut-shutdown comm="systemd" exe="/usr/lib/systemd/systemd" > hostname=? addr=? terminal=? res=success'• > > which was one open question. So it's likely that dracut-initramfs-restore is > getting run. We don't know if it's exiting early. Some of the ideas in > comment 65 might help get us further. > > I did just have one other thought. We could be dealing with low diskspace! > > /run is by default 50% of ram I think, so on this system that's 7G. If for > some reason, /run was filled with 7G of data before shutdown, then that > might lead to the cpio archive only getting partially unpacked I guess. > > of course we don't know for sure that it's only partially unpacked..doing > the ExecStop=/bin/bash -c "find /run/initramfs/ -type f | xargs md5sum" bit > from comment 65 would tell us that definitively. > > You could throw in a > > ExecStop=/bin/df /run > > in there before dracut-initramfs-restore to get an idea what disk usage is > like right before the archive is unpacked. Thanks for all the helpful info! I will try out everything and post results soon. Back to this, the first thing I tried is set up debugging in dracut-initramfs-restore. It gives some clues but I am not sure what to make out of it: With a successful run, the output is: + '[' -e /run/initramfs/bin/sh ']' + '[' -e /run/initramfs/.need_shutdown ']' ++ uname -r + KERNEL_VERSION=4.18.0-305.el8.x86_64 + [[ -n '' ]] + dracutbasedir=/usr/lib/dracut + SKIP=/usr/lib/dracut/skipcpio + [[ -x /usr/lib/dracut/skipcpio ]] + [[ -f /etc/machine-id ]] + read MACHINE_ID + mount -o ro /boot + true + [[ -n 4e4d1268e3bc4436ad9fc61613d3c4cd ]] + [[ -d /boot/4e4d1268e3bc4436ad9fc61613d3c4cd ]] + [[ -L /boot/4e4d1268e3bc4436ad9fc61613d3c4cd ]] + [[ -f '' ]] + IMG=/boot/initramfs-4.18.0-305.el8.x86_64.img + cd /run/initramfs + '[' -f .need_shutdown -a -f /boot/initramfs-4.18.0-305.el8.x86_64.img ']' + /usr/lib/dracut/skipcpio /boot/initramfs-4.18.0-305.el8.x86_64.img + zcat + cpio -id --no-absolute-filenames 390118 blocks + echo 'cpio succeeded' cpio succeeded + rm -f -- .need_shutdown + exit 0 With the one that fails: + '[' -e /run/initramfs/bin/sh ']' + '[' -e /run/initramfs/.need_shutdown ']' ++ uname -r + KERNEL_VERSION=4.18.0-305.el8.x86_64 + [[ -n '' ]] + dracutbasedir=/usr/lib/dracut + SKIP=/usr/lib/dracut/skipcpio + [[ -x /usr/lib/dracut/skipcpio ]] + [[ -f /etc/machine-id ]] + read MACHINE_ID + mount -o ro /boot + true + [[ -n 4e4d1268e3bc4436ad9fc61613d3c4cd ]] + [[ -d /boot/4e4d1268e3bc4436ad9fc61613d3c4cd ]] + [[ -L /boot/4e4d1268e3bc4436ad9fc61613d3c4cd ]] + [[ -f '' ]] + IMG=/boot/initramfs-4.18.0-305.el8.x86_64.img + cd /run/initramfs + '[' -f .need_shutdown -a -f /boot/initramfs-4.18.0-305.el8.x86_64.img ']' + /usr/lib/dracut/skipcpio /boot/initramfs-4.18.0-305.el8.x86_64.img + zcat + cpio -id --no-absolute-filenames The output just stops here and it appears that at the cpio line, the script just died mysteriously or messages stopped getting logged. That falls in line with comment 63 where my custom message to the end of the dracut-initramfs-restore: +echo "MD5SUM is " > /dev/kmsg +md5sum /run/initramfs/shutdown > /dev/kmsg exit 0 doesn't get printed when the reboot panic occurs. does setting DefaultTimeoutStopSec=0 in sysyemd.conf make problem go away? i wonder if its just taking too long and systemd is killing it maybe. other possibility is cpio is killing parent, though that would be weird. you could try putting strace tbefore the cpio line to check. you coukd also tey adding a sleep 200 or something before the cpio line to see if it makes it 100% reproducible (In reply to Bandan Das from comment #77) > > > > hmm... too bad all the debug messages are getting rate limited > I can try tweaking the ratelimiting numbers to see if I can get a little bit > more messages. > The script quickly starts complaining, @Yanan, any idea how we can tweak the > timeout to turn off all > ratelimiting as described here: https://access.redhat.com/solutions/1417483 > ? The test fails otherwise. Hi Bandan, Do you mean the timeout in auto script ? To extend the timeout of reboot vm(auto case name: reboot), you can change the value of 'timeout': https://github.com/autotest/tp-qemu/blob/cfb8496f68014572cf7bab6144b3d163ce6df26e/generic/tests/boot.py#L55 Hope it is helpful, thanks! (In reply to Ray Strode [halfline] from comment #80) > does setting DefaultTimeoutStopSec=0 in sysyemd.conf make problem go away? i > wonder if its just taking too long and systemd is killing it maybe. other > possibility is cpio is killing parent, though that would be weird. you could > try putting strace tbefore the cpio line to check. you coukd also tey > adding a sleep 200 or something before the cpio line to see if it makes it > 100% reproducible DefaultTimeoutStopSec=0 didn't help, the crash triggers with the expected probability. I tried sleep too - systemd correctly tells me that something is taking longer than expected and kills it but it's probably not the correct setup for the error to occur as it really doesn't cause *cpio* to fail which is what happening. Adding strace makes the error incredibly difficult to reproduce and I am not sure what to make out of it! My test has been going on for a few hours without a failure. I also tried adding a ExecStopPost to dracut-shutdown.service hoping that I will retry the initrd extraction if it fails for whatever reason. However, when the panic occurs, the script that ExecStopPost points to doesn't get executed. That's probably intended ? i.e. if ExecStop doesn't exit gracefully, systemd ignores the post script. Another thing I was wondering is if we could replace cpio with something else (like tar ?) and check if it still reproduces. I didn't go far, do you have a cmdline that would work and that replaces cpio ? i dont think tar supports cpio format. does the script have set -e at the top of it? i guess that would make it fail right away if cpio was failing. you could try replacing cpio with a wrapper to catch the problem maybe mv /usr/bin/cpio /usr/bin/cpio.bin cat << EOF > /usr/bin/cpio #!/bin/sh /usr/bin/cpio.bin "${@}" < /dev/stdin status=$? echo cpio terminated with status $status | tee /dev/kmsg exit $status EOF chmod +x /usr/bin/cpio (In reply to Ray Strode [halfline] from comment #83) > i dont think tar supports cpio format. > > does the script have set -e at the top of it? i guess that would make it > fail right away if cpio was failing. > > you could try replacing cpio with a wrapper to catch the problem maybe > > mv /usr/bin/cpio /usr/bin/cpio.bin > > cat << EOF > /usr/bin/cpio > #!/bin/sh > /usr/bin/cpio.bin "${@}" < /dev/stdin > status=$? > echo cpio terminated with status $status | tee /dev/kmsg > exit $status > EOF > > chmod +x /usr/bin/cpio Some updates: Ray, with your modifications as above, there are no failures anymore. I have been running over the weekend and even did a fresh install; no failures so far. I am not sure how! The other thing I tried is remove "set -e" from dracut-initramfs-restore. Even removing it, my debug prints won't appear on a failure i.e nothing below the cpio line would get printed. The other thing I added is : trap 'echo *******dracut: exit normally************ > /dev/kmsg' EXIT trap 'echo *******dracut: signal received********** > /dev/kmsg' 1 2 3 4 5 6 7 8 9 10 IIUC, the script getting killed would have an abnormal exit code? but that's now what I see. I always get a "dracut: exit normally" even when a reboot panic occurs. These are the messages that get printed around the time the script is executed: [ 45.138075] sda1: Can't mount, would change RO state [ 46.560350] audit: type=1305 audit(1624204298.404:285): op=set audit_pid=0 old=1237 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 [ 46.600946] audit: type=1131 audit(1624204298.444:286): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 46.601800] audit: type=1131 audit(1624204298.445:287): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 46.603090] audit: type=1131 audit(1624204298.447:288): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=import-state comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 46.870157] systemd-shutdown[1]: Syncing filesystems and block devices. [ 46.871679] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 46.878407] *******dracut: exit normally************ [ 46.894512] systemd-shutdown[1]: Sending SIGKILL to remaining processes... Even if systemd somehow killed the script, SIGTERM or SIGKILL would result in "dracut: signal received" line getting printed, correct ? I thought we are closer, but I am even more confused now :) So the EXIT trap will get run for abnormal termination too. See for instance this example: ``` #!/bin/sh set -e trap 'echo EXITED' EXIT trap 'echo ERROR' ERR trap 'echo SIGNAL' TERM kill -TERM $$ false ``` ``` ╎❯ ./foo SIGNAL ERROR EXITED ``` Also note you only went up to signal 10, but the SIGTERM signal is 15. The fact that your script dies after systemd signals SIGTERM to all processes seems to point to systemd killing it. The fact that your script is even running after systemd sends SIGTERM to all processes seems weird and wrong. I mean look at the order of events in the dracut-shutdown.service man page: ``` The following steps are executed during a shutdown: • systemd switches to the shutdown.target • systemd starts /lib/systemd/system/shutdown.target.wants/dracut-shutdown.service • dracut-shutdown.service executes /usr/lib/dracut/dracut-initramfs-restore which unpacks the initramfs to /run/initramfs • systemd finishes shutdown.target • systemd kills all processes ``` You can see before systemd kills all processes dracut-initramfs-restore should have already run and finished. Your log suggests its still running when systemd-shutdown kills all processes. In the logs i talked about in comment 76, I noted dracut-shutdown/dracut-initramfs-restore weren't getting called at all during the problem boot. So it seems like ordering constraints aren't working. Normally the way it works i think is at boot up, dracut-shutdown is "started" where ExecStart is "/bin/true" and RemainAfterExit=yes. So at boot its not doing anything but making systemd treat the service as if it were active for the duration of the machines up time. It's a bookkeeping maneuver so that it can get "stopped" at shutdown. Come shutdown time, systemd "stops" dracut-shutdown, which makes it run ExecStop=/usr/lib/dracut/dracut-initramfs-restore. The service file has Conflicts=shutdown.target so shutdown shouldn't proceed until the ExecStop command finishes. It seems in some cases we're not seeing it get run at all (cf comment 76) and in other cases we're seeing it briefly get run too late only for it to be killed shortly after (cf comment 84). dracut-shutdown won't run /bin/true at boot up and become marked active until local-fs.target is reached. local-fs.target is reached when all the file systems in /etc/fstab are finished mounting. I wonder if the reboot is getting triggered early in boot up before all the mounts have finished? Anything weird in fstab? I'm having another look at the logs in comment 76 ... (In reply to Ray Strode [halfline] from comment #85) > In the logs i talked about in comment 76, I noted > dracut-shutdown/dracut-initramfs-restore weren't getting called at all > during the problem boot. So it seems like ordering constraints aren't > working. This was wrong. dracut-shutdown is started in that log, just never stopped: ``` [ OK ] Reached target Local File Systems. Starting Restore /run/initramfs on shutdown... [ OK ] Started Restore /run/initramfs on shutdown. ... dhcp159-111 login: [ 47.089095] systemd-logind[1382]: Power key pressed. [ OK ] Reached target Shutdown. [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Final Step. [ 49.189981] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 [ 49.198699] CPU: 7 PID: 1 Comm: shutdown Kdump: loaded Tainted: G --------- -t - 4.18.0-305.el8.x86_64 #1 ``` So local-fs.target was reached in this case, and dracut-shutdown was activated. (In reply to Ray Strode [halfline] from comment #85) > So the EXIT trap will get run for abnormal termination too. > > See for instance this example: > > ``` > #!/bin/sh > set -e > > trap 'echo EXITED' EXIT > trap 'echo ERROR' ERR > trap 'echo SIGNAL' TERM > kill -TERM $$ > false > ``` > > ``` > ╎❯ ./foo > SIGNAL > ERROR > EXITED > ``` > > Also note you only went up to signal 10, but the SIGTERM signal is 15. > Heh, thanks for the catch! Not sure why I thought SIGTERM would be 10. Here's another log when it fails: [ 48.764334] systemd-shutdown[1]: Syncing filesystems and block devices. [ 48.866465] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 48.884447] **********dracut: received signal 13***** SIGPIPE [ 48.886193] **********dracut: received signal 15***** SIGTERM [ 48.887708] **********dracut: received signal 18***** SIGCONT [ 49.187016] **********dracut: received signal 13***** SIGPIPE [ 49.192751] **********dracut: exit normally********* [ 49.198645] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 49.205730] systemd-shutdown[1]: Unmounting file systems. [ 49.208339] [3160]: Remounting '/boot' read-only in with options 'seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'. [ 49.303276] [3161]: Unmounting '/boot'. > The fact that your script dies after systemd signals SIGTERM to all > processes seems to point to systemd killing it. The fact that your script > is even running after systemd sends SIGTERM to all processes seems weird and > wrong. > Yes, this seems to be the problem. With a successful reboot, by the time SIGTERM is sent, dracut-intramfs-restore is already finished. However, during a problem reboot, the script gets SIGTERMed during the middle of the run. If we are unlucky enough, this happens in the middle of cpio extraction. Actually, depending on when this happened the reboot could still survive as I have seen logs of successful reboot with incomplete extraction but if we are really unlucky, then finally, we get the panic. Now, the question is why would the script run late or take longer than expected ? Wrt comment 86, yes, it does get to it but for whatever reason, it takes longer to get to the initramfs-restore script. One thing that comes out as obvious is the order of this message- [ 47.077986] sda1: Can't mount, would change RO state This comes up way late in the game in the case of a panic compared to a normal run where it most likely is one of the first messages to be printed. I am not sure where this message comes from but it looks like whatever service it is, it is probably getting called late in the case of a failed reboot. > I mean look at the order of events in the dracut-shutdown.service man page: > > ``` > The following steps are executed during a shutdown: > > • systemd switches to the shutdown.target > > • systemd starts > /lib/systemd/system/shutdown.target.wants/dracut-shutdown.service > > • dracut-shutdown.service executes > /usr/lib/dracut/dracut-initramfs-restore which unpacks the initramfs to > /run/initramfs > > • systemd finishes shutdown.target > > • systemd kills all processes > ``` > > You can see before systemd kills all processes dracut-initramfs-restore > should have already run and finished. Your log suggests its still running > when systemd-shutdown kills all processes. > > In the logs i talked about in comment 76, I noted > dracut-shutdown/dracut-initramfs-restore weren't getting called at all > during the problem boot. So it seems like ordering constraints aren't > working. > > Normally the way it works i think is at boot up, dracut-shutdown is > "started" where ExecStart is "/bin/true" and RemainAfterExit=yes. So at boot > its not doing anything but making systemd treat the service as if it were > active for the duration of the machines up time. It's a bookkeeping maneuver > so that it can get "stopped" at shutdown. Come shutdown time, systemd > "stops" dracut-shutdown, which makes it run > ExecStop=/usr/lib/dracut/dracut-initramfs-restore. The service file has > Conflicts=shutdown.target so shutdown shouldn't proceed until the ExecStop > command finishes. > > It seems in some cases we're not seeing it get run at all (cf comment 76) > and in other cases we're seeing it briefly get run too late only for it to > be killed shortly after (cf comment 84). > > dracut-shutdown won't run /bin/true at boot up and become marked active > until local-fs.target is reached. local-fs.target is reached when all the > file systems in /etc/fstab are finished mounting. I wonder if the reboot is > getting triggered early in boot up before all the mounts have finished? > > Anything weird in fstab? Nothing unusual here, just the default entries: [root@ibm-p8-kvm-03-guest-02 ~]# cat /etc/fstab # # /etc/fstab # Created by anaconda on Mon Jun 21 23:01:04 2021 # # Accessible filesystems, by reference, are maintained under '/dev/disk/'. # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info. # # After editing this file, run 'systemctl daemon-reload' to update systemd # units generated from this file. # /dev/mapper/rhel_dhcp159--182-root / xfs defaults 0 0 UUID=79f80f93-23a4-4bbf-bc44-15f75ac56c36 /boot xfs defaults 0 0 /dev/mapper/rhel_dhcp159--182-swap none swap defaults 0 0 Hi, > One thing that comes out as obvious is the order of this message- > [ 47.077986] sda1: Can't mount, would change RO state this might be related indeed. So looking at your log sda1 is /boot. Obviously /boot needs to be mounted for dracut to unpack the initramfs from /boot. I wonder if it's getting unmounted after boot up and that's causing the issue? I guess, for some reason, it can't mount /boot read-only because the mount needs to write to the file system to proceed. Just guessing (stabbing in the dark really), but maybe /boot got unmounted, some parameter got tweaked on it (with xfs_admin?) and then some deferred worked was scheduled to happen at the next mount time... or something like that (or maybe journal needs to get replayed because of a hardware blip?) Can you dump systemctl status --all ls /boot to a file before triggering the power button, so we can see if there's a boot.mount active? (and maybe /proc/mounts output) > This comes up way late in the game in the case of a panic compared to a > normal run where it most likely is > one of the first messages to be printed. I am not sure where this message > comes from but it looks like whatever service it is, it is probably > getting called late in the case of a failed reboot. Yea i'm not sure either. dracut-initramfs-restore does have a mount -o ro /boot &>/dev/null || true at the top of it. But I think we've established it's not getting run at this point, so i'm not sure what's trying to mount /boot so late in the process or why /boot is unmounted (if it's unmounted) (I will take a look at comment 88 tomorrow) > You can see before systemd kills all processes dracut-initramfs-restore > should have already run and finished. Your log suggests its still running > when systemd-shutdown kills all processes. > So, I was looking at this comment and I am wondering - does shutdown.target run systemd-shutdown ? dracut-shutdown.service has [Unit] Description=Restore /run/initramfs on shutdown Documentation=man:dracut-shutdown.service(8) After=local-fs.target boot.mount boot.automount umount.target Wants=local-fs.target Conflicts=shutdown.target umount.target DefaultDependencies=no ConditionPathExists=!/run/initramfs/bin/sh Since the service is started at boot, does that mean that shutdown.target can't start running as long as dracut-shutdown hasn't stopped or iow, the dracut-initramfs-restore script hasn't stopped executing ? That's exactly my understanding. It gets started at boot after the file systems are mounted (which sets up the reverse dependencies...file systems won't get unmounted until after it's stopped). It conflicts with shutdown.target, so that target isn't reachable until dracut-shutdown is stopped. dracut-shutdown does its interesting work when it gets stopped. actually maybe it's just missing a Before=shutdown.target ? From the systemd.target man page: Note that this setting does not imply an ordering dependency, similarly to the Wants= and Requires= dependencies described above. This means that to ensure that the conflicting unit is stopped before the other unit is started, an After= or Before= dependency must be declared. It doesn't matter which of the two ordering dependencies is used, because stop jobs are always ordered before start jobs, see the discussion in Before=/After= below. So could be it just needs a a Before= or After=... (i think Before is slightly more logical, i'd be tempted to move umount.target to a Before as well) (In reply to Ray Strode [halfline] from comment #91) > actually maybe it's just missing a Before=shutdown.target ? > > From the systemd.target man page: > > > Note that this setting does not imply an ordering dependency, > similarly to the Wants= and Requires= > dependencies described above. This means that to ensure that the > conflicting unit is stopped before the > other unit is started, an After= or Before= dependency must be > declared. It doesn't matter which of the > two ordering dependencies is used, because stop jobs are always > ordered before start jobs, see the > discussion in Before=/After= below. > > So could be it just needs a a Before= or After=... (i think Before is > slightly more logical, i'd be tempted to move umount.target to a Before as > well) So from https://www.freedesktop.org/software/systemd/man/systemd.unit.html, Before=shutdown.target means that dracut-shutdown.service is fully started before shutdown.target and on shutdown, shutdown.target is fully stopped before dracut-shutdown.service ? But we want the opposite on shutdown, right ? Or is the above saying that in this case, it really doesn't matter whether you use Before/After (as long as you use one) and the shutdwn service will always run before the startup service (which is what we want)? I don't think it matters which is picked. I think using Before or After have entirely equivalent and interchangeable semantics when they're applied to a transaction where one of the units is starting and the other is stopping. Namely, they make sure one unit isn't started until the other is stopped. Would be good to get confirmation from one of the systemd guys though... I believe they get differentiated behavior (and earn their names) when both units are starting at the same time or stopping at the same time. You could imagine systemd having a third thing (maybe called SerializedWith or DisjointFrom or BarrierTo) that just does the the intersection of functionality between Before and After without the confusion implied from the names, but that doesn't exist I think. Anyway, can you try adding the Before or After and see if the problem disappears? (In reply to Ray Strode [halfline] from comment #93) > I don't think it matters which is picked. I think using Before or After have > entirely equivalent and interchangeable semantics when they're applied to a > transaction where one of the units is starting and the other is stopping. > Namely, they make sure one unit isn't started until the other is stopped. > > Would be good to get confirmation from one of the systemd guys though... > > I believe they get differentiated behavior (and earn their names) when both > units are starting at the same time or stopping at the same time. > > You could imagine systemd having a third thing (maybe called SerializedWith > or DisjointFrom or BarrierTo) that just does the the intersection of > functionality between Before and After without the confusion implied from > the names, but that doesn't exist I think. > > Anyway, can you try adding the Before or After and see if the problem > disappears? Yes, starting off a run with the following change: --- dracut-shutdown.service.orig 2021-06-24 01:50:31.895406155 +0800 +++ dracut-shutdown.service 2021-06-24 01:51:40.068256139 +0800 @@ -5,6 +5,7 @@ [Unit] Description=Restore /run/initramfs on shutdown Documentation=man:dracut-shutdown.service(8) +Before=shutdown.target umount.target After=local-fs.target boot.mount boot.automount Wants=local-fs.target Conflicts=shutdown.target umount.target (In reply to Ray Strode [halfline] from comment #88) > Hi, > > > One thing that comes out as obvious is the order of this message- > > [ 47.077986] sda1: Can't mount, would change RO state > this might be related indeed. So looking at your log sda1 is /boot. > > Obviously /boot needs to be mounted for dracut to unpack the initramfs from > /boot. > > I wonder if it's getting unmounted after boot up and that's causing the > issue? > > I guess, for some reason, it can't mount /boot read-only because the mount > needs to > write to the file system to proceed. This message pops out for filesystems using the new mount API if a mount process (I think?) somehow(?) is passing a new mount context to an existing mount, and one but not the other is readonly: if (s->s_root) { /* Don't summarily change the RO/RW state. */ if ((fc->sb_flags ^ s->s_flags) & SB_RDONLY) { warnf(fc, "%pg: Can't mount, would change RO state", bdev); deactivate_locked_super(s); blkdev_put(bdev, mode); return -EBUSY; } It's not immediately clear to me what might cause this, or if there's a way to get a panic or backtrace at this point, if it would lend any clues. (Sorry, haven't really followed this bug as it meandered off into userspace, but thought I'd at least chime in with what I think that message means) Is anything making an "fsconfig" system call? I guess there are a few paths that can get to the ->get_tree which ultimately leads to this message. (In reply to Bandan Das from comment #94) > (In reply to Ray Strode [halfline] from comment #93) > > I don't think it matters which is picked. I think using Before or After have > > entirely equivalent and interchangeable semantics when they're applied to a > > transaction where one of the units is starting and the other is stopping. > > Namely, they make sure one unit isn't started until the other is stopped. > > > > Would be good to get confirmation from one of the systemd guys though... > > > > I believe they get differentiated behavior (and earn their names) when both > > units are starting at the same time or stopping at the same time. > > > > You could imagine systemd having a third thing (maybe called SerializedWith > > or DisjointFrom or BarrierTo) that just does the the intersection of > > functionality between Before and After without the confusion implied from > > the names, but that doesn't exist I think. > > > > Anyway, can you try adding the Before or After and see if the problem > > disappears? > > Yes, starting off a run with the following change: > --- dracut-shutdown.service.orig 2021-06-24 01:50:31.895406155 +0800 > +++ dracut-shutdown.service 2021-06-24 01:51:40.068256139 +0800 > @@ -5,6 +5,7 @@ > [Unit] > Description=Restore /run/initramfs on shutdown > Documentation=man:dracut-shutdown.service(8) > +Before=shutdown.target umount.target > After=local-fs.target boot.mount boot.automount > Wants=local-fs.target > Conflicts=shutdown.target umount.target Since this is very diffcult to reproduce and sometimes disappears for days at a stretch, I adopted the following procedure: 1. First I ran 10 iterations of the the avocado reboot test without the change to dracut-shutdown.service. One iteration executes reboot 50 times. I noted the number of failures. Number of failures = 7. 2. Second, I added the above change and repeated the process. Number of failures = 0. 3. Third, I reverted the change and repeated 10 iterations. Number of failures = 6. So, I think I can say with a reasonable confidence that adding this change takes care of issue. I added the change back and will let it run over the weekend. Do you think this change is an acceptable fix for this issue ? If yes, I am not sure how to submit a fix, who should I reassign this bug to ? dracut-shutdown.service seems to belong to dracut-modules. yea i think the best course of action is to reassign to dracut. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. (In reply to Ray Strode [halfline] from comment #91) > actually maybe it's just missing a Before=shutdown.target ? Yes, it does. Though the right fix here is not to add more directives to the unit file but to remove some: dracut-shutdown.service is clearly not an early boot service, therefore it shouldn't use DefaultDependencies=no. Then the proper dependencies on shutdown.target would be added automatically. *** Bug 2038001 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (dracut bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7725 |