Bug 2126076
| Summary: | After Cluster Node restore paths from previous disconnection, reboot pacemaker will cause the host kernel panic | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | wilson.hua | ||||||||
| Component: | kernel | Assignee: | Robert Peterson <rpeterso> | ||||||||
| kernel sub component: | GFS-GFS2 | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||
| Severity: | unspecified | ||||||||||
| Priority: | unspecified | CC: | cluster-maint, david.barnhill, gfs2-maint, kazen, kgaillot, linux-bugs, nwahl, rpeterso, wayne.berthiaume | ||||||||
| Version: | 9.1 | Flags: | pm-rhel:
mirror+
|
||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2022-10-10 15:20:18 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
wilson.hua
2022-09-12 11:05:27 UTC
Hi, Can you attach the result of "pcs cluster report" covering the time before the crash? Most likely this will need to be reassigned to the kernel component, but I can check whether there were any hints of trouble beforehand. Created attachment 1913251 [details]
pcs report contain log when crash happens
Have attached the pcs report [root@e2e-l4-236128 ~]# pcs cluster report --from "2022-09-06 04:00:00" --to "2022-09-06 05:00:00" /crash_report.log e2e-l4-236128: Calculated node list: e2e-l4-236128 e2e-l4-236129 e2e-l4-236128: Collecting data from e2e-l4-236128 e2e-l4-236129 (09/06/22 04:00:00 to 09/06/22 05:00:00) e2e-l4-236128: Searching for where Pacemaker daemons live... this may take a while e2e-l4-236128: Pacemaker daemons found in: /usr/libexec/pacemaker e2e-l4-236128: Pacemaker config files found in: /var/lib/pacemaker/cib e2e-l4-236128: Pacemaker scheduler inputs found in: /var/lib/pacemaker/pengine e2e-l4-236128: WARN: Skipping bad logfile '/var/log/cluster/corosync.log': Could not determine log dates e2e-l4-236128: Including segment [144-232] from /var/log/cluster/corosync.log-20220907.gz e2e-l4-236128: Including segment [116807-129428] from /var/log/messages-20220911 e2e-l4-236128: Including segment [116807-129428] from /var/log/messages-20220911 e2e-l4-236128: Including segment [8841-9921] from /var/log/pacemaker/pacemaker.log-20220911.gz e2e-l4-236129: Searching for where Pacemaker daemons live... this may take a while e2e-l4-236129: Pacemaker daemons found in: /usr/libexec/pacemaker e2e-l4-236129: Pacemaker config files found in: /var/lib/pacemaker/cib e2e-l4-236129: Pacemaker scheduler inputs found in: /var/lib/pacemaker/pengine e2e-l4-236129: WARN: Skipping bad logfile '/var/log/cluster/corosync.log': Could not determine log dates e2e-l4-236129: Including segment [131-166] from /var/log/cluster/corosync.log-20220907.gz e2e-l4-236129: Including segment [76161-77673] from /var/log/messages-20220911 e2e-l4-236129: Including segment [2597-3795] from /var/log/pacemaker/pacemaker.log-20220911.gz e2e-l4-236128: e2e-l4-236128: Collected results are available in /crash_report.log.tar.bz2 e2e-l4-236128: e2e-l4-236128: Please create a bug entry at e2e-l4-236128: https://bugzilla.redhat.com/ e2e-l4-236128: Include a description of your problem and attach this tarball e2e-l4-236128: e2e-l4-236128: Thank you for taking time to create this report. e2e-l4-236128: Can you confirm that the crash occurred on node e2e-l4-236128 around 04:31 on Sep 6 in the cluster report? Can you upload the vmcore-dmesg.txt from that crash? You mentioned that the issue occurs when restarting pacemaker. Did node e2e-l4-236128 crash while stopping pacemaker or while starting it? Node e2e-l4-236128's logs don't contain anything from a pacemaker restart. Those logs probably didn't get flushed to disk before the reboot. Node e2e-l4-236128 shows a gfs2 withdrawal at 04:28:30, after the storage issues began: ``` Sep 6 04:28:30 e2e-l4-236128 kernel: I/O error, dev dm-16, sector 269168 op 0x0:(READ) flags 0x3000 phys_seg 2 prio class 0 Sep 6 04:28:30 e2e-l4-236128 kernel: I/O error, dev dm-16, sector 49024 op 0x1:(WRITE) flags 0x1800 phys_seg 1 prio class 0 Sep 6 04:28:30 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: Error 10 writing to journal, jid=0 Sep 6 04:28:30 e2e-l4-236128 kernel: I/O error, dev dm-16, sector 269120 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0 Sep 6 04:28:30 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: about to withdraw this file system Sep 6 04:28:30 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: dirty_inode: glock -5 Sep 6 04:28:30 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: G: s:EX n:2/826e f:lDpIon t:UN d:UN/36450000 a:0 v:0 r:6 m:60 p:0 Sep 6 04:28:30 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: I: n:1/33390 t:8 f:0x00 d:0x00000000 s:3391094784 p:0 Sep 6 04:28:30 e2e-l4-236128 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 5 for dev=dm-71 ino=33390 Sep 6 04:28:30 e2e-l4-236128 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 5 for dev=dm-71 ino=33390 Sep 6 04:28:30 e2e-l4-236128 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 5 for dev=dm-71 ino=33390 Sep 6 04:28:30 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: The logd daemon is withdrawing. ... Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: Requesting recovery of jid 0. Sep 6 04:28:35 e2e-l4-236128 kernel: I/O error, dev dm-16, sector 2200 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0 Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: Reprocessing of jid 0 failed with -5. Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: Journal recovery skipped for jid 0 until next mount. Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: Glock dequeues delayed: 0 Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: telling LM to unmount Sep 6 04:28:35 e2e-l4-236128 kernel: dlm: gfs2: leaving the lockspace group... Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: recover_prep ignored due to withdraw. Sep 6 04:28:35 e2e-l4-236128 kernel: dlm: gfs2: group event done 0 Sep 6 04:28:35 e2e-l4-236128 kernel: dlm: gfs2: release_lockspace final free Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2: fsid=RHCS:gfs2.0: File system withdrawn Sep 6 04:28:35 e2e-l4-236128 kernel: CPU: 7 PID: 6662 Comm: gfs2_logd Kdump: loaded Tainted: G I --------- --- 5.14.0-108.el9.x86_64 #1 Sep 6 04:28:35 e2e-l4-236128 kernel: Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.12.2 07/09/2021 Sep 6 04:28:35 e2e-l4-236128 kernel: Call Trace: Sep 6 04:28:35 e2e-l4-236128 kernel: dump_stack_lvl+0x34/0x44 Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2_withdraw.cold+0xb5/0xde [gfs2] Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2_log_flush+0x156/0xb20 [gfs2] Sep 6 04:28:35 e2e-l4-236128 kernel: ? gfs2_ail1_empty+0x253/0x310 [gfs2] Sep 6 04:28:35 e2e-l4-236128 kernel: ? schedule_timeout+0x90/0x150 Sep 6 04:28:35 e2e-l4-236128 kernel: gfs2_logd+0xec/0x2c0 [gfs2] Sep 6 04:28:35 e2e-l4-236128 kernel: ? dequeue_task_stop+0x70/0x70 Sep 6 04:28:35 e2e-l4-236128 kernel: ? gfs2_log_commit+0x1a0/0x1a0 [gfs2] Sep 6 04:28:35 e2e-l4-236128 kernel: kthread+0x149/0x170 Sep 6 04:28:35 e2e-l4-236128 kernel: ? set_kthread_struct+0x40/0x40 Sep 6 04:28:35 e2e-l4-236128 kernel: ret_from_fork+0x22/0x30 ``` There are some SELinux denials for dlm_controld shortly thereafter: ``` Sep 6 04:28:38 e2e-l4-236128 setroubleshoot[23135]: SELinux is preventing /usr/sbin/dlm_controld from using the signull access on a process. For complete SELinux messages run: sealert -l 13962ff8-d126-4e5e-bff2-623a111ca44e ``` Node e2e-l4-236129's logs suggest that someone tried to stop pacemaker on node e2e-l4-236128 around 04:31:17, but node e2e-l4-236128 failed (possibly while stopping the cluster_fs resource): ``` Sep 06 04:31:17.126 e2e-l4-236129 pacemaker-controld [3741] (handle_shutdown_request) info: Creating shutdown request for e2e-l4-236128 (state=S_IDLE) ... Sep 06 04:31:17.149 e2e-l4-236129 pacemaker-controld [3741] (te_rsc_command) notice: Initiating stop operation cluster_fs_stop_0 on e2e-l4-236128 | action 29 ... Sep 06 04:31:23.761 e2e-l4-236129 pacemaker-controld [3741] (node_left) info: Group crmd event 6: e2e-l4-236128 (node 1 pid 4288) left via cluster exit Sep 06 04:31:23.762 e2e-l4-236129 pacemaker-controld [3741] (crm_update_peer_proc) info: node_left: Node e2e-l4-236128[1] - corosync-cpg is now offline Sep 06 04:31:23.762 e2e-l4-236129 pacemaker-controld [3741] (peer_update_callback) info: Node e2e-l4-236128 is no longer a peer | DC=true old=0x4000000 new=0000000 ``` At this point, node e2e-l4-236129 fenced node e2e-l4-236128. But node e2e-l4-236128 was likely already panicking. ----- In addition to the questions at the beginning of this comment, can you see if the issue is reproducible with SELinux in permissive mode? Please run `setenforce 0` before reproducing. I'm going to assign this BZ to the gfs2 team. We can continue to look at the Pacemaker side as needed. Created attachment 1913272 [details]
crash dump happens at Sep 6th 04:08
Created attachment 1913274 [details]
Crash dump at Sep 6th 04:31
The system just panic right after I run systemctl restart pacemaker So I'm not sure the panic happens at pacemaker stop or start step... Also uploaded the vmcore-mesg.txt, please check I already found and fixed this bug upstream. The upstream patch is: 04133b607a78f2fd3daadbe5519513942b0f3a05 For now: https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=04133b607a78f2fd3daadbe5519513942b0f3a05 Andreas submitted a pull request for it yesterday, so hopefully Linus will put it into mainline soon. I've already submitted a merge request for Centos Stream that includes this patch. https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1325 This was done for bug #2102770. So I'm closing this bug as a duplicate of that one. *** This bug has been marked as a duplicate of bug 2102770 *** |