Bug 1452894
Summary: | Clustered LVM hangs hard after node is fenced | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> |
Component: | lvm2 | Assignee: | Alasdair Kergon <agk> |
Status: | CLOSED UPSTREAM | QA Contact: | cluster-qe <cluster-qe> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.9 | CC: | agk, ccaulfie, cluster-maint, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, rpeterso, teigland, thornber, zkabelac |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-09-29 21:03:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Madison Kelly
2017-05-19 22:31:26 UTC
Forgot to mention, rgmanager is still responsive; ==== [root@an-a01n01 ~]# clustat Cluster Status for an-anvil-01 @ Fri May 19 18:31:48 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 (an-a01n02.alteeve.com) stopped service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 (an-a01n02.alteeve.com) stopped vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 an-a01n01.alteeve.com started vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 an-a01n01.alteeve.com started vm:srv06-centos4 an-a01n01.alteeve.com started ==== It's just clvmd that is hung. I ran 'pvscan' again and then rebooted node 2. It stayed hung until node #2 started cman and clvmd started, clvmd recovered and pvscan completed. ==== [root@an-a01n01 ~]# pvscan PV /dev/drbd0 VG an-a01n01_vg0 lvm2 [636.57 GiB / 236.57 GiB free] Total: 1 [636.57 GiB] / in use: 1 [636.57 GiB] / in no VG: 0 [0 ] ==== I just had this happen again on a different cluster. This time, however, the node wasn't fenced, it simply was withdrawn from the cluster and then restarted. Node 1 was withdrawn from the cluster with '/etc/init.d/rgmanager stop && /etc/init.d/cman stop', then a few minutes later, rejoined with '/etc/init.d/cman start' followed by '/etc/init.d/clvmd start'. At this point, 'pvscan' locked up again. ==== May 21 21:26:40 an-a03n01 rgmanager[15368]: Member 2 shutting down May 21 21:26:40 an-a03n01 rgmanager[15368]: Marking service:storage_n02 as stopped: Restricted domain unavailable May 21 21:26:41 an-a03n01 rgmanager[15368]: Marking service:libvirtd_n02 as stopped: Restricted domain unavailable May 21 21:26:58 an-a03n01 corosync[11331]: [QUORUM] Members[1]: 1 May 21 21:26:58 an-a03n01 corosync[11331]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:26:58 an-a03n01 corosync[11331]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:2 left:1) May 21 21:26:58 an-a03n01 corosync[11331]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:26:58 an-a03n01 kernel: dlm: closing connection to node 2 May 21 21:28:18 an-a03n01 corosync[11331]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:28:18 an-a03n01 corosync[11331]: [QUORUM] Members[2]: 1 2 May 21 21:28:18 an-a03n01 corosync[11331]: [QUORUM] Members[2]: 1 2 May 21 21:28:18 an-a03n01 corosync[11331]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:1 left:0) May 21 21:28:18 an-a03n01 corosync[11331]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:28:26 an-a03n01 kernel: dlm: got connection from 2 May 21 21:28:38 an-a03n01 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 May 21 21:28:38 an-a03n01 kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. May 21 21:28:38 an-a03n01 kernel: drbd r0: conn( WFConnection -> WFReportParams ) May 21 21:28:38 an-a03n01 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [16726]) May 21 21:28:38 an-a03n01 kernel: block drbd0: drbd_sync_handshake: May 21 21:28:38 an-a03n01 kernel: block drbd0: self 421EA68818FE0737:B470D4A67EBFB993:4FE1E2CEE9AE2C3C:4FE0E2CEE9AE2C3C bits:0 flags:0 May 21 21:28:38 an-a03n01 kernel: block drbd0: peer B470D4A67EBFB992:0000000000000000:4FE1E2CEE9AE2C3D:4FE0E2CEE9AE2C3C bits:0 flags:0 May 21 21:28:38 an-a03n01 kernel: block drbd0: uuid_compare()=1 by rule 70 May 21 21:28:38 an-a03n01 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) May 21 21:28:38 an-a03n01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% May 21 21:28:38 an-a03n01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% May 21 21:28:38 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 May 21 21:28:38 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) May 21 21:28:38 an-a03n01 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) May 21 21:28:38 an-a03n01 kernel: block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]). May 21 21:28:38 an-a03n01 kernel: block drbd0: updated sync UUID 421EA68818FE0737:B471D4A67EBFB993:B470D4A67EBFB993:4FE1E2CEE9AE2C3C May 21 21:28:38 an-a03n01 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) May 21 21:28:38 an-a03n01 kernel: block drbd0: updated UUIDs 421EA68818FE0737:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 May 21 21:28:38 an-a03n01 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) May 21 21:28:38 an-a03n01 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0 May 21 21:28:38 an-a03n01 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0 exit code 0 (0x0) May 21 21:28:38 an-a03n01 kernel: block drbd0: peer( Secondary -> Primary ) May 21 21:28:52 an-a03n01 rgmanager[15368]: Shutting down May 21 21:28:53 an-a03n01 rgmanager[15368]: Stopping service service:storage_n01 May 21 21:28:53 an-a03n01 rgmanager[15368]: Stopping service service:libvirtd_n01 May 21 21:28:53 an-a03n01 rgmanager[15368]: Service service:libvirtd_n01 is stopped May 21 21:28:54 an-a03n01 kernel: block drbd0: role( Primary -> Secondary ) May 21 21:28:54 an-a03n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. May 21 21:28:54 an-a03n01 kernel: drbd r0: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7) May 21 21:28:54 an-a03n01 kernel: drbd r0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown ) May 21 21:28:54 an-a03n01 kernel: drbd r0: ack_receiver terminated May 21 21:28:54 an-a03n01 kernel: drbd r0: Terminating drbd_a_r0 May 21 21:28:54 an-a03n01 kernel: drbd r0: Connection closed May 21 21:28:54 an-a03n01 kernel: drbd r0: conn( Disconnecting -> StandAlone ) May 21 21:28:54 an-a03n01 kernel: drbd r0: receiver terminated May 21 21:28:54 an-a03n01 kernel: drbd r0: Terminating drbd_r_r0 May 21 21:28:54 an-a03n01 kernel: block drbd0: disk( Outdated -> Failed ) May 21 21:28:54 an-a03n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. May 21 21:28:54 an-a03n01 kernel: block drbd0: disk( Failed -> Diskless ) May 21 21:28:54 an-a03n01 kernel: drbd r0: Terminating drbd_w_r0 May 21 21:28:54 an-a03n01 kernel: drbd: module cleanup done. May 21 21:28:54 an-a03n01 rgmanager[15368]: Service service:storage_n01 is stopped May 21 21:28:54 an-a03n01 rgmanager[15368]: Shutting down May 21 21:28:54 an-a03n01 rgmanager[15368]: Disconnecting from CMAN May 21 21:28:54 an-a03n01 rgmanager[15368]: Exiting May 21 21:28:56 an-a03n01 kernel: dlm: closing connection to node 2 May 21 21:28:56 an-a03n01 kernel: dlm: closing connection to node 1 May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Unloading all Corosync service engines. May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync configuration service May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01 May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync profile loading service May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01 May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90 May 21 21:28:57 an-a03n01 corosync[11331]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 May 21 21:28:57 an-a03n01 corosync[11331]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055. May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service. May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Corosync built-in features: nss dbus rdma snmp May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Successfully parsed cman config May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] Initializing transport (UDP/IP Unicast). May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] The network interface [10.20.30.1] is now up. May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] Using quorum provider quorum_cman May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 May 21 21:29:47 an-a03n01 corosync[22810]: [CMAN ] CMAN 3.0.12.1 (built Mar 24 2017 12:40:51) started May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90 May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: openais checkpoint service B.01.01 May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync extended virtual synchrony service May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync configuration service May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync profile loading service May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] Using quorum provider quorum_cman May 21 21:29:47 an-a03n01 corosync[22810]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] adding new UDPU member {10.20.30.1} May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] adding new UDPU member {10.20.30.2} May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:29:47 an-a03n01 corosync[22810]: [CMAN ] quorum regained, resuming activity May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] This node is within the primary component and will provide service. May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] Members[1]: 1 May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] Members[1]: 1 May 21 21:29:47 an-a03n01 corosync[22810]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:0 left:0) May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:29:47 an-a03n01 corosync[22810]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] Members[2]: 1 2 May 21 21:29:47 an-a03n01 corosync[22810]: [QUORUM] Members[2]: 1 2 May 21 21:29:47 an-a03n01 corosync[22810]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:1 left:0) May 21 21:29:47 an-a03n01 corosync[22810]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:29:51 an-a03n01 fenced[22923]: fenced 3.0.12.1 started May 21 21:29:51 an-a03n01 dlm_controld[22946]: dlm_controld 3.0.12.1 started May 21 21:29:52 an-a03n01 gfs_controld[23021]: gfs_controld 3.0.12.1 started May 21 21:29:55 an-a03n01 kernel: dlm: Using TCP for communications May 21 21:29:55 an-a03n01 kernel: dlm: connecting to 2 May 21 21:29:56 an-a03n01 clvmd: Cluster LVM daemon started - connected to CMAN ==== From node 2's perspective; ==== May 21 21:28:46 an-a03n02 rgmanager[17840]: I am node #2 May 21 21:28:46 an-a03n02 rgmanager[17840]: Resource Group Manager Starting May 21 21:28:48 an-a03n02 rgmanager[17840]: Starting stopped service service:storage_n02 May 21 21:28:49 an-a03n02 rgmanager[17840]: Starting stopped service service:libvirtd_n02 May 21 21:28:49 an-a03n02 rgmanager[17840]: Service service:libvirtd_n02 started May 21 21:28:49 an-a03n02 rgmanager[17840]: Service service:storage_n02 started May 21 21:28:49 an-a03n02 kernel: lo: Disabled Privacy Extensions May 21 21:28:54 an-a03n02 kernel: block drbd0: peer( Primary -> Secondary ) May 21 21:28:54 an-a03n02 kernel: block drbd0: State change failed: Refusing to be Primary while peer is not outdated May 21 21:28:54 an-a03n02 kernel: block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } May 21 21:28:54 an-a03n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:UpToDate/DUnknown s---F- } May 21 21:28:54 an-a03n02 kernel: drbd r0: State change failed: Refusing to be Primary while peer is not outdated May 21 21:28:54 an-a03n02 kernel: drbd r0: mask = 0x1f0 val = 0x70 May 21 21:28:54 an-a03n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown May 21 21:28:54 an-a03n02 kernel: drbd r0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) May 21 21:28:54 an-a03n02 kernel: drbd r0: ack_receiver terminated May 21 21:28:54 an-a03n02 kernel: drbd r0: Terminating drbd_a_r0 May 21 21:28:54 an-a03n02 kernel: block drbd0: new current UUID AB97F17A92B4AA55:421EA68818FE0737:B471D4A67EBFB992:B470D4A67EBFB993 May 21 21:28:54 an-a03n02 kernel: drbd r0: Connection closed May 21 21:28:54 an-a03n02 kernel: drbd r0: conn( TearDown -> Unconnected ) May 21 21:28:54 an-a03n02 kernel: drbd r0: receiver terminated May 21 21:28:54 an-a03n02 kernel: drbd r0: Restarting receiver thread May 21 21:28:54 an-a03n02 kernel: drbd r0: receiver (re)started May 21 21:28:54 an-a03n02 kernel: drbd r0: conn( Unconnected -> WFConnection ) May 21 21:28:54 an-a03n02 rgmanager[17840]: Member 1 shutting down May 21 21:28:54 an-a03n02 rgmanager[17840]: Marking service:storage_n01 as stopped: Restricted domain unavailable May 21 21:28:57 an-a03n02 corosync[16913]: [QUORUM] Members[1]: 2 May 21 21:28:57 an-a03n02 corosync[16913]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:28:57 an-a03n02 corosync[16913]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:2 left:1) May 21 21:28:57 an-a03n02 corosync[16913]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:28:57 an-a03n02 kernel: dlm: closing connection to node 1 May 21 21:29:47 an-a03n02 corosync[16913]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:29:47 an-a03n02 corosync[16913]: [QUORUM] Members[2]: 1 2 May 21 21:29:47 an-a03n02 corosync[16913]: [QUORUM] Members[2]: 1 2 May 21 21:29:47 an-a03n02 corosync[16913]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:1 left:0) May 21 21:29:47 an-a03n02 corosync[16913]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:29:55 an-a03n02 kernel: dlm: got connection from 1 ==== At this point 'pvscan' locks up on both nodes, but 'clustat' still works. ==== [root@an-a03n01 ~]# clustat Cluster Status for an-anvil-03 @ Sun May 21 21:50:11 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a03n01.alteeve.com 1 Online, Local an-a03n02.alteeve.com 2 Online ==== And: ==== [root@an-a03n02 ~]# clustat Cluster Status for an-anvil-03 @ Sun May 21 21:50:14 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a03n01.alteeve.com 1 Online an-a03n02.alteeve.com 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 (an-a03n01.alteeve.com) stopped service:libvirtd_n02 an-a03n02.alteeve.com started service:storage_n01 (an-a03n01.alteeve.com) stopped service:storage_n02 an-a03n02.alteeve.com started ==== Both nodes were freshly booted and neither node had been fenced or removed/rejoined since the cluster was first started on both nodes. Note that the nodes had not yet been updated from base 6.9. ==== [root@an-a03n01 ~]# rpm -q lvm2-cluster cman rgmanager gfs2-utils lvm2-cluster-2.02.143-12.el6.x86_64 cman-3.0.12.1-84.el6.x86_64 rgmanager-3.0.12.1-33.el6.x86_64 gfs2-utils-3.0.12.1-84.el6.x86_64 ==== I just noticed that the rgmanager version wasn't reported in the initial comment. It is the same though, so I am guessing it hasn't been updated since 6.9 was released: ==== [root@an-a01n01 ~]# rpm -q rgmanager rgmanager-3.0.12.1-33.el6.x86_64 ==== Note also that this cluster is CentOS, where the previous comments where from RHEL 6.9 proper. This is a pretty serious issue, so far as I can tell. Risking a cluster lock up on LVM from simply withdrawing and rejoining the cluster... To try and recover, I fenced node 1 and tried to directly start cman followed by rgmanager, without manually starting clvmd first. It took a while, but it did eventually start and, once started, pvscan on node 2 finally returned. Could this have something to do with cman and clvmd running without rgmanager running? I suppose it's just coincidence and fencing node 1 and rejoining it to the cluster was enough. I'll try to gather more data when I hit this bug again. syslog from node 1 after it was fenced; ==== May 21 21:59:39 an-a03n01 corosync[2360]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service. May 21 21:59:39 an-a03n01 corosync[2360]: [MAIN ] Corosync built-in features: nss dbus rdma snmp May 21 21:59:39 an-a03n01 corosync[2360]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf May 21 21:59:39 an-a03n01 corosync[2360]: [MAIN ] Successfully parsed cman config May 21 21:59:39 an-a03n01 corosync[2360]: [TOTEM ] Initializing transport (UDP/IP Unicast). May 21 21:59:39 an-a03n01 corosync[2360]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). May 21 21:59:39 an-a03n01 corosync[2360]: [TOTEM ] The network interface [10.20.30.1] is now up. May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] Using quorum provider quorum_cman May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 May 21 21:59:40 an-a03n01 corosync[2360]: [CMAN ] CMAN 3.0.12.1 (built Mar 24 2017 12:40:51) started May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90 May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: openais checkpoint service B.01.01 May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync extended virtual synchrony service May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync configuration service May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync profile loading service May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] Using quorum provider quorum_cman May 21 21:59:40 an-a03n01 corosync[2360]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 May 21 21:59:40 an-a03n01 corosync[2360]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. May 21 21:59:40 an-a03n01 corosync[2360]: [TOTEM ] adding new UDPU member {10.20.30.1} May 21 21:59:40 an-a03n01 corosync[2360]: [TOTEM ] adding new UDPU member {10.20.30.2} May 21 21:59:40 an-a03n01 corosync[2360]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:59:40 an-a03n01 corosync[2360]: [CMAN ] quorum regained, resuming activity May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] This node is within the primary component and will provide service. May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] Members[1]: 1 May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] Members[1]: 1 May 21 21:59:40 an-a03n01 corosync[2360]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] Members[2]: 1 2 May 21 21:59:40 an-a03n01 corosync[2360]: [QUORUM] Members[2]: 1 2 May 21 21:59:40 an-a03n01 corosync[2360]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:1 left:0) May 21 21:59:40 an-a03n01 corosync[2360]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:59:44 an-a03n01 fenced[2427]: fenced 3.0.12.1 started May 21 21:59:44 an-a03n01 dlm_controld[2442]: dlm_controld 3.0.12.1 started May 21 21:59:45 an-a03n01 gfs_controld[2504]: gfs_controld 3.0.12.1 started May 21 22:00:01 an-a03n01 kernel: dlm: Using TCP for communications May 21 22:00:01 an-a03n01 kernel: dlm: connecting to 2 May 21 22:00:02 an-a03n01 rgmanager[2605]: I am node #1 May 21 22:00:02 an-a03n01 rgmanager[2605]: Resource Group Manager Starting May 21 22:00:04 an-a03n01 rgmanager[2605]: Starting stopped service service:storage_n01 May 21 22:00:04 an-a03n01 kernel: drbd: events: mcg drbd: 2 May 21 22:00:04 an-a03n01 kernel: drbd: initialized. Version: 8.4.9-1 (api:1/proto:86-101) May 21 22:00:04 an-a03n01 kernel: drbd: GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by root.ca, 2017-02-05 15:36:19 May 21 22:00:04 an-a03n01 kernel: drbd: registered as block device major 147 May 21 22:00:04 an-a03n01 kernel: drbd r0: Starting worker thread (from drbdsetup-84 [3718]) May 21 22:00:04 an-a03n01 kernel: block drbd0: disk( Diskless -> Attaching ) May 21 22:00:04 an-a03n01 kernel: drbd r0: Method to ensure write ordering: drain May 21 22:00:04 an-a03n01 kernel: block drbd0: max BIO size = 1048576 May 21 22:00:04 an-a03n01 kernel: block drbd0: drbd_bm_resize called with capacity == 326093016 May 21 22:00:04 an-a03n01 kernel: block drbd0: resync bitmap: bits=40761627 words=636901 pages=1244 May 21 22:00:04 an-a03n01 kernel: block drbd0: size = 155 GB (163046508 KB) May 21 22:00:04 an-a03n01 kernel: block drbd0: recounting of set bits took additional 3 jiffies May 21 22:00:04 an-a03n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. May 21 22:00:04 an-a03n01 kernel: block drbd0: disk( Attaching -> Outdated ) May 21 22:00:04 an-a03n01 kernel: block drbd0: attached to UUIDs 421EA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 May 21 22:00:04 an-a03n01 kernel: drbd r0: conn( StandAlone -> Unconnected ) May 21 22:00:04 an-a03n01 kernel: drbd r0: Starting receiver thread (from drbd_w_r0 [3720]) May 21 22:00:04 an-a03n01 kernel: drbd r0: receiver (re)started May 21 22:00:04 an-a03n01 kernel: drbd r0: conn( Unconnected -> WFConnection ) May 21 22:00:04 an-a03n01 rgmanager[2605]: Starting stopped service service:libvirtd_n01 May 21 22:00:05 an-a03n01 rgmanager[2605]: Service service:libvirtd_n01 started May 21 22:00:05 an-a03n01 kernel: Ebtables v2.0 registered May 21 22:00:05 an-a03n01 kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team May 21 22:00:05 an-a03n01 kernel: lo: Disabled Privacy Extensions May 21 22:00:14 an-a03n01 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 May 21 22:00:14 an-a03n01 kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. May 21 22:00:14 an-a03n01 kernel: drbd r0: conn( WFConnection -> WFReportParams ) May 21 22:00:14 an-a03n01 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [3726]) May 21 22:00:14 an-a03n01 kernel: block drbd0: drbd_sync_handshake: May 21 22:00:14 an-a03n01 kernel: block drbd0: self 421EA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 bits:0 flags:0 May 21 22:00:14 an-a03n01 kernel: block drbd0: peer AB97F17A92B4AA55:421EA68818FE0737:B471D4A67EBFB992:B470D4A67EBFB993 bits:15 flags:0 May 21 22:00:14 an-a03n01 kernel: block drbd0: uuid_compare()=-1 by rule 50 May 21 22:00:14 an-a03n01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 21 22:00:14 an-a03n01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0% May 21 22:00:14 an-a03n01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0% May 21 22:00:14 an-a03n01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) May 21 22:00:14 an-a03n01 kernel: block drbd0: updated sync uuid 421FA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) May 21 22:00:14 an-a03n01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) May 21 22:00:14 an-a03n01 kernel: block drbd0: Began resync as SyncTarget (will sync 60 KB [15 bits set]). May 21 22:00:14 an-a03n01 kernel: block drbd0: role( Secondary -> Primary ) May 21 22:00:14 an-a03n01 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 60 K/sec) May 21 22:00:14 an-a03n01 kernel: block drbd0: updated UUIDs AB97F17A92B4AA55:0000000000000000:421FA68818FE0737:421EA68818FE0737 May 21 22:00:14 an-a03n01 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) May 21 22:00:16 an-a03n01 clvmd: Cluster LVM daemon started - connected to CMAN May 21 22:00:45 an-a03n01 kernel: bcn_bond1: link status definitely up for interface bcn_link2, 1000 Mbps full duplex May 21 22:00:50 an-a03n01 kernel: ifn_bond1: link status definitely up for interface ifn_link2, 1000 Mbps full duplex May 21 22:00:50 an-a03n01 kernel: sn_bond1: link status definitely up for interface sn_link2, 1000 Mbps full duplex May 21 22:01:16 an-a03n01 kernel: Slow work thread pool: Starting up May 21 22:01:16 an-a03n01 kernel: Slow work thread pool: Ready May 21 22:01:16 an-a03n01 kernel: GFS2 (built Mar 21 2017 19:30:12) installed May 21 22:01:16 an-a03n01 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-anvil-03:shared" May 21 22:01:16 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: Joined cluster. Now mounting FS... May 21 22:01:17 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: jid=0, already locked for use May 21 22:01:17 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: jid=0: Looking at journal... May 21 22:01:17 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: jid=0: Done May 21 22:01:17 an-a03n01 rgmanager[2605]: Service service:storage_n01 started May 21 22:01:31 an-a03n01 modclusterd: startup succeeded May 21 22:01:31 an-a03n01 oddjobd: oddjobd startup succeeded May 21 22:01:32 an-a03n01 saslauthd[4937]: detach_tty : master pid is: 4937 May 21 22:01:32 an-a03n01 saslauthd[4937]: ipc_init : listening on socket: /var/run/saslauthd/mux May 21 22:01:32 an-a03n01 ricci: startup succeeded ==== The fence of node 1 and it coming back up, from node 2's perspective. ==== May 21 21:29:55 an-a03n02 kernel: dlm: got connection from 1 May 21 21:58:04 an-a03n02 fence_node[3571]: fence an-a03n01 success May 21 21:58:10 an-a03n02 corosync[16913]: [TOTEM ] A processor failed, forming new configuration. May 21 21:58:12 an-a03n02 corosync[16913]: [QUORUM] Members[1]: 2 May 21 21:58:12 an-a03n02 corosync[16913]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:58:12 an-a03n02 corosync[16913]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:2 left:1) May 21 21:58:12 an-a03n02 corosync[16913]: [MAIN ] Completed service synchronization, ready to provide service. May 21 21:58:12 an-a03n02 kernel: dlm: closing connection to node 1 May 21 21:58:12 an-a03n02 fenced[17038]: node_history_fence_external no nodeid -1 May 21 21:58:12 an-a03n02 fenced[17038]: fencing node an-a03n01.alteeve.com May 21 21:58:30 an-a03n02 fenced[17038]: fence an-a03n01.alteeve.com success May 21 21:59:40 an-a03n02 corosync[16913]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 21:59:40 an-a03n02 corosync[16913]: [QUORUM] Members[2]: 1 2 May 21 21:59:40 an-a03n02 corosync[16913]: [QUORUM] Members[2]: 1 2 May 21 21:59:40 an-a03n02 corosync[16913]: [CPG ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:1 left:0) May 21 21:59:40 an-a03n02 corosync[16913]: [MAIN ] Completed service synchronization, ready to provide service. May 21 22:00:01 an-a03n02 kernel: dlm: got connection from 1 May 21 22:00:15 an-a03n02 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 May 21 22:00:15 an-a03n02 kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. May 21 22:00:15 an-a03n02 kernel: drbd r0: conn( WFConnection -> WFReportParams ) May 21 22:00:15 an-a03n02 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [17369]) May 21 22:00:15 an-a03n02 kernel: block drbd0: drbd_sync_handshake: May 21 22:00:15 an-a03n02 kernel: block drbd0: self AB97F17A92B4AA55:421EA68818FE0737:B471D4A67EBFB992:B470D4A67EBFB993 bits:15 flags:0 May 21 22:00:15 an-a03n02 kernel: block drbd0: peer 421EA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 bits:0 flags:0 May 21 22:00:15 an-a03n02 kernel: block drbd0: uuid_compare()=1 by rule 70 May 21 22:00:15 an-a03n02 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) May 21 22:00:15 an-a03n02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0% May 21 22:00:15 an-a03n02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0% May 21 22:00:15 an-a03n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 May 21 22:00:15 an-a03n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) May 21 22:00:15 an-a03n02 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) May 21 22:00:15 an-a03n02 kernel: block drbd0: Began resync as SyncSource (will sync 60 KB [15 bits set]). May 21 22:00:15 an-a03n02 kernel: block drbd0: updated sync UUID AB97F17A92B4AA55:421FA68818FE0737:421EA68818FE0737:B471D4A67EBFB992 May 21 22:00:15 an-a03n02 kernel: block drbd0: peer( Secondary -> Primary ) May 21 22:00:15 an-a03n02 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 60 K/sec) May 21 22:00:15 an-a03n02 kernel: block drbd0: updated UUIDs AB97F17A92B4AA55:0000000000000000:421FA68818FE0737:421EA68818FE0737 May 21 22:00:15 an-a03n02 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) May 21 22:00:15 an-a03n02 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0 May 21 22:00:15 an-a03n02 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0 exit code 0 (0x0) ==== One more note; cluster 1 (an-anvil-01) is on physical hardware and cluster 2 (an-anvil-03) is a KVM-based cluster. So cluster 1's fencing is IPMI -> PDUs and cluster 2 is just fence_virsh. Their configs are otherwise identical (same installer). Created attachment 1280889 [details]
sos report from node 1
md5: 675c774212efa45d1e6b600f43d0b40c
Created attachment 1280890 [details]
sos report from node 2
md5: b6ebbf0390a346585b2181357496026b
Created attachment 1280953 [details]
sos report from an-anvil-01 - node 1
md5: 6c17bb2b210278bced00c081b6f51b2f
Created attachment 1280954 [details]
sos report from an-anvil-01 - node 2
md5: 689ac077104c7032e8d1146440d602a8
For posterity (from freenode -> #clusterlabs) (unrelated chatter removed) ==== 03:04 <@chrissie> digimer: can you enable clvmd debug logging please? (clvmd -d2) 03:39 < digimer> chrissie: it happens sporatically. is there a way to set -d2 permanently? 03:39 < digimer> I'll run a sosreport next time it happens 03:42 <@chrissie> You can set it in the the sysconfig file I think. (this is a memory test, right?) 03:44 < digimer> which sysconfig file? 03:47 <@chrissie> add it to CLVMDOPTS in /etc/init.d/clvmd 03:48 <@chrissie> There's /etc/sysconfig/clvmd too, but it's not shipped by default and you'd need to remember to add -T30 (which is the default in the init script but which would bet overridden) 03:49 < digimer> done 03:49 < digimer> ya, I looked in /etc/sysconfig/ but didn't see clvmd 03:50 < digimer> ok, restarted clvmd on both nodes 03:51 < digimer> storage back online 03:51 < digimer> I'll do more testing tomorrow of start/stop storage and fencing nodes 03:51 <@chrissie> It's odd that dlm is still OK, it's the most usual cause of clvmd hangs. The second is hung storage 03:51 < digimer> anything else I can do in the meantime? 03:52 <@chrissie> so I'm at a bit of a loss to explain it at the moment 03:52 < digimer> in this case, neither is the problem 03:52 <@chrissie> has it started happening since an upgrade? 03:52 < digimer> gfs2 still works fine, as do VMs using clvmd LVs as raw storage 03:52 < digimer> just can't touch pvscan or the like or else it locks everything up 03:52 < digimer> dunno 03:52 < digimer> I've recently started much more serious / heavy testing getting ready for release 03:53 < digimer> I didn't see it before 03:53 < digimer> or rather, I saw dlm hangs cause problems like this, as you said 03:53 < digimer> but I can't recall ever seeing just clvmd lock up when gfs2/rgmanager were still fine 03:54 < digimer> thought it is scaring me pretty badly that deployed systems might face this 03:54 < digimer> given that recovery seems to require reconnecting the peer, and that won't be possible if a node is destroyed in the field 03:54 <@chrissie> yes, it scares me too. I 03:54 <@chrissie> I've not seen anything like this happen on RHEL6 03:56 < digimer> let me see if I can trigger it 03:56 <@chrissie> in the logs I've seen, there's no reboot between cluster shutdown and restart, is that common? 03:57 < digimer> that was the case in the most recent event 03:57 < digimer> on an-anvil-03 03:57 < digimer> it happened previously on a node that was fenced after I 'echo c >/proc/sysrq-trigger'ed it with IPMI unplugged (to force backup fencing to be used) 03:58 < digimer> so it doesn't seem to matter if it reboots between restarts 03:58 <@chrissie> ok 03:59 < digimer> stopped -> started (no reboot), OK 03:59 < digimer> I don't have a reproducer, so I have to just repeat until it happens... 04:02 < digimer> ok that time, too 04:02 * digimer tries fencing the node 04:03 < digimer> this is odd. could this be related? 04:03 < digimer> May 22 04:02:04 an-a01n01 rsyslogd-2177: imuxsock begins to drop messages from pid 1270 due to rate-limiting 04:03 < digimer> May 22 04:02:09 an-a01n01 rsyslogd-2177: imuxsock lost 1924 messages from pid 1270 due to rate-limiting 04:03 < digimer> repeats over and over 04:03 < digimer> happening on node 2, as well 04:04 < digimer> [root@an-a01n01 ~]# ps aux | grep 1270 04:04 < digimer> root 1270 0.8 0.0 257052 40176 ? SLsl 03:58 0:02 clvmd -T30 -d2 04:04 < digimer> it's clvmd 04:04 <@chrissie> ahh, those will be the debug log messages 04:04 < digimer> ok, so don't worry about that? 04:04 < digimer> if not, I'll go back to fence testing 04:04 <@chrissie> so it's doing *something* in a loop! 04:05 < digimer> oh, so that isn't normal? 04:05 < digimer> should I create a sosreport now? 04:05 <@chrissie> it shouldn't be logging 2000 messages a second 04:05 <@chrissie> yes please 04:05 < digimer> ok. so I should stop the cluster on both nodes and generate the report 04:05 < digimer> give me a few 04:06 <@chrissie> clvmd should only log when it's been asked to do something (which mostly it isn't - it's a management tool) 04:06 < digimer> these reports will be from actual RHEL 6.9 on hardware, fully updated 04:07 < digimer> the logging started when I withdrew and then rejoined the nodes 04:08 <@chrissie> hmm interestinng. I hope we have at least some of the clvmd firehose 04:08 < digimer> append them to the rhbz or just upload them where you can get at them? 04:08 < digimer> firehose? 04:09 <@chrissie> so it's doing *something* in a loop!mad amount of logging it seems to be doing 04:09 <@chrissie> oops 04:09 * chrissie needs more tea 04:09 * digimer nods 04:09 <@chrissie> best put it in the bz, then all the info is in one place 04:09 <@chrissie> what's the BZ#? 04:11 < digimer> 1452894 04:13 < digimer> uploaded; c#10 and c#11 04:18 <@chrissie> hmm nothing from clvmd 04:18 <@chrissie> That's cos *.debug logging is disabled #facepalm 04:18 < digimer> ? 04:18 < digimer> in the sosreport? 04:19 <@chrissie> in syslog.conf 04:19 < digimer> oh 04:19 < digimer> so I need to start over? 04:19 <@chrissie> sorry, yes :( === 12:32 < digimer> chrissie_away: I enabled debug logging on rsyslog.conf 12:32 < digimer> and I noticed that there would be a flood of logs when I ran my program that scans the cluster (including calling pvs/vgs/lvs), formats and displays the state info 12:32 < digimer> there were bursts of logs when it ran 12:33 < digimer> so 12:33 < digimer> a) calling pvs/pvscan et. al. too often breaks things (which I imagine shouldn't happen), or these logs are not itself related to the crash 12:33 < digimer> note that when LVM hung, I stopped all our code to try and rule it out 12:34 < digimer> I'll do more experiments though and try to see 12:36 < digimer> here is what our program calls; https://github.com/ClusterLabs/striker/blob/master/tools/anvil-report-state#L668 12:37 < digimer> basically, pvs + vgs + lvs 12:37 < digimer> there is a function above it that used to call pvscan + vgscan + lvscan, but that was disabled ages ago 12:38 < digimer> so if pvs + vgs + lvs causes a flood of debug logs, then that would explain it 12:38 < digimer> it == log rate limiting 12:38 < digimer> not it shouldn't cause a total lockup of cLVM 12:39 < digimer> I'll add this to the rhbz ==== Created attachment 1281166 [details]
sos report from an-anvil-01 - node 1 - Post-fence hang
This sosreport is from after fencing node #2 and the issue occurred. As before, I could write to the gfs2 partition and run clustat once the node was successfully fenced, but I couldn't call pvscan.
md5: c887244d74e6a40654e10f16c645f9c2
Created attachment 1281167 [details]
sos report from an-anvil-01 - node 2 - Post-fence hang
This is the node that was fenced. I crashed it with 'echo c > /proc/sysrq-trigger'.
md5: 6e50599634182408877e64f623da87b9
Related to c#13 and c#14, before the hang, 'dmsetup info' returned; ==== [root@an-a01n01 ~]# dmsetup info Name: an--a01n01_vg0-srv01--centos7_0 State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2h9YN9gLkz63EUc9WYdPqmI5Ue122Mm3Ek Name: an--a01n01_vg0-shared State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2hrVbI7JgURjpE0zKZ4kw0eq5in5tcLUwr ==== After node 2 was fenced, while clvmd was hung and gfs2/rgmanager worked, 'dmsetup info' returned; ==== [root@an-a01n01 ~]# dmsetup info Name: an--a01n01_vg0-srv01--centos7_0 State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2h9YN9gLkz63EUc9WYdPqmI5Ue122Mm3Ek Name: an--a01n01_vg0-shared State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2hrVbI7JgURjpE0zKZ4kw0eq5in5tcLUwr ==== 08:52 <@chrissie> So the sosreport managed to run pvs et al successfully (which it does by disabling locking) so I suspect a hung lock in the DLM somewhere. Annyoingly sosreport doesn't dump the dlm lockspace 09:00 <@chrissie> there's an un-unlocked lock on the VG showing in the log, I can't be sure it's not the rate-limiting that's removed the log line though! 09:06 <@chrissie> so it looks like clvmd hasn't got the next (unlock) command from the lvm command-line AND the command-line hadn't quit 09:11 <@chrissie> I'd be interested to hear an LVM developer's take on that 09:23 -!- mbecroft [~user.co.nz] has quit [Ping timeout: 240 seconds] 09:26 < digimer> do you know who to ping on that? 09:27 < digimer> I'm not sure how to monitor LVM without pvs/vgs/lvs... 09:27 < digimer> I loop about ever 45 seconds, give or take 09:27 < digimer> and I agree, a total lock up shouldn't occur 09:28 < digimer> do you want me to do something different to make sosreport more useful? 09:28 * digimer won't be doing it tonight though, it's 4:30am... need some sleep 09:28 < digimer> but I can make any changes and try to reproduce when I get back to it in a few hours 09:29 <@chrissie> without more knowlwdge I think the thing to do is to strace the pvs command that hangs 09:30 <@chrissie> or (maybe better) add verbose lvm logging to the command 09:30 <@chrissie> or (maybe better) add verbose lvm logging to the command 09:31 <@chrissie> -vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 09:31 <@chrissie> (not sure how many v's you need, I just repeat it :) 09:31 < digimer> ok 09:31 < digimer> strace -v^v pvscan 09:31 < digimer> yes? 09:32 <@chrissie> no, the -vvvvvv needs to go after the pvs command 09:32 <@chrissie> it's an argument to lvm 09:32 < digimer> ah 09:32 < digimer> strace pvscan -vvvv 09:32 < digimer> got it 09:32 <@chrissie> zkabelac is who I usually speak to on LVM matters. I'm not sure if he hangs out here or not. I'll ping him when I next seem him on the internal channels 09:32 < digimer> want another sosreport, or just the strace? 09:32 < digimer> thanks! 09:33 <@chrissie> just the command output & strace should be fine The part of the DLM log I'm referring to is this (from the sosreport in comment #13 May 22 13:24:14 an-a01n01 lvm[5131]: Got new connection on fd 5 (0x7f0bb9a9ed00) May 22 13:24:14 an-a01n01 lvm[5131]: Read on local socket 5, len = 35 May 22 13:24:14 an-a01n01 lvm[5131]: Creating pipe, [11, 12] May 22 13:24:14 an-a01n01 lvm[5131]: Creating pre&post thread for pipe fd 11 (0x7f0bb9a9edf0) May 22 13:24:14 an-a01n01 lvm[5131]: Created pre&post thread, state = 0 May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00), pipe fd 12 May 22 13:24:14 an-a01n01 lvm[5131]: doing PRE command LOCK_VG 'V_an-a01n01_vg0' at 1 (client=0x7f0bb9a9ed00) May 22 13:24:14 an-a01n01 lvm[5131]: sync_lock: 'V_an-a01n01_vg0' mode:3 flags=0 May 22 13:24:14 an-a01n01 lvm[5131]: sync_lock: returning lkid 9d0001 May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) writes status 0 down to pipe fd 12 May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting to do post command, state = 0 May 22 13:24:14 an-a01n01 lvm[5131]: Read on pipe 11, 4 bytes, status 0 May 22 13:24:14 an-a01n01 lvm[5131]: Background routine status was 0, sock_client (0x7f0bb9a9ed00) May 22 13:24:14 an-a01n01 lvm[5131]: distribute command: XID = 1373, flags=0x1 (LOCAL) May 22 13:24:14 an-a01n01 lvm[5131]: Doing command explicitly on local node only May 22 13:24:14 an-a01n01 lvm[5131]: add_to_lvmqueue: cmd=0x7f0bb9a9e480. client=0x7f0bb9a9ed00, msg=0x7f0bb9a960d0 , len=35, csid=(nil), xid=1373 May 22 13:24:14 an-a01n01 lvm[5131]: process_work_item: local May 22 13:24:14 an-a01n01 lvm[5131]: process_local_command: LOCK_VG (0x33) msg=0x7f0bb9a9e4c0, msglen =35, client=0 x7f0bb9a9ed00 May 22 13:24:14 an-a01n01 lvm[5131]: do_lock_vg: resource 'V_an-a01n01_vg0', cmd = 0x1 LCK_VG (READ|VG), flags = 0x 4 ( DMEVENTD_MONITOR ), critical_section = 0 May 22 13:24:14 an-a01n01 lvm[5131]: Invalidating cached metadata for VG an-a01n01_vg0 May 22 13:24:14 an-a01n01 lvm[5131]: Reply from node an-a01n01.alteeve.com: 0 bytes May 22 13:24:14 an-a01n01 lvm[5131]: Got 1 replies, expecting: 1 May 22 13:24:14 an-a01n01 lvm[5131]: LVM thread waiting for work May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) got post command condition... May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting for next pre command May 22 13:24:14 an-a01n01 lvm[5131]: Read on pipe 11, 4 bytes, status 0 May 22 13:24:14 an-a01n01 lvm[5131]: Background routine status was 0, sock_client (0x7f0bb9a9ed00) May 22 13:24:14 an-a01n01 lvm[5131]: Send local reply May 22 13:24:14 an-a01n01 lvm[5131]: Read on local socket 5, len = 84 May 22 13:24:14 an-a01n01 lvm[5131]: check_all_clvmds_running May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) got pre command condition... May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) writes status 0 down to pipe fd 12 May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting to do post command, state = 0 May 22 13:24:14 an-a01n01 lvm[5131]: Read on pipe 11, 4 bytes, status 0 ... There's more here but the PRE command gets a lock on the VG, runs the command it wants to do and then I don't see another PRE command to unlock the VG, the log ends here: May 22 13:24:14 an-a01n01 lvm[5131]: add_to_lvmqueue: cmd=0x7f0bb9a9e480. client=0x7f0bb9a9ed00, msg=0x7f0bb9a960d0, len=31, csid=(nil), xid=1376 May 22 13:24:14 an-a01n01 lvm[5131]: Sending message to all cluster nodes May 22 13:24:14 an-a01n01 lvm[5131]: process_work_item: local May 22 13:24:14 an-a01n01 lvm[5131]: process_local_command: SYNC_NAMES (0x2d) msg=0x7f0bb9a9e4c0, msglen =31, client=0x7f0bb9a9ed00 May 22 13:24:14 an-a01n01 lvm[5131]: Syncing device names May 22 13:24:14 an-a01n01 lvm[5131]: Reply from node an-a01n01.alteeve.com: 0 bytes May 22 13:24:14 an-a01n01 lvm[5131]: Got 1 replies, expecting: 2 May 22 13:24:14 an-a01n01 lvm[5131]: LVM thread waiting for work May 22 13:24:14 an-a01n01 lvm[5131]: Reply from node an-a01n02.alteeve.com: 0 bytes May 22 13:24:14 an-a01n01 lvm[5131]: Got 2 replies, expecting: 2 May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) got post command condition... May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting for next pre command It happened again. Here is the strace: ==== ... stat("/dev/drbd0", {st_mode=S_IFBLK|0660, st_rdev=makedev(147, 0), ...}) = 0 stat("/dev/block/253:0", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/block/253:0", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) stat("/dev/disk/by-label/ccrs:shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/disk/by-label/ccrs:shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) stat("/dev/disk/by-id/dm-name-node1_vg0-shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/disk/by-id/dm-name-node1_vg0-shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) stat("/dev/drbd/by-disk/vda5", {st_mode=S_IFBLK|0660, st_rdev=makedev(147, 0), ...}) = 0 stat("/dev/mapper/node1_vg0-shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/mapper/node1_vg0-shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) stat("/dev/node1_vg0/shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/node1_vg0/shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) stat("/dev/drbd/by-res/r0/0", {st_mode=S_IFBLK|0660, st_rdev=makedev(147, 0), ...}) = 0 stat("/dev/dm-0", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/dm-0", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) stat("/dev/disk/by-id/dm-uuid-LVM-u2b7folff7WHVHm9LKbB4lmPN5S6KfGYbMRcecVTfHUpeIQxsuKD5FqCAthuL8Ym", 0x7ffc37313170) = -1 ENOENT (No such file or directory) stat("/dev/disk/by-id/dm-uuid-LVM-u2b7folff7WHVHm9LKbB4lmPN5S6KfGYbMRcecVTfHUpeIQxsuKD5FqCAthuL8Ym", 0x7ffc373130a0) = -1 ENOENT (No such file or directory) open("/sys/dev/block", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4 getdents(4, /* 34 entries */, 32768) = 872 open("/sys/dev/block/1:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:1/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:2/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:3/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:4/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:5/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:6/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:7/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:8/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:9/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:10/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:11/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:12/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:13/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:14/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/1:15/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:1/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:2/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:3/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:4/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:5/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:6/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/7:7/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/252:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/252:1/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/252:2/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/252:3/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/252:4/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/252:5/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 3 entries */, 32768) = 80 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/11:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 open("/sys/dev/block/147:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 getdents(5, /* 2 entries */, 32768) = 48 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 getdents(4, /* 0 entries */, 32768) = 0 close(4) = 0 close(3) = 0 socket(PF_LOCAL, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/lvm/clvmd.sock"}, 110) = 0 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0 write(3, "3\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\0\4\4P_#global\0", 29) = 29 read(3, ==== This is where it sits. Full strace will be attached momentarily. Created attachment 1288909 [details]
Full strace of pvscan
I just hit a bug that seems to be related to this, but didn't manifest in the normal way. I tried to update cluster, and as part of that, called to stop rgmanager and it hung there. When I investigated, I found that a 'vgs' process was stuck and I couldn't kill it. ==== [root@node01 ~]# pvscan PV /dev/drbd0 VG node01_vg0 lvm2 [1.18 TiB / 1.15 TiB free] Total: 1 [1.18 TiB] / in use: 1 [1.18 TiB] / in no VG: 0 [0 ] [root@node01 ~]# vgscan Reading all physical volumes. This may take a while... Found volume group "node01_vg0" using metadata type lvm2 [root@node01 ~]# lvscan ACTIVE '/dev/node01_vg0/shared' [20.00 GiB] inherit ACTIVE '/dev/node01_vg0/vm03-server_0' [20.00 GiB] inherit [root@node01 ~]# lvchange -an /dev/node01_vg0/vm03-server_0 Error locking on node node02.example.com: Logical volume node01_vg0/vm03-server_0 in use. [root@node01 ~]# /etc/init.d/libvirtd start Starting libvirtd daemon: [ OK ] [root@node01 ~]# virsh list --all Id Name State ---------------------------------------------------- [root@node01 ~]# ps aux | grep server root 15614 0.0 0.0 105368 912 pts/4 S+ 00:46 0:00 grep server [root@node01 ~]# ps aux | grep vgs root 1079 0.0 0.0 129420 2864 ? S Jun23 0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1 root 1360 0.0 0.0 32536 4780 ? S< Jun23 0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1 root 15649 0.0 0.0 105372 912 pts/4 S+ 00:46 0:00 grep vgs [root@node01 ~]# kill 1079 [root@node01 ~]# kill 1360 [root@node01 ~]# ps aux | grep vgs root 1079 0.0 0.0 129420 2864 ? S Jun23 0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1 root 1360 0.0 0.0 32536 4780 ? S< Jun23 0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1 root 15711 0.0 0.0 105372 916 pts/4 S+ 00:47 0:00 grep vgs [root@node01 ~]# ps auxf | grep vgs -B 5 -A 5 root 9738 0.0 0.0 102072 4076 ? Ss 00:25 0:00 \_ sshd: root@pts/3 root 9748 0.0 0.0 112068 3632 pts/3 Ss+ 00:25 0:00 | \_ -bash root 13447 0.0 0.0 102712 4616 ? Ss 00:38 0:00 \_ sshd: root@pts/4 root 13460 0.0 0.0 112188 3804 pts/4 Ss 00:38 0:00 | \_ -bash root 15777 0.0 0.0 112288 1600 pts/4 R+ 00:47 0:00 | \_ ps auxf root 15778 0.0 0.0 105368 880 pts/4 S+ 00:47 0:00 | \_ grep vgs -B 5 -A 5 root 13455 0.0 0.0 102072 4100 ? Ss 00:38 0:00 \_ sshd: root@pts/5 root 13490 0.0 0.0 112188 3780 pts/5 Ss 00:38 0:00 \_ -bash root 14429 0.0 0.0 103004 672 pts/5 S+ 00:42 0:00 \_ tail -f -n 100 /var/log/messages ntp 3413 0.0 0.0 32788 2148 ? Ss Jun21 0:07 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g root 3492 0.0 0.0 83000 3464 ? Ss Jun21 0:16 /usr/libexec/postfix/master -- root 24294 0.0 0.0 38992 7276 ? S<Ls Jun21 0:00 rgmanager root 24296 0.0 0.0 789756 5548 ? S<l Jun21 1:29 \_ rgmanager root 1303 0.0 0.0 13352 1488 ? S< Jun23 0:00 \_ /bin/bash /usr/share/cluster/script.sh status root 1353 0.0 0.0 13616 1792 ? S< Jun23 0:00 \_ /bin/bash /etc/init.d/clvmd status root 1359 0.0 0.0 13616 1000 ? S< Jun23 0:00 \_ /bin/bash /etc/init.d/clvmd status root 1360 0.0 0.0 32536 4780 ? S< Jun23 0:00 \_ /sbin/vgs --noheadings -o vg_name -S vg_clustered=1 root 1079 0.0 0.0 129420 2864 ? S Jun23 0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1 root 11311 0.0 0.0 108124 1504 ? Ss 00:29 0:00 bash -c /etc/init.d/rgmanager stop && /etc/init.d/cman stop root 11318 0.1 0.0 108524 1896 ? S 00:29 0:01 \_ /bin/bash /etc/init.d/rgmanager stop root 15776 0.0 0.0 102968 620 ? S 00:47 0:00 \_ sleep 1 root 15486 0.2 0.0 698516 8092 ? Sl 00:46 0:00 libvirtd --daemon [root@node01 ~]# exit ==== (slightly edited to anonimize data, but just domains/server name). Whatever I tried, I couldn't recover the node, so I manually fenced it off (just a 'fence_node ...'). Alasdair mentioned on IRC that this might be related to a bug where things can lock up if an LVM operation happens just as membership tries to change? Here are the syslogs from the same node: ==== Jun 24 00:29:51 node01 rgmanager[24296]: Shutting down Jun 24 00:29:51 node01 rgmanager[24296]: Stopping service service:libvirtd_n01 Jun 24 00:29:52 node01 rgmanager[24296]: Service service:libvirtd_n01 is stopped Jun 24 00:30:01 node01 rgmanager[24296]: Shutting down Jun 24 00:42:41 node01 rgmanager[24296]: Shutting down Jun 24 00:42:47 node01 rgmanager[24296]: Shutting down Jun 24 00:43:45 node01 kernel: Slow work thread pool: Shutting down Jun 24 00:43:45 node01 kernel: Slow work thread pool: Shut down complete Jun 24 00:43:52 node01 kernel: block drbd0: State change failed: Device is held open by someone Jun 24 00:43:52 node01 kernel: block drbd0: state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- } Jun 24 00:43:52 node01 kernel: block drbd0: wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- } Jun 24 00:43:52 node01 kernel: block drbd0: State change failed: Device is held open by someone Jun 24 00:43:52 node01 kernel: block drbd0: state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- } Jun 24 00:43:52 node01 kernel: block drbd0: wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- } Jun 24 00:44:29 node01 kernel: block drbd0: State change failed: Device is held open by someone Jun 24 00:44:29 node01 kernel: block drbd0: state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- } Jun 24 00:44:29 node01 kernel: block drbd0: wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- } Jun 24 00:46:28 node01 kernel: lo: Disabled Privacy Extensions ==== Being a user's system, I couldn't install debuginfo I am afraid. So this is mainly a comment in case it helps more than anything. I should mention that, after fencing the node 1 and it rebooted/rejoined the cluster, things started working normally again. There are 4 upstream patches now that might be relevant: commit f2eda36cfa9baf3f7e132dbc3a3f34c908cb36d4 Author: Alasdair G Kergon <agk> Date: Sat Jul 1 01:34:38 2017 +0100 clvmd: Fix client list corruption Centralise editing of the client list into _add_client() and _del_client(). Introduce _local_client_count to track the size of the list for debugging purposes. Simplify and standardise the various ways the list gets walked. While processing one element of the list in main_loop(), cleanup_zombie() may be called and remove a different element, so make sure main_loop() refreshes its list state on return. Prior to this patch, the list edits for clients disappearing could race against the list edits for new clients connecting and corrupt the list and cause a variety of segfaults. An easy way to trigger such failures was by repeatedly running shell commands such as: lvs &; lvs &; lvs &;...;killall -9 lvs; lvs &; lvs &;... Situations that occasionally lead to the failures can be spotted by looking for 'EOF' with 'inprogress=1' in the clvmd debug logs. commit af789fd6d0cba45cc4897772bae3ca73a3f15b38 Author: Alasdair G Kergon <agk> Date: Sat Jul 1 01:17:40 2017 +0100 clvmd: add client id to debug log messages Use standard format to make it easier to find the client to which each debug log message refers. commit 17ed25409136162d091809d412596a45e0a30fd4 Author: Alasdair G Kergon <agk> Date: Sat Jul 1 00:58:39 2017 +0100 clvmd: add debuglog mutex Log messages issued by different threads occasionally got intertwined. commit dae4f53acb269219e876c229c8f034fcdaf3ff5a Author: Zdenek Kabelac <zkabelac> Date: Sat Feb 4 14:47:27 2017 +0100 clvmd: add mutex protection for cpg_ call The library for corosync multicasting is not supporting multithread usage - add local mutex to avoid parallel call of cpg_mcast_joined(). There will be no further work on this for RHEL6. If the fixes from comment 21 do not address the issue, you can open a bug against RHEL7 (or reopen this bug and move it to RHEL7). RHEL 6 is still supported for bug and security fixes, isn't it? This bug causes production RHEL 6 clusters to hang badly. Does that not qualify for a bug fix under RHEL 6 support? With the changes from cman/rgmanager to pacemaker, along with the changes from SysV Init -> systemd, simply upgrading to RHEL 7 isn't viable at this time. |