Bug 1349755
| Summary: | Crashed a node and a service in a restricted failoverdomain on the lost node remained showing 'started' | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> |
| Component: | rgmanager | Assignee: | Ryan McCabe <rmccabe> |
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.7 | CC: | cluster-maint, fdinitto, jkortus, jruemker, mkelly, rmccabe |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-12-06 10:28:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Madison Kelly
2016-06-24 07:37:57 UTC
Stopping the service manually marked it as stopped. ==== [root@node2 ~]# clusvcadm -d libvirtd_n01 Local machine disabling service:libvirtd_n01...Success ==== [root@node2 ~]# clustat Cluster Status for ccrs @ Fri Jun 24 07:39:12 2016 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1.ccrs.bcn 1 Offline node2.ccrs.bcn 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 (node1.ccrs.bcn) stopped service:libvirtd_n02 node2.ccrs.bcn started service:storage_n01 (node1.ccrs.bcn) stopped service:storage_n02 node2.ccrs.bcn started vm:server node2.ccrs.bcn started ==== I checked the logs and noticed that there was no mention of the 'libvirtd_n01' service, where there was for 'storage_n02'. It would appear that rgmanager just missed that service entirely. Hello, Thank you for the problem report. We would need to investigate the details of this behavior further with knowledge of your configuration and the conditions present at the time of the incident, and those are best explored through our support process. Do you have a customer account with Red Hat Support through which you could open a case and engage us for further discussion on this matter? If so, please go ahead and create that case and reference this bug and we can help drive the investigation forward to resolution. Thanks, John I'm an ISV with NFR self-support subscriptions. So this channel is the only one available for me. Note that this happened in a test environment that has since been torn down. Possibly related; I've been noticing additional problems with rgmanager misbehaving. I've not been able to capture details, but a common issue is where I will start rgmanager but clustat does not show any services. I can verify in these cases that '/etc/init.d/rgmanager status' shows that it is running. Restarting rgmanager recovers this. All of these tests on on fresh RHEL 6.7 (updated within the .z stream) and 6.8 installs. Unfortunately there are a variety of conditions that can cause some of the symptoms you've described, so its difficult to conclude this is a defect of any kind, or to give specific guidance as to how you should proceed without being able to carry out a deeper investigation. The first problem you described with the libvirtd_n01 service being left in a started state is certainly not expected behavior and is not something that shows up in our testing, but its also not obvious what could have led to it that we could immediately resolve. We'd need to work through reproductions of the issue while capturing additional diagnostic information in order to narrow the focus, and all of that would need to be done with consideration for your specific configuration and environment, which is where the support process would come in. The code in question that evaluates resource groups to decide what to do with them after a node is fenced does iterate the entire config tree, and it should produce a log message for each one that fails or gets bypassed for some reason, so there's not an obvious reason why one would get missed. We've seen similar issues to this when there have been past problems attempting to apply a configuration update that introduced the "missing" resources or services, so we would have to understand the history of events since the cluster started to rule in or out certain possibilities. If you wanted to enable debug logging, you might see more clearly whether its actually attempting to process that resource group or not. So again, without the ability to reproduce this in your original environment and work through it together through the support process, we're left without an obvious target for this bug report to focus on. The other issue you described with services not appearing in clustat is more likely to be a result of some other problematic condition in the environment, and so is not immediately obvious as a bug. This can result from an initialization (stop) operation blocking when rgmanager starts, it can result from membership changes requiring fencing that has not completed yet, it can result from DLM problems, or a variety of other conditions. Its not an uncommon symptom to see, so without the ability to investigate further with you through our support offering, I'm afraid we couldn't give you more specific suggestions other than look for other problematic conditions in the cluster's state and try to resolve them. With Self-Support entitlements, you should have access to the Red Hat Customer Portal, so you may want to search thoroughly through the knowledgebase and/or initiate a discussion with other users to see if they have had similar experiences. If you have any questions let us know, or whether you're ok with us closing this with INSUFFICIENT_DATA if you are unable to offer more direct evidence of a defect. Thanks, John Ruemker Principal Software Maintenance Engineer Red Hat Support Can you give some advice on how to best enable debug logging? We have a fully automated cluster build system and a very static configuration, so I am sure that I can hammer on the system until the issue returns. At that time, I can append to this ticket. If you want to mark it as INSUFFICIENT_DATA for now, provided I can append/re-open this later, that would be fine by me. I do understand the difficulty of debugging without a reliable reproducer. :) (In reply to digimer from comment #6) > Can you give some advice on how to best enable debug logging? How do I configure logging for the various components of a RHEL 6 High Availability or Resilient Storage cluster? https://access.redhat.com/solutions/35528 Red Hat Enterprise Linux 6 Cluster Administration - 4.5.6. Logging Configuration https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Cluster_Administration/index.html#s1-config-logging-conga-CA Red Hat Enterprise Linux 6 Cluster Administration - 6.14.4. Logging https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Cluster_Administration/index.html#s2-logconfig-ccs-CA man(5) cluster.conf The following will achieve debug logging for rgmanager in a way that would show it evaluating each resource group after a node is fenced: <logging_daemon name="rgmanager" debug="on"/> I'll close this for now. Thanks, John Follow-up; I'm not having much luck getting useful information out of rgmanager, but whatever is wrong with it seems to be getting worse. Today, I have found myself in a situation where a storage service (drbd -> clvmd -> gfs2) failed on a node on start. After that, calling 'clusvcadm -d X' updated clustat to show that the storage service was stopped, but the shell call would not exit. ==== Aug 1 17:44:52 an-a05n02 rgmanager[30999]: [script] script:clvmd: status of /etc/init.d/clvmd failed (returned 3) Aug 1 17:44:52 an-a05n02 rgmanager[17323]: status on script "clvmd" returned 1 (generic error) Aug 1 17:44:52 an-a05n02 rgmanager[17323]: Stopping service service:storage_n02 Aug 1 17:44:52 an-a05n02 rgmanager[31036]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:44:52 an-a05n02 rgmanager[17323]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:44:52 an-a05n02 rgmanager[17323]: #12: RG service:storage_n02 failed to stop; intervention required Aug 1 17:44:52 an-a05n02 rgmanager[17323]: Service service:storage_n02 is failed ==== There are no further messages. ==== root@an-a05n01 ~]# clustat Cluster Status for an-anvil-05 @ Mon Aug 1 17:45:45 2016 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a05n01.alteeve.ca 1 Online, Local, rgmanager an-a05n02.alteeve.ca 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a05n01.alteeve.ca started service:libvirtd_n02 an-a05n02.alteeve.ca started service:storage_n01 (an-a05n01.alteeve.ca) disabled service:storage_n02 an-a05n02.alteeve.ca starting ==== And the attempt to disable: ==== [root@an-a05n01 ~]# timeout 30 /usr/sbin/clusvcadm -d storage_n02 || echo Timed out Local machine disabling service:storage_n02...Timed out ==== My gut tells me this is somehow related to DLM and/or clvmd. digimer Marking it as 'NEW' because the stability of rgmanager (directly or indirectly) is getting to be a big problem. These problems happen on multiple different systems of different eras over many different fresh installs. We've got an automated cluster build system that we're heavily testing at the moment, which is why we're doing so many reinstalls and various rebuild and failure tests. Ha! This might be useful, I am seeing new messages; ==== [root@an-a05n01 ~]# grep rgmanager /var/log/messages Aug 1 15:20:25 new-node01 yum[16134]: Installed: rgmanager-3.0.12.1-26.el6_8.3.x86_64 Aug 1 16:39:03 an-a05n01 rgmanager[20072]: I am node #1 Aug 1 16:39:03 an-a05n01 rgmanager[20072]: Resource Group Manager Starting Aug 1 16:39:05 an-a05n01 rgmanager[21212]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 16:39:05 an-a05n01 rgmanager[21266]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 16:39:06 an-a05n01 rgmanager[20072]: Starting stopped service service:storage_n01 Aug 1 16:39:06 an-a05n01 rgmanager[20072]: Starting stopped service service:libvirtd_n01 Aug 1 16:39:06 an-a05n01 rgmanager[20072]: Service service:libvirtd_n01 started Aug 1 16:39:07 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:09 an-a05n01 rgmanager[20072]: Service service:storage_n01 started Aug 1 16:39:10 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 16:39:10 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:03:16 an-a05n01 rgmanager[7788]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3) Aug 1 17:03:16 an-a05n01 rgmanager[20072]: status on script "drbd" returned 1 (generic error) Aug 1 17:03:16 an-a05n01 rgmanager[20072]: Stopping service service:storage_n01 Aug 1 17:03:16 an-a05n01 rgmanager[7826]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:03:16 an-a05n01 rgmanager[20072]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:03:16 an-a05n01 rgmanager[20072]: #12: RG service:storage_n01 failed to stop; intervention required Aug 1 17:03:16 an-a05n01 rgmanager[20072]: Service service:storage_n01 is failed Aug 1 17:13:58 an-a05n01 rgmanager[20072]: Stopping service service:storage_n01 Aug 1 17:13:58 an-a05n01 rgmanager[13159]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:13:58 an-a05n01 rgmanager[20072]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:13:59 an-a05n01 rgmanager[20072]: Marking service:storage_n01 as 'disabled', but some resources may still be allocated! Aug 1 17:13:59 an-a05n01 rgmanager[20072]: Service service:storage_n01 is disabled Aug 1 17:13:59 an-a05n01 rgmanager[20072]: Stopping service service:storage_n02 Aug 1 17:13:59 an-a05n01 rgmanager[13316]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:13:59 an-a05n01 rgmanager[20072]: Service service:storage_n02 is disabled Aug 1 17:27:20 an-a05n01 rgmanager[20072]: Starting disabled service service:storage_n01 Aug 1 17:27:23 an-a05n01 rgmanager[20072]: Service service:storage_n01 started Aug 1 17:27:23 an-a05n01 rgmanager[20072]: Shutting down Aug 1 17:27:23 an-a05n01 rgmanager[20072]: Stopping service service:storage_n01 Aug 1 17:27:23 an-a05n01 rgmanager[20072]: Stopping service service:libvirtd_n01 Aug 1 17:27:24 an-a05n01 rgmanager[17753]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5) Aug 1 17:27:24 an-a05n01 rgmanager[20072]: stop on script "clvmd" returned 1 (generic error) Aug 1 17:27:24 an-a05n01 rgmanager[20072]: Service service:libvirtd_n01 is stopped Aug 1 17:27:24 an-a05n01 rgmanager[20072]: #12: RG service:storage_n01 failed to stop; intervention required Aug 1 17:27:24 an-a05n01 rgmanager[20072]: Service service:storage_n01 is failed Aug 1 17:27:24 an-a05n01 rgmanager[20072]: Disconnecting from CMAN Aug 1 17:27:39 an-a05n01 rgmanager[20072]: Exiting Aug 1 17:28:07 an-a05n01 rgmanager[18632]: I am node #1 Aug 1 17:28:07 an-a05n01 rgmanager[18632]: Resource Group Manager Starting Aug 1 17:28:11 an-a05n01 rgmanager[18632]: Marking service:storage_n02 as stopped: Restricted domain unavailable Aug 1 17:28:11 an-a05n01 rgmanager[18632]: Starting stopped service service:storage_n01 Aug 1 17:28:11 an-a05n01 rgmanager[18632]: Starting stopped service service:libvirtd_n01 Aug 1 17:28:11 an-a05n01 rgmanager[18632]: Service service:libvirtd_n01 started Aug 1 17:28:13 an-a05n01 rgmanager[18632]: Service service:storage_n01 started Aug 1 17:28:13 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:28:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation! Aug 1 17:44:51 an-a05n01 rgmanager[32419]: [script] script:clvmd: status of /etc/init.d/clvmd failed (returned 3) Aug 1 17:44:51 an-a05n01 rgmanager[18632]: status on script "clvmd" returned 1 (generic error) Aug 1 17:44:51 an-a05n01 rgmanager[18632]: Stopping service service:storage_n01 Aug 1 17:44:51 an-a05n01 rgmanager[32456]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:44:51 an-a05n01 rgmanager[18632]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:44:52 an-a05n01 rgmanager[18632]: #12: RG service:storage_n01 failed to stop; intervention required Aug 1 17:44:52 an-a05n01 rgmanager[18632]: Service service:storage_n01 is failed Aug 1 17:44:52 an-a05n01 rgmanager[18632]: Stopping service service:storage_n01 Aug 1 17:44:52 an-a05n01 rgmanager[32726]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:44:52 an-a05n01 rgmanager[18632]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:44:52 an-a05n01 rgmanager[18632]: Marking service:storage_n01 as 'disabled', but some resources may still be allocated! Aug 1 17:44:53 an-a05n01 rgmanager[18632]: Service service:storage_n01 is disabled Aug 1 17:44:55 an-a05n01 rgmanager[18632]: Stopping service service:storage_n02 Aug 1 17:44:55 an-a05n01 rgmanager[545]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:44:55 an-a05n01 rgmanager[18632]: Service service:storage_n02 is disabled Aug 1 17:59:46 an-a05n01 rgmanager[18632]: Shutting down Aug 1 17:59:46 an-a05n01 rgmanager[18632]: Stopping service service:libvirtd_n01 Aug 1 17:59:46 an-a05n01 rgmanager[18632]: Service service:libvirtd_n01 is stopped Aug 1 17:59:55 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 17:59:55 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:00:05 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:00:05 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:00:16 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:00:16 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:00:26 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:00:26 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:00:36 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:00:36 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! ==== This is on the node that I pasted output for in #c8. After posting, I tried to restart the rgmanager service, which also hung like the attempt to restart the storage service did. In case it helps, here is what 'ps' shows on the effected node (slightly trimmed to remove unrelated ps processes): ==== [root@an-a05n01 ~]# ps auxf |grep rgmanager -B 10 -A 10 root 2812 0.0 0.0 66240 1248 ? Ss 15:34 0:00 /usr/sbin/sshd root 8419 0.0 0.0 100016 4088 ? Ss 15:59 0:00 \_ sshd: root@pts/1 root 8424 0.0 0.0 110160 3848 pts/1 Ss 15:59 0:00 | \_ -bash root 5616 0.0 0.0 106472 1832 pts/1 S+ 17:59 0:00 | \_ /bin/bash /etc/init.d/rgmanager restart root 5621 0.1 0.0 106476 1880 pts/1 S+ 17:59 0:00 | \_ /bin/bash /etc/init.d/rgmanager stop root 6771 0.0 0.0 100920 624 pts/1 S+ 18:03 0:00 | \_ sleep 1 root 28730 0.0 0.0 100012 4044 ? Ss 17:42 0:00 \_ sshd: root@notty root 519 0.0 0.0 106076 1496 ? Ss 17:44 0:00 | \_ bash -c /usr/sbin/clusvcadm -d storage_n02 && /bin/sleep 10 && /usr/sbin/clusvcadm -F -e storage_n02 root 814 0.0 0.0 10404 688 ? S 17:45 0:00 | \_ /usr/sbin/clusvcadm -F -e storage_n02 root 5882 0.0 0.0 100016 4104 ? Ss 18:00 0:00 \_ sshd: root@pts/0 root 5885 0.0 0.0 110160 3812 pts/0 Ss 18:00 0:00 \_ -bash root 6772 0.0 0.0 110680 1596 pts/0 R+ 18:03 0:00 \_ ps auxf root 6773 0.0 0.0 103316 880 pts/0 S+ 18:03 0:00 \_ grep rgmanager -B 10 -A 10 -- root 18631 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 18632 0.0 0.0 570640 3052 ? S<l 17:28 0:01 \_ rgmanager ==== Sorry for the comment spam... I'm adding notes as I try to fix things. I'm hoping "too much data" is the right side to err on. This comment alone is pretty long, but I wanted to document the entire recovery procedure. So I tried killing the pids related to rgmanager: ==== [root@an-a05n01 ~]# ps aux | grep rgmanager root 5616 0.0 0.0 106472 1832 pts/1 S+ 17:59 0:00 /bin/bash /etc/init.d/rgmanager restart root 5621 0.1 0.0 106476 1896 pts/1 S+ 17:59 0:00 /bin/bash /etc/init.d/rgmanager stop root 7516 0.0 0.0 103316 904 pts/0 S+ 18:06 0:00 grep rgmanager root 18631 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 18632 0.0 0.0 580884 3060 ? S<l 17:28 0:01 rgmanager [root@an-a05n01 ~]# kill 5616 5621 18631 18632 [root@an-a05n01 ~]# grep rgmanager /var/log/messages ==== trimmed output: ==== Aug 1 18:00:36 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:00:46 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:00:46 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:00:56 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:00:56 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:01:07 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:01:07 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:01:17 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:01:17 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:01:27 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:01:27 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:01:37 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:01:37 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:01:47 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:01:47 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:01:58 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:01:58 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:02:08 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:02:08 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:02:18 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:02:18 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:02:28 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:02:28 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:02:38 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:02:38 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:02:49 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:02:49 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:02:59 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:02:59 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:03:09 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:03:09 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:03:19 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:03:19 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:03:30 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:03:30 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:03:40 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:03:40 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:03:50 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:03:50 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:04:00 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:04:00 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:04:10 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:04:10 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:04:21 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:04:21 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:04:31 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:04:31 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:04:41 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:04:41 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:04:51 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:04:51 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:05:01 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:05:01 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:05:12 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:05:12 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:05:22 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:05:22 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:05:32 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:05:32 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:05:42 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:05:42 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:05:51 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:05:52 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:05:52 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:06:01 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:06:03 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:06:03 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:06:13 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:06:13 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:06:23 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:06:23 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:06:23 an-a05n01 rgmanager[18632]: Shutting down ==== Now ps shows two instances of rgmanager still, despite the kill. ==== [root@an-a05n01 ~]# ps auxf | grep rgmanager root 8075 0.0 0.0 103316 868 pts/0 S+ 18:09 0:00 \_ grep rgmanager root 18631 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 18632 0.0 0.0 797980 3104 ? S<l 17:28 0:01 \_ rgmanager ==== Shutting down still hangs though. So I checked, and there are hung clusvcadm calls; ==== [root@an-a05n01 ~]# ps aux | grep -e rgmanager -e clusvcadm root 519 0.0 0.0 106076 1496 ? Ss 17:44 0:00 bash -c /usr/sbin/clusvcadm -d storage_n02 && /bin/sleep 10 && /usr/sbin/clusvcadm -F -e storage_n02 root 814 0.0 0.0 10404 688 ? S 17:45 0:00 /usr/sbin/clusvcadm -F -e storage_n02 root 8341 0.0 0.0 103320 956 pts/0 S+ 18:10 0:00 grep -e rgmanager -e clusvcadm root 18631 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 18632 0.0 0.0 949540 5168 ? S<l 17:28 0:01 rgmanager ==== So I tried to kill them as well, but no luck: ==== [root@an-a05n01 ~]# kill 519 814 18631 18632 [root@an-a05n01 ~]# ps aux | grep -e rgmanager -e clusvcadm root 8502 0.5 0.0 106076 1500 ? Ss 18:11 0:00 bash -c /etc/init.d/rgmanager stop; /bin/echo rc:$? root 8509 0.5 0.0 106476 1868 ? S 18:11 0:00 /bin/bash /etc/init.d/rgmanager stop root 8523 0.0 0.0 103316 936 pts/0 S+ 18:11 0:00 grep -e rgmanager -e clusvcadm root 18631 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 18632 0.0 0.0 1035564 5188 ? S<l 17:28 0:01 rgmanager ==== So I try the dreaded kill -9; ==== [root@an-a05n01 ~]# kill -9 519 814 18631 18632 -bash: kill: (519) - No such process -bash: kill: (814) - No such process [root@an-a05n01 ~]# ps aux | grep -e rgmanager -e clusvcadm root 8771 0.0 0.0 103316 936 pts/0 S+ 18:12 0:00 grep -e rgmanager -e clusvcadm ==== And now it is dead. ==== Aug 1 18:06:23 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:06:33 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:06:33 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:06:33 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:06:43 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:06:43 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:06:54 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:06:54 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:07:04 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:07:04 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:07:14 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:07:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:07:24 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:07:24 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:07:35 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:07:35 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:07:45 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:07:45 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:07:55 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:07:55 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:08:05 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:08:05 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:08:15 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:08:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:08:26 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:08:26 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:08:36 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:08:36 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:08:46 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:08:46 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:08:56 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:08:56 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:09:06 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:09:06 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:09:17 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:09:17 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:09:27 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:09:27 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:09:37 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:09:37 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:09:44 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:09:44 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:09:47 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:09:47 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:09:57 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:09:57 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:10:07 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:10:07 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:10:18 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:10:18 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:10:28 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:10:28 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:10:38 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:10:38 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:10:48 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:10:48 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:10:58 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:10:58 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:11:09 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:11:09 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:11:19 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:11:19 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:11:29 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:11:29 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:11:29 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:11:30 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:11:39 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:11:39 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:11:40 an-a05n01 rgmanager[18632]: Shutting down Aug 1 18:11:49 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:11:49 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:12:00 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:12:00 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! Aug 1 18:12:10 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down Aug 1 18:12:10 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation! ==== Now, 'clustat' shows nothing on node 2, which might be because it's storage had failed earlier so a hung stop operation probably finally exited. ==== [root@an-a05n01 ~]# clustat Cluster Status for an-anvil-05 @ Mon Aug 1 18:15:31 2016 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a05n01.alteeve.ca 1 Online, Local an-a05n02.alteeve.ca 2 Online [root@an-a05n01 ~]# /etc/init.d/drbd status drbd not loaded [root@an-a05n01 ~]# /etc/init.d/clvmd status clvmd is stopped [root@an-a05n01 ~]# /etc/init.d/gfs2 status GFS2: service is not running ---- [root@an-a05n02 ~]# clustat Cluster Status for an-anvil-05 @ Mon Aug 1 18:15:33 2016 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a05n01.alteeve.ca 1 Online an-a05n02.alteeve.ca 2 Online, Local [root@an-a05n02 ~]# /etc/init.d/drbd status drbd driver loaded OK; device status: version: 8.4.8-1 (api:1/proto:86-101) GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by root.ca, 2016-07-18 15:13:48 m:res cs ro ds p mounted fstype 0:r0 WFConnection Secondary/Unknown Inconsistent/DUnknown C [root@an-a05n02 ~]# /etc/init.d/clvmd status clvmd is stopped [root@an-a05n02 ~]# /etc/init.d/gfs2 status GFS2: service is not running ==== Here are the rgmanager logs from node 2: ==== [root@an-a05n02 ~]# grep rgmanager /var/log/messages Aug 1 15:24:23 new-node02 yum[15984]: Installed: rgmanager-3.0.12.1-26.el6_8.3.x86_64 Aug 1 16:39:03 an-a05n02 rgmanager[19470]: I am node #2 Aug 1 16:39:03 an-a05n02 rgmanager[19470]: Resource Group Manager Starting Aug 1 16:39:05 an-a05n02 rgmanager[20577]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 16:39:06 an-a05n02 rgmanager[20630]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 16:39:06 an-a05n02 rgmanager[19470]: Starting stopped service service:storage_n02 Aug 1 16:39:07 an-a05n02 rgmanager[19470]: Starting stopped service service:libvirtd_n02 Aug 1 16:39:07 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 9 Aug 1 16:39:07 an-a05n02 rgmanager[19470]: Service service:libvirtd_n02 started Aug 1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 13 Aug 1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 14 Aug 1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 15 Aug 1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 16 Aug 1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 17 Aug 1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 18 Aug 1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 19 Aug 1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 20 Aug 1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 21 Aug 1 16:39:10 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 22 Aug 1 16:39:10 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 24 Aug 1 16:39:10 an-a05n02 rgmanager[19470]: Service service:storage_n02 started Aug 1 17:03:17 an-a05n02 rgmanager[6759]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3) Aug 1 17:03:17 an-a05n02 rgmanager[19470]: status on script "drbd" returned 1 (generic error) Aug 1 17:03:17 an-a05n02 rgmanager[19470]: Stopping service service:storage_n02 Aug 1 17:03:17 an-a05n02 rgmanager[6796]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:03:17 an-a05n02 rgmanager[19470]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:03:17 an-a05n02 rgmanager[19470]: #12: RG service:storage_n02 failed to stop; intervention required Aug 1 17:03:17 an-a05n02 rgmanager[19470]: Service service:storage_n02 is failed Aug 1 17:27:23 an-a05n02 rgmanager[19470]: Service service:storage_n02 started Aug 1 17:27:24 an-a05n02 rgmanager[19470]: Member 1 shutting down Aug 1 17:27:24 an-a05n02 rgmanager[19470]: Marking service:libvirtd_n01 as stopped: Restricted domain unavailable Aug 1 17:27:39 an-a05n02 rgmanager[19470]: Shutting down Aug 1 17:27:39 an-a05n02 rgmanager[19470]: Stopping service service:storage_n02 Aug 1 17:27:40 an-a05n02 rgmanager[19470]: Stopping service service:libvirtd_n02 Aug 1 17:27:40 an-a05n02 rgmanager[16465]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5) Aug 1 17:27:40 an-a05n02 rgmanager[19470]: stop on script "clvmd" returned 1 (generic error) Aug 1 17:27:40 an-a05n02 rgmanager[19470]: Service service:libvirtd_n02 is stopped Aug 1 17:27:40 an-a05n02 rgmanager[19470]: #12: RG service:storage_n02 failed to stop; intervention required Aug 1 17:27:40 an-a05n02 rgmanager[19470]: Service service:storage_n02 is failed Aug 1 17:27:40 an-a05n02 rgmanager[19470]: Disconnecting from CMAN Aug 1 17:27:55 an-a05n02 rgmanager[19470]: Exiting Aug 1 17:28:07 an-a05n02 rgmanager[17323]: I am node #2 Aug 1 17:28:07 an-a05n02 rgmanager[17323]: Resource Group Manager Starting Aug 1 17:28:12 an-a05n02 rgmanager[17323]: Starting stopped service service:libvirtd_n02 Aug 1 17:28:12 an-a05n02 rgmanager[17323]: Starting stopped service service:storage_n02 Aug 1 17:28:12 an-a05n02 rgmanager[17323]: Service service:libvirtd_n02 started Aug 1 17:28:13 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 12 Aug 1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 13 Aug 1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 14 Aug 1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 15 Aug 1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 16 Aug 1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 17 Aug 1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 18 Aug 1 17:28:15 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 19 Aug 1 17:28:15 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 20 Aug 1 17:28:15 an-a05n02 rgmanager[17323]: Service service:storage_n02 started Aug 1 17:44:52 an-a05n02 rgmanager[30999]: [script] script:clvmd: status of /etc/init.d/clvmd failed (returned 3) Aug 1 17:44:52 an-a05n02 rgmanager[17323]: status on script "clvmd" returned 1 (generic error) Aug 1 17:44:52 an-a05n02 rgmanager[17323]: Stopping service service:storage_n02 Aug 1 17:44:52 an-a05n02 rgmanager[31036]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 17:44:52 an-a05n02 rgmanager[17323]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 17:44:52 an-a05n02 rgmanager[17323]: #12: RG service:storage_n02 failed to stop; intervention required Aug 1 17:44:52 an-a05n02 rgmanager[17323]: Service service:storage_n02 is failed Aug 1 18:12:17 an-a05n02 rgmanager[17323]: Marking service:libvirtd_n01 as stopped: Restricted domain unavailable Aug 1 18:12:17 an-a05n02 rgmanager[17323]: Shutting down Aug 1 18:12:17 an-a05n02 rgmanager[17323]: Stopping service service:libvirtd_n02 Aug 1 18:12:17 an-a05n02 rgmanager[17323]: Service service:libvirtd_n02 is stopped ==== So I tried to start rgmanager on both nodes. It started on node 1, but hung on node 2... ==== [root@an-a05n01 ~]# /etc/init.d/rgmanager start Starting Cluster Service Manager: [ OK ] [root@an-a05n01 ~]# clustat Cluster Status for an-anvil-05 @ Mon Aug 1 18:17:31 2016 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a05n01.alteeve.ca 1 Online, Local, rgmanager an-a05n02.alteeve.ca 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a05n01.alteeve.ca started service:libvirtd_n02 (an-a05n02.alteeve.ca) stopped service:storage_n01 (an-a05n01.alteeve.ca) disabled service:storage_n02 an-a05n02.alteeve.ca starting ==== ==== [root@an-a05n02 ~]# /etc/init.d/rgmanager start Starting Cluster Service Manager: [ OK ] [root@an-a05n02 ~]# clustat Cluster Status for an-anvil-05 @ Mon Aug 1 18:17:34 2016 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a05n01.alteeve.ca 1 Online an-a05n02.alteeve.ca 2 Online, Local ==== And syslog from both node 1: ==== Aug 1 18:17:26 an-a05n01 rgmanager[9585]: I am node #1 Aug 1 18:17:26 an-a05n01 rgmanager[9585]: Resource Group Manager Starting Aug 1 18:17:28 an-a05n01 rgmanager[10652]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 18:17:28 an-a05n01 rgmanager[10705]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 18:17:29 an-a05n01 rgmanager[9585]: Starting stopped service service:libvirtd_n01 Aug 1 18:17:29 an-a05n01 rgmanager[9585]: Service service:libvirtd_n01 started ==== Node 2 had nothing new in its syslog, making me thing rgmanager is hung. So I checked it's ps auxf and sure enough, earlier clusvcadm and stop operations are there... ==== [root@an-a05n02 ~]# ps aux | grep -e rgmanager -e clusvcadm root 14436 0.0 0.0 106076 1492 ? Ss 18:12 0:00 bash -c /etc/init.d/rgmanager stop; /bin/echo rc:$? root 14443 0.0 0.0 106476 1896 ? S 18:12 0:00 /bin/bash /etc/init.d/rgmanager stop root 17321 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 17323 0.0 0.0 570620 3044 ? S<l 17:28 0:02 rgmanager ==== So I kill those: ==== [root@an-a05n02 ~]# kill 14436 14443 17321 17323 [root@an-a05n02 ~]# ps aux | grep -e rgmanager -e clusvcadm root 17321 0.0 0.0 36952 6256 ? S<Ls 17:28 0:00 rgmanager root 17323 0.0 0.0 570620 3044 ? S<l 17:28 0:02 rgmanager ==== rgmanager eventually exits. Here's what appear in the logs: ==== Aug 1 18:21:15 an-a05n02 rgmanager[17323]: Shutting down Aug 1 18:21:25 an-a05n02 rgmanager[17323]: Shutting down Aug 1 18:21:30 an-a05n02 rgmanager[17323]: Service service:storage_n02 started Aug 1 18:21:30 an-a05n02 rgmanager[17323]: Stopping service service:storage_n02 Aug 1 18:21:30 an-a05n02 rgmanager[18460]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5) Aug 1 18:21:30 an-a05n02 rgmanager[17323]: stop on script "clvmd" returned 1 (generic error) Aug 1 18:21:30 an-a05n02 rgmanager[17323]: #12: RG service:storage_n02 failed to stop; intervention required Aug 1 18:21:30 an-a05n02 rgmanager[17323]: Service service:storage_n02 is failed Aug 1 18:21:31 an-a05n02 rgmanager[17323]: Disconnecting from CMAN Aug 1 18:21:46 an-a05n02 rgmanager[17323]: Exiting ==== ==== Aug 1 18:21:29 an-a05n01 rgmanager[9585]: Service service:storage_n01 started Aug 1 18:21:30 an-a05n01 rgmanager[9585]: Member 2 shutting down Aug 1 18:21:31 an-a05n01 rgmanager[9585]: Marking service:libvirtd_n02 as stopped: Restricted domain unavailable Aug 1 18:21:31 an-a05n01 rgmanager[9585]: Stopping service service:storage_n02 Aug 1 18:21:32 an-a05n01 rgmanager[12670]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5) Aug 1 18:21:32 an-a05n01 rgmanager[9585]: stop on script "clvmd" returned 1 (generic error) Aug 1 18:21:32 an-a05n01 rgmanager[9585]: Marking service:storage_n02 as 'disabled', but some resources may still be allocated! Aug 1 18:21:32 an-a05n01 rgmanager[9585]: Service service:storage_n02 is disabled ==== Note that at this point, DRBD started on node 2 allowing node 1's storage service to finish coming up. So I try to start rgmanager again on node 2: ==== Aug 1 18:25:00 an-a05n02 rgmanager[19084]: I am node #2 Aug 1 18:25:00 an-a05n02 rgmanager[19084]: Resource Group Manager Starting Aug 1 18:25:04 an-a05n02 rgmanager[19084]: Starting stopped service service:libvirtd_n02 Aug 1 18:25:05 an-a05n02 rgmanager[19084]: Service service:libvirtd_n02 started ==== This time it started, but it left the storage service off, despite being set to 'autostart="1"'; ==== Aug 1 18:25:00 an-a05n02 rgmanager[19084]: I am node #2 Aug 1 18:25:00 an-a05n02 rgmanager[19084]: Resource Group Manager Starting Aug 1 18:25:04 an-a05n02 rgmanager[19084]: Starting stopped service service:libvirtd_n02 Aug 1 18:25:05 an-a05n02 rgmanager[19084]: Service service:libvirtd_n02 started ==== I'm guessing this is because node 1 marked it as stopped because it was in a restricted domain, but it should have started when a member of that restricted domain rejoined, shouldn't of it? Anyway, manually starting it: ==== [root@an-a05n01 ~]# clusvcadm -F -e storage_n02 Local machine trying to enable service:storage_n02...Success service:storage_n02 is now running on an-a05n02.alteeve.ca ==== Finally works and the cluster is operational again. What an ordeal! >_< OK, I think this most recent issue's root cause is DRBD's '/etc/init.d/drbd status' returning non-0. When this happens, rgmanager flags the service as 'failed' and tries to recover it, as it should. However, if fails at that is when things go sideways. The node doesn't get fenced (is that possible to fence a node if a service enters a failed state?) and DLM hangs. ==== Aug 1 20:28:25 an-a05n01 rgmanager[9328]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3) Aug 1 20:28:25 an-a05n01 rgmanager[22400]: status on script "drbd" returned 1 (generic error) Aug 1 20:28:25 an-a05n01 rgmanager[22400]: Stopping service service:storage_n01 Aug 1 20:28:25 an-a05n01 rgmanager[9365]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 20:28:25 an-a05n01 rgmanager[22400]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 20:28:25 an-a05n01 rgmanager[22400]: #12: RG service:storage_n01 failed to stop; intervention required Aug 1 20:28:25 an-a05n01 rgmanager[22400]: Service service:storage_n01 is failed Aug 1 22:24:03 an-a05n01 rgmanager[22400]: Stopping service service:storage_n01 Aug 1 22:24:03 an-a05n01 rgmanager[9383]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 22:24:03 an-a05n01 rgmanager[22400]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 22:24:04 an-a05n01 rgmanager[22400]: Marking service:storage_n01 as 'disabled', but some resources may still be allocated! Aug 1 22:24:04 an-a05n01 rgmanager[22400]: Service service:storage_n01 is disabled Aug 1 22:24:09 an-a05n01 rgmanager[22400]: Starting disabled service service:storage_n01 ==== Peer: ==== Aug 1 20:28:05 an-a05n02 rgmanager[12487]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3) Aug 1 20:28:05 an-a05n02 rgmanager[26115]: status on script "drbd" returned 1 (generic error) Aug 1 20:28:05 an-a05n02 rgmanager[26115]: Stopping service service:storage_n02 Aug 1 20:28:05 an-a05n02 rgmanager[12525]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device Aug 1 20:28:05 an-a05n02 rgmanager[26115]: stop on clusterfs "sharedfs" returned 5 (program not installed) Aug 1 20:28:05 an-a05n02 rgmanager[26115]: #12: RG service:storage_n02 failed to stop; intervention required Aug 1 20:28:05 an-a05n02 rgmanager[26115]: Service service:storage_n02 is failed ==== The bit that stands out the most to me is: ==== [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device stop on clusterfs "sharedfs" returned 5 (program not installed) ==== It can't match the LV because the backing device is gone. That's a problem, of course, but the the cluster should be able to handle it, shouldn't it? This is a test system (with no plans to put it into production). If you want access to poke away at it, I am happy to give it to you. It's already available over ssh, I can email the credentials. Also note; The cluster.conf at the start of this ticket is not the one for the recent posts (in so far as node and cluster names), but the structure is identical to the current one. It was the same installer, just different identification stuff was chosen at the time. The RPM versions of the nodes always matched each other exactly, but the two clusters differed (ccrs was 6.7.z updated and an-anvil-05 is rhel 6.8.z updated as of today's available updates). Well there's quite a lot happening throughout your various comments that are difficult to fully explain with the information you've given, and things get complicated after you start doing things like attempting to kill daemons manually and then stopping and restarting them while they're not fully cleaned up, so its difficult to respond to all of this with clear suggestions to solve all of your problems. Again, I'll note that the nature of what you're looking for assistance with is something we need to be handling in a support engagement, not here in bugzilla. Bugzilla is not a support tool, it is intended for the reporting of defects or feature enhancements to Red Hat products. There is no clear single defect in what you've described, and those unexplained areas that could have resulted from some unexpected flaw in rgmanager all require much deeper investigation to nail down the cause of, and could feasibly be a result of some other condition in your environment rather than a defect. It's clear that there are problems with rgmanager's behavior in your environment, but it is our support offering that is designed to guide users through assessing the nature of those problems and addressing them. Without a clear problem report here that highlights something that can be addressed in the product, I'm afraid there's not much we can do unless you're able to pursue this through our support team. I will comment on a few items of note in your posts though, just to try to highlight areas where you need to focus. - Indeed your problems start with drbd returning a failure code on a status op. Address that and perhaps the chain of events that follows will be prevented. - rgmanager's recovery of the service after drbd failed results in stopping the other resources in the service, and clvmd fails to stop. A failure to stop puts the service in a failed state, and there is no further recovery after that. This requires manual intervention, as it indicates there was a condition that rgmanager could not automatically resolve to ensure the resources are no longer in use on that node. - clvmd's stop failure indicates something was likely still using an open volume at that point, which is likely to be the GFS2 file system and libvirtd. - DRBD having failed while clvmd had a mapped, active volume on it creates problems. If DRBD can fail in such a manner, then it leaves little recourse for rgmanager to recover the rest of the resources that depend on it in that chain. That is: If DRBD goes away, so too may the block device that backs clustered volumes in use by clvmd. If a GFS2 file system was using that volume, then DRBD devices going away means GFS2 can be blocked or withdraw, leaving that filesystem unusable and stuck in a mounted state. If that happens, clvmd will still see its volumes as in-use preventing it from stopping, but the resource it needs to be available to free itself up can't be made available until the service can stop, which is prevented by the volume being in use. Basically, I see no way to recover from this scenario, which calls into question the purpose of your cluster configuration. - There are ways to cause the node to reboot itself and be fenced in the face of certain failures, like a GFS2 file system hitting a fatal error (use mount err=panic), a clusterfs resource failing to stop (force_unmount=1 self_fence=1 on the resource), or an lvm resource failing to stop (self_fence=1 on the resource). Our support team could help you better understand the appropriate course of action here. That's more a usage question or discussion about your preferences. - After DRBD fails and clvmd fails to stop and is in a state where its not expected to be able to get cleaned up, you begin attempting to kill daemons. This is a bad idea for various reasons, not least of which is that its not going to entirely clean up those blocked resources, but then when you start rgmanager again its going to have no memory of the fact that resources were blocked, and can now once again try to interact with those resources, which may block or fail. This could block further interaction with rgmanager through clustat or clusvcadm, which could explain more of your problems from here on. - After resources fail and you start killing daemons in this manner, we can't really predict what the behavior should be from that point on, so anything else unexpected is difficult to describe as a defect. - You have 'invalid reply' messages through all of these events, which is troubling. These could certainly be contributing to the overall problematic behavior, as these are not expected and are generally not handled in any comprehensive way. If messages are coming across with invalid headers or data, then it indicates either a communication problem with packets being corrupted somehow, or rgmanager being in a problematic state on one node. But because its not clear what state rgmanager was in on both nodes from the comments you've provided, and we don't have any additional diagnostic information about its internal state (like an application core of rgmanager, internal debug dump, tcpdumps, a clear history showing what was done leading up to all of this, debug logs, etc), there's not really much we can say about this condition other than its unexpected and is likely going to cause further problems after it occurs. My suggestion is to speak with your partner account team regarding support subscriptions so that you can pursue a support engagement allowing us to look at these events in the necessary level of detail. Alternatively you can reach out to Linbit to have them help you better understand why it failed to set off this chain of events. Aside from that, I don't see anything here that we can immediately address in the product. -John > Again, I'll note that the nature of what you're looking for assistance with is something we need to be handling in a support engagement, not here in bugzilla. Which is why I offered up SSH access to the effected machines. Being an ISV, as I mentioned, I am on a self-support tier so I can't officially engage support. > - Indeed your problems start with drbd returning a failure code on a status op. Address that and perhaps the chain of events that follows will be prevented. No doubt, but the nature of HA is that you have to handle unexpected failures in a way that you can hopefully recover from. Getting a bad return status is exactly what rgmanager is designed to do. I am working in parallel to solve that, but when I do, it doesn't change the fact that rgmanager reacts badly to a bad status. The issue could arise again with a different component that throws an unexpected RC. > - clvmd's stop failure indicates something was likely still using an open volume at that point, which is likely to be the GFS2 file system and libvirtd. Reasonable assumption. That rgmanager itself stopped responding is why I think this is fundamentally a DLM issue. GFS2, clvmd and rgmanager are the three users of DLM. However, it's not hard locked like in a failed fence because it does respond to the kill commands, which is why I think rgmanager should have been able to handle it better. Leaving the system hung, when fencing is still an option, seems like a bad idea in an HA environment. Fencing the failed node to unblock the cluster strikes me as a must saner option. > - There are ways to cause the node to reboot itself and be fenced in the face of certain failures, like a GFS2 file system hitting a fatal error (use mount err=panic), a clusterfs resource failing to stop (force_unmount=1 self_fence=1 on the resource), or an lvm resource failing to stop (self_fence=1 on the resource). Our support team could help you better understand the appropriate course of action here. That's more a usage question or discussion about your preferences. I will try this, it might be just the ticket. > - After DRBD fails and clvmd fails to stop and is in a state where its not expected to be able to get cleaned up, you begin attempting to kill daemons. This is a bad idea for various reasons, not least of which is that its not going to entirely clean up those blocked resources, but then when you start rgmanager again its going to have no memory of the fact that resources were blocked, and can now once again try to interact with those resources, which may block or fail. This could block further interaction with rgmanager through clustat or clusvcadm, which could explain more of your problems from here on. I understand that 'kill' is bad, but faced with a non-responsive system, I was looking for the "least bad" solution. So in general, if rgmanager fails, the node should be rebooted? Or would withdrawing the node from the cluster entirely (stopping cman and all resources) be enough to restore a clean state? > - After resources fail and you start killing daemons in this manner, we can't really predict what the behavior should be from that point on, so anything else unexpected is difficult to describe as a defect. Understandable. > - You have 'invalid reply' messages through all of these events, which is troubling. In all my years of using rgmanager, I've never seen those message before. I understand your trouble of diagnosing this without a reliable reproducer. When I tried to enable bugging according to man cluster.conf, all I got was rate-limit messages to the tune of ~1000 messages being suppressed every few seconds. I'm much more of a user than a dev, so I am probably not doing a good job helping you here. > My suggestion is to speak with your partner account team regarding support subscriptions so that you can pursue a support engagement allowing us to look at these events in the necessary level of detail. I'll try. As for LINBIT, I've already engaged them as well. However, knowing that rgmanager reacts poorly to certain script RCs worries me, independent of the current root cause. Very much appreciate your lengthy reply and effort. (In reply to digimer from comment #17) > Which is why I offered up SSH access to the effected machines. Being an ISV, > as I mentioned, I am on a self-support tier so I can't officially engage > support. Sorry, I won't be able to connect directly in that manner. > > > > - Indeed your problems start with drbd returning a failure code on a status op. Address that and perhaps the chain of events that follows will be prevented. > > No doubt, but the nature of HA is that you have to handle unexpected > failures in a way that you can hopefully recover from. Getting a bad return > status is exactly what rgmanager is designed to do. I am working in parallel > to solve that, but when I do, it doesn't change the fact that rgmanager > reacts badly to a bad status. The issue could arise again with a different > component that throws an unexpected RC. rgmanager did not, from what I see, respond badly to the DRBD status failure. It executed its stop procedures as it should during an attempt to recover, and then clvmd returned a failure during stop, which is expected to put the service in a failed state. As I said, if we can't stop, then we can't be sure the resource is free to be used throughout the cluster, so we require manual intervention in response. This is expected and is part of the design of rgmanager. > > > > - clvmd's stop failure indicates something was likely still using an open volume at that point, which is likely to be the GFS2 file system and libvirtd. > > Reasonable assumption. That rgmanager itself stopped responding is why I > think this is fundamentally a DLM issue. GFS2, clvmd and rgmanager are the > three users of DLM. However, it's not hard locked like in a failed fence > because it does respond to the kill commands, which is why I think rgmanager > should have been able to handle it better. Leaving the system hung, when > fencing is still an option, seems like a bad idea in an HA environment. > Fencing the failed node to unblock the cluster strikes me as a must saner > option. Based on what I see, I wouldn't agree this is what it was blocked on. But without data about what rgmangaer's state was internally, there's no point in me arguing it. We can't prove one way or another from the captured information. A core of rgmanager's state internally on each node would be a good start. DLM's lock states, netstat connection information, full logs, and a complete description of every action taken would also be helpful. I can't guarantee that we'll be able to comprehensively investigate this and reach a conclusion without a support engagement, but having that data would at least give us the ability to see more of what is happening. > > > > - There are ways to cause the node to reboot itself and be fenced in the face of certain failures, like a GFS2 file system hitting a fatal error (use mount err=panic), a clusterfs resource failing to stop (force_unmount=1 self_fence=1 on the resource), or an lvm resource failing to stop (self_fence=1 on the resource). Our support team could help you better understand the appropriate course of action here. That's more a usage question or discussion about your preferences. > > I will try this, it might be just the ticket. > > > > - After DRBD fails and clvmd fails to stop and is in a state where its not expected to be able to get cleaned up, you begin attempting to kill daemons. This is a bad idea for various reasons, not least of which is that its not going to entirely clean up those blocked resources, but then when you start rgmanager again its going to have no memory of the fact that resources were blocked, and can now once again try to interact with those resources, which may block or fail. This could block further interaction with rgmanager through clustat or clusvcadm, which could explain more of your problems from here on. > > I understand that 'kill' is bad, but faced with a non-responsive system, I > was looking for the "least bad" solution. Understood. My point was more that once you go down this path, everything that follows is off the rails, and we can't give much consideration to how everything else behaves following these actions. > > So in general, if rgmanager fails, the node should be rebooted? Or would > withdrawing the node from the cluster entirely (stopping cman and all > resources) be enough to restore a clean state? Difficult to say without being able to pinpoint the exact reason for things being stuck. But with clvmd and GFS2 in the state they were in, you're unlikely to be able to stop those cleanly, so you'd likely need to reboot instead. > > > > - After resources fail and you start killing daemons in this manner, we can't really predict what the behavior should be from that point on, so anything else unexpected is difficult to describe as a defect. > > Understandable. > > > > - You have 'invalid reply' messages through all of these events, which is troubling. > > In all my years of using rgmanager, I've never seen those message before. I > understand your trouble of diagnosing this without a reliable reproducer. > When I tried to enable bugging according to man cluster.conf, all I got was > rate-limit messages to the tune of ~1000 messages being suppressed every few > seconds. I'm much more of a user than a dev, so I am probably not doing a > good job helping you here. We've seen them before, and they've resulted from a variety of conditions, most often the two scenarios I described: something interfering with rgmanager communications but not with general token passing, or rgmanager put into some unexpected state through a problematic sequence of events. There's no single condition I can point to that I would expect to explain your situation through. > > > > My suggestion is to speak with your partner account team regarding support subscriptions so that you can pursue a support engagement allowing us to look at these events in the necessary level of detail. > > I'll try. As for LINBIT, I've already engaged them as well. However, knowing > that rgmanager reacts poorly to certain script RCs worries me, independent > of the current root cause. > Again, I don't see rgmanager reacting poorly to any script returns. The sequence that follows the drbd failure is expected. As I explained, the way you've got your resources laid out means that a DRBD failure pretty much spells the end of what rgmanager can do to recover. DRBD means GFS2 withdraws/blocks and clvmd can't find the device it has mapped volumes for. I can say, without a doubt, that we should expect rgmanager to be unable to recover those resources on that node if this happens. So addressing why DRBD failed seems to me to be the most critical part of all this. If there's something specific you think rgmanager did incorrectly in how it responded to that script failure, please restate it for me. Perhaps I'm missing your point. -John Thank you for your reply, and I understand your position.
> If there's something specific you think rgmanager did incorrectly in how it responded to that script failure, please restate it for me. Perhaps I'm missing your point.
I suppose the only real point I can make is that 'clusvcadm -e <foo>' when <foo> had been failed and then disabled shouldn't endlessly hang. A timeout of some sort that exits (reporting a failure) would be much better.
cheers
Sorry, I guess I did not address that point directly, but only hinted at it with mention of other conditions I felt were driving that.
clusvcadm -d and clusvcadm -e could have very easily (and somewhat expectedly) blocked trying to stop or start one of the resources that was left in a bad state following DRBD's unexpected exit. This is what I was trying to touch on by commenting on how if DRBD goes away, the other resources are mostly hosed at that point. LVM operations could reasonably block depending on what state the volumes are left in. If GFS2 withdraws, it suspends the underlying volume, which then can cause further LVM operations to block when they attempt to scan that device (depending on settings, but this is not uncommon to see). So attempting to start clvmd after it'd been left running when the service stop failed could encounter this blocking. Similarly, attempting to start the Filesystem resource could block for a variety of reasons if it had withdrawn.
The point is: if DRBD goes away while these other resources are active, their design does not leave any simple way to then un-wedge them to start over fresh. clvmd and GFS2 both have components in the kernel, and so starting or stopping these types of resources requires a somewhat complex sequence of operations involving userspace and kernelspace. If one or the other gets stuck, such as if there is a withdrawal or a suspended device, then those components can be left in an uninterruptible state that can't just be cleared out and started fresh. So again, this is why I say your service design is problematic. If DRBD fails in this manner, there's not much else that can happen with the other components to clean them up.
>> A timeout of some sort that exits (reporting a failure) would be much better.
This can be configured. rgmanager doesn't enforce timeouts on resource operations by default, but it can if you enable __enforce_timeouts. That said, it doesn't resolve the problems I noted previously with those components being stuck in a state that can't be recovered. You'd simply have a more responsive clusvcadm command, but likely would still have no recourse to get those resources functional again.
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/ The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |