1157951 – rgmanager (clustat) sees a deleted VM, couldn't get rid of it, hung when stopping rgmanager

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1157951 - rgmanager (clustat) sees a deleted VM, couldn't get rid of it, hung when stopping rgmanager

Summary: rgmanager (clustat) sees a deleted VM, couldn't get rid of it, hung when stop...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1166589
TreeView+	depends on / blocked

Reported:	2014-10-28 05:59 UTC by Madison Kelly
Modified:	2016-11-08 15:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1166589 (view as bug list)
Environment:
Last Closed:	2016-11-08 14:25:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Madison Kelly 2014-10-28 05:59:43 UTC

Description of problem:

This is the second time I've hit this bug in about 1.5 yearss. I do not have a reproducer. =/

I deleted a VM from rgmanager using ccs. It deleted the vm from cluster.conf and the other node's clustat removed it, but the first node still saw it. I tried incrementing the cluster.conf version number and pushing it out, no luck. I added, updated, deleted, updated, still no luck. 

I decided to restart rgmanager, but the '/etc/init.d/rgmanager stop' never exited. When I finally ctrl+c'ed it, I saw two rgmanager processes. I could not kill either of them, so I eventually fenced the node.

The first time I hit it, I posted here:

http://www.redhat.com/archives/linux-cluster/2013-July/msg00027.html

This time I got a bit more into:

Node 1:

logs:
===
Oct 27 15:13:30 nr-c03n01 ricci[24497]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:30 nr-c03n01 ricci[24502]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:30 nr-c03n01 ricci[24505]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:30 nr-c03n01 ricci[24507]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/735065620'
Oct 27 15:13:30 nr-c03n01 ricci[24511]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:30 nr-c03n01 ricci[24516]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:30 nr-c03n01 ricci[24519]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:30 nr-c03n01 ricci[24521]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/247634552'
Oct 27 15:13:32 nr-c03n01 ricci[24632]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:32 nr-c03n01 ricci[24637]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:32 nr-c03n01 ricci[24640]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:32 nr-c03n01 ricci[24642]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/2072681699'
Oct 27 15:13:32 nr-c03n01 modcluster: Updating cluster.conf
Oct 27 15:13:32 nr-c03n01 corosync[4674]:   [QUORUM] Members[2]: 1 2
Oct 27 15:13:32 nr-c03n01 ricci[24674]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:32 nr-c03n01 ricci[24720]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:32 nr-c03n01 ricci[24736]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:32 nr-c03n01 rgmanager[5071]: Reconfiguring
Oct 27 15:13:32 nr-c03n01 ricci[24755]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/442170264'
Oct 27 15:13:33 nr-c03n01 ricci[24899]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:33 nr-c03n01 ricci[24944]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:33 nr-c03n01 ricci[24964]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:33 nr-c03n01 ricci[24993]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/722138948'
Oct 27 15:13:33 nr-c03n01 modcluster: Updating cluster.conf
Oct 27 15:13:33 nr-c03n01 corosync[4674]:   [QUORUM] Members[2]: 1 2
===

clustat:
===
Cluster Status for nr-cluster-03 @ Mon Oct 27 15:43:28 2014
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 nr-c03n01.example.com                           1 Online, Local, rgmanager
 nr-c03n02.example.com                           2 Online, rgmanager

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:libvirtd_n01           nr-c03n01.example.com              started
 service:libvirtd_n02           nr-c03n02.example.com              started
 service:storage_n01            nr-c03n01.example.com              started
 service:storage_n02            nr-c03n02.example.com              started
 vm:iRed2                       (nr-c03n01.example.com)            disabled
 vm:vm05-http03                 nr-c03n01.example.com              started
 vm:vm06-iRedMail               nr-c03n02.example.com              started
===

Node 2:

logs:
===
Oct 27 15:13:32 nr-c03n02 corosync[2714]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Oct 27 15:13:32 nr-c03n02 corosync[2714]:   [CMAN  ] Can't get updated config version 16: New configuration version has to be newer than current running configuration#012.
Oct 27 15:13:32 nr-c03n02 corosync[2714]:   [CMAN  ] Activity suspended on this node
Oct 27 15:13:32 nr-c03n02 corosync[2714]:   [CMAN  ] Error reloading the configuration, will retry every second
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Can't get updated config version 17: New configuration version has to be newer than current running configuration#012.
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Activity suspended on this node
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Can't get updated config version 17: New configuration version has to be newer than current running configuration#012.
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Activity suspended on this node
Oct 27 15:13:33 nr-c03n02 corosync[2714]:   [CMAN  ] Error reloading the configuration, will retry every second
Oct 27 15:13:34 nr-c03n02 ricci[26096]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:34 nr-c03n02 corosync[2714]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Oct 27 15:13:34 nr-c03n02 corosync[2714]:   [CMAN  ] Can't get updated config version 17: New configuration version has to be newer than current running configuration#012.
Oct 27 15:13:34 nr-c03n02 corosync[2714]:   [CMAN  ] Activity suspended on this node
Oct 27 15:13:34 nr-c03n02 corosync[2714]:   [CMAN  ] Error reloading the configuration, will retry every second
Oct 27 15:13:34 nr-c03n02 ricci[26152]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:34 nr-c03n02 ricci[26155]: Executing '/usr/bin/virsh nodeinfo'
Oct 27 15:13:35 nr-c03n02 ricci[26157]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1154873520'
Oct 27 15:13:35 nr-c03n02 modcluster: Updating cluster.conf
Oct 27 15:13:35 nr-c03n02 corosync[2714]:   [QUORUM] Members[2]: 1 2
Oct 27 15:13:35 nr-c03n02 rgmanager[2944]: Reconfiguring
===

clustat:
===
Cluster Status for nr-cluster-03 @ Mon Oct 27 15:34:38 2014
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 nr-c03n01.example.com                                          1 Online, rgmanager
 nr-c03n02.example.com                                          2 Online, Local, rgmanager

 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          nr-c03n01.example.com                             started       
 service:libvirtd_n02                          nr-c03n02.example.com                             started       
 service:storage_n01                           nr-c03n01.example.com                             started       
 service:storage_n02                           nr-c03n02.example.com                             started       
 vm:vm05-http03                                nr-c03n01.example.com                             started       
 vm:vm06-iRedMail                              nr-c03n02.example.com                             started       
===

This was after the attempts to manually remove it failed:

===
[root@nr-c03n01 ~]# ccs -h localhost --activate --sync --password "secret" --rmvm iRed2
Unable to find virtual machine: iRed2

[root@nr-c03n01 ~]# clustat
Cluster Status for nr-cluster-03 @ Mon Oct 27 15:50:54 2014
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 nr-c03n01.example.com                                          1 Online, Local, rgmanager
 nr-c03n02.example.com                                          2 Online, rgmanager

 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          nr-c03n01.example.com                             started       
 service:libvirtd_n02                          nr-c03n02.example.com                             started       
 service:storage_n01                           nr-c03n01.example.com                             started       
 service:storage_n02                           nr-c03n02.example.com                             started       
 vm:iRed2                                      (nr-c03n01.example.com)                           disabled      
 vm:vm05-http03                                nr-c03n02.example.com                             started       
 vm:vm06-iRedMail                              nr-c03n02.example.com                             started
===

Version-Release number of selected component (if applicable):

cman-3.0.12.1-59.el6_5.3.x86_64
rgmanager-3.0.12.1-19.el6.x86_64
corosync-1.4.1-17.el6_5.1.x86_64
ccs-0.16.2-69.el6_5.1.x86_64

How reproducible:

Very rare.


Steps to Reproduce:
1. Unknown :(
2.
3.

Actual results:

rgmanager gets stuck


Expected results:

rgmanager not get stuck


Additional info:

Comment 1 Madison Kelly 2014-10-28 06:05:21 UTC

rgmanager.log, node1:

Oct 27 15:13:32 rgmanager Reconfiguring

rgmanager.log, node2:

Oct 27 15:13:35 rgmanager Reconfiguring

Comment 2 Jan Pokorný [poki] 2014-11-20 22:52:52 UTC

Digimer,
scratched head on this shortly, I think there is some progress:

0. assumption:
   you originally used the same (or equivalent) command as later on, i.e.:

>  ccs -h localhost --activate --sync --password "secret" --rmvm iRed2

--


1. "Updating cluster.conf" followed by symptoms of cluster.conf being
   indeed propagated, shortly twice in row on nr-c03n01, seemed unnatural
   and suspicious

->

2. indeed there is a bug in ccs causing following sequence:

   - if (removevm): remove_vm(name)
     -> set_cluster_conf (while "activate" holds ~ --activate,
                          only against localhost)

     <spoiler-alert>
         "activate" should be temporarily masked if "sync" is set
         to prevent "double activate", just as the method below does
     </spoiler-alert>

   - if (sync): sync_cluster_conf()
     -> set_cluster_conf (with "activate" masked,
                          against all nodes via cluster.conf hostnames)
     -> set_cluster_conf (with "activate" unmasked, hence true as above,
                         only against the last enumerated node)

--

Bottom-line: there is still a bug in rgmanager in not being able, in some
circumstances, to deal with 2+ subsequent configuration updates in a very
very very short time frame (likely a race condition)

Good news: buggy ccs (in a sense, working, but less efficiently than
appropriate) helped to discover this bug :)

Comment 3 Madison Kelly 2014-11-21 01:10:15 UTC

In my cases, the networks are generally really low latency on dedicated VLANed segments, so a race isn't entirely out of the question I suppose. Anything I can do on my side of things to help/test?

In case it's related; I periodically run into issues where rgmanager won't stop. In about 75% of those cases, 'kill `pidof rgmanager`' causes it to actually shut down, but it about 25% of cases I have to fence the node. I'd saw this happens about 2~3% of the time, and I never see anything funning in the logs.

Comment 4 Jan Pokorný [poki] 2014-11-21 16:13:32 UTC

> In my cases, the networks are generally really low latency on dedicated
> VLANed segments, so a race isn't entirely out of the question I suppose.

I suspect the fact you hit this only with VMs only so far may have more
to do with virtual machine shutdown (?) being quite CPU intensive hence
taking more time, increasing probability it won't fit into window between
forced config reloads, without direct influence of network latency.
But that's just a guessing game.

> Anything I can do on my side of things to help/test?

I don't think it's documented, but some more diagnostics can be obtained
like this (as I've found out):

# kill -USR1 $(ps -d --sort start | grep rgmanager | head -n1 \
  | sed 's/\s*\([0-9]*\).*/\1/)
# ls -al /var/lib/cluster/rgmanager-dump
> -rw-r--r--. 1 root root 22151 Nov 21 16:45 /var/lib/cluster/rgmanager-dump
# cat /var/lib/cluster/rgmanager-dump
> Cluster configuration version 28
> 
> Resource Group Threads
> [...]

The output might help, but more likely it will be akin to clustat output.
Then as a last resort measure -- attaching the debugger :)

> In case it's related; I periodically run into issues where rgmanager
> won't stop. In about 75% of those cases, 'kill `pidof rgmanager`' causes
> it to actually shut down, but it about 25% of cases I have to fence
> the node. I'd saw this happens about 2~3% of the time, and I never see
> anything funning in the logs.

IIRC that unability to terminate RGManager is caused by the locking part
being wired down to kernel and its uninterruptible threads.  The
diagnostics here is usually something like "echo l > /proc/sysrq-trigger"
and then you can find a potentially useful info in dmesg/messages/console
(without guarantees).


BTW. ccs part of this bug should be straightforward as a least resistance
prevention: [bug 1166589].

Comment 5 Jan Pokorný [poki] 2015-07-14 17:38:02 UTC

Might/not be related:
http://clusterlabs.org/pipermail/users/2015-July/000819.html

Comment 6 Madison Kelly 2015-07-14 17:45:30 UTC

Certainly related.

Had it happen again over the weekend. It seems to trigger when I delete a VM. This time though, I didn't use ccs, I manually deleted the <vm /> element, ran ccs_config_validate and then cman_tool version -r. The peer didn't see the delete. Thankfully, the VM was running on the OK node, so I migrated the VMs off and tried to stop rgmanager, but it hung again.

Oddly this time, I decided to try and bring rgmanager down. It would not respond to anything (kill`ed the two 'rgmanager stop' PIDs and the main rgmanager PID to no effect). So I started stopping resources manually. When I stopped gfs2, it unmounted and it was fine. Then I stopped clvmd and rgmanager finally reacted, marked the storage service as failed and then shut down. Very odd.

Sorry, I don't have the logs from this most recent event.

Comment 7 Madison Kelly 2015-07-14 17:49:14 UTC

It may not matter, but I had ~12 VMs on the 2-node cluster with fully up to date 6.6. Three drbd resources backing three clustered LVM VGs. Each VM had it's own LV.

Comment 8 Madison Kelly 2015-07-20 05:53:12 UTC

I'm starting to be more and more convinced that this is somehow related to clustered LVM...

I just had an odd case where the storage service (drbd -> clvmd -> gfs2) timed out starting on a node. When I tried to then stop rgmanager, it stuck and would not exit. Note that I had no VMs on the cluster yet.

I ctrl+c'ed the '/etc/init.d/rgmanager stop' and then manually stopped gfs2, no reaction. I then stopped clvmd and suddenly rgmanager unblocked and everything shut down.

This seems to be cropping up more and more of late, but I still don't have a reliable reproducer.

Comment 9 Madison Kelly 2015-07-21 03:09:08 UTC

I think I have a fairly reliable reproducer now.

If you build a 2-node cluster with RHEL 6.6 on a pair of VMs (I am using my laptop with an SSD to back 2x 100GB qcow2 VMs with 1GB of RAM and fence_virsh), start cman, manually start drbd -> clvmd -> gfs2, then start rgmanager, one of the nodes will usually fail to actually start rgmanager.

Rather, rgmanager starts, but clustat never shows 'rgmanager' under one of the nodes. It seems to be random as to which one fails to start. It may or may not have anything to do with drbd still sync'ing.

I *think* this is a timing issue, though.

It looks like the backing storage takes time to start, so clvmd can't see the PV on one of the nodes fast enough (it tears down the storage and then starts it back up when '/etc/init.d/rgmanager start' is called, rather than see that the services are already running).

Once this happens, rgmanager is stuck. You can't kill it, stop it, start it or anything. However, if you then run '/etc/init.d/clvmd restart' on the stuck node, BAM, rgmanager unblocks and everything returns to normal.

Here are the relevant config files:

=====================
[root@an-a02n02 ~]# cat /etc/cluster/cluster.conf 
<?xml version="1.0"?>
<cluster name="an-anvil-02" config_version="1">
	<cman expected_votes="1" two_node="1" />
	<clusternodes>
		<clusternode name="an-a02n01.alteeve.ca" nodeid="1">
			<fence>
				<method name="kvm">
					<device name="kvm_host" port="node1" delay="15" action="reboot" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-a02n02.alteeve.ca" nodeid="2">
			<fence>
				<method name="kvm">
					<device name="kvm_host" port="node2" action="reboot" />
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="kvm_host" agent="fence_virsh" ipaddr="192.168.122.1" login="root" passwd_script="/root/lemass" />
	</fencedevices>
	<fence_daemon post_join_delay="30" />
	<totem rrp_mode="none" secauth="off"/>
	<rm log_level="5">
		<resources>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<clusterfs device="/dev/an-a02n01_vg0/shared" force_unmount="1" fstype="gfs2" mountpoint="/shared" name="sharedfs" />
			<script file="/etc/init.d/libvirtd" name="libvirtd"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="only_n01" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-a02n01.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="only_n02" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-a02n02.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="primary_n01" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-a02n01.alteeve.ca" priority="1"/>
				<failoverdomainnode name="an-a02n02.alteeve.ca" priority="2"/>
			</failoverdomain>
			<failoverdomain name="primary_n02" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-a02n01.alteeve.ca" priority="2"/>
				<failoverdomainnode name="an-a02n02.alteeve.ca" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service name="storage_n01" autostart="1" domain="only_n01" exclusive="0" recovery="restart">
			<script ref="drbd">
				<script ref="clvmd">
					<clusterfs ref="sharedfs"/>
				</script>
			</script>
		</service>
		<service name="storage_n02" autostart="1" domain="only_n02" exclusive="0" recovery="restart">
			<script ref="drbd">
				<script ref="clvmd">
					<clusterfs ref="sharedfs"/>
				</script>
			</script>
		</service>
		<service name="libvirtd_n01" autostart="1" domain="only_n01" exclusive="0" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<service name="libvirtd_n02" autostart="1" domain="only_n02" exclusive="0" recovery="restart">
			<script ref="libvirtd"/>
		</service>
	</rm>
</cluster>
=====================

=====================
[root@an-a02n02 ~]# cat /etc/lvm/lvm.conf 
config {
    checks = 1
    abort_on_errors = 0
    profile_dir = "/etc/lvm/profile"
}

devices {
    dir = "/dev"
    scan = [ "/dev" ]
    obtain_device_list_from_udev = 0
    preferred_names = [ "^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d" ]
    filter = [ "a|/dev/drbd*|", "r/.*/" ]
    cache_dir = "/etc/lvm/cache"
    cache_file_prefix = ""
    write_cache_state = 1
    sysfs_scan = 1
    multipath_component_detection = 1
    md_component_detection = 1
    md_chunk_alignment = 1
    data_alignment_detection = 1
    data_alignment = 0
    data_alignment_offset_detection = 1
    ignore_suspended_devices = 0
    ignore_lvm_mirrors = 1
    disable_after_error_count = 0
    require_restorefile_with_uuid = 1
    pv_min_size = 2048
    issue_discards = 0
}

allocation {
    maximise_cling = 1
    use_blkid_wiping = 1
    wipe_signatures_when_zeroing_new_lvs = 1
    mirror_logs_require_separate_pvs = 0
    cache_pool_metadata_require_separate_pvs = 0
    thin_pool_metadata_require_separate_pvs = 0
}

log {
    verbose = 0
    silent = 0
    syslog = 1
    overwrite = 0
    level = 0
    indent = 1
    command_names = 0
    prefix = "  "
    debug_classes = [ "memory", "devices", "activation", "allocation",
		      "lvmetad", "metadata", "cache", "locking" ]
}

backup {
    backup = 1
    backup_dir = "/etc/lvm/backup"
    archive = 1
    archive_dir = "/etc/lvm/archive"
    retain_min = 10
    retain_days = 30
}

shell {
    history_size = 100
}

global {
    umask = 077
    test = 0
    units = "h"
    si_unit_consistency = 1
    suffix = 1
    activation = 1
    proc = "/proc"
    locking_type = 3
    wait_for_locks = 1
    fallback_to_clustered_locking = 1
    fallback_to_local_locking = 0
    locking_dir = "/var/lock/lvm"
    prioritise_write_locks = 1
    abort_on_internal_errors = 0
    detect_internal_vg_cache_corruption = 0
    metadata_read_only = 0
    mirror_segtype_default = "mirror"
    raid10_segtype_default = "mirror"
    use_lvmetad = 0
}

activation {
    checks = 0
    udev_sync = 1
    udev_rules = 1
    verify_udev_operations = 0
    retry_deactivation = 1
    missing_stripe_filler = "error"
    use_linear_target = 1
    reserved_stack = 64
    reserved_memory = 8192
    process_priority = -18
    raid_region_size = 512
    readahead = "auto"
    raid_fault_policy = "warn"
    mirror_log_fault_policy = "allocate"
    mirror_image_fault_policy = "remove"
    snapshot_autoextend_threshold = 100
    snapshot_autoextend_percent = 20
    thin_pool_autoextend_threshold = 100
    thin_pool_autoextend_percent = 20
    use_mlockall = 0
    monitoring = 1
    polling_interval = 15
    activation_mode = "degraded"
}

dmeventd {
    mirror_library = "libdevmapper-event-lvm2mirror.so"
    snapshot_library = "libdevmapper-event-lvm2snapshot.so"
    thin_library = "libdevmapper-event-lvm2thin.so"
}
=====================

I know RH doesn't do DRBD, but here's the config anyway:

=====================
[root@an-a02n02 ~]# drbdadm dump
# /etc/drbd.conf
global {
    usage-count yes;
    cmd-timeout-medium 600;
    cmd-timeout-long 0;
}

common {
    net {
        protocol           C;
        allow-two-primaries yes;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    startup {
        become-primary-on both;
    }
    handlers {
        fence-peer       /sbin/rhcs_fence;
    }
}

# resource r0 on an-a02n02.alteeve.ca: not ignored, not stacked
# defined at /etc/drbd.d/r0.res:4
resource r0 {
    on an-a02n01.alteeve.ca {
        volume 0 {
            device       /dev/drbd0 minor 0;
            disk         /dev/vda5;
            meta-disk    internal;
        }
        address          ipv4 10.10.20.1:7788;
    }
    on an-a02n02.alteeve.ca {
        volume 0 {
            device       /dev/drbd0 minor 0;
            disk         /dev/vda5;
            meta-disk    internal;
        }
        address          ipv4 10.10.20.2:7788;
    }
    disk {
        disk-flushes      no;
        md-flushes        no;
    }
}

# resource r1 on an-a02n02.alteeve.ca: not ignored, not stacked
# defined at /etc/drbd.d/r1.res:3
resource r1 {
    on an-a02n01.alteeve.ca {
        volume 0 {
            device       /dev/drbd1 minor 1;
            disk         /dev/vda6;
            meta-disk    internal;
        }
        address          ipv4 10.10.20.1:7789;
    }
    on an-a02n02.alteeve.ca {
        volume 0 {
            device       /dev/drbd1 minor 1;
            disk         /dev/vda6;
            meta-disk    internal;
        }
        address          ipv4 10.10.20.2:7789;
    }
    disk {
        disk-flushes      no;
        md-flushes        no;
    }
}


[root@an-a02n02 ~]# cat /proc/drbd 
version: 8.4.6 (api:1/proto:86-101)
GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by root.ca, 2015-07-18 16:09:46
 0: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r-----
    ns:0 nr:772352 dw:772236 dr:4168 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:30501080
	[>....................] sync'ed:  2.5% (29784/30540)M
	finish: 5:58:09 speed: 1,408 (2,968) want: 440 K/sec
 1: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r-----
    ns:0 nr:618348 dw:618260 dr:1556 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:18026788
	[>....................] sync'ed:  3.4% (17604/18208)M
	finish: 2:38:07 speed: 1,896 (2,376) want: 6,040 K/sec
=====================

RPMs:

=====================
[root@an-a02n02 ~]# rpm -q cman rgmanager lvm2-cluster gfs2-utils
cman-3.0.12.1-68.el6_6.1.x86_64
rgmanager-3.0.12.1-21.el6.x86_64
lvm2-cluster-2.02.111-2.el6_6.3.x86_64
gfs2-utils-3.0.12.1-68.el6_6.1.x86_64
=====================

Here are the logs on the node the froze (node2)

=====================
Jul 20 22:56:12 an-a02n02 rgmanager[3208]: I am node #2
Jul 20 22:56:12 an-a02n02 rgmanager[3208]: Resource Group Manager Starting
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: peer( Primary -> Secondary ) 
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: peer( Primary -> Secondary ) 
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1f0 val = 0x70
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:   state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- }
Jul 20 22:56:16 an-a02n02 kernel: block drbd1:  wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- }
Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  mask = 0x1e1f0 val = 0xa070
Jul 20 22:56:16 an-a02n02 kernel: drbd r1:  old_conn:WFReportParams wanted_conn:TearDown
Jul 20 22:56:17 an-a02n02 kernel: block drbd0: peer( Secondary -> Primary ) 
Jul 20 22:56:17 an-a02n02 kernel: block drbd1: peer( Secondary -> Primary ) 
=====================

Here is where it stopped. Note the extra time needed to start rgmanager? Until this point, clvmd would not have been able to see its PV.

This is what was logged when I ran '/etc/init.d/clvmd restart' (note that 'start' alone wouldn't unjam it).

=====================
Jul 20 23:01:30 an-a02n02 rgmanager[6098]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5)
Jul 20 23:01:30 an-a02n02 rgmanager[3208]: stop on script "clvmd" returned 1 (generic error)
Jul 20 23:01:30 an-a02n02 clvmd: Cluster LVM daemon started - connected to CMAN
Jul 20 23:01:30 an-a02n02 kernel: block drbd0: State change failed: Device is held open by someone
Jul 20 23:01:30 an-a02n02 kernel: block drbd0:   state = { cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate r----- }
Jul 20 23:01:30 an-a02n02 kernel: block drbd0:  wanted = { cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate r----- }
Jul 20 23:01:30 an-a02n02 kernel: block drbd1: role( Primary -> Secondary ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: peer( Primary -> Unknown ) conn( SyncTarget -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: ack_receiver terminated
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Terminating drbd_a_r1
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Connection closed
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: conn( Disconnecting -> StandAlone ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: receiver terminated
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Terminating drbd_r_r1
Jul 20 23:01:30 an-a02n02 kernel: block drbd1: disk( Inconsistent -> Failed ) 
Jul 20 23:01:30 an-a02n02 kernel: block drbd1: bitmap WRITE of 0 pages took 0 jiffies
Jul 20 23:01:30 an-a02n02 kernel: block drbd1: 18 GB (4661262 bits) marked out-of-sync by on disk bit-map.
Jul 20 23:01:30 an-a02n02 kernel: block drbd1: disk( Failed -> Diskless ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Terminating drbd_w_r1
Jul 20 23:01:30 an-a02n02 kernel: block drbd0: role( Primary -> Secondary ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: peer( Primary -> Unknown ) conn( SyncTarget -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: ack_receiver terminated
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Terminating drbd_a_r0
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Connection closed
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: conn( Disconnecting -> StandAlone ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: receiver terminated
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Terminating drbd_r_r0
Jul 20 23:01:30 an-a02n02 kernel: block drbd0: disk( Inconsistent -> Failed ) 
Jul 20 23:01:30 an-a02n02 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Jul 20 23:01:30 an-a02n02 kernel: block drbd0: 30 GB (7818329 bits) marked out-of-sync by on disk bit-map.
Jul 20 23:01:30 an-a02n02 kernel: block drbd0: disk( Failed -> Diskless ) 
Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Terminating drbd_w_r0
Jul 20 23:01:30 an-a02n02 kernel: drbd: module cleanup done.
Jul 20 23:01:31 an-a02n02 rgmanager[3208]: Starting stopped service service:storage_n02
Jul 20 23:01:32 an-a02n02 kernel: drbd: events: mcg drbd: 2
Jul 20 23:01:32 an-a02n02 kernel: drbd: initialized. Version: 8.4.6 (api:1/proto:86-101)
Jul 20 23:01:32 an-a02n02 kernel: drbd: GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by root.ca, 2015-07-18 16:09:46
Jul 20 23:01:32 an-a02n02 kernel: drbd: registered as block device major 147
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Starting worker thread (from drbdsetup-84 [6418])
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: disk( Diskless -> Attaching ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Method to ensure write ordering: drain
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: max BIO size = 1048576
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: drbd_bm_resize called with capacity == 69199736
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: resync bitmap: bits=8649967 words=135156 pages=264
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: size = 33 GB (34599868 KB)
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: recounting of set bits took additional 1 jiffies
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: 30 GB (7818329 bits) marked out-of-sync by on disk bit-map.
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: disk( Attaching -> Inconsistent ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: attached to UUIDs 994BBE5EC23EFD18:0000000000000000:0000000000000000:0000000000000000
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Starting worker thread (from drbdsetup-84 [6425])
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: disk( Diskless -> Attaching ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Method to ensure write ordering: drain
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: max BIO size = 1048576
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: drbd_bm_resize called with capacity == 47182360
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: resync bitmap: bits=5897795 words=92154 pages=180
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: size = 22 GB (23591180 KB)
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: recounting of set bits took additional 0 jiffies
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: 18 GB (4661262 bits) marked out-of-sync by on disk bit-map.
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: disk( Attaching -> Inconsistent ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: attached to UUIDs 9F5708CFE6D0CBD8:0000000000000000:0000000000000000:0000000000000000
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: conn( StandAlone -> Unconnected ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Starting receiver thread (from drbd_w_r0 [6419])
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: receiver (re)started
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: conn( Unconnected -> WFConnection ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: conn( StandAlone -> Unconnected ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Starting receiver thread (from drbd_w_r1 [6426])
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: receiver (re)started
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: conn( Unconnected -> WFConnection ) 
Jul 20 23:01:32 an-a02n02 rgmanager[3208]: Starting stopped service service:libvirtd_n02
Jul 20 23:01:32 an-a02n02 rgmanager[3208]: Service service:libvirtd_n02 started
Jul 20 23:01:32 an-a02n02 kernel: lo: Disabled Privacy Extensions
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Handshake successful: Agreed network protocol version 101
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Agreed to support TRIM on protocol level
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: conn( WFConnection -> WFReportParams ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [6431])
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Handshake successful: Agreed network protocol version 101
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Agreed to support TRIM on protocol level
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: conn( WFConnection -> WFReportParams ) 
Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Starting ack_recv thread (from drbd_r_r1 [6433])
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: drbd_sync_handshake:
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: self 994BBE5EC23EFD18:0000000000000000:0000000000000000:0000000000000000 bits:7818329 flags:0
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: peer 032F06EC14D6AB2B:994BBE5EC23EFD19:994ABE5EC23EFD19:9949BE5EC23EFD19 bits:7818329 flags:0
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: uuid_compare()=-1 by rule 50
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: Becoming sync target due to disk states.
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: drbd_sync_handshake:
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: self 9F5708CFE6D0CBD8:0000000000000000:0000000000000000:0000000000000000 bits:4661262 flags:0
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: peer ED55D611AB6E9CF7:9F5708CFE6D0CBD8:9F5608CFE6D0CBD8:9F5508CFE6D0CBD8 bits:4661262 flags:0
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: uuid_compare()=-1 by rule 50
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: Becoming sync target due to disk states.
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0%
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0%
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: role( Secondary -> Primary ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0%
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0%
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: role( Secondary -> Primary ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: updated sync uuid 9F5808CFE6D0CBD9:0000000000000000:0000000000000000:0000000000000000
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: updated sync uuid 994CBE5EC23EFD19:0000000000000000:0000000000000000:0000000000000000
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd1: Began resync as SyncTarget (will sync 18645048 KB [4661262 bits set]).
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Jul 20 23:01:32 an-a02n02 kernel: block drbd0: Began resync as SyncTarget (will sync 31273316 KB [7818329 bits set]).
Jul 20 23:01:34 an-a02n02 clvmd: Cluster LVM daemon started - connected to CMAN
Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-anvil-02:shared"
Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: Joined cluster. Now mounting FS...
Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: jid=1, already locked for use
Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: jid=1: Looking at journal...
Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: jid=1: Done
Jul 20 23:01:34 an-a02n02 rgmanager[3208]: Service service:storage_n02 started
=====================

Here are the logs from the node (node 1) that started properly:

=====================
Jul 20 22:56:12 an-a02n01 rgmanager[1142]: I am node #1
Jul 20 22:56:12 an-a02n01 rgmanager[1142]: Resource Group Manager Starting
Jul 20 22:56:16 an-a02n01 kernel: block drbd0: role( Primary -> Secondary ) 
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: block drbd1: role( Primary -> Secondary ) 
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2)
Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Marking service:storage_n02 as stopped: Restricted domain unavailable
Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Starting stopped service service:storage_n01
Jul 20 22:56:17 an-a02n01 kernel: block drbd0: role( Secondary -> Primary ) 
Jul 20 22:56:17 an-a02n01 kernel: block drbd1: role( Secondary -> Primary ) 
Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Starting stopped service service:libvirtd_n01
Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Service service:libvirtd_n01 started
Jul 20 22:56:17 an-a02n01 kernel: lo: Disabled Privacy Extensions
Jul 20 22:56:18 an-a02n01 clvmd: Cluster LVM daemon started - connected to CMAN
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-anvil-02:shared"
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: Joined cluster. Now mounting FS...
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=0, already locked for use
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=0: Looking at journal...
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=0: Done
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=1: Trying to acquire journal lock...
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=1: Looking at journal...
Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=1: Done
Jul 20 22:56:19 an-a02n01 rgmanager[1142]: Service service:storage_n01 started
=====================

Here you see that the peer restarted it's storage service when clvmd was restarted.

=====================
Jul 20 23:01:30 an-a02n01 kernel: block drbd1: peer( Primary -> Secondary ) 
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: peer( Secondary -> Unknown ) conn( SyncSource -> TearDown ) 
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: ack_receiver terminated
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: Terminating drbd_a_r1
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: Connection closed
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: conn( TearDown -> Unconnected ) 
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: receiver terminated
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: Restarting receiver thread
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: receiver (re)started
Jul 20 23:01:30 an-a02n01 kernel: drbd r1: conn( Unconnected -> WFConnection ) 
Jul 20 23:01:30 an-a02n01 kernel: block drbd0: peer( Primary -> Secondary ) 
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: peer( Secondary -> Unknown ) conn( SyncSource -> TearDown ) 
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: ack_receiver terminated
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: Terminating drbd_a_r0
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: Connection closed
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: conn( TearDown -> Unconnected ) 
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: receiver terminated
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: Restarting receiver thread
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: receiver (re)started
Jul 20 23:01:30 an-a02n01 kernel: drbd r0: conn( Unconnected -> WFConnection ) 
Jul 20 23:01:32 an-a02n01 kernel: drbd r0: Handshake successful: Agreed network protocol version 101
Jul 20 23:01:32 an-a02n01 kernel: drbd r0: Agreed to support TRIM on protocol level
Jul 20 23:01:32 an-a02n01 kernel: drbd r0: conn( WFConnection -> WFReportParams ) 
Jul 20 23:01:32 an-a02n01 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [27478])
Jul 20 23:01:32 an-a02n01 kernel: drbd r1: Handshake successful: Agreed network protocol version 101
Jul 20 23:01:32 an-a02n01 kernel: drbd r1: Agreed to support TRIM on protocol level
Jul 20 23:01:32 an-a02n01 kernel: drbd r1: conn( WFConnection -> WFReportParams ) 
Jul 20 23:01:32 an-a02n01 kernel: drbd r1: Starting ack_recv thread (from drbd_r_r1 [27510])
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: drbd_sync_handshake:
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: self 032F06EC14D6AB2B:994BBE5EC23EFD19:994ABE5EC23EFD19:9949BE5EC23EFD19 bits:7818329 flags:0
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: drbd_sync_handshake:
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: self ED55D611AB6E9CF7:9F5708CFE6D0CBD8:9F5608CFE6D0CBD8:9F5508CFE6D0CBD8 bits:4661262 flags:0
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: peer 9F5708CFE6D0CBD8:0000000000000000:0000000000000000:0000000000000000 bits:4661262 flags:0
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: uuid_compare()=1 by rule 70
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: Becoming sync source due to disk states.
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0%
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: peer 994BBE5EC23EFD18:0000000000000000:0000000000000000:0000000000000000 bits:7818329 flags:0
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: uuid_compare()=1 by rule 70
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: Becoming sync source due to disk states.
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0%
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0%
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: peer( Secondary -> Primary ) 
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0%
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: conn( WFBitMapS -> SyncSource ) 
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: Began resync as SyncSource (will sync 18645048 KB [4661262 bits set]).
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: peer( Secondary -> Primary ) 
Jul 20 23:01:32 an-a02n01 kernel: block drbd1: updated sync UUID ED55D611AB6E9CF7:9F5808CFE6D0CBD8:9F5708CFE6D0CBD8:9F5608CFE6D0CBD8
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) 
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: Began resync as SyncSource (will sync 31273316 KB [7818329 bits set]).
Jul 20 23:01:32 an-a02n01 kernel: block drbd0: updated sync UUID 032F06EC14D6AB2B:994CBE5EC23EFD19:994BBE5EC23EFD19:994ABE5EC23EFD19
=====================

At this point, service was restored, if after a rather dramatic repair.

Anything else?

Comment 14 Madison Kelly 2016-11-08 15:08:54 UTC

What info was missing?

Note You need to log in before you can comment on or make changes to this bug.