Bug 1157951
Summary: | rgmanager (clustat) sees a deleted VM, couldn't get rid of it, hung when stopping rgmanager | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> | |
Component: | rgmanager | Assignee: | Ryan McCabe <rmccabe> | |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 6.5 | CC: | cfeist, cluster-maint, fdinitto, jkortus, jpokorny | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1166589 (view as bug list) | Environment: | ||
Last Closed: | 2016-11-08 14:25:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1166589 |
Description
Madison Kelly
2014-10-28 05:59:43 UTC
rgmanager.log, node1: Oct 27 15:13:32 rgmanager Reconfiguring rgmanager.log, node2: Oct 27 15:13:35 rgmanager Reconfiguring Digimer,
scratched head on this shortly, I think there is some progress:
0. assumption:
you originally used the same (or equivalent) command as later on, i.e.:
> ccs -h localhost --activate --sync --password "secret" --rmvm iRed2
--
1. "Updating cluster.conf" followed by symptoms of cluster.conf being
indeed propagated, shortly twice in row on nr-c03n01, seemed unnatural
and suspicious
->
2. indeed there is a bug in ccs causing following sequence:
- if (removevm): remove_vm(name)
-> set_cluster_conf (while "activate" holds ~ --activate,
only against localhost)
<spoiler-alert>
"activate" should be temporarily masked if "sync" is set
to prevent "double activate", just as the method below does
</spoiler-alert>
- if (sync): sync_cluster_conf()
-> set_cluster_conf (with "activate" masked,
against all nodes via cluster.conf hostnames)
-> set_cluster_conf (with "activate" unmasked, hence true as above,
only against the last enumerated node)
--
Bottom-line: there is still a bug in rgmanager in not being able, in some
circumstances, to deal with 2+ subsequent configuration updates in a very
very very short time frame (likely a race condition)
Good news: buggy ccs (in a sense, working, but less efficiently than
appropriate) helped to discover this bug :)
In my cases, the networks are generally really low latency on dedicated VLANed segments, so a race isn't entirely out of the question I suppose. Anything I can do on my side of things to help/test? In case it's related; I periodically run into issues where rgmanager won't stop. In about 75% of those cases, 'kill `pidof rgmanager`' causes it to actually shut down, but it about 25% of cases I have to fence the node. I'd saw this happens about 2~3% of the time, and I never see anything funning in the logs. > In my cases, the networks are generally really low latency on dedicated > VLANed segments, so a race isn't entirely out of the question I suppose. I suspect the fact you hit this only with VMs only so far may have more to do with virtual machine shutdown (?) being quite CPU intensive hence taking more time, increasing probability it won't fit into window between forced config reloads, without direct influence of network latency. But that's just a guessing game. > Anything I can do on my side of things to help/test? I don't think it's documented, but some more diagnostics can be obtained like this (as I've found out): # kill -USR1 $(ps -d --sort start | grep rgmanager | head -n1 \ | sed 's/\s*\([0-9]*\).*/\1/) # ls -al /var/lib/cluster/rgmanager-dump > -rw-r--r--. 1 root root 22151 Nov 21 16:45 /var/lib/cluster/rgmanager-dump # cat /var/lib/cluster/rgmanager-dump > Cluster configuration version 28 > > Resource Group Threads > [...] The output might help, but more likely it will be akin to clustat output. Then as a last resort measure -- attaching the debugger :) > In case it's related; I periodically run into issues where rgmanager > won't stop. In about 75% of those cases, 'kill `pidof rgmanager`' causes > it to actually shut down, but it about 25% of cases I have to fence > the node. I'd saw this happens about 2~3% of the time, and I never see > anything funning in the logs. IIRC that unability to terminate RGManager is caused by the locking part being wired down to kernel and its uninterruptible threads. The diagnostics here is usually something like "echo l > /proc/sysrq-trigger" and then you can find a potentially useful info in dmesg/messages/console (without guarantees). BTW. ccs part of this bug should be straightforward as a least resistance prevention: [bug 1166589]. Might/not be related: http://clusterlabs.org/pipermail/users/2015-July/000819.html Certainly related. Had it happen again over the weekend. It seems to trigger when I delete a VM. This time though, I didn't use ccs, I manually deleted the <vm /> element, ran ccs_config_validate and then cman_tool version -r. The peer didn't see the delete. Thankfully, the VM was running on the OK node, so I migrated the VMs off and tried to stop rgmanager, but it hung again. Oddly this time, I decided to try and bring rgmanager down. It would not respond to anything (kill`ed the two 'rgmanager stop' PIDs and the main rgmanager PID to no effect). So I started stopping resources manually. When I stopped gfs2, it unmounted and it was fine. Then I stopped clvmd and rgmanager finally reacted, marked the storage service as failed and then shut down. Very odd. Sorry, I don't have the logs from this most recent event. It may not matter, but I had ~12 VMs on the 2-node cluster with fully up to date 6.6. Three drbd resources backing three clustered LVM VGs. Each VM had it's own LV. I'm starting to be more and more convinced that this is somehow related to clustered LVM... I just had an odd case where the storage service (drbd -> clvmd -> gfs2) timed out starting on a node. When I tried to then stop rgmanager, it stuck and would not exit. Note that I had no VMs on the cluster yet. I ctrl+c'ed the '/etc/init.d/rgmanager stop' and then manually stopped gfs2, no reaction. I then stopped clvmd and suddenly rgmanager unblocked and everything shut down. This seems to be cropping up more and more of late, but I still don't have a reliable reproducer. I think I have a fairly reliable reproducer now. If you build a 2-node cluster with RHEL 6.6 on a pair of VMs (I am using my laptop with an SSD to back 2x 100GB qcow2 VMs with 1GB of RAM and fence_virsh), start cman, manually start drbd -> clvmd -> gfs2, then start rgmanager, one of the nodes will usually fail to actually start rgmanager. Rather, rgmanager starts, but clustat never shows 'rgmanager' under one of the nodes. It seems to be random as to which one fails to start. It may or may not have anything to do with drbd still sync'ing. I *think* this is a timing issue, though. It looks like the backing storage takes time to start, so clvmd can't see the PV on one of the nodes fast enough (it tears down the storage and then starts it back up when '/etc/init.d/rgmanager start' is called, rather than see that the services are already running). Once this happens, rgmanager is stuck. You can't kill it, stop it, start it or anything. However, if you then run '/etc/init.d/clvmd restart' on the stuck node, BAM, rgmanager unblocks and everything returns to normal. Here are the relevant config files: ===================== [root@an-a02n02 ~]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster name="an-anvil-02" config_version="1"> <cman expected_votes="1" two_node="1" /> <clusternodes> <clusternode name="an-a02n01.alteeve.ca" nodeid="1"> <fence> <method name="kvm"> <device name="kvm_host" port="node1" delay="15" action="reboot" /> </method> </fence> </clusternode> <clusternode name="an-a02n02.alteeve.ca" nodeid="2"> <fence> <method name="kvm"> <device name="kvm_host" port="node2" action="reboot" /> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="kvm_host" agent="fence_virsh" ipaddr="192.168.122.1" login="root" passwd_script="/root/lemass" /> </fencedevices> <fence_daemon post_join_delay="30" /> <totem rrp_mode="none" secauth="off"/> <rm log_level="5"> <resources> <script file="/etc/init.d/drbd" name="drbd"/> <script file="/etc/init.d/clvmd" name="clvmd"/> <clusterfs device="/dev/an-a02n01_vg0/shared" force_unmount="1" fstype="gfs2" mountpoint="/shared" name="sharedfs" /> <script file="/etc/init.d/libvirtd" name="libvirtd"/> </resources> <failoverdomains> <failoverdomain name="only_n01" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="an-a02n01.alteeve.ca"/> </failoverdomain> <failoverdomain name="only_n02" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="an-a02n02.alteeve.ca"/> </failoverdomain> <failoverdomain name="primary_n01" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="an-a02n01.alteeve.ca" priority="1"/> <failoverdomainnode name="an-a02n02.alteeve.ca" priority="2"/> </failoverdomain> <failoverdomain name="primary_n02" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="an-a02n01.alteeve.ca" priority="2"/> <failoverdomainnode name="an-a02n02.alteeve.ca" priority="1"/> </failoverdomain> </failoverdomains> <service name="storage_n01" autostart="1" domain="only_n01" exclusive="0" recovery="restart"> <script ref="drbd"> <script ref="clvmd"> <clusterfs ref="sharedfs"/> </script> </script> </service> <service name="storage_n02" autostart="1" domain="only_n02" exclusive="0" recovery="restart"> <script ref="drbd"> <script ref="clvmd"> <clusterfs ref="sharedfs"/> </script> </script> </service> <service name="libvirtd_n01" autostart="1" domain="only_n01" exclusive="0" recovery="restart"> <script ref="libvirtd"/> </service> <service name="libvirtd_n02" autostart="1" domain="only_n02" exclusive="0" recovery="restart"> <script ref="libvirtd"/> </service> </rm> </cluster> ===================== ===================== [root@an-a02n02 ~]# cat /etc/lvm/lvm.conf config { checks = 1 abort_on_errors = 0 profile_dir = "/etc/lvm/profile" } devices { dir = "/dev" scan = [ "/dev" ] obtain_device_list_from_udev = 0 preferred_names = [ "^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d" ] filter = [ "a|/dev/drbd*|", "r/.*/" ] cache_dir = "/etc/lvm/cache" cache_file_prefix = "" write_cache_state = 1 sysfs_scan = 1 multipath_component_detection = 1 md_component_detection = 1 md_chunk_alignment = 1 data_alignment_detection = 1 data_alignment = 0 data_alignment_offset_detection = 1 ignore_suspended_devices = 0 ignore_lvm_mirrors = 1 disable_after_error_count = 0 require_restorefile_with_uuid = 1 pv_min_size = 2048 issue_discards = 0 } allocation { maximise_cling = 1 use_blkid_wiping = 1 wipe_signatures_when_zeroing_new_lvs = 1 mirror_logs_require_separate_pvs = 0 cache_pool_metadata_require_separate_pvs = 0 thin_pool_metadata_require_separate_pvs = 0 } log { verbose = 0 silent = 0 syslog = 1 overwrite = 0 level = 0 indent = 1 command_names = 0 prefix = " " debug_classes = [ "memory", "devices", "activation", "allocation", "lvmetad", "metadata", "cache", "locking" ] } backup { backup = 1 backup_dir = "/etc/lvm/backup" archive = 1 archive_dir = "/etc/lvm/archive" retain_min = 10 retain_days = 30 } shell { history_size = 100 } global { umask = 077 test = 0 units = "h" si_unit_consistency = 1 suffix = 1 activation = 1 proc = "/proc" locking_type = 3 wait_for_locks = 1 fallback_to_clustered_locking = 1 fallback_to_local_locking = 0 locking_dir = "/var/lock/lvm" prioritise_write_locks = 1 abort_on_internal_errors = 0 detect_internal_vg_cache_corruption = 0 metadata_read_only = 0 mirror_segtype_default = "mirror" raid10_segtype_default = "mirror" use_lvmetad = 0 } activation { checks = 0 udev_sync = 1 udev_rules = 1 verify_udev_operations = 0 retry_deactivation = 1 missing_stripe_filler = "error" use_linear_target = 1 reserved_stack = 64 reserved_memory = 8192 process_priority = -18 raid_region_size = 512 readahead = "auto" raid_fault_policy = "warn" mirror_log_fault_policy = "allocate" mirror_image_fault_policy = "remove" snapshot_autoextend_threshold = 100 snapshot_autoextend_percent = 20 thin_pool_autoextend_threshold = 100 thin_pool_autoextend_percent = 20 use_mlockall = 0 monitoring = 1 polling_interval = 15 activation_mode = "degraded" } dmeventd { mirror_library = "libdevmapper-event-lvm2mirror.so" snapshot_library = "libdevmapper-event-lvm2snapshot.so" thin_library = "libdevmapper-event-lvm2thin.so" } ===================== I know RH doesn't do DRBD, but here's the config anyway: ===================== [root@an-a02n02 ~]# drbdadm dump # /etc/drbd.conf global { usage-count yes; cmd-timeout-medium 600; cmd-timeout-long 0; } common { net { protocol C; allow-two-primaries yes; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } disk { fencing resource-and-stonith; } startup { become-primary-on both; } handlers { fence-peer /sbin/rhcs_fence; } } # resource r0 on an-a02n02.alteeve.ca: not ignored, not stacked # defined at /etc/drbd.d/r0.res:4 resource r0 { on an-a02n01.alteeve.ca { volume 0 { device /dev/drbd0 minor 0; disk /dev/vda5; meta-disk internal; } address ipv4 10.10.20.1:7788; } on an-a02n02.alteeve.ca { volume 0 { device /dev/drbd0 minor 0; disk /dev/vda5; meta-disk internal; } address ipv4 10.10.20.2:7788; } disk { disk-flushes no; md-flushes no; } } # resource r1 on an-a02n02.alteeve.ca: not ignored, not stacked # defined at /etc/drbd.d/r1.res:3 resource r1 { on an-a02n01.alteeve.ca { volume 0 { device /dev/drbd1 minor 1; disk /dev/vda6; meta-disk internal; } address ipv4 10.10.20.1:7789; } on an-a02n02.alteeve.ca { volume 0 { device /dev/drbd1 minor 1; disk /dev/vda6; meta-disk internal; } address ipv4 10.10.20.2:7789; } disk { disk-flushes no; md-flushes no; } } [root@an-a02n02 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by root.ca, 2015-07-18 16:09:46 0: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r----- ns:0 nr:772352 dw:772236 dr:4168 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:30501080 [>....................] sync'ed: 2.5% (29784/30540)M finish: 5:58:09 speed: 1,408 (2,968) want: 440 K/sec 1: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r----- ns:0 nr:618348 dw:618260 dr:1556 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:18026788 [>....................] sync'ed: 3.4% (17604/18208)M finish: 2:38:07 speed: 1,896 (2,376) want: 6,040 K/sec ===================== RPMs: ===================== [root@an-a02n02 ~]# rpm -q cman rgmanager lvm2-cluster gfs2-utils cman-3.0.12.1-68.el6_6.1.x86_64 rgmanager-3.0.12.1-21.el6.x86_64 lvm2-cluster-2.02.111-2.el6_6.3.x86_64 gfs2-utils-3.0.12.1-68.el6_6.1.x86_64 ===================== Here are the logs on the node the froze (node2) ===================== Jul 20 22:56:12 an-a02n02 rgmanager[3208]: I am node #2 Jul 20 22:56:12 an-a02n02 rgmanager[3208]: Resource Group Manager Starting Jul 20 22:56:16 an-a02n02 kernel: block drbd0: peer( Primary -> Secondary ) Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: peer( Primary -> Secondary ) Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r0: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r0: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r0: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s---F- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1f0 val = 0x70 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:16 an-a02n02 kernel: block drbd1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: block drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r----- } Jul 20 22:56:16 an-a02n02 kernel: block drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r----- } Jul 20 22:56:16 an-a02n02 kernel: drbd r1: State change failed: Need access to UpToDate data Jul 20 22:56:16 an-a02n02 kernel: drbd r1: mask = 0x1e1f0 val = 0xa070 Jul 20 22:56:16 an-a02n02 kernel: drbd r1: old_conn:WFReportParams wanted_conn:TearDown Jul 20 22:56:17 an-a02n02 kernel: block drbd0: peer( Secondary -> Primary ) Jul 20 22:56:17 an-a02n02 kernel: block drbd1: peer( Secondary -> Primary ) ===================== Here is where it stopped. Note the extra time needed to start rgmanager? Until this point, clvmd would not have been able to see its PV. This is what was logged when I ran '/etc/init.d/clvmd restart' (note that 'start' alone wouldn't unjam it). ===================== Jul 20 23:01:30 an-a02n02 rgmanager[6098]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5) Jul 20 23:01:30 an-a02n02 rgmanager[3208]: stop on script "clvmd" returned 1 (generic error) Jul 20 23:01:30 an-a02n02 clvmd: Cluster LVM daemon started - connected to CMAN Jul 20 23:01:30 an-a02n02 kernel: block drbd0: State change failed: Device is held open by someone Jul 20 23:01:30 an-a02n02 kernel: block drbd0: state = { cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate r----- } Jul 20 23:01:30 an-a02n02 kernel: block drbd0: wanted = { cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate r----- } Jul 20 23:01:30 an-a02n02 kernel: block drbd1: role( Primary -> Secondary ) Jul 20 23:01:30 an-a02n02 kernel: drbd r1: peer( Primary -> Unknown ) conn( SyncTarget -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Jul 20 23:01:30 an-a02n02 kernel: drbd r1: ack_receiver terminated Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Terminating drbd_a_r1 Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Connection closed Jul 20 23:01:30 an-a02n02 kernel: drbd r1: conn( Disconnecting -> StandAlone ) Jul 20 23:01:30 an-a02n02 kernel: drbd r1: receiver terminated Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Terminating drbd_r_r1 Jul 20 23:01:30 an-a02n02 kernel: block drbd1: disk( Inconsistent -> Failed ) Jul 20 23:01:30 an-a02n02 kernel: block drbd1: bitmap WRITE of 0 pages took 0 jiffies Jul 20 23:01:30 an-a02n02 kernel: block drbd1: 18 GB (4661262 bits) marked out-of-sync by on disk bit-map. Jul 20 23:01:30 an-a02n02 kernel: block drbd1: disk( Failed -> Diskless ) Jul 20 23:01:30 an-a02n02 kernel: drbd r1: Terminating drbd_w_r1 Jul 20 23:01:30 an-a02n02 kernel: block drbd0: role( Primary -> Secondary ) Jul 20 23:01:30 an-a02n02 kernel: drbd r0: peer( Primary -> Unknown ) conn( SyncTarget -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Jul 20 23:01:30 an-a02n02 kernel: drbd r0: ack_receiver terminated Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Terminating drbd_a_r0 Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Connection closed Jul 20 23:01:30 an-a02n02 kernel: drbd r0: conn( Disconnecting -> StandAlone ) Jul 20 23:01:30 an-a02n02 kernel: drbd r0: receiver terminated Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Terminating drbd_r_r0 Jul 20 23:01:30 an-a02n02 kernel: block drbd0: disk( Inconsistent -> Failed ) Jul 20 23:01:30 an-a02n02 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies Jul 20 23:01:30 an-a02n02 kernel: block drbd0: 30 GB (7818329 bits) marked out-of-sync by on disk bit-map. Jul 20 23:01:30 an-a02n02 kernel: block drbd0: disk( Failed -> Diskless ) Jul 20 23:01:30 an-a02n02 kernel: drbd r0: Terminating drbd_w_r0 Jul 20 23:01:30 an-a02n02 kernel: drbd: module cleanup done. Jul 20 23:01:31 an-a02n02 rgmanager[3208]: Starting stopped service service:storage_n02 Jul 20 23:01:32 an-a02n02 kernel: drbd: events: mcg drbd: 2 Jul 20 23:01:32 an-a02n02 kernel: drbd: initialized. Version: 8.4.6 (api:1/proto:86-101) Jul 20 23:01:32 an-a02n02 kernel: drbd: GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by root.ca, 2015-07-18 16:09:46 Jul 20 23:01:32 an-a02n02 kernel: drbd: registered as block device major 147 Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Starting worker thread (from drbdsetup-84 [6418]) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: disk( Diskless -> Attaching ) Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Method to ensure write ordering: drain Jul 20 23:01:32 an-a02n02 kernel: block drbd0: max BIO size = 1048576 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: drbd_bm_resize called with capacity == 69199736 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: resync bitmap: bits=8649967 words=135156 pages=264 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: size = 33 GB (34599868 KB) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: recounting of set bits took additional 1 jiffies Jul 20 23:01:32 an-a02n02 kernel: block drbd0: 30 GB (7818329 bits) marked out-of-sync by on disk bit-map. Jul 20 23:01:32 an-a02n02 kernel: block drbd0: disk( Attaching -> Inconsistent ) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: attached to UUIDs 994BBE5EC23EFD18:0000000000000000:0000000000000000:0000000000000000 Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Starting worker thread (from drbdsetup-84 [6425]) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: disk( Diskless -> Attaching ) Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Method to ensure write ordering: drain Jul 20 23:01:32 an-a02n02 kernel: block drbd1: max BIO size = 1048576 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: drbd_bm_resize called with capacity == 47182360 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: resync bitmap: bits=5897795 words=92154 pages=180 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: size = 22 GB (23591180 KB) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: recounting of set bits took additional 0 jiffies Jul 20 23:01:32 an-a02n02 kernel: block drbd1: 18 GB (4661262 bits) marked out-of-sync by on disk bit-map. Jul 20 23:01:32 an-a02n02 kernel: block drbd1: disk( Attaching -> Inconsistent ) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: attached to UUIDs 9F5708CFE6D0CBD8:0000000000000000:0000000000000000:0000000000000000 Jul 20 23:01:32 an-a02n02 kernel: drbd r0: conn( StandAlone -> Unconnected ) Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Starting receiver thread (from drbd_w_r0 [6419]) Jul 20 23:01:32 an-a02n02 kernel: drbd r0: receiver (re)started Jul 20 23:01:32 an-a02n02 kernel: drbd r0: conn( Unconnected -> WFConnection ) Jul 20 23:01:32 an-a02n02 kernel: drbd r1: conn( StandAlone -> Unconnected ) Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Starting receiver thread (from drbd_w_r1 [6426]) Jul 20 23:01:32 an-a02n02 kernel: drbd r1: receiver (re)started Jul 20 23:01:32 an-a02n02 kernel: drbd r1: conn( Unconnected -> WFConnection ) Jul 20 23:01:32 an-a02n02 rgmanager[3208]: Starting stopped service service:libvirtd_n02 Jul 20 23:01:32 an-a02n02 rgmanager[3208]: Service service:libvirtd_n02 started Jul 20 23:01:32 an-a02n02 kernel: lo: Disabled Privacy Extensions Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Agreed to support TRIM on protocol level Jul 20 23:01:32 an-a02n02 kernel: drbd r0: conn( WFConnection -> WFReportParams ) Jul 20 23:01:32 an-a02n02 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [6431]) Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Handshake successful: Agreed network protocol version 101 Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Agreed to support TRIM on protocol level Jul 20 23:01:32 an-a02n02 kernel: drbd r1: conn( WFConnection -> WFReportParams ) Jul 20 23:01:32 an-a02n02 kernel: drbd r1: Starting ack_recv thread (from drbd_r_r1 [6433]) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: drbd_sync_handshake: Jul 20 23:01:32 an-a02n02 kernel: block drbd0: self 994BBE5EC23EFD18:0000000000000000:0000000000000000:0000000000000000 bits:7818329 flags:0 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: peer 032F06EC14D6AB2B:994BBE5EC23EFD19:994ABE5EC23EFD19:9949BE5EC23EFD19 bits:7818329 flags:0 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: uuid_compare()=-1 by rule 50 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: Becoming sync target due to disk states. Jul 20 23:01:32 an-a02n02 kernel: block drbd1: drbd_sync_handshake: Jul 20 23:01:32 an-a02n02 kernel: block drbd1: self 9F5708CFE6D0CBD8:0000000000000000:0000000000000000:0000000000000000 bits:4661262 flags:0 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: peer ED55D611AB6E9CF7:9F5708CFE6D0CBD8:9F5608CFE6D0CBD8:9F5508CFE6D0CBD8 bits:4661262 flags:0 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: uuid_compare()=-1 by rule 50 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: Becoming sync target due to disk states. Jul 20 23:01:32 an-a02n02 kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0% Jul 20 23:01:32 an-a02n02 kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0% Jul 20 23:01:32 an-a02n02 kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID ) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: role( Secondary -> Primary ) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0% Jul 20 23:01:32 an-a02n02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0% Jul 20 23:01:32 an-a02n02 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: role( Secondary -> Primary ) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: updated sync uuid 9F5808CFE6D0CBD9:0000000000000000:0000000000000000:0000000000000000 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: updated sync uuid 994CBE5EC23EFD19:0000000000000000:0000000000000000:0000000000000000 Jul 20 23:01:32 an-a02n02 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Jul 20 23:01:32 an-a02n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) Jul 20 23:01:32 an-a02n02 kernel: block drbd1: Began resync as SyncTarget (will sync 18645048 KB [4661262 bits set]). Jul 20 23:01:32 an-a02n02 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) Jul 20 23:01:32 an-a02n02 kernel: block drbd0: Began resync as SyncTarget (will sync 31273316 KB [7818329 bits set]). Jul 20 23:01:34 an-a02n02 clvmd: Cluster LVM daemon started - connected to CMAN Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-anvil-02:shared" Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: Joined cluster. Now mounting FS... Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: jid=1, already locked for use Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: jid=1: Looking at journal... Jul 20 23:01:34 an-a02n02 kernel: GFS2: fsid=an-anvil-02:shared.1: jid=1: Done Jul 20 23:01:34 an-a02n02 rgmanager[3208]: Service service:storage_n02 started ===================== Here are the logs from the node (node 1) that started properly: ===================== Jul 20 22:56:12 an-a02n01 rgmanager[1142]: I am node #1 Jul 20 22:56:12 an-a02n01 rgmanager[1142]: Resource Group Manager Starting Jul 20 22:56:16 an-a02n01 kernel: block drbd0: role( Primary -> Secondary ) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: block drbd1: role( Primary -> Secondary ) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r0: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:16 an-a02n01 kernel: drbd r1: Requested state change failed by peer: Need access to UpToDate data (-2) Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Marking service:storage_n02 as stopped: Restricted domain unavailable Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Starting stopped service service:storage_n01 Jul 20 22:56:17 an-a02n01 kernel: block drbd0: role( Secondary -> Primary ) Jul 20 22:56:17 an-a02n01 kernel: block drbd1: role( Secondary -> Primary ) Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Starting stopped service service:libvirtd_n01 Jul 20 22:56:17 an-a02n01 rgmanager[1142]: Service service:libvirtd_n01 started Jul 20 22:56:17 an-a02n01 kernel: lo: Disabled Privacy Extensions Jul 20 22:56:18 an-a02n01 clvmd: Cluster LVM daemon started - connected to CMAN Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-anvil-02:shared" Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: Joined cluster. Now mounting FS... Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=0, already locked for use Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=0: Looking at journal... Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=0: Done Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=1: Trying to acquire journal lock... Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=1: Looking at journal... Jul 20 22:56:19 an-a02n01 kernel: GFS2: fsid=an-anvil-02:shared.0: jid=1: Done Jul 20 22:56:19 an-a02n01 rgmanager[1142]: Service service:storage_n01 started ===================== Here you see that the peer restarted it's storage service when clvmd was restarted. ===================== Jul 20 23:01:30 an-a02n01 kernel: block drbd1: peer( Primary -> Secondary ) Jul 20 23:01:30 an-a02n01 kernel: drbd r1: peer( Secondary -> Unknown ) conn( SyncSource -> TearDown ) Jul 20 23:01:30 an-a02n01 kernel: drbd r1: ack_receiver terminated Jul 20 23:01:30 an-a02n01 kernel: drbd r1: Terminating drbd_a_r1 Jul 20 23:01:30 an-a02n01 kernel: drbd r1: Connection closed Jul 20 23:01:30 an-a02n01 kernel: drbd r1: conn( TearDown -> Unconnected ) Jul 20 23:01:30 an-a02n01 kernel: drbd r1: receiver terminated Jul 20 23:01:30 an-a02n01 kernel: drbd r1: Restarting receiver thread Jul 20 23:01:30 an-a02n01 kernel: drbd r1: receiver (re)started Jul 20 23:01:30 an-a02n01 kernel: drbd r1: conn( Unconnected -> WFConnection ) Jul 20 23:01:30 an-a02n01 kernel: block drbd0: peer( Primary -> Secondary ) Jul 20 23:01:30 an-a02n01 kernel: drbd r0: peer( Secondary -> Unknown ) conn( SyncSource -> TearDown ) Jul 20 23:01:30 an-a02n01 kernel: drbd r0: ack_receiver terminated Jul 20 23:01:30 an-a02n01 kernel: drbd r0: Terminating drbd_a_r0 Jul 20 23:01:30 an-a02n01 kernel: drbd r0: Connection closed Jul 20 23:01:30 an-a02n01 kernel: drbd r0: conn( TearDown -> Unconnected ) Jul 20 23:01:30 an-a02n01 kernel: drbd r0: receiver terminated Jul 20 23:01:30 an-a02n01 kernel: drbd r0: Restarting receiver thread Jul 20 23:01:30 an-a02n01 kernel: drbd r0: receiver (re)started Jul 20 23:01:30 an-a02n01 kernel: drbd r0: conn( Unconnected -> WFConnection ) Jul 20 23:01:32 an-a02n01 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 Jul 20 23:01:32 an-a02n01 kernel: drbd r0: Agreed to support TRIM on protocol level Jul 20 23:01:32 an-a02n01 kernel: drbd r0: conn( WFConnection -> WFReportParams ) Jul 20 23:01:32 an-a02n01 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [27478]) Jul 20 23:01:32 an-a02n01 kernel: drbd r1: Handshake successful: Agreed network protocol version 101 Jul 20 23:01:32 an-a02n01 kernel: drbd r1: Agreed to support TRIM on protocol level Jul 20 23:01:32 an-a02n01 kernel: drbd r1: conn( WFConnection -> WFReportParams ) Jul 20 23:01:32 an-a02n01 kernel: drbd r1: Starting ack_recv thread (from drbd_r_r1 [27510]) Jul 20 23:01:32 an-a02n01 kernel: block drbd0: drbd_sync_handshake: Jul 20 23:01:32 an-a02n01 kernel: block drbd0: self 032F06EC14D6AB2B:994BBE5EC23EFD19:994ABE5EC23EFD19:9949BE5EC23EFD19 bits:7818329 flags:0 Jul 20 23:01:32 an-a02n01 kernel: block drbd1: drbd_sync_handshake: Jul 20 23:01:32 an-a02n01 kernel: block drbd1: self ED55D611AB6E9CF7:9F5708CFE6D0CBD8:9F5608CFE6D0CBD8:9F5508CFE6D0CBD8 bits:4661262 flags:0 Jul 20 23:01:32 an-a02n01 kernel: block drbd1: peer 9F5708CFE6D0CBD8:0000000000000000:0000000000000000:0000000000000000 bits:4661262 flags:0 Jul 20 23:01:32 an-a02n01 kernel: block drbd1: uuid_compare()=1 by rule 70 Jul 20 23:01:32 an-a02n01 kernel: block drbd1: Becoming sync source due to disk states. Jul 20 23:01:32 an-a02n01 kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Jul 20 23:01:32 an-a02n01 kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0% Jul 20 23:01:32 an-a02n01 kernel: block drbd0: peer 994BBE5EC23EFD18:0000000000000000:0000000000000000:0000000000000000 bits:7818329 flags:0 Jul 20 23:01:32 an-a02n01 kernel: block drbd0: uuid_compare()=1 by rule 70 Jul 20 23:01:32 an-a02n01 kernel: block drbd0: Becoming sync source due to disk states. Jul 20 23:01:32 an-a02n01 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Jul 20 23:01:32 an-a02n01 kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 26(1), total 26; compression: 100.0% Jul 20 23:01:32 an-a02n01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0% Jul 20 23:01:32 an-a02n01 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 Jul 20 23:01:32 an-a02n01 kernel: block drbd0: peer( Secondary -> Primary ) Jul 20 23:01:32 an-a02n01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 141(1), total 141; compression: 100.0% Jul 20 23:01:32 an-a02n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Jul 20 23:01:32 an-a02n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Jul 20 23:01:32 an-a02n01 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) Jul 20 23:01:32 an-a02n01 kernel: block drbd1: conn( WFBitMapS -> SyncSource ) Jul 20 23:01:32 an-a02n01 kernel: block drbd1: Began resync as SyncSource (will sync 18645048 KB [4661262 bits set]). Jul 20 23:01:32 an-a02n01 kernel: block drbd1: peer( Secondary -> Primary ) Jul 20 23:01:32 an-a02n01 kernel: block drbd1: updated sync UUID ED55D611AB6E9CF7:9F5808CFE6D0CBD8:9F5708CFE6D0CBD8:9F5608CFE6D0CBD8 Jul 20 23:01:32 an-a02n01 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) Jul 20 23:01:32 an-a02n01 kernel: block drbd0: Began resync as SyncSource (will sync 31273316 KB [7818329 bits set]). Jul 20 23:01:32 an-a02n01 kernel: block drbd0: updated sync UUID 032F06EC14D6AB2B:994CBE5EC23EFD19:994BBE5EC23EFD19:994ABE5EC23EFD19 ===================== At this point, service was restored, if after a rather dramatic repair. Anything else? What info was missing? |