Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1053617

Summary:

clvmd dies and causes all the nodes in the cluster to fence themselves (and reboot at once) when PVs fail in a mirror

Product:

Red Hat Enterprise Linux 7

Reporter:

Nenad Peric <nperic>

Component:

lvm2

Assignee:

LVM and device-mapper development team <lvm-team>

lvm2 sub component:

Default / Unclassified

QA Contact:

cluster-qe <cluster-qe>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

medium

Priority:

medium

CC:

agk, cmarthal, djansa, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac

Version:

7.0

Keywords:

TestBlocker, Triaged

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-13 09:56:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
lvm-messages a tail of LVM messages from /var/log/messages	none

Description Nenad Peric 2014-01-15 14:06:27 UTC

Created attachment 850531 [details]
lvm-messages a tail of LVM messages from /var/log/messages

Description of problem:

When running revolution_9 test scenario kill_secondary_and_log_synced_4_legs, clvmd crashes on all cluster nodes causing the cluster to reboot. 

********* Mirror hash info for this scenario *********
* names:              syncd_secondary_log_4legs_1 syncd_secondary_log_4legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sdd1 /dev/sdf1 /dev/sdi1 /dev/sda1
* log devices:        /dev/sde1
* no MDA devices:
* failpv(s):          /dev/sdf1 /dev/sde1
* failnode(s):        virt-122.cluster-qe.lab.eng.brq.redhat.com virt-123.cluster-qe.lab.eng.brq.redhat.com virt
-124.cluster-qe.lab.eng.brq.redhat.com
* lvmetad:            0
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

The test just gets stuck after disabling the PVs since the nodes get rebooted. When fencing is turned off it is visible that clvmd is the culprit of these reboots.

On the nodes themselves some errors can be seen though:

Failed actions:
    clvmd_monitor_60000 on virt-124.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=19, status=Timed Out, last-rc-change='Wed Jan 15 14:40:51 2014', queued=0ms, exec=0ms
    clvmd_monitor_60000 on virt-122.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=21, status=Timed Out, last-rc-change='Wed Jan 15 14:40:53 2014', queued=0ms, exec=0ms
    clvmd_monitor_60000 on virt-123.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=19, status=Timed Out, last-rc-change='Wed Jan 15 14:40:49 2014', queued=0ms, exec=0ms

clvmd.service - LSB: This service is Clusterd LVM Daemon.
   Loaded: loaded (/etc/rc.d/init.d/clvmd)
   Active: inactive (dead) since Wed 2014-01-15 14:41:14 CET; 10min ago
  Process: 6948 ExecStop=/etc/rc.d/init.d/clvmd stop (code=exited, status=5)
  Process: 1256 ExecStart=/etc/rc.d/init.d/clvmd start (code=exited, status=5)
 Main PID: 1261 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/clvmd.service

Jan 15 14:25:17 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[1256]: 0 logical volume(s) in volume group "helter_skelter" now active
Jan 15 14:25:17 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[1256]: [FAILED]
Jan 15 14:25:17 virt-124.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Started LSB: This service is Clusterd LVM Daemon..
Jan 15 14:41:07 virt-124.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Stopping LSB: This service is Clusterd LVM Daemon....
Jan 15 14:41:12 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[6948]: Deactivating clustered VG(s):   Couldn't find device with uuid fhqnPF-ghGO-p5jJ-4dwN-Xqmp-swfl-8rlLXs.
Jan 15 14:41:12 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[6948]: Couldn't find device with uuid pbDED5-dg4U-1u0r-ZqDa-jAEc-B9jP-wOI0VP.
Jan 15 14:41:14 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[6948]: Logical volume helter_skelter/syncd_secondary_log_4legs_1 contains a filesystem in use.
Jan 15 14:41:14 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[6948]: Can't deactivate volume group "helter_skelter" with 2 open logical volume(s)
Jan 15 14:41:14 virt-124.cluster-qe.lab.eng.brq.redhat.com clvmd[6948]: [FAILED]
Jan 15 14:41:14 virt-124.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Stopped LSB: This service is Clusterd LVM Daemon..

I will attach a file with a snipped from /var/log/messages as well not to spam the comments here. 


Version-Release number of selected component (if applicable):
lvm2-2.02.103-10.el7.x86_64
lvm2-cluster-2.02.103-10.el7.x86_64
kernel-3.10.0-64.el7.x86_64
kernel-3.10.0-67.el7.x86_64
device-mapper-1.02.82-10.el7.x86_64
cmirror-2.02.103-10.el7.x86_64


How reproducible:
95% of the time

By running revolution_9 test, most reliably. 

Steps to Reproduce:

Run revolution_9 test, scenario kill_secondary_and_log_synced_4_legs

It actually creates two LVs using the same order of PVs, a GFS2 FS is created on top and a leg and log PVs are failed after the mirrors are synced. 

Actual results:

The whole cluster reboots due to self-fencing. 

Expected results:

Not to fail so catastrophically as to reboot all the cluster nodes.

Comment 3 Peter Rajnoha 2014-02-21 11:29:41 UTC

Please, try if this is still reproducible with latest lvm2-2.02.105-5.el7 and resource-agents-3.9.5-24.el7 (the clvmd now needs to be set as a cluster resource for pacemaker, the old initscripts are gone now).

Comment 6 Corey Marthaler 2014-03-24 23:21:32 UTC

Marking verified since the clustered mirror device failure tests passed.

lvm_cluster_mirror_failure               PASS      


3.10.0-113.el7.x86_64
lvm2-2.02.105-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
lvm2-libs-2.02.105-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
lvm2-cluster-2.02.105-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
device-mapper-1.02.84-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
device-mapper-libs-1.02.84-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
device-mapper-event-1.02.84-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
device-mapper-event-libs-1.02.84-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014
device-mapper-persistent-data-0.2.8-5.el7    BUILT: Fri Feb 28 19:15:56 CST
2014
cmirror-2.02.105-13.el7    BUILT: Wed Mar 19 05:38:19 CDT 2014

Comment 7 Ludek Smid 2014-06-13 09:56:51 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.