RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1374786 - 'R VGLK lock_san acquire error' potential errors when attempting shared volume configurations
Summary: 'R VGLK lock_san acquire error' potential errors when attempting shared volu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2
Version: 7.3
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-09 15:42 UTC by Corey Marthaler
Modified: 2021-09-03 12:35 UTC (History)
8 users (show)

Fixed In Version: lvm2-2.02.176-5.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 15:18:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0853 0 None None None 2018-04-10 15:19:55 UTC

Description Corey Marthaler 2016-09-09 15:42:04 UTC
Description of problem:
I've only seen this once, so until I reproduce it it's probably not a big deal.

SCENARIO (raid1) - [open_EXT4_on_raid_fsadm_resize_attempt]
Create raid, add fs, and then attempt to resize it while it's mounted
harding-03: lvcreate --activate ey --nosync --type raid1 -m 1 -n open_fsadm_resize -L 4G raid_sanity
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
Placing an EXT4 on open_fsadm_resize volume
mke2fs 1.42.9 (28-Dec-2013)
Starting dd io to raid fs to be resized
Attempt to resize the open raid filesystem multiple times with lvextend/fsadm on harding-03
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
  VG raid_sanity lock failed: error -202
couldn't resize raid and filesystem on harding-03


Sep  7 14:19:47 harding-03 qarshd[10371]: Running cmdline: lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize
Sep  7 14:19:47 harding-03 kernel: md/raid1:mdX: active with 2 out of 2 mirrors
Sep  7 14:19:47 harding-03 kernel: created bitmap (14 pages) for device mdX
Sep  7 14:19:48 harding-03 kernel: mdX: bitmap initialized from disk: read 1 pages, set 2048 of 28672 bits
Sep  7 14:19:49 harding-03 kernel: EXT4-fs (dm-33): resizing filesystem from 3145728 to 3670016 blocks
Sep  7 14:19:49 harding-03 kernel: EXT4-fs (dm-33): resized filesystem to 3670016
Sep  7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40624).
Sep  7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40624)...
Sep  7 14:19:54 harding-03 qarshd[10465]: Talking to peer ::ffff:10.15.80.224:40624 (IPv6)
Sep  7 14:19:54 harding-03 qarshd[10465]: Running cmdline: df -k /mnt/open_fsadm_resize
Sep  7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40626).
Sep  7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40626)...
Sep  7 14:19:54 harding-03 qarshd[10469]: Talking to peer ::ffff:10.15.80.224:40626 (IPv6)
Sep  7 14:19:54 harding-03 qarshd[10469]: Running cmdline: df -k /mnt/open_fsadm_resize
Sep  7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40628).
Sep  7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40628)...
Sep  7 14:19:54 harding-03 qarshd[10473]: Talking to peer ::ffff:10.15.80.224:40628 (IPv6)
Sep  7 14:19:54 harding-03 qarshd[10473]: Running cmdline: lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize
Sep  7 14:20:05 harding-03 sanlock[3343]: 2016-09-07 14:20:05-0500 1670 [5030]: s1 delta_renew read timeout 10 sec offset 0 /dev/mapper/global-lvmlock
Sep  7 14:20:05 harding-03 sanlock[3343]: 2016-09-07 14:20:05-0500 1670 [5030]: s1 renewal error -202 delta_length 10 last_success 1640
Sep  7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: write_sectors leader offset 69206016 rv -202 /dev/mapper/raid_sanity-lvmlock
Sep  7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: r178 paxos_acquire write_new_leader error -202 timeout 1 owner 1391 1 1671
Sep  7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: r178 acquire_token disk error -202 RETRACT_PAXOS
Sep  7 14:20:22 harding-03 sanlock[3343]: 2016-09-07 14:20:22-0500 1687 [3355]: r178 cmd_acquire 3,10,3359 acquire_token -202
Sep  7 14:20:22 harding-03 lvmlockd[3359]: 1473276022 S lvm_raid_sanity R VGLK lock_san acquire error -202


Version-Release number of selected component (if applicable):
3.10.0-498.el7.x86_64

lvm2-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
lvm2-libs-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
lvm2-cluster-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-libs-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-event-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-event-libs-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
sanlock-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016


How reproducible:
Only once so far

Comment 1 David Teigland 2016-09-09 16:02:23 UTC
I think dd was hitting the disk with enough io to delay the reads/writes acquiring the lock beyond the 10 sec io timeout.  The io timeout caused the lock request to fail with a timeout error.  lvmlockd should probably recognized when a lock request fails due to an io timeout and just retry it.

Comment 2 Corey Marthaler 2016-10-20 17:03:24 UTC
I've seen -212 a few times as well. I'm assuming this is the exact same issue.


Oct 20 11:03:40 host-083 qarshd[15762]: Running cmdline: pvs --noheadings -o pv_name,pv_uuid,vg_name
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 lver[1943] 7 dblock 6
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 mbal[1943] 1944 dblock 3536
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 inp[1943] 0 0 0 dblock 1536 1 85461
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 paxos_acquire 6 ballot error -212
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 acquire_token disk error -212
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 cmd_acquire 2,9,14335 acquire_token -212
Oct 20 11:03:40 host-083 lvmlockd[14335]: 1476979420 S lvm_global R VGLK lock_san acquire error -212



3.10.0-513.el7.x86_64

lvm2-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
lvm2-libs-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
lvm2-cluster-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-libs-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-event-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-event-libs-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
sanlock-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016

Comment 3 David Teigland 2017-04-12 14:30:51 UTC
pushing this out to the next release

Comment 5 Red Hat Bugzilla Rules Engine 2017-04-12 14:31:01 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 6 David Teigland 2017-04-12 14:32:58 UTC
bugzilla is absolutely idiotic

Comment 11 David Teigland 2017-11-20 19:50:47 UTC
https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=e52d2e3bd86a006bcc322649ac9cb7c52117b787

There are a number of sanlock errors which mean the lock couldn't be acquired, but the caller can retry.  The
two errors above (-202 and -212) have been added to this set.

-202: there were i/o timeouts while trying to acquire the lock.
-212: there was (an unusal) sort of lock contention and another host holds the lock.

All of the failures that can be retried will often be masked when a retry is successful, but if the condition is persistent enough, the retry limit (configurable) is reached, and the error is reported by the command, as seen above.

Testing that produces any sort of lock contention and retries is sufficient here.  Reproducing those two specific errors would not be possible and isn't necessary.

Also, those error numbers won't be reported to the command any longer (they were sanlock-specific), instead the command should report lvm-specific error number -221 which represents any lock manager error that can't be handled.

Comment 14 Corey Marthaler 2018-01-05 23:44:26 UTC
Marking verified with the latest rpms. Saw no errors after 100 iterations. 

3.10.0-826.el7.x86_64
lvm2-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-libs-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-cluster-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-lockd-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-python-boom-0.8.1-5.el7    BUILT: Wed Dec  6 04:15:40 CST 2017
cmirror-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-libs-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-event-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-event-libs-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-persistent-data-0.7.3-3.el7    BUILT: Tue Nov 14 05:07:18 CST 2017
sanlock-3.6.0-1.el7    BUILT: Tue Dec  5 11:47:21 CST 2017
sanlock-lib-3.6.0-1.el7    BUILT: Tue Dec  5 11:47:21 CST 2017



============================================================
Iteration 100 of 100 started at Fri Jan  5 17:31:36 CST 2018
============================================================
SCENARIO (raid1) - [open_EXT4_on_raid_fsadm_resize_attempt]
Create raid, add fs, and then attempt to resize it while it's mounted
mckinley-03: lvcreate --activate ey --nosync --type raid1 -m 1 -n open_fsadm_resize -L 4G raid_sanity
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
Placing an EXT4 on open_fsadm_resize volume
mke2fs 1.42.9 (28-Dec-2013)
Starting dd io to raid fs to be resized

Attempt to resize the open raid filesystem multiple times with lvextend/fsadm on mckinley-03
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)

perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/open_fsadm_resize
Deactivating raid open_fsadm_resize... and removing

Comment 17 errata-xmlrpc 2018-04-10 15:18:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0853


Note You need to log in before you can comment on or make changes to this bug.