Bug 1374786

Summary: 'R VGLK lock_san acquire error' potential errors when attempting shared volume configurations
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: David Teigland <teigland>
lvm2 sub component: LVM lock daemon / lvmlockd QA Contact: cluster-qe <cluster-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: unspecified CC: agk, heinzm, jbrassow, mcsontos, prajnoha, rbednar, teigland, zkabelac
Version: 7.3Keywords: Reopened
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.176-5.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 15:18:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2016-09-09 15:42:04 UTC
Description of problem:
I've only seen this once, so until I reproduce it it's probably not a big deal.

SCENARIO (raid1) - [open_EXT4_on_raid_fsadm_resize_attempt]
Create raid, add fs, and then attempt to resize it while it's mounted
harding-03: lvcreate --activate ey --nosync --type raid1 -m 1 -n open_fsadm_resize -L 4G raid_sanity
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
Placing an EXT4 on open_fsadm_resize volume
mke2fs 1.42.9 (28-Dec-2013)
Starting dd io to raid fs to be resized
Attempt to resize the open raid filesystem multiple times with lvextend/fsadm on harding-03
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
  VG raid_sanity lock failed: error -202
couldn't resize raid and filesystem on harding-03


Sep  7 14:19:47 harding-03 qarshd[10371]: Running cmdline: lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize
Sep  7 14:19:47 harding-03 kernel: md/raid1:mdX: active with 2 out of 2 mirrors
Sep  7 14:19:47 harding-03 kernel: created bitmap (14 pages) for device mdX
Sep  7 14:19:48 harding-03 kernel: mdX: bitmap initialized from disk: read 1 pages, set 2048 of 28672 bits
Sep  7 14:19:49 harding-03 kernel: EXT4-fs (dm-33): resizing filesystem from 3145728 to 3670016 blocks
Sep  7 14:19:49 harding-03 kernel: EXT4-fs (dm-33): resized filesystem to 3670016
Sep  7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40624).
Sep  7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40624)...
Sep  7 14:19:54 harding-03 qarshd[10465]: Talking to peer ::ffff:10.15.80.224:40624 (IPv6)
Sep  7 14:19:54 harding-03 qarshd[10465]: Running cmdline: df -k /mnt/open_fsadm_resize
Sep  7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40626).
Sep  7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40626)...
Sep  7 14:19:54 harding-03 qarshd[10469]: Talking to peer ::ffff:10.15.80.224:40626 (IPv6)
Sep  7 14:19:54 harding-03 qarshd[10469]: Running cmdline: df -k /mnt/open_fsadm_resize
Sep  7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40628).
Sep  7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40628)...
Sep  7 14:19:54 harding-03 qarshd[10473]: Talking to peer ::ffff:10.15.80.224:40628 (IPv6)
Sep  7 14:19:54 harding-03 qarshd[10473]: Running cmdline: lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize
Sep  7 14:20:05 harding-03 sanlock[3343]: 2016-09-07 14:20:05-0500 1670 [5030]: s1 delta_renew read timeout 10 sec offset 0 /dev/mapper/global-lvmlock
Sep  7 14:20:05 harding-03 sanlock[3343]: 2016-09-07 14:20:05-0500 1670 [5030]: s1 renewal error -202 delta_length 10 last_success 1640
Sep  7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: write_sectors leader offset 69206016 rv -202 /dev/mapper/raid_sanity-lvmlock
Sep  7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: r178 paxos_acquire write_new_leader error -202 timeout 1 owner 1391 1 1671
Sep  7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: r178 acquire_token disk error -202 RETRACT_PAXOS
Sep  7 14:20:22 harding-03 sanlock[3343]: 2016-09-07 14:20:22-0500 1687 [3355]: r178 cmd_acquire 3,10,3359 acquire_token -202
Sep  7 14:20:22 harding-03 lvmlockd[3359]: 1473276022 S lvm_raid_sanity R VGLK lock_san acquire error -202


Version-Release number of selected component (if applicable):
3.10.0-498.el7.x86_64

lvm2-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
lvm2-libs-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
lvm2-cluster-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-libs-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-event-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-event-libs-1.02.134-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016
sanlock-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.165-1.el7    BUILT: Wed Sep  7 11:04:22 CDT 2016


How reproducible:
Only once so far

Comment 1 David Teigland 2016-09-09 16:02:23 UTC
I think dd was hitting the disk with enough io to delay the reads/writes acquiring the lock beyond the 10 sec io timeout.  The io timeout caused the lock request to fail with a timeout error.  lvmlockd should probably recognized when a lock request fails due to an io timeout and just retry it.

Comment 2 Corey Marthaler 2016-10-20 17:03:24 UTC
I've seen -212 a few times as well. I'm assuming this is the exact same issue.


Oct 20 11:03:40 host-083 qarshd[15762]: Running cmdline: pvs --noheadings -o pv_name,pv_uuid,vg_name
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 lver[1943] 7 dblock 6
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 mbal[1943] 1944 dblock 3536
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 inp[1943] 0 0 0 dblock 1536 1 85461
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 paxos_acquire 6 ballot error -212
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 acquire_token disk error -212
Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 cmd_acquire 2,9,14335 acquire_token -212
Oct 20 11:03:40 host-083 lvmlockd[14335]: 1476979420 S lvm_global R VGLK lock_san acquire error -212



3.10.0-513.el7.x86_64

lvm2-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
lvm2-libs-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
lvm2-cluster-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-libs-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-event-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-event-libs-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
sanlock-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016

Comment 3 David Teigland 2017-04-12 14:30:51 UTC
pushing this out to the next release

Comment 5 Red Hat Bugzilla Rules Engine 2017-04-12 14:31:01 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 6 David Teigland 2017-04-12 14:32:58 UTC
bugzilla is absolutely idiotic

Comment 11 David Teigland 2017-11-20 19:50:47 UTC
https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=e52d2e3bd86a006bcc322649ac9cb7c52117b787

There are a number of sanlock errors which mean the lock couldn't be acquired, but the caller can retry.  The
two errors above (-202 and -212) have been added to this set.

-202: there were i/o timeouts while trying to acquire the lock.
-212: there was (an unusal) sort of lock contention and another host holds the lock.

All of the failures that can be retried will often be masked when a retry is successful, but if the condition is persistent enough, the retry limit (configurable) is reached, and the error is reported by the command, as seen above.

Testing that produces any sort of lock contention and retries is sufficient here.  Reproducing those two specific errors would not be possible and isn't necessary.

Also, those error numbers won't be reported to the command any longer (they were sanlock-specific), instead the command should report lvm-specific error number -221 which represents any lock manager error that can't be handled.

Comment 14 Corey Marthaler 2018-01-05 23:44:26 UTC
Marking verified with the latest rpms. Saw no errors after 100 iterations. 

3.10.0-826.el7.x86_64
lvm2-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-libs-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-cluster-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-lockd-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-python-boom-0.8.1-5.el7    BUILT: Wed Dec  6 04:15:40 CST 2017
cmirror-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-libs-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-event-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-event-libs-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-persistent-data-0.7.3-3.el7    BUILT: Tue Nov 14 05:07:18 CST 2017
sanlock-3.6.0-1.el7    BUILT: Tue Dec  5 11:47:21 CST 2017
sanlock-lib-3.6.0-1.el7    BUILT: Tue Dec  5 11:47:21 CST 2017



============================================================
Iteration 100 of 100 started at Fri Jan  5 17:31:36 CST 2018
============================================================
SCENARIO (raid1) - [open_EXT4_on_raid_fsadm_resize_attempt]
Create raid, add fs, and then attempt to resize it while it's mounted
mckinley-03: lvcreate --activate ey --nosync --type raid1 -m 1 -n open_fsadm_resize -L 4G raid_sanity
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
Placing an EXT4 on open_fsadm_resize volume
mke2fs 1.42.9 (28-Dec-2013)
Starting dd io to raid fs to be resized

Attempt to resize the open raid filesystem multiple times with lvextend/fsadm on mckinley-03
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)
(lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize)
resize2fs 1.42.9 (28-Dec-2013)

perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/open_fsadm_resize
Deactivating raid open_fsadm_resize... and removing

Comment 17 errata-xmlrpc 2018-04-10 15:18:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0853