Red Hat Bugzilla – Bug 1374786
'R VGLK lock_san acquire error' potential errors when attempting shared volume configurations
Last modified: 2018-04-10 11:19:55 EDT
Description of problem: I've only seen this once, so until I reproduce it it's probably not a big deal. SCENARIO (raid1) - [open_EXT4_on_raid_fsadm_resize_attempt] Create raid, add fs, and then attempt to resize it while it's mounted harding-03: lvcreate --activate ey --nosync --type raid1 -m 1 -n open_fsadm_resize -L 4G raid_sanity WARNING: New raid1 won't be synchronised. Don't read what you didn't write! Placing an EXT4 on open_fsadm_resize volume mke2fs 1.42.9 (28-Dec-2013) Starting dd io to raid fs to be resized Attempt to resize the open raid filesystem multiple times with lvextend/fsadm on harding-03 (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) VG raid_sanity lock failed: error -202 couldn't resize raid and filesystem on harding-03 Sep 7 14:19:47 harding-03 qarshd[10371]: Running cmdline: lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize Sep 7 14:19:47 harding-03 kernel: md/raid1:mdX: active with 2 out of 2 mirrors Sep 7 14:19:47 harding-03 kernel: created bitmap (14 pages) for device mdX Sep 7 14:19:48 harding-03 kernel: mdX: bitmap initialized from disk: read 1 pages, set 2048 of 28672 bits Sep 7 14:19:49 harding-03 kernel: EXT4-fs (dm-33): resizing filesystem from 3145728 to 3670016 blocks Sep 7 14:19:49 harding-03 kernel: EXT4-fs (dm-33): resized filesystem to 3670016 Sep 7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40624). Sep 7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40624)... Sep 7 14:19:54 harding-03 qarshd[10465]: Talking to peer ::ffff:10.15.80.224:40624 (IPv6) Sep 7 14:19:54 harding-03 qarshd[10465]: Running cmdline: df -k /mnt/open_fsadm_resize Sep 7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40626). Sep 7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40626)... Sep 7 14:19:54 harding-03 qarshd[10469]: Talking to peer ::ffff:10.15.80.224:40626 (IPv6) Sep 7 14:19:54 harding-03 qarshd[10469]: Running cmdline: df -k /mnt/open_fsadm_resize Sep 7 14:19:54 harding-03 systemd: Started qarsh Per-Connection Server (10.15.80.224:40628). Sep 7 14:19:54 harding-03 systemd: Starting qarsh Per-Connection Server (10.15.80.224:40628)... Sep 7 14:19:54 harding-03 qarshd[10473]: Talking to peer ::ffff:10.15.80.224:40628 (IPv6) Sep 7 14:19:54 harding-03 qarshd[10473]: Running cmdline: lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize Sep 7 14:20:05 harding-03 sanlock[3343]: 2016-09-07 14:20:05-0500 1670 [5030]: s1 delta_renew read timeout 10 sec offset 0 /dev/mapper/global-lvmlock Sep 7 14:20:05 harding-03 sanlock[3343]: 2016-09-07 14:20:05-0500 1670 [5030]: s1 renewal error -202 delta_length 10 last_success 1640 Sep 7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: write_sectors leader offset 69206016 rv -202 /dev/mapper/raid_sanity-lvmlock Sep 7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: r178 paxos_acquire write_new_leader error -202 timeout 1 owner 1391 1 1671 Sep 7 14:20:17 harding-03 sanlock[3343]: 2016-09-07 14:20:17-0500 1682 [3355]: r178 acquire_token disk error -202 RETRACT_PAXOS Sep 7 14:20:22 harding-03 sanlock[3343]: 2016-09-07 14:20:22-0500 1687 [3355]: r178 cmd_acquire 3,10,3359 acquire_token -202 Sep 7 14:20:22 harding-03 lvmlockd[3359]: 1473276022 S lvm_raid_sanity R VGLK lock_san acquire error -202 Version-Release number of selected component (if applicable): 3.10.0-498.el7.x86_64 lvm2-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 lvm2-libs-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 lvm2-cluster-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 device-mapper-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 device-mapper-libs-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 device-mapper-event-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 device-mapper-event-libs-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016 cmirror-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 lvm2-lockd-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016 How reproducible: Only once so far
I think dd was hitting the disk with enough io to delay the reads/writes acquiring the lock beyond the 10 sec io timeout. The io timeout caused the lock request to fail with a timeout error. lvmlockd should probably recognized when a lock request fails due to an io timeout and just retry it.
I've seen -212 a few times as well. I'm assuming this is the exact same issue. Oct 20 11:03:40 host-083 qarshd[15762]: Running cmdline: pvs --noheadings -o pv_name,pv_uuid,vg_name Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 lver[1943] 7 dblock 6 Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 mbal[1943] 1944 dblock 3536 Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 ballot 6 larger2 inp[1943] 0 0 0 dblock 1536 1 85461 Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 paxos_acquire 6 ballot error -212 Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 acquire_token disk error -212 Oct 20 11:03:40 host-083 sanlock[14324]: 2016-10-20 11:03:40-0500 85461 [14330]: r4 cmd_acquire 2,9,14335 acquire_token -212 Oct 20 11:03:40 host-083 lvmlockd[14335]: 1476979420 S lvm_global R VGLK lock_san acquire error -212 3.10.0-513.el7.x86_64 lvm2-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 lvm2-libs-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 lvm2-cluster-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-libs-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-event-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-event-libs-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016 cmirror-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 lvm2-lockd-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016
pushing this out to the next release
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
bugzilla is absolutely idiotic
https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=e52d2e3bd86a006bcc322649ac9cb7c52117b787 There are a number of sanlock errors which mean the lock couldn't be acquired, but the caller can retry. The two errors above (-202 and -212) have been added to this set. -202: there were i/o timeouts while trying to acquire the lock. -212: there was (an unusal) sort of lock contention and another host holds the lock. All of the failures that can be retried will often be masked when a retry is successful, but if the condition is persistent enough, the retry limit (configurable) is reached, and the error is reported by the command, as seen above. Testing that produces any sort of lock contention and retries is sufficient here. Reproducing those two specific errors would not be possible and isn't necessary. Also, those error numbers won't be reported to the command any longer (they were sanlock-specific), instead the command should report lvm-specific error number -221 which represents any lock manager error that can't be handled.
Marking verified with the latest rpms. Saw no errors after 100 iterations. 3.10.0-826.el7.x86_64 lvm2-2.02.176-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 lvm2-libs-2.02.176-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 lvm2-cluster-2.02.176-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 lvm2-lockd-2.02.176-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 lvm2-python-boom-0.8.1-5.el7 BUILT: Wed Dec 6 04:15:40 CST 2017 cmirror-2.02.176-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 device-mapper-1.02.145-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 device-mapper-libs-1.02.145-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 device-mapper-event-1.02.145-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 device-mapper-event-libs-1.02.145-5.el7 BUILT: Wed Dec 6 04:13:07 CST 2017 device-mapper-persistent-data-0.7.3-3.el7 BUILT: Tue Nov 14 05:07:18 CST 2017 sanlock-3.6.0-1.el7 BUILT: Tue Dec 5 11:47:21 CST 2017 sanlock-lib-3.6.0-1.el7 BUILT: Tue Dec 5 11:47:21 CST 2017 ============================================================ Iteration 100 of 100 started at Fri Jan 5 17:31:36 CST 2018 ============================================================ SCENARIO (raid1) - [open_EXT4_on_raid_fsadm_resize_attempt] Create raid, add fs, and then attempt to resize it while it's mounted mckinley-03: lvcreate --activate ey --nosync --type raid1 -m 1 -n open_fsadm_resize -L 4G raid_sanity WARNING: New raid1 won't be synchronised. Don't read what you didn't write! Placing an EXT4 on open_fsadm_resize volume mke2fs 1.42.9 (28-Dec-2013) Starting dd io to raid fs to be resized Attempt to resize the open raid filesystem multiple times with lvextend/fsadm on mckinley-03 (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) (lvextend -L +2G -r /dev/raid_sanity/open_fsadm_resize) resize2fs 1.42.9 (28-Dec-2013) perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/open_fsadm_resize Deactivating raid open_fsadm_resize... and removing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0853