Bug 1398791

Summary: VG is not auto-activating with lvmetad enabled in automated backup script
Product: Red Hat Enterprise Linux 7 Reporter: brosser
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
lvm2 sub component: LVM Metadata / lvmetad QA Contact: cluster-qe <cluster-qe>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: high CC: agk, brosser, cmarthal, heinzm, jbrassow, jpittman, loberman, msnitzer, prajnoha, prockai, rhandlin, zkabelac
Version: 7.2Flags: loberman: needinfo? (brosser)
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-13 19:14:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1420851    
Attachments:
Description Flags
Log Review from 10.5.16
none
Log Review from 9.29.16
none
lvmdump requested on 11-28 none

Description brosser 2016-11-26 01:50:53 UTC
Description of problem:

Customer has a workflow where 8 disks comprising prdvg are unmounted, vgremoved, re-synced from a clone via a vRO workflow, and re-attached/re-mounted so they can take a backup. On one VM, however, the VG appears as partial, so it doesn't auto-activate (which causes their scripts to fail). The 8 PVs are all attached and after a pvscan --cache, the VG appears normally and can be activated.

The disks are not automatically mounting after the following commands are executed (per their workflow/automated script):

umount /epic/prd01
systemctl mask epic-prd01.mount
- We had to include this to prevent auto-mount.
vgchange -an prdvg
vgremove -f prdvg

VMWare then removes the disks, takes a new clone, and attaches those new disks, which should auto-activate as prdvg. When their workflow fails (which is only ~2 in 7 times), all of the disks appear in lsblk but none of the 8 belong to a VG (until they run "pvscan --cache" and "vgchange -ay prdvg" manually).

Currently we are not able to confirm what steps are taken to add the disks back.  They use vRealize Orchestrator to call in to a script on the OS that does the steps  previously mentioned (unmount, vgchange, vgremove), which then kicks back a return code that tells vRO to proceed (or error out if any of those steps were unsuccessful). If the OS steps work as expected, their workflow pulls the disks, takes a new clone of the production system, then attaches those clone disks to the VM, where they expect prdvg to auto-activate (and auto-mount, if unmasked the mount unit is unmasked).

(From the customer) The failures do seem to follow a pattern. For example, say the 8 disks are mounted at /dev/sdc through /dev/sdj. After the vgremove and vRO disk removal, /dev/sdf still appears in lsblk, indicating that there may be stale metadata (or something). When the disks are added back, the disk with the PVID that previously matched /dev/sdf is now /dev/sdj (since the device names are unpredictable). At that point, LVM sees "duplicate" disks (because it doesn't know /dev/sdf was removed and is now /dev/sdj) so it marks /dev/sdf with the new PVID missing.

It's been suggested that there are some issues with udev and the disk read signaling.  To avert this, we have suggested the following:

 [1] edit lvm.conf to reflect the following:
use_lvmetad = 1  <<<<<<<<<<< CHANGE THIS VALUE FROM "1" TO "0"

 [2] Disable lvmetad

The customer has not been able to implement the above changes yet, but wants to get to the root of where the issue lies with lvmetad and why auto-activaiton/mounting is not occurring when lvmetad is enabled on this system, but is able to work correctly on other systems with identical configurations.


How reproducible:

I have not been able to successfully reproduce the issue.  Per the customer, this happens at random and has not seen a trend to identify a pattern.  We have asked the customer to please gather a 'lvmdump -u -l -s' on the problem system before taking any additional steps if the issue comes up again, as it appears to happen at random.

Actual results:

N/A

Expected results:

Expects automated workflow to work correctly with lvmetad enabled. 

Additional info:

Comment 1 brosser 2016-11-26 01:51:45 UTC
Created attachment 1224504 [details]
Log Review from 10.5.16

Comment 2 brosser 2016-11-26 01:52:20 UTC
Created attachment 1224505 [details]
Log Review from 9.29.16

Comment 4 Peter Rajnoha 2016-11-28 09:32:43 UTC
For lvm2 autoactivation, it's important to have correct udev events - we need to check this first.

Would it be possible to collect lvm, systemd and udev debug logs by adding "debug" to kernel command line before starting the system? Also, when customer runs the script, he can call "logger" command to make a log before running certain commands from the script and use some message that's clearly visible in the log then. I suggest adding that before vgchagne, vgremove, disk detach and attach call. This would make it easier for us to match sequences from the logs with the command sequence that is executed in the script.

Then, please, collect all logs by running "lvmdump -u -l -s" and attach it here, if possible (this will include lvm2 debug logs as well as systemd/journal and udev logs). Thanks.

Comment 5 brosser 2017-01-03 17:42:56 UTC
Created attachment 1236933 [details]
lvmdump requested on 11-28

Attachment is the lvmdump using command listed in update on 11-28-16.

Comment 6 brosser 2017-02-08 22:13:00 UTC
Hi, 

Has anyone had a chance to look at this issue yet?  The issue is pretty urgent for the customer, and would like to know any update if possible.  If a workaround is available, that may work for the customer also until we are able to confirm a more permanent solution.

Thank you in advance for your help.

Comment 7 Peter Rajnoha 2017-02-09 14:56:29 UTC
The log points to hardware problem, from the systemd journal:

Dec 27 05:26:11.236673 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 27 05:26:11.236755 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Dec 27 05:26:11.236800 et0958vodbcln01 kernel: blk_update_request: I/O error, dev sdk, sector 0
Dec 27 05:26:11.236857 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 27 05:26:11.236927 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] CDB: Read(10) 28 00 7c b9 77 80 00 00 08 00
Dec 27 05:26:11.236951 et0958vodbcln01 kernel: blk_update_request: I/O error, dev sdk, sector 2092529536
Dec 27 05:26:11.236355 et0958vodbcln01 lvm[2539473]: /dev/sdk: read failed after 0 of 4096 at 0: Input/output error
Dec 27 05:26:11.236414 et0958vodbcln01 lvm[2539473]: /dev/sdk: size is 2092529664 sectors
Dec 27 05:26:11.236683 et0958vodbcln01 lvm[2539473]: /dev/sdk: read failed after 0 of 4096 at 1071375122432: Input/output error
Dec 27 05:26:11.237213 et0958vodbcln01 lvm[2539473]: /dev/sdk: read failed after 0 of 4096 at 1071375179776: Input/output error
Dec 27 05:26:11.237446 et0958vodbcln01 lvm[2539473]: /dev/sdk: read failed after 0 of 4096 at 0: Input/output error
Dec 27 05:26:11.237777 et0958vodbcln01 lvm[2539473]: /dev/sdk: read failed after 0 of 4096 at 4096: Input/output error
Dec 27 05:26:11.238075 et0958vodbcln01 lvm[2539473]: /dev/sdk: read failed after 0 of 4096 at 0: Input/output error
Dec 27 05:26:11.238088 et0958vodbcln01 lvm[2539473]: /dev/sdk: No label detected
Dec 27 05:26:11.238291 et0958vodbcln01 lvm[2539473]: No PV label found on /dev/sdk.
Dec 27 05:26:11.238505 et0958vodbcln01 lvm[2539473]: Request to drop PV /dev/sdk in lvmetad did not find any matching object.
Dec 27 05:26:11.243595 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 27 05:26:11.243679 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] CDB: Read(10) 28 00 7c b9 77 f0 00 00 08 00
Dec 27 05:26:11.243705 et0958vodbcln01 kernel: blk_update_request: I/O error, dev sdk, sector 2092529648
Dec 27 05:26:11.243779 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 27 05:26:11.243852 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Dec 27 05:26:11.243884 et0958vodbcln01 kernel: blk_update_request: I/O error, dev sdk, sector 0
Dec 27 05:26:11.243943 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 27 05:26:11.244017 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] CDB: Read(10) 28 00 00 00 00 08 00 00 08 00
Dec 27 05:26:11.244043 et0958vodbcln01 kernel: blk_update_request: I/O error, dev sdk, sector 8
Dec 27 05:26:11.244113 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 27 05:26:11.244180 et0958vodbcln01 kernel: sd 1:0:3:0: [sdk] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Dec 27 05:26:11.244217 et0958vodbcln01 kernel: blk_update_request: I/O error, dev sdk, sector 0



Based on this, the record about the PV is removed from lvmetad and hence the VG is incomplete then:

Dec 27 05:26:11.238505 et0958vodbcln01 lvm[2539473]: Request to drop PV /dev/sdk in lvmetad did not find any matching object.

And incomplete VGs can't be autoactivated.

Can you confirm with the customer that whenever the problem appears, there's also the error as pasted above in the journal?

Comment 8 Peter Rajnoha 2017-02-09 15:01:45 UTC
If these are transient errors on a drive, then they lack proper uevent. For lvmetad to get the state back when the drive is usable again, it needs refresh by running pvscan --cache which is executed manually on command line.

Of course, the source of the failed state for the drive should be identified for proper solution.