Bug 1468919

Summary: LVM Metadata rolls back and causes disk corruption
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED WONTFIX QA Contact: Raz Tamir <ratamir>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.6.10CC: aefrat, amureini, bazulay, ebenahar, gveitmic, gwatson, lsurette, nashok, nsoffer, srevivo, ycui, ykaul, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-10 11:36:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2017-07-09 23:38:58 UTC
Description of problem:

In my understanding this will not happen in 4.0.7+ as lvmetad was disabled. But I believe it's important to report it to engineering for deeper evaluation as the consequences are quite severe and I'm not sure just masking lvmetad covers all cases.

A customer was complaining of random disk corruptions on RHV 3.6 (vdsm-4.17). At some point we noticed some LVs were missing from the Storage Domain as some VMs were throwing bad volume specification errors when starting up/snapshotting/migrating.

After some investigation, it came down to this:
1. lvmetad had a much older metadata cached
2. lvm command run without --config 'global { use_lvmetad = 0 } writes that older cached metadata to storage. This older metadata is missing several LVs.
3. LVs are missing, but the respective VMs still running.
4. There is now "free space" that can be allocated to new disks.
5. New VMs are created with overlapping extents to those LVs from step [3][4].
6. Lots of corruption.

The problem:
- Old metadata ready to be written back to storage in case of some lvm command was run
- Support uses several internal procedures with lvm commands for recovery
  * live storage migration failures
  * snapshot failures
  * LUN re-sizing (3.6 can be done through GUI, but 3.5 can use RHEL7).

Version-Release number of selected component (if applicable):
RHEL7 Host, RHV < 4.0.7

How reproducible:
100%

Steps to Reproduce:
1. Start with RHEL7.3 in a blank disk of 10G.

# lsblk /dev/vdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vdb  253:16   0  10G  0 disk 

2. Enable lvmetad

# systemctl start lvm2-lvmetad.service

3. Create a 5G PV on the 10G disk and "RHV" Storage Domain

# pvcreate /dev/vdb --setphysicalvolumesize 5G
  Physical volume "/dev/vdb" successfully created.
# vgcreate myvg /dev/vdb
  Volume group "myvg" successfully created

Use LVM commands with "use_lvmetad = 0" to simulate VDSM

# lvcreate --config 'global { use_lvmetad = 0 }' myvg -L 1M
# lvcreate --config 'global { use_lvmetad = 0 }' myvg -L 1M
# lvcreate --config 'global { use_lvmetad = 0 }' myvg -L 1M
# vgs -o +seq_no --config 'global { use_lvmetad = 0 }' 
  VG   #PV #LV #SN Attr   VSize VFree Seq
  myvg   1   3   0 wz--n- 5.00g 4.98g   4

All good, we created 3 "RHV Disks - LVs", they are there and vg metadata seq_no is now 4. 
Our VMs are running happy with their disks, SPM VDSM is probably making more changes to the LVM, all good.

However, note that lvmetad still sees seq_no=1 and 0 LVs, so it's far behind:

# vgs -o +seq_no
  VG   #PV #LV #SN Attr   VSize VFree Seq
  myvg   1   0   0 wz--n- 5.00g 5.00g   1

Now we resize the "RHV Storage Domain"

# pvresize /dev/vdb
  Physical volume "/dev/vdb" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized

And... we ROOLBACK THE METADATA
- seq_no back 2 (1++)
- our 3 "RHV Disks" are gone

# vgs -o +seq_no
  VG   #PV #LV #SN Attr   VSize  VFree  Seq
  myvg   1   0   0 wz--n- 10.00g 10.00g   2

And it doesn't matter the way we look at it. Data has been written to disk:

# vgs -o +seq_no --config 'global { use_lvmetad = 0 }' 
  VG   #PV #LV #SN Attr   VSize  VFree  Seq
  myvg   1   0   0 wz--n- 10.00g 10.00g   2

Actual results:
Metadata can be rolled back, causing several problems, including corruption.

Expected results:
- Anything else needs to be done to prevent this in newer versions?
- Shouldn't RHV somehow implement some monitoring of SD metadata to warn users of problems instead of showing random failures?
  * unintended vgcfgrestores
  * unintended metadata wipes (this happens on a weekly basis)
  * metadata tempering
- In this case, seq_no was rolled back on several Storage Domains. If somehow RHV detected it and locked new disks creations/extensions, no data would have been lost.

Comment 1 Germano Veit Michel 2017-07-09 23:45:07 UTC
lvremove does the same:

# lvs --config 'global { use_lvmetad = 0 }' 
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  LV    VG   Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 myvg -wi-a----- 4.00m                                                    
  lvol1 myvg -wi-a----- 4.00m                                                    
  lvol2 myvg -wi-a----- 4.00m                                                    
  lvol3 myvg -wi-a----- 4.00m                                                    
  lvol4 myvg -wi-a----- 4.00m                                                    
  lvol5 myvg -wi-a----- 4.00m        

# lvs
  LV    VG   Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 myvg -wi-a----- 4.00m                                                    
  lvol1 myvg -wi-a----- 4.00m                                                    
  lvol2 myvg -wi-a----- 4.00m       
                                            
# lvremove /dev/myvg/lvol1

# lvs --config 'global { use_lvmetad = 0 }' 
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  LV    VG   Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 myvg -wi-a----- 4.00m                                                    
  lvol2 myvg -wi-a----- 4.00m                                                    

# lvs
  LV    VG   Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 myvg -wi-a----- 4.00m                                                    
  lvol2 myvg -wi-a----- 4.00m

Comment 7 Nir Soffer 2017-07-10 10:04:27 UTC
Yes, it is possible to ruin your storage by running lvm commands on a host, this is
not supported.

Vdsm is the owner of all multipath devices on a host and the pvs, vgs, and lvs on these. Only vdsm, and in particular the spm host may modify the shared storage.

For the example in comment 0, we support automatic PV resizing since 3.6, there
is not need to do this manually.

If you want to modify the storage directly you are responsible to use the correct
lvm command line options and you must ensure that the spm is not running, otherwise
you may corrupt lvm metadata.

For 3.6 I recommend to disable lvmetad manually, since it is not compatible with 
ovirt shared storage.

Comment 8 Allon Mureinik 2017-07-10 10:25:36 UTC
(In reply to Nir Soffer from comment #7)
> Yes, it is possible to ruin your storage by running lvm commands on a host,
> this is
> not supported.
> 
> Vdsm is the owner of all multipath devices on a host and the pvs, vgs, and
> lvs on these. Only vdsm, and in particular the spm host may modify the
> shared storage.
> 
> For the example in comment 0, we support automatic PV resizing since 3.6,
> there
> is not need to do this manually.
> 
> If you want to modify the storage directly you are responsible to use the
> correct
> lvm command line options and you must ensure that the spm is not running,
> otherwise
> you may corrupt lvm metadata.
> 
> For 3.6 I recommend to disable lvmetad manually, since it is not compatible
> with 
> ovirt shared storage.
I tend to agree, and I rather not introduce another backport to 3.6.z.


(In reply to Germano Veit Michel from comment #0)
> - Support uses several internal procedures with lvm commands for recovery
>   * live storage migration failures
>   * snapshot failures
>   * LUN re-sizing (3.6 can be done through GUI, but 3.5 can use RHEL7).
Germano - can you please share these procedures (in a private comment, probably).
Engineering should review them and suggest the correct arguments to the lvm commands.

Comment 9 Allon Mureinik 2017-07-10 10:32:13 UTC
Yaniv - I suggest closing as WONTFIX based on comment 7 and comment 8 (and solve this by updating CEE's procedures).
Please confirm.

Comment 10 Yaniv Lavi 2017-07-10 11:36:56 UTC
(In reply to Allon Mureinik from comment #9)
> Yaniv - I suggest closing as WONTFIX based on comment 7 and comment 8 (and
> solve this by updating CEE's procedures).
> Please confirm.

Ack, let work with CEE to make sure their flows are supportable ones.
Germano, please send a email to the tech list, so we will be able to review the current solutions.

Comment 11 Germano Veit Michel 2017-07-10 23:06:42 UTC
Yes, WONTFIX is OK and logical.

Nir, I just need a way to safely disable lvmetad online. I'm afraid anything can trigger it to write that old stale metadata back to the storage. How do we safely disable it? Even in maintenanance mode, if the Storage is FC there is a risk right? What do you recommend?

Regarding internal solutions using lvm commands without "use_lvmetad = 0", we probably have dozens. I will send an email to the support list to warn about this.

Comment 12 Nir Soffer 2017-09-04 13:35:14 UTC
Modifying lvm.conf like this:

global {
    ...
    # lvmetad service is not compatible with ovirt shared storage. We disable
    # the lvm2-lvmetad.socket/service, this option helps lvm commands so they
    # do not try to access the disabled service.
    use_lvmetad = 0
    ...
}

Should prevent usage of lvmetad daemon by any lvm command.