Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1446492

Summary: Storage domain in 4.1 RHV will go offline if LVM metadata was restored manually
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.0CC: bcholler, cshao, eedri, fgarciad, gveitmic, gwatson, huzhao, jcoscia, jentrena, kgoldbla, lsurette, mgoldboi, mkalinin, nashok, qiyuan, ratamir, rhodain, sbonazzo, srevivo, tcarlin, tnisan, trichard, weiwang, yaniwang, ycui, ykaul, yzhao
Target Milestone: ovirt-4.2.0Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, an incorrect storage domain procedure could create invalid storage domain LVM metadata. When detected, the system would fail to activate the storage domain. Now, the system logs a warning when invalid storage domain metadata is detected, without failing the activation.
Story Points: ---
Clone Of:
: 1450634 1452984 1454864 (view as bug list) Environment:
Last Closed: 2018-05-15 17:51:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1450634, 1452984, 1454864    

Description nijin ashok 2017-04-28 08:25:06 UTC
Description of problem:

If the customer's LVM was manually restored without --metadatacopies switch, then it will fail to activate in RHV 4.1 as we are having new check as a part of StorageDomain.getInfo flow ( https://gerrit.ovirt.org/#/c/64433/ ) where vdsm reports PV which is having active LVM metadata to engine.

So it will fail with below error as metadatacopies by default in lvm is 1.

=====
2017-04-28 01:55:51,276+0530 ERROR (jsonrpc/1) [storage.StoragePool] Couldn't read from master domain (sp:1393)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 1391, in getInfo
    msdInfo = self.masterDomain.getInfo()
  File "/usr/share/vdsm/storage/blockSD.py", line 1243, in getInfo
    info['vgMetadataDevice'] = self._manifest.getVgMetadataDevice()
  File "/usr/share/vdsm/storage/blockSD.py", line 487, in getVgMetadataDevice
    return os.path.basename(lvm.getVgMetadataPv(self.sdUUID))
  File "/usr/share/vdsm/storage/lvm.py", line 1430, in getVgMetadataPv
    (vgName, pvs))
UnexpectedVolumeGroupMetadata: Volume Group metadata isn't as expected: "reason=Expected one metadata pv in vg: 3cb67522-1df1-47e3-8e85-2c116e500590, vg pvs: [PV(uuid='s8jUp5-qetw-bvg9-3pyY-YWfp-qGGb-K4NyIQ', name='/dev/mapper/360014054ae1fdf75b074e488e9d803bd', size='12482248704', vg_name='3cb67522-1df1-47e3-8e85-2c116e500590', vg_uuid='vCWEue-hbwr-KL0X-D214-gUoQ-RCex-LtkGJA', pe_start='135266304', pe_count='93', pe_alloc_count='0', mda_count='1', dev_size='12884901888', mda_used_count='1', guid='360014054ae1fdf75b074e488e9d803bd'), 

PV(uuid='z3YlYS-sSfq-cRQF-Qb23-IJbh-LxAR-AmXuqJ', name='/dev/mapper/360014053f404fa44d844d9198cfee437', size='52210696192', vg_name='3cb67522-1df1-47e3-8e85-2c116e500590', vg_uuid='vCWEue-hbwr-KL0X-D214-gUoQ-RCex-LtkGJA', pe_start='135266304', pe_count='389', pe_alloc_count='47', mda_count='1', dev_size='52613349376', mda_used_count='1', guid='360014053f404fa44d844d9198cfee437')]"

Relevant code.

def getVgMetadataPv(vgName):
    pvs = _lvminfo.getPvs(vgName)
    mdpvs = [pv for pv in pvs
             if not isinstance(pv, Stub) and _isMetadataPv(pv)]
    if len(mdpvs) != 1:
        raise se.UnexpectedVolumeGroupMetadata("Expected one metadata pv in "
                                               "vg: %s, vg pvs: %s" %
                                               (vgName, pvs))
    return mdpvs[0].name


def _isMetadataPv(pv):
return pv.mda_used_count == '2'

====

As we can't increase the number of metadata copies after creating PV,  then the only option will be to restore the metadata again with correct options which needs a complete downtime of the VMs.

Also  vdsm is only expecting 1 PV with active metadata and others should be disabled with pvchange --metadataignore y or else it will fail with same error as above even if it's created with 2 metadata copies.
 

Version-Release number of selected component (if applicable):

vdsm-4.19.10.1-1.el7ev.x86_64


How reproducible:

100%

Steps to Reproduce:

Restore the LVM metadata manually without --metadatacopies 

Actual results:

Storage domain will go offline if LVM metadata was restored manually

Expected results:

Don't fail the storage domain if mda_used_count is 1 for the PV.

Additional info:

Comment 6 Nir Soffer 2017-05-04 19:01:14 UTC
(In reply to nijin ashok from comment #0)
> Description of problem:
> 
> If the customer's LVM was manually restored without --metadatacopies switch,
> then it will fail to activate in RHV 4.1 as we are having new check as a
> part of StorageDomain.getInfo flow ( https://gerrit.ovirt.org/#/c/64433/ )
> where vdsm reports PV which is having active LVM metadata to engine.

This is expected, RHV supports only RHV storage domain format, and this
storage domain is in the correct format.

> As we can't increase the number of metadata copies after creating PV,  then
> the only option will be to restore the metadata again with correct options
> which needs a complete downtime of the VMs.

Yes, this is what should be done, recreate the vg with the correct options.

> Also  vdsm is only expecting 1 PV with active metadata and others should be
> disabled with pvchange --metadataignore y or else it will fail with same
> error as above even if it's created with 2 metadata copies.

Right, this is also part of the format.

> Expected results:
> 
> Don't fail the storage domain if mda_used_count is 1 for the PV.

I don't think this expectation is feasible.

We never tested this format, and we cannot guarantee that anything
will work with such storage domain. The best thing we can do is to fail
to activate this storage domain.

The new checks are required for the removing pvs from storage domain
(new feature in 4.1). We can consider not failing to activate such storage
domain, and disable this feature, but I don't see how we can support
such system.

Liron, can engine treat the new metadata keys as optional, and disable
removing pvs (with a warning) if the info is not available?

Comment 7 Allon Mureinik 2017-05-08 12:18:01 UTC
(In reply to Nir Soffer from comment #6)
> Liron, can engine treat the new metadata keys as optional, and disable
> removing pvs (with a warning) if the info is not available?

A quick review of the engine's code seems to show that this is indeed the behavior. I think we're OK from the engine's side.

Comment 8 Nir Soffer 2017-05-08 12:39:49 UTC
I reproduced this issue by modifying vdsm to not use the --metadatacopies and
--metadaignore options when creating a storage domain, so it creates an invalid
storage domain with the same configuration as badly restored storage domain.

This is how vdsm log looks now when we find such storage domain:

1. Starting getStorageDomainInfo requets

2017-05-08 15:24:56,054+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: getStorageDomainInfo(sdUUID=u'e529906a-f20f-4ad7-99e6-20242678a58e', options=None) (logUtils:51)

2. Warning about unsupported storage domain:

2930 2017-05-08 15:24:56,258+0300 WARN  (jsonrpc/5) [storage.StorageDomain] Cannot get metadata device, this storage domain is unsupported: Volume Group metadata isn't as expected: u"reason=     Expected one metadata pv in vg: e529906a-f20f-4ad7-99e6-20242678a58e, vg pvs: [PV(uuid='CTcOe0-uqQJ-c3lk-02SV-PYGf-cPvS-fPBjZ8', name='/dev/mapper/360014052e489b6f5ed34881ac5ed27fd', si     ze='53418655744', vg_name='e529906a-f20f-4ad7-99e6-20242678a58e', vg_uuid='CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', pe_start='142606336', pe_count='398', pe_alloc_count='39', mda_count=     '1', dev_size='53687091200', mda_used_count='1', guid='360014052e489b6f5ed34881ac5ed27fd'), PV(uuid='hGmFCu-i6Hm-LylO-EFLX-I5p2-hxWW-GyoTfZ', name='/dev/mapper/360014052de8e2b8007944a1a     93a82c40', size='53418655744', vg_name='e529906a-f20f-4ad7-99e6-20242678a58e', vg_uuid='CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', pe_start='142606336', pe_count='398', pe_alloc_count='0'     , mda_count='1', dev_size='53687091200', mda_used_count='1', guid='360014052de8e2b8007944a1a93a82c40'), PV(uuid='bWau5Z-C5Vb-Vt0G-Wjwp-2gYP-8X6m-Unocp3', name='/dev/mapper/36001405bfc49     549313946daa720cb1ba', size='53418655744', vg_name='e529906a-f20f-4ad7-99e6-20242678a58e', vg_uuid='CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', pe_start='142606336', pe_count='398', pe_all     oc_count='0', mda_count='1', dev_size='53687091200', mda_used_count='1', guid='36001405bfc49549313946daa720cb1ba')]" (blockSD:1243)                                                      

3. Returning response without vgMetadataDevice key:

2931 2017-05-08 15:24:56,258+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: getStorageDomainInfo, Return response: {'info': {'uuid': u'e529906a-f20f-4ad7-99e6-20242678a58e', 'vguuid':      'CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', 'metadataDevice': '360014052e489b6f5ed34881ac5ed27fd', 'state': 'OK', 'version': '4', 'role': 'Regular', 'type': 'ISCSI', 'class': 'Data', 'poo     l': [], 'name': 'bad-sd'}} (logUtils:54)

Comment 9 Nir Soffer 2017-05-08 13:04:24 UTC
Testing this change:
1. Create an invalid storage domain with one used metadata area on every PV
2. Run StorageDomain getInfo:

# vdsm-client StorageDomain getInfo storagedomainID=e529906a-f20f-4ad7-99e6-20242678a58e
{
    "uuid": "e529906a-f20f-4ad7-99e6-20242678a58e", 
    "vguuid": "CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc", 
    "metadataDevice": "360014052e489b6f5ed34881ac5ed27fd", 
    "state": "OK", 
    "version": "4", 
    "role": "Regular", 
    "type": "ISCSI", 
    "class": "Data", 
    "pool": [
        "6c99f4e5-8588-46f5-a818-e11151c1d19c"
    ], 
    "name": "bad-sd"
}

The call should succeed, not returning the vgMetadataDevice key.

Here is the same request for a good sd:

# vdsm-client StorageDomain getInfo storagedomainID=aed577ea-d1ca-4ebe-af80-f852c7ce59bb
{
    "uuid": "aed577ea-d1ca-4ebe-af80-f852c7ce59bb", 
    "type": "ISCSI", 
    "vguuid": "7T9sFi-okfz-JZON-xDUK-n0vH-OpyH-L7IjKO", 
    "metadataDevice": "360014052761af2654a94a70a60a7ee3f", 
    "state": "OK", 
    "version": "4", 
    "role": "Master", 
    "vgMetadataDevice": "360014052761af2654a94a70a60a7ee3f", 
    "class": "Data", 
    "pool": [
        "6c99f4e5-8588-46f5-a818-e11151c1d19c"
    ], 
    "name": "dumbo-iscsi-01"
}

Comment 16 rhev-integ 2017-05-12 15:10:34 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found clone flags: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found clone flags: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops

Comment 17 Nir Soffer 2017-05-12 15:15:29 UTC
Tal, this bug should be ready for QA, but the bot is complaning about the flags,
see comment 16. Can you help with this?

Comment 22 Kevin Alon Goldblatt 2017-05-28 15:15:50 UTC
Verified with the following code:
---------------------------------------
ovirt-engine-4.2.0-0.0.master.20170523140304.git04be891.el7.centos.noarch
vdsm-4.20.0-886.gitf9accf8.el7.centos.x86_64

Verified with the following scenario:
---------------------------------------
Modified the lvm.py file on the vdsm in order to create an invalid storage domain with one used metadata area on every PV as follows:

------------------------------------------
diff -uab lvm.py.orig lvm.py.bad
--- lvm.py.orig	2017-05-28 16:13:26.855334099 +0300
+++ lvm.py.bad	2017-05-28 18:06:45.883392405 +0300
@@ -722,9 +722,7 @@
     if options:
         cmd.extend(options)
     if metadataSize != 0:
-        cmd.extend(("--metadatasize", "%sm" % metadataSize,
-                    "--metadatacopies", "2",
-                    "--metadataignore", "y"))
+        cmd.extend(("--metadatasize", "%sm" % metadataSize))
     cmd.extend(devices)
     rc, out, err = _lvminfo.cmd(cmd, devices)
     return rc, out, err
@@ -984,12 +982,6 @@
     _checkpvsblksize(pvs)
 
     _initpvs(pvs, metadataSize, force)
-    # Activate the 1st PV metadata areas
-    cmd = ["pvchange", "--metadataignore", "n"]
-    cmd.append(pvs[0])
-    rc, out, err = _lvminfo.cmd(cmd, tuple(pvs))
-    if rc != 0:
-        raise se.PhysDevInitializationError(pvs[0])
 
     options = ["--physicalextentsize", "%dm" % VG_EXTENT_SIZE_MB]
     if initialTag:
------------------------------------------

2. Created a new block domain
3. Verified the warning in the vdsm.log as follows:

2017-05-28 18:07:44,672+0300 WARN  (jsonrpc/5) [storage.StorageDomain] Cannot get VG metadata device, this storage domain is unsupported: Volume Group metadata isn't as expected: "reason=Expected one metadata pv in vg: 8985b8d5-404c-4de8-b2d9-d8466b06ab77, vg pvs: [PV(uuid='4RJO7I-xzm0-VTKs-Lday-S1cl-UdrC-5PKa6V', name='/dev/mapper/3514f0c5a51600676', size='53418655744', vg_name='8985b8d5-404c-4de8-b2d9-d8466b06ab77', vg_uuid='Mnz1Dx-Vkeh-ibX0-BcHU-T2uu-1uht-9TucUe', pe_start='135266304', pe_count='398', pe_alloc_count='71', mda_count='1', dev_size='53687091200', mda_used_count='1', guid='3514f0c5a51600676')]" (blockSD:1243)

4. Verified creation of a new VM and disk on the invalid domain


Moving to VERIFIED

Comment 27 errata-xmlrpc 2018-05-15 17:51:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 28 Franta Kust 2019-05-16 13:08:18 UTC
BZ<2>Jira Resync