Bug 1450634 - [downstream clone - 4.1.2] Storage domain in 4.1 RHV will go offline if LVM metadata was restored manually
Summary: [downstream clone - 4.1.2] Storage domain in 4.1 RHV will go offline if LVM m...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.1.2
: ---
Assignee: Nir Soffer
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On: 1446492
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-14 09:16 UTC by rhev-integ
Modified: 2020-06-11 13:49 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
An incorrect storage domain restore procedure can create invalid storage domain lvm metadata. If the invalid storage domain lvm metadata is detected the system would fail to activate the storage domain. A warning is now logged when an invalid storage domain is detected, without failing the activation.
Clone Of: 1446492
Environment:
Last Closed: 2017-05-24 11:25:22 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3019041 0 None None None 2017-05-14 09:19:51 UTC
Red Hat Product Errata RHEA-2017:1281 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1.2 2017-05-24 15:18:53 UTC
oVirt gerrit 76580 0 master MERGED blockSD: Allow using badly restored storage domains 2020-04-27 22:59:55 UTC
oVirt gerrit 76644 0 ovirt-4.1 MERGED blockSD: Allow using badly restored storage domains 2020-04-27 22:59:55 UTC

Description rhev-integ 2017-05-14 09:16:13 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1446492 +++
======================================================================

Description of problem:

If the customer's LVM was manually restored without --metadatacopies switch, then it will fail to activate in RHV 4.1 as we are having new check as a part of StorageDomain.getInfo flow ( https://gerrit.ovirt.org/#/c/64433/ ) where vdsm reports PV which is having active LVM metadata to engine.

So it will fail with below error as metadatacopies by default in lvm is 1.

=====
2017-04-28 01:55:51,276+0530 ERROR (jsonrpc/1) [storage.StoragePool] Couldn't read from master domain (sp:1393)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 1391, in getInfo
    msdInfo = self.masterDomain.getInfo()
  File "/usr/share/vdsm/storage/blockSD.py", line 1243, in getInfo
    info['vgMetadataDevice'] = self._manifest.getVgMetadataDevice()
  File "/usr/share/vdsm/storage/blockSD.py", line 487, in getVgMetadataDevice
    return os.path.basename(lvm.getVgMetadataPv(self.sdUUID))
  File "/usr/share/vdsm/storage/lvm.py", line 1430, in getVgMetadataPv
    (vgName, pvs))
UnexpectedVolumeGroupMetadata: Volume Group metadata isn't as expected: "reason=Expected one metadata pv in vg: 3cb67522-1df1-47e3-8e85-2c116e500590, vg pvs: [PV(uuid='s8jUp5-qetw-bvg9-3pyY-YWfp-qGGb-K4NyIQ', name='/dev/mapper/360014054ae1fdf75b074e488e9d803bd', size='12482248704', vg_name='3cb67522-1df1-47e3-8e85-2c116e500590', vg_uuid='vCWEue-hbwr-KL0X-D214-gUoQ-RCex-LtkGJA', pe_start='135266304', pe_count='93', pe_alloc_count='0', mda_count='1', dev_size='12884901888', mda_used_count='1', guid='360014054ae1fdf75b074e488e9d803bd'), 

PV(uuid='z3YlYS-sSfq-cRQF-Qb23-IJbh-LxAR-AmXuqJ', name='/dev/mapper/360014053f404fa44d844d9198cfee437', size='52210696192', vg_name='3cb67522-1df1-47e3-8e85-2c116e500590', vg_uuid='vCWEue-hbwr-KL0X-D214-gUoQ-RCex-LtkGJA', pe_start='135266304', pe_count='389', pe_alloc_count='47', mda_count='1', dev_size='52613349376', mda_used_count='1', guid='360014053f404fa44d844d9198cfee437')]"

Relevant code.

def getVgMetadataPv(vgName):
    pvs = _lvminfo.getPvs(vgName)
    mdpvs = [pv for pv in pvs
             if not isinstance(pv, Stub) and _isMetadataPv(pv)]
    if len(mdpvs) != 1:
        raise se.UnexpectedVolumeGroupMetadata("Expected one metadata pv in "
                                               "vg: %s, vg pvs: %s" %
                                               (vgName, pvs))
    return mdpvs[0].name


def _isMetadataPv(pv):
return pv.mda_used_count == '2'

====

As we can't increase the number of metadata copies after creating PV,  then the only option will be to restore the metadata again with correct options which needs a complete downtime of the VMs.

Also  vdsm is only expecting 1 PV with active metadata and others should be disabled with pvchange --metadataignore y or else it will fail with same error as above even if it's created with 2 metadata copies.
 

Version-Release number of selected component (if applicable):

vdsm-4.19.10.1-1.el7ev.x86_64


How reproducible:

100%

Steps to Reproduce:

Restore the LVM metadata manually without --metadatacopies 

Actual results:

Storage domain will go offline if LVM metadata was restored manually

Expected results:

Don't fail the storage domain if mda_used_count is 1 for the PV.

Additional info:

(Originally by Nijin Ashok)

Comment 7 rhev-integ 2017-05-14 09:17:16 UTC
(In reply to nijin ashok from comment #0)
> Description of problem:
> 
> If the customer's LVM was manually restored without --metadatacopies switch,
> then it will fail to activate in RHV 4.1 as we are having new check as a
> part of StorageDomain.getInfo flow ( https://gerrit.ovirt.org/#/c/64433/ )
> where vdsm reports PV which is having active LVM metadata to engine.

This is expected, RHV supports only RHV storage domain format, and this
storage domain is in the correct format.

> As we can't increase the number of metadata copies after creating PV,  then
> the only option will be to restore the metadata again with correct options
> which needs a complete downtime of the VMs.

Yes, this is what should be done, recreate the vg with the correct options.

> Also  vdsm is only expecting 1 PV with active metadata and others should be
> disabled with pvchange --metadataignore y or else it will fail with same
> error as above even if it's created with 2 metadata copies.

Right, this is also part of the format.

> Expected results:
> 
> Don't fail the storage domain if mda_used_count is 1 for the PV.

I don't think this expectation is feasible.

We never tested this format, and we cannot guarantee that anything
will work with such storage domain. The best thing we can do is to fail
to activate this storage domain.

The new checks are required for the removing pvs from storage domain
(new feature in 4.1). We can consider not failing to activate such storage
domain, and disable this feature, but I don't see how we can support
such system.

Liron, can engine treat the new metadata keys as optional, and disable
removing pvs (with a warning) if the info is not available?

(Originally by Nir Soffer)

Comment 8 rhev-integ 2017-05-14 09:17:26 UTC
(In reply to Nir Soffer from comment #6)
> Liron, can engine treat the new metadata keys as optional, and disable
> removing pvs (with a warning) if the info is not available?

A quick review of the engine's code seems to show that this is indeed the behavior. I think we're OK from the engine's side.

(Originally by Allon Mureinik)

Comment 9 rhev-integ 2017-05-14 09:17:37 UTC
I reproduced this issue by modifying vdsm to not use the --metadatacopies and
--metadaignore options when creating a storage domain, so it creates an invalid
storage domain with the same configuration as badly restored storage domain.

This is how vdsm log looks now when we find such storage domain:

1. Starting getStorageDomainInfo requets

2017-05-08 15:24:56,054+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: getStorageDomainInfo(sdUUID=u'e529906a-f20f-4ad7-99e6-20242678a58e', options=None) (logUtils:51)

2. Warning about unsupported storage domain:

2930 2017-05-08 15:24:56,258+0300 WARN  (jsonrpc/5) [storage.StorageDomain] Cannot get metadata device, this storage domain is unsupported: Volume Group metadata isn't as expected: u"reason=     Expected one metadata pv in vg: e529906a-f20f-4ad7-99e6-20242678a58e, vg pvs: [PV(uuid='CTcOe0-uqQJ-c3lk-02SV-PYGf-cPvS-fPBjZ8', name='/dev/mapper/360014052e489b6f5ed34881ac5ed27fd', si     ze='53418655744', vg_name='e529906a-f20f-4ad7-99e6-20242678a58e', vg_uuid='CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', pe_start='142606336', pe_count='398', pe_alloc_count='39', mda_count=     '1', dev_size='53687091200', mda_used_count='1', guid='360014052e489b6f5ed34881ac5ed27fd'), PV(uuid='hGmFCu-i6Hm-LylO-EFLX-I5p2-hxWW-GyoTfZ', name='/dev/mapper/360014052de8e2b8007944a1a     93a82c40', size='53418655744', vg_name='e529906a-f20f-4ad7-99e6-20242678a58e', vg_uuid='CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', pe_start='142606336', pe_count='398', pe_alloc_count='0'     , mda_count='1', dev_size='53687091200', mda_used_count='1', guid='360014052de8e2b8007944a1a93a82c40'), PV(uuid='bWau5Z-C5Vb-Vt0G-Wjwp-2gYP-8X6m-Unocp3', name='/dev/mapper/36001405bfc49     549313946daa720cb1ba', size='53418655744', vg_name='e529906a-f20f-4ad7-99e6-20242678a58e', vg_uuid='CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', pe_start='142606336', pe_count='398', pe_all     oc_count='0', mda_count='1', dev_size='53687091200', mda_used_count='1', guid='36001405bfc49549313946daa720cb1ba')]" (blockSD:1243)                                                      

3. Returning response without vgMetadataDevice key:

2931 2017-05-08 15:24:56,258+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: getStorageDomainInfo, Return response: {'info': {'uuid': u'e529906a-f20f-4ad7-99e6-20242678a58e', 'vguuid':      'CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc', 'metadataDevice': '360014052e489b6f5ed34881ac5ed27fd', 'state': 'OK', 'version': '4', 'role': 'Regular', 'type': 'ISCSI', 'class': 'Data', 'poo     l': [], 'name': 'bad-sd'}} (logUtils:54)

(Originally by Nir Soffer)

Comment 10 rhev-integ 2017-05-14 09:17:48 UTC
Testing this change:
1. Create an invalid storage domain with one used metadata area on every PV
2. Run StorageDomain getInfo:

# vdsm-client StorageDomain getInfo storagedomainID=e529906a-f20f-4ad7-99e6-20242678a58e
{
    "uuid": "e529906a-f20f-4ad7-99e6-20242678a58e", 
    "vguuid": "CXMs11-xiGr-JzAU-5PtP-PW1j-bkjO-2jKlgc", 
    "metadataDevice": "360014052e489b6f5ed34881ac5ed27fd", 
    "state": "OK", 
    "version": "4", 
    "role": "Regular", 
    "type": "ISCSI", 
    "class": "Data", 
    "pool": [
        "6c99f4e5-8588-46f5-a818-e11151c1d19c"
    ], 
    "name": "bad-sd"
}

The call should succeed, not returning the vgMetadataDevice key.

Here is the same request for a good sd:

# vdsm-client StorageDomain getInfo storagedomainID=aed577ea-d1ca-4ebe-af80-f852c7ce59bb
{
    "uuid": "aed577ea-d1ca-4ebe-af80-f852c7ce59bb", 
    "type": "ISCSI", 
    "vguuid": "7T9sFi-okfz-JZON-xDUK-n0vH-OpyH-L7IjKO", 
    "metadataDevice": "360014052761af2654a94a70a60a7ee3f", 
    "state": "OK", 
    "version": "4", 
    "role": "Master", 
    "vgMetadataDevice": "360014052761af2654a94a70a60a7ee3f", 
    "class": "Data", 
    "pool": [
        "6c99f4e5-8588-46f5-a818-e11151c1d19c"
    ], 
    "name": "dumbo-iscsi-01"
}

(Originally by Nir Soffer)

Comment 17 rhev-integ 2017-05-14 09:18:56 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found clone flags: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found clone flags: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops

(Originally by rhev-integ)

Comment 18 rhev-integ 2017-05-14 09:19:05 UTC
Tal, this bug should be ready for QA, but the bot is complaning about the flags,
see comment 16. Can you help with this?

(Originally by Nir Soffer)

Comment 26 Kevin Alon Goldblatt 2017-05-22 13:34:28 UTC
Verified with the following code:
------------------------------------------
ovirt-engine-4.1.2.2-0.1.el7.noarch
rhevm-4.1.2.2-0.1.el7.noarch
vdsm-4.19.14-1.el7ev.x86_64

Verified with the following scenario:
-----------------------------------------
1. Created an invalid storage domain with one used metadata area on every PV
2. Created a VM on the storage domain
3. Set the storage domain to maintenance
4. Activated the storage domain
5. Started the vm previously created
6. Created a new vm pon the storage domain

Moving to VERIFIED!

Comment 28 errata-xmlrpc 2017-05-24 11:25:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1281


Note You need to log in before you can comment on or make changes to this bug.