Bug 1714154

Summary: [downstream clone - 4.3.4] When a storage domain is updated to V5 during a DC upgrade, if there are volumes with metadata that has been reset then the upgrade fails
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Shir Fishbain <sfishbai>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.3.1CC: aefrat, dfediuck, emarcus, gveitmic, lsurette, nsoffer, srevivo, tnisan, ycui
Target Milestone: ovirt-4.3.4Keywords: ZStream
Target Release: 4.3.1Flags: lsvaty: testing_plan_complete-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.30.17 Doc Type: Bug Fix
Doc Text:
Converting a storage domain to the V5 format failed when there were partly deleted volumes with cleared metadata remaining in the storage domain following an unsuccessful delete volume operation. In this release, storage domain conversion succeeds even when partly deleted volumes with cleared metadata remain in the storage domain.
Story Points: ---
Clone Of: 1713724 Environment:
Last Closed: 2019-06-20 14:48:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1713724    
Bug Blocks:    

Description RHV bug bot 2019-05-27 09:17:46 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1713724 +++
======================================================================

Description of problem:

If a strorage domain has volume metadata that has been reset (contains "NONE=######...."), then the attempt to upgrade the DC will fail. This appears to leave the DC in a non-responsive state, with no SPM, and no new VMs can be started, etc.

VDSM reports "MetaDataKeyNotFoundError" for each volume metadata area that is in this state.

The volume metadata can perhaps be in this state as a result of a failed live merge at some time in the past, where the volume should have been removed, the metadata was reset, but the removal failed.


Version-Release number of selected component (if applicable):

RHV 4.3.3


How reproducible:

I assume 100%, but I am going to try to reproduce this myself.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

(Originally by Gordon Watson)

Comment 8 RHV bug bot 2019-05-27 09:18:01 UTC
Gordon, can you attach:
- lvm commands output from sosreport?
- copy of the first MiB of the metadata LV from a domain with this issue.

The attached patch should fix this issue, but it is not tested yet. I can test
in on Sunday, but we can save time if you can test it during the weekend.

(Originally by Nir Soffer)

Comment 9 RHV bug bot 2019-05-27 09:18:03 UTC
## Testing the fix

1. Create a DC/cluster with compatibility version 4.2
2. Create new iSCSI/FC V4 storage domain
3. Add some floating disks
4. Clear the metadata of one or more disks (see "How to clear volume metadata bellow")
5. Upgrade the cluster and DC to version 4.3

Expected result:
- Storage domain upgraded to V5 successfully
- Warning about volume with cleared metadata logged for the volume with this issue

Should be tested with with 4.3.3, reproducing this issue, and then with
a build including this fix, verifying that upgrading storage domain to V5 succeeds.


## How to clear volume metadata

This is not possible since 4.2.5, and impossible to reproduce to older version,
so we have to clear the metadata manually:

1. Find the volume metadata slot using:

  lvs -o vg_name,lv_name,tags -> MD_42

2. Write cleared metadata to the volume metadata area

    import os

    with open("/dev/vg-name/metadata, "rb+") as f:
        f.seek(42 * 512)
        f.write(b"NONE=" + (b"#" * 502) + b"\nEOF\n")
        os.fsync(f.fileno())

(Originally by Nir Soffer)

Comment 10 RHV bug bot 2019-05-27 09:18:04 UTC
Trying to target to 4.3.4, since this issue breaks upgrades to 4.3, and does not
have an easy workaround.

(Originally by Nir Soffer)

Comment 17 RHV bug bot 2019-05-27 09:18:16 UTC
Also test installing the temp fix vdsm-4.30.16-2.gitfb7cdef.el7.x86_64 before upgrading the CD to 4.3 and at looks well.
SD/DC upgraded to V5/4.3 without issues and all hosts are up.

(Originally by Avihai Efrat)

Comment 19 Shir Fishbain 2019-06-02 14:01:57 UTC
Verified

The storage domain upgrades to V5 successfully.

vdsm-4.30.17-1.el7ev.x86_64
ovirt-engine-4.3.4.2-0.1.el7.noarch

Comment 24 errata-xmlrpc 2019-06-20 14:48:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1567

Comment 25 Daniel Gur 2019-08-28 13:13:52 UTC
sync2jira

Comment 26 Daniel Gur 2019-08-28 13:18:05 UTC
sync2jira