Bug 1553133

Summary:

Creating many thin clones corrupts vg metadata

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

oliver.albl

Component:

vdsm

Assignee:

Amit Bawer <abawer>

Status:

CLOSED CURRENTRELEASE

QA Contact:

David Vaanunu <dvaanunu>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.2.8-4

CC:

abawer, adevolder, aefrat, amarchuk, aoconnor, aperotti, bubrown, bugs, christian.grundmann, dagur, dvaanunu, ebenahar, eedri, frolland, guchen, guillaume.pavese, gveitmic, gwatson, jortialc, kkulkarn, kzona, lsurette, lsvaty, mavital, michal.skrivanek, mkalinin, mlehrer, mshriver, mwest, nashok, ngavrilo, nkshirsa, nsoffer, oliver.albl, pelauter, rdlugyhe, schandle, srevivo, stanislav.polasek, teigland, tnisan, vendors, vjuranek, ycui

Target Milestone:

ovirt-4.4.0

Keywords:

Performance, ZStream

Target Release:

4.4.0

Flags:

abawer: needinfo+
lsvaty: testing_plan_complete-

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

vdsm-4.40.7

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1811391 (view as bug list)

Environment:

Last Closed:

2020-03-22 10:22:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Scale

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1809660, 1811391

Bug Blocks:

1637405, 1703275, 1715026

Attachments:

Description	Flags
engine and vdsm logs	none
dd dump of two luns	none
Logfiles fabavmhost002 and fabavmhost019	none
qe-logs	none
vdsm log from SPM during test	none
dump of lvm metadata area 1	none
dump of lvm metadata area 2	none
lvm backup for the corrupted vg	none
logs-kaminario	none
reproducer script	none
log from reproducer script running manager flow	none
log from reproducer script running regular node flow	none
reproducer script with locking_type 4 change	none
log from reproducer script with locking_type 4 running manager flow	none
log from reproducer script with locking_type 4 running regular node flow	none
locking type 4 simulator	none
log for running 50 threads, 1000 lvs per thread, 1 second extend delay	none
log for running 50 threads, 100 lvs per thread, 2 second extend delay	none
log for running 50 threads, 100 lvs per thread, 5 second extend delay	none
log for running 50 threads, 100 lvs per thread, 10 second extend delay	none
log for running 50 threads, 100 lvs per thread, 20 second extend delay	none
log for running 50 threads, 100 lvs per thread, 40 second extend delay	none

Description oliver.albl 2018-03-08 11:54:11 UTC

Description of problem:

After creating many thin clones from template the storage domain is deactivated with error "Storage Domain <LUN> (Data Center <NAME>) was deactivated by system because it's not visible by any of the hosts.

Trying to activate the storage domain shows the following error for all hosts in the datacenter: VDSM <HOSTNAME> command GetVGInfoVDS failed: Volume Group does not exist: (u'vg_uuid: jJnOUj-Ju4j-0YlS-r04l-Z1tD-Hjz0-evKdjz',)

pvs output on host shows:
/dev/mapper/3600601603cc045002e83815aed1324f2: Checksum error
Couldn't read volume group metadata.
/dev/mapper/3600601603cc045002e83815aed1324f2: Checksum error
Couldn't read volume group metadata.

We run 30 hosts connected to 30 FC storage domains on 2 all-flash storage systems.

Version-Release number of selected component (if applicable):
oVirt 4.2.1
CentOS 7.4

How reproducible:
Occured 2 times within 10 days

Steps to Reproduce:
1.Create and operate many (100) thin clones from template.
2.
3.

Actual results:
Storage domain inactive, metadata corrupt

Expected results:
Storage domain active.

Additional info:
Logfiles attached

Comment 1 Allon Mureinik 2018-03-08 13:22:19 UTC

Can we have the logs, please?

Comment 2 oliver.albl 2018-03-08 13:27:29 UTC

Created attachment 1405862 [details]
engine and vdsm logs

Comment 3 Nir Soffer 2018-03-11 21:42:25 UTC

Oliver, can you share the first 129 MiB of the bad PV?

You can copy it like this:

    dd if=/dev/mapper/xxxyyy of=pv.md bs=1M count=129 iflag=direct conv=fsync

And then compress the generated file (e.g xz pv.md). The output may be small 
enough to attach the bug, if not, please shared via dropbox or similar service.

It will also help if you share the backups of this vg metadata from /etc/lvm/backup.
The problem with the backup is you will have to search for the latest backup
on all the hosts running as SPM in your setup. The last host used as SPM with
this vg will have the most up to date backup info.

The metadata of the pv should not contain any sensitive info.

Comment 4 oliver.albl 2018-03-11 22:00:03 UTC

I already ran vgcfgrestore on the LUNs to be able to recover vms. Would the PV data still be relevant?

Comment 5 Nir Soffer 2018-03-11 22:50:39 UTC

(In reply to oliver.albl from comment #4)
> I already ran vgcfgrestore on the LUNs to be able to recover vms. Would the
> PV data still be relevant?

The PV metadata contains all the changes made to the pv, including the date the
host, and the lvm version for every change. If we are lucky, we will have complete
history of this PV. If we are less lucky, some of the history will be deleted by
the restore operation. I hope we will have enough info to understand why the PV
disappeared.

You can see example of the data here:
https://bugzilla.redhat.com/attachment.cgi?id=1385115

and here you can see how to see all the entries:
https://bugzilla.redhat.com/show_bug.cgi?id=1500395#c3

bug 1500395 is similar case, the conclusion was that another host running different
lvm version accessed the LUN and and removed the vg/pv.

Comment 6 oliver.albl 2018-03-12 14:57:09 UTC

Created attachment 1407244 [details]
dd dump of two luns

Comment 7 Nir Soffer 2018-03-12 15:54:10 UTC

Thanks for the dump Oliver!

In LUN204 we see this history only from Tue Mar 6 23:19:46 until Wed Mar 7 
23:09:00, is this the date when you restored the metadata?

$ grep -a '# Generated by LVM2 version' LUN204 | head
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:46 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:47 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:48 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:50 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:52 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:52 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:54 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:57 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:19:59 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Mar  6 23:20:02 2018

$ grep -a '# Generated by LVM2 version' LUN204 | tail
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:07:56 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:06 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:43 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:45 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:46 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:48 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:57 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:58 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:08:59 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 23:09:00 2018

We also have some strange metadata lines:

# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 00:52:45 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:28:12 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 00:52:48 2018

Maybe this is leftover from older metadata - do you remember something special about
Tue Feb 27 09:28:12 2018?

# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 01:50:32 2018
# Generated by LVM2 version 2.02.171(237e56b4-9f3a-436a-84b6-0a2b09133b54 {
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 01:50:33 2018

Looks like leftover form previous metadata version, not wiped when writing new metadata.

# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 03:16:37 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon M237e56b4-9f3a-436a-84b6-0a2b09133b54 {
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Mar  7 03:16:38 2018

Another leftover?

Oliver, can provide timeline - when the domains started to fail? what did you change
in your deployment and when in this time frame?

Comment 8 oliver.albl 2018-03-12 16:39:38 UTC

Nir,

I have the following details for LUN 204:

March 8th 2018 around 0.55am CET - Lost access to storage domain on LUN 204
March 8th 2018 around 6.40am CET - Tried to activate the storage domain
March 8th 2018 around 1.00pm CET - Ran vgcfgrestore with creation_time Mon Mar  5 12:54:56 2018.

We run automated tasks against this system, creating/running/dropping vms and snapshots. I do not see anything special at Tue Feb 27 09:28:12 2018.

Comment 9 Nir Soffer 2018-03-12 19:39:49 UTC

Looking at LUN 205:

1. Lot of history from Feb 12, 14, 27, 28,

$ grep -a '# Generated by LVM2 version' LUN205
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:57 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:58 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:58 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:58 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:59 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:59 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:59 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Mon Feb 12 13:36:59 2018
...
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:38 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:38 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:39 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:40 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:40 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:40 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:41 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:41 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 12:23:42 2018
...
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:29:03 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:29:28 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:29:44 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:29:44 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:30:06 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:30:09 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:34:00 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:34:15 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Feb 27 09:34:29 2018
...
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:00:01 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:15 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:17 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:24 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:27 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:38 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:40 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:41 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:44 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 00:01:44 2018


2. Then history from Mar 8 - is this the point when you restored the vg?

# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:37:30 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:53:37 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:53:41 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:53:51 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:53:54 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:54:30 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:54:34 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:54:41 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:54:45 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:57:02 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:57:11 2018
...
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 15:44:56 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 15:44:56 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 15:47:27 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 15:47:27 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 15:47:33 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 15:47:40 2018
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Fri Mar  9 11:54:42 2018


3. Looking at hosts creating the changes

$ egrep -a "creation_host = \".+\"\s+# Linux" LUN205 | sort | uniq -c
     71 creation_host = "cloud5.sq.fabasoft.com"	# Linux cloud5.sq.fabasoft.com 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64
   1771 creation_host = "fabavmhost002"	# Linux fabavmhost002 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
     62 creation_host = "fabavmhost003"	# Linux fabavmhost003 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
      1 creation_host = "fabavmhost008"	# Linux fabavmhost008 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64
      1 creation_host = "fabavmhost015"	# Linux fabavmhost015 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64
      1 creation_host = "fabavmhost019"	# Linux fabavmhost019 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
      1 creation_host = "fabavmhost021"	# Linux fabavmhost021 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
      1 creation_host = "fabavmhost030"	# Linux fabavmhost030 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
      1 creation_host = "sqdx11.sq.fabasoft.com"	# Linux sqdx11.sq.fabasoft.com 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64

Are these hosts part of the oVirt DC? did you run SPM on all of these hosts?


Lets look at the changes of the hosts that did 1 change:


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 13:46:43 2018
creation_host = "sqdx11.sq.fabasoft.com"        # Linux sqdx11.sq.fabasoft.com 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64
creation_time = 1518612403      # Wed Feb 14 13:46:43 2018
seqno = 73


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 10:27:10 2018
creation_host = "fabavmhost021" # Linux fabavmhost021 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
creation_time = 1519810030      # Wed Feb 28 10:27:10 2018
seqno = 1483


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 10:38:34 2018
creation_host = "fabavmhost015" # Linux fabavmhost015 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64
creation_time = 1519810714      # Wed Feb 28 10:38:34 2018
seqno = 1765


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 14 14:45:29 2018
creation_host = "fabavmhost008" # Linux fabavmhost008 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64
creation_time = 1518615929      # Wed Feb 14 14:45:29 2018
seqno = 74


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 10:42:10 2018
creation_host = "fabavmhost019" # Linux fabavmhost019 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
creation_time = 1519810930      # Wed Feb 28 10:42:10 2018
seqno = 74


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Thu Mar  8 13:37:30 2018
creation_host = "fabavmhost030" # Linux fabavmhost030 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
creation_time = 1520512650      # Thu Mar  8 13:37:30 2018
seqno = 75


Note: we have 2 entries with the same seqno! (seqno = 74)


Looking at seqno numbers:

$ grep -a 'seqno = ' LUN205
seqno = 1
seqno = 2
seqno = 3
seqno = 4
seqno = 5
seqno = 6
seqno = 7
seqno = 8
seqno = 9
seqno = 10
...
seqno = 1837
seqno = 1838
seqno = 1839
seqno = 1840
seqno = 1841
seqno = 1842
seqno = 1843
seqno = 1844
seqno = 1845
seqno = 1846
seqno = 1847   << note the switch here
seqno = 74
seqno = 75
seqno = 76
seqno = 77
seqno = 78
seqno = 79
seqno = 80
seqno = 81
seqno = 82
seqno = 83
...
seqno = 124
seqno = 125
seqno = 126
seqno = 127
seqno = 128
seqno = 129
seqno = 130
seqno = 131
seqno = 132
seqno = 133
seqno = 134
seqno = 135
seqno = 136


After seqno = 1847 we started over. Lets look at this version:


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 10:42:10 2018
creation_host = "fabavmhost002" # Linux fabavmhost002 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
creation_time = 1519810930      # Wed Feb 28 10:42:10 2018

b83c159c-4ad6-4613-ba16-bab95ccd10c0 {
id = "813oRe-64r8-mloU-k9G2-LFsS-dXSG-hpN4kf"
seqno = 1847
format = "lvm2"
status = ["RESIZEABLE", "READ", "WRITE"]
flags = []
tags = ["MDT_LEASETIMESEC=60", "MDT_VERSION=4", "MDT_SDUUID=b83c159c-4ad6-4613-ba16-bab95ccd10c0", "MDT_LOGBLKSIZE=512", "MDT_DESCRIPTION=FABAVMHOST_LUN_205", "MDT_LEASERETRIES=3", "MDT_IOOPTIMEOUTSEC=10", "MDT_LOCKRENEWALINTERVALSEC=5", "MDT_ROLE=Regular", "MDT_PHYBLKSIZE=512", "MDT_PV0=pv:3600601603cc045005d83815aff3a307c&44&uuid:pyN2Sr-QOh0-8QSX-tShd-x3oI-dh9B-Ck59H3&44&pestart:0&44&pecount:65533&44&mapoffset:0", "MDT_CLASS=Data", "MDT_TYPE=FCP", "MDT_LOCKPOLICY=", "MDT_VGUUID=813oRe-64r8-mloU-k9G2-LFsS-dXSG-hpN4kf", "RHAT_storage_domain", "MDT__SHA_CKSUM=536309f4105827dc86058e897c68a613873223a5", "MDT_POOL_UUID=5849b030-626e-47cb-ad90-3ce782d831b3"]
extent_size = 262144
max_lv = 0
max_pv = 0
metadata_copies = 0

physical_volumes {

pv0 {
id = "pyN2Sr-QOh0-8QSX-tShd-x3oI-dh9B-Ck59H3"
device = "/dev/mapper/3600601603cc045005d83815aff3a307c"
...


Next version is seqo = 74:


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Wed Feb 28 10:42:10 2018
creation_host = "fabavmhost019" # Linux fabavmhost019 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64
creation_time = 1519810930      # Wed Feb 28 10:42:10 2018

b83c159c-4ad6-4613-ba16-bab95ccd10c0 {
id = "813oRe-64r8-mloU-k9G2-LFsS-dXSG-hpN4kf"
seqno = 74
format = "lvm2"
status = ["RESIZEABLE", "READ", "WRITE"]
flags = []
tags = ["MDT_LEASETIMESEC=60", "MDT_VERSION=4", "MDT_SDUUID=b83c159c-4ad6-4613-ba16-bab95ccd10c0", "MDT_LOGBLKSIZE=512", "MDT_DESCRIPTION=FABAVMHOST_LUN_205", "MDT_LEASERETRIES=3", "MDT_IOOPTIMEOUTSEC=10", "MDT_LOCKRENEWALINTERVALSEC=5", "MDT_ROLE=Regular", "MDT_PHYBLKSIZE=512", "MDT_PV0=pv:3600601603cc045005d83815aff3a307c&44&uuid:pyN2Sr-QOh0-8QSX-tShd-x3oI-dh9B-Ck59H3&44&pestart:0&44&pecount:65533&44&mapoffset:0", "MDT_CLASS=Data", "MDT_TYPE=FCP", "MDT_LOCKPOLICY=", "MDT_VGUUID=813oRe-64r8-mloU-k9G2-LFsS-dXSG-hpN4kf", "RHAT_storage_domain", "MDT__SHA_CKSUM=536309f4105827dc86058e897c68a613873223a5", "MDT_POOL_UUID=5849b030-626e-47cb-ad90-3ce782d831b3"]
extent_size = 262144
max_lv = 0
max_pv = 0
metadata_copies = 0

physical_volumes {

pv0 {
id = "pyN2Sr-QOh0-8QSX-tShd-x3oI-dh9B-Ck59H3"
device = "/dev/mapper/3600601603cc045005d83815aff3a307c"
...


Note that both metadata changes happened in the *same* second
(1519810930 # Wed Feb 28 10:42:10 2018) from these hosts:
- fabavmhost002
- fabavmhost019

Only one of them could be the SPM at this time.

Can you extract messages from engine log about the SPM?
(Storage Pool Manager runs on Host xxx.yyy (Address: xxx.yyy), Data Center v42.)

Also we need from both hosts
- /var/log/vdsm/vdsm.log
- /var/log/messages
- /var/log/sanlock.log

Around that time - Wed Feb 28 10:42:10 2018.

Comment 10 oliver.albl 2018-03-13 08:57:06 UTC

For LUN 205:

February 02nd 2018 around 02.40pm CET - Imported LUN 205 that was detached from another oVirt installation.
February 28th 2018 around 10.40am CET - Lost access to storage domain on LUN 205. In this case we also saw messages regaring the storage system. We did not see those messages when we lost access to storage domain on LUN 204.
March 3rd 2018 around 05.00pm CET - Managed to switch the storage domain from inactive to maintenance mode
March 8th 2018 around 01.30pm CET - Ran vgcfgrestore with creation_time  Mon Feb 26 14:55:43 2018

All hosts named FABAVMHOST* are part of one oVirt DC. All hosts named *.sq.fabasoft.com are part of another oVirt DC (we detached the storage domain and imported it). Here are the SPM events:

# select log_time,severity,message from audit_log where message like '%Storage Pool%Data Center Production%' order by 1;
          log_time          | severity |                                                                  message
----------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------
 2018-02-12 08:23:46.455+01 |        0 | Storage Pool Manager runs on Host fabavmhost013.fabagl.fabasoft.com (Address: fabavmhost013.fabagl.fabasoft.com), Data Center Production.
 2018-02-14 14:44:56.353+01 |        0 | Storage Pool Manager runs on Host fabavmhost008.fabagl.fabasoft.com (Address: fabavmhost008.fabagl.fabasoft.com), Data Center Production.
 2018-02-15 17:16:08.397+01 |        0 | Storage Pool Manager runs on Host fabavmhost008.fabagl.fabasoft.com (Address: fabavmhost008.fabagl.fabasoft.com), Data Center Production.
 2018-02-15 17:45:22.47+01  |        0 | Storage Pool Manager runs on Host fabavmhost008.fabagl.fabasoft.com (Address: fabavmhost008.fabagl.fabasoft.com), Data Center Production.
 2018-02-15 17:49:43.772+01 |        0 | Storage Pool Manager runs on Host fabavmhost008.fabagl.fabasoft.com (Address: fabavmhost008.fabagl.fabasoft.com), Data Center Production.
 2018-02-16 11:54:34.9+01   |        0 | Storage Pool Manager runs on Host fabavmhost002.fabagl.fabasoft.com (Address: fabavmhost002.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:42:52.152+01 |        0 | Storage Pool Manager runs on Host fabavmhost002.fabagl.fabasoft.com (Address: fabavmhost002.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:43:02.328+01 |        0 | Storage Pool Manager runs on Host fabavmhost002.fabagl.fabasoft.com (Address: fabavmhost002.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:43:09.621+01 |        0 | Storage Pool Manager runs on Host fabavmhost002.fabagl.fabasoft.com (Address: fabavmhost002.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:43:48.412+01 |        0 | Storage Pool Manager runs on Host fabavmhost020.fabagl.fabasoft.com (Address: fabavmhost020.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:44:15.516+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:44:34.17+01  |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 10:44:42.878+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 11:23:58.188+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-02-28 11:25:48.534+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-01 17:31:40.922+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-01 21:40:47.256+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-02 15:02:12.661+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-02 16:54:28.204+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-06 15:21:46.471+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-06 16:40:48.419+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-06 16:51:38.168+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:54:07.208+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:54:19.464+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:54:25.753+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:54:31.151+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:54:47.339+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:54:56.896+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:06.913+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:15.058+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:20.855+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:28.16+01  |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:34.43+01  |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:49.68+01  |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:55:55.388+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:56:01.199+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:56:12.104+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:56:18.472+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:56:30.562+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:56:40.967+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:56:53.223+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:57:00.707+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:57:17.991+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:57:29.793+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 00:58:09.52+01  |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-08 06:40:13.027+01 |        0 | Storage Pool Manager runs on Host fabavmhost003.fabagl.fabasoft.com (Address: fabavmhost003.fabagl.fabasoft.com), Data Center Production.
 2018-03-10 11:06:02.589+01 |        0 | Storage Pool Manager runs on Host fabavmhost002.fabagl.fabasoft.com (Address: fabavmhost002.fabagl.fabasoft.com), Data Center Production.

I attached the logs from fabavmhost002 and fabavmhost019 (vdsm logs from fabavmhost019 are not available any more).

Comment 11 oliver.albl 2018-03-13 08:58:15 UTC

Created attachment 1407476 [details]
Logfiles fabavmhost002 and fabavmhost019

Comment 13 Elad 2018-04-18 10:11:12 UTC

Created attachment 1423463 [details]
qe-logs

(In reply to Yaniv Kaul from comment #12)
> Yaniv, I'm moving this to 4.2.4 - we are not going to be able to have it for
> 4.2.3.
> Elad - was this reproduced by QE? Storage or Scale QE?

Might be. Examined our latest tier3 automation execution. I'm observing the following, not sure it's related. It might be related to an issue with a LUN that has been unmasked from the hosts while it was part of a storage connection for an iSCSI domain in the setup. Nir, please take a look.



engine.log:

2018-04-14 19:24:09,971+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-29518) [7b4b5cfb] EVENT_ID: SYSTEM_DEACTIVATED_STORAGE_DOMAIN(970), 
Storage Domain sd_TestCase10147_1418564266 (Data Center golden_env_mixed) was deactivated by system because it's not visible by any of the hosts.



From around the same time, the following appears in vdsm:

2018-04-14 19:22:44,628+0300 ERROR (hsm/init) [storage.Multipath] Could not resize device 3600a09803830447a4f244c4657595264 (multipath:98)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/multipath.py", line 96, in resize_devices
    _resize_if_needed(guid)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/multipath.py", line 120, in _resize_if_needed
    supervdsm.getProxy().resizeMap(name)
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
    **kwargs)
  File "<string>", line 2, in resizeMap
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
Error: Resizing map 'dm-147' failed: out='fail\n' err=''
2018-04-14 19:22:44,629+0300 DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:395)

Comment 14 Nir Soffer 2018-06-03 19:25:37 UTC

(In reply to Elad from comment #13)
> Created attachment 1423463 [details]
> qe-logs
> 
> (In reply to Yaniv Kaul from comment #12)
> > Yaniv, I'm moving this to 4.2.4 - we are not going to be able to have it for
> > 4.2.3.
> > Elad - was this reproduced by QE? Storage or Scale QE?
> 
> Might be. Examined our latest tier3 automation execution. I'm observing the
> following, not sure it's related. It might be related to an issue with a LUN
> that has been unmasked from the hosts while it was part of a storage
> connection for an iSCSI domain in the setup. Nir, please take a look.
> 
> engine.log:
> 
> 2018-04-14 19:24:09,971+03 WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-29518) [7b4b5cfb] EVENT_ID:
> SYSTEM_DEACTIVATED_STORAGE_DOMAIN(970), 
> Storage Domain sd_TestCase10147_1418564266 (Data Center golden_env_mixed)
> was deactivated by system because it's not visible by any of the hosts.

I don't think this is related, if you changed a lun configuration and it became
invisible on all hosts, it does not mean that it was corrupted.

> From around the same time, the following appears in vdsm:
> 
> Error: Resizing map 'dm-147' failed: out='fail\n' err=''
> 2018-04-14 19:22:44,629+0300 DEBUG (hsm/init) [storage.SamplingMethod]
> Returning last result (misc:395)

This is like to happen when a lun is not available.

Elad, I think we need to write a specific script for this test, creating many
thin clones from template. Do we have such test?

We reproduced it if a storage domain which is configured properly becomes
deactivated after some time, and we cannot activate it again.

Did you try to activate the storage domain after fixing the lun zoning?

Comment 15 Elad 2018-06-07 12:02:46 UTC

(In reply to Nir Soffer from comment #14)
> (In reply to Elad from comment #13)
> > Created attachment 1423463 [details]
> > qe-logs
> > 
> > (In reply to Yaniv Kaul from comment #12)
> > > Yaniv, I'm moving this to 4.2.4 - we are not going to be able to have it for
> > > 4.2.3.
> > > Elad - was this reproduced by QE? Storage or Scale QE?
> > 
> > Might be. Examined our latest tier3 automation execution. I'm observing the
> > following, not sure it's related. It might be related to an issue with a LUN
> > that has been unmasked from the hosts while it was part of a storage
> > connection for an iSCSI domain in the setup. Nir, please take a look.
> > 
> > engine.log:
> > 
> > 2018-04-14 19:24:09,971+03 WARN 
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (EE-ManagedThreadFactory-engine-Thread-29518) [7b4b5cfb] EVENT_ID:
> > SYSTEM_DEACTIVATED_STORAGE_DOMAIN(970), 
> > Storage Domain sd_TestCase10147_1418564266 (Data Center golden_env_mixed)
> > was deactivated by system because it's not visible by any of the hosts.
> 
> I don't think this is related, if you changed a lun configuration and it
> became
> invisible on all hosts, it does not mean that it was corrupted.
> 
> > From around the same time, the following appears in vdsm:
> > 
> > Error: Resizing map 'dm-147' failed: out='fail\n' err=''
> > 2018-04-14 19:22:44,629+0300 DEBUG (hsm/init) [storage.SamplingMethod]
> > Returning last result (misc:395)
> 
> This is like to happen when a lun is not available.
> 
> Elad, I think we need to write a specific script for this test, creating many
> thin clones from template. Do we have such test?
We can create one
Natalie, please create one, python sdk is an option. 

> We reproduced it if a storage domain which is configured properly becomes
> deactivated after some time, and we cannot activate it again.
> 
> Did you try to activate the storage domain after fixing the lun zoning?
Yes

Comment 16 Natalie Gavrielov 2018-06-10 13:15:17 UTC

(In reply to Elad from comment #15)
> (In reply to Nir Soffer from comment #14)
> > (In reply to Elad from comment #13)
> > > Created attachment 1423463 [details]
> > > qe-logs
> > > 
> > > (In reply to Yaniv Kaul from comment #12)
> > > > Yaniv, I'm moving this to 4.2.4 - we are not going to be able to have it for
> > > > 4.2.3.
> > > > Elad - was this reproduced by QE? Storage or Scale QE?
> > > 
> > > Might be. Examined our latest tier3 automation execution. I'm observing the
> > > following, not sure it's related. It might be related to an issue with a LUN
> > > that has been unmasked from the hosts while it was part of a storage
> > > connection for an iSCSI domain in the setup. Nir, please take a look.
> > > 
> > > engine.log:
> > > 
> > > 2018-04-14 19:24:09,971+03 WARN 
> > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > > (EE-ManagedThreadFactory-engine-Thread-29518) [7b4b5cfb] EVENT_ID:
> > > SYSTEM_DEACTIVATED_STORAGE_DOMAIN(970), 
> > > Storage Domain sd_TestCase10147_1418564266 (Data Center golden_env_mixed)
> > > was deactivated by system because it's not visible by any of the hosts.
> > 
> > I don't think this is related, if you changed a lun configuration and it
> > became
> > invisible on all hosts, it does not mean that it was corrupted.
> > 
> > > From around the same time, the following appears in vdsm:
> > > 
> > > Error: Resizing map 'dm-147' failed: out='fail\n' err=''
> > > 2018-04-14 19:22:44,629+0300 DEBUG (hsm/init) [storage.SamplingMethod]
> > > Returning last result (misc:395)
> > 
> > This is like to happen when a lun is not available.
> > 
> > Elad, I think we need to write a specific script for this test, creating many
> > thin clones from template. Do we have such test?

Nir, isn't it the same as creating a VM pool?

> We can create one
> Natalie, please create one, python sdk is an option. 
> 
> > We reproduced it if a storage domain which is configured properly becomes
> > deactivated after some time, and we cannot activate it again.
> > 
> > Did you try to activate the storage domain after fixing the lun zoning?
> Yes

Comment 17 Nir Soffer 2018-06-10 13:34:53 UTC

(In reply to Natalie Gavrielov from comment #16)
> > > Elad, I think we need to write a specific script for this test, creating many
> > > thin clones from template. Do we have such test?
> 
> Nir, isn't it the same as creating a VM pool?

Maybe, we need to ask the reporter what is the flow that reproduces this issue.

Oliver, can you explain what is "creating many thin clones from template"?

Are you using the ovirt SDK? can you show an example code you use to create
these clones, so we can reproduce this in the lab, and include this flow in our
automated tests?

Comment 18 christian.grundmann 2018-06-14 09:41:17 UTC

(In reply to Nir Soffer from comment #17)
> (In reply to Natalie Gavrielov from comment #16)
> > > > Elad, I think we need to write a specific script for this test, creating many
> > > > thin clones from template. Do we have such test?
> > 
> > Nir, isn't it the same as creating a VM pool?
> 
> Maybe, we need to ask the reporter what is the flow that reproduces this
> issue.
> 
> Oliver, can you explain what is "creating many thin clones from template"?
> 
> Are you using the ovirt SDK? can you show an example code you use to create
> these clones, so we can reproduce this in the lab, and include this flow in
> our
> automated tests?


Hi,
we are using the ovirt vagrant provider https://www.ovirt.org/develop/projects/proposals/vagrant-provider
which uses the Ruby SDK.
But I think it should be the same just using the SDK with the following flow:
  - Create thin clones from Template
  - Use the clones (-> extends the disk)
  - Remove the clones (-> discard)

Comment 19 oliver.albl 2018-06-14 09:45:50 UTC

Please see Christian's comment 18

Comment 21 mlehrer 2018-06-18 11:05:38 UTC

(In reply to christian.grundmann from comment #18)
> (In reply to Nir Soffer from comment #17)
> > (In reply to Natalie Gavrielov from comment #16)
> > > > > Elad, I think we need to write a specific script for this test, creating many
> > > > > thin clones from template. Do we have such test?
> > > 
> > > Nir, isn't it the same as creating a VM pool?
> > 
> > Maybe, we need to ask the reporter what is the flow that reproduces this
> > issue.
> > 
> > Oliver, can you explain what is "creating many thin clones from template"?
> > 
> > Are you using the ovirt SDK? can you show an example code you use to create
> > these clones, so we can reproduce this in the lab, and include this flow in
> > our
> > automated tests?
> 
> 
> Hi,
> we are using the ovirt vagrant provider
> https://www.ovirt.org/develop/projects/proposals/vagrant-provider
> which uses the Ruby SDK.
> But I think it should be the same just using the SDK with the following flow:
>   - Create thin clones from Template
>   - Use the clones (-> extends the disk)
>   - Remove the clones (-> discard)

Thanks for the above a few more questions:

Whats the rate that these actions were performed - was there any concurrency involved? 
 
What is the distribution of the templates/VMs to SDs?

Were all disks extended at same time?  if not what level of concurrency was involved in the disks extension?

Were all VMs removed at same time? if not what level of concurrency was invoked in the VM removal?

Comment 22 christian.grundmann 2018-06-18 13:26:52 UTC

Yes there is concurrency in each step.
We create up to 50 VMs with a few seconds delay

The VMs are created on the same SD as the templates.

The disk will get extended at the same time because the use cases in every VM is the same.

The removal has the same concurrency as the creation.

Comment 23 Ilan Zuckerman 2018-06-26 08:41:26 UTC

I recreated the described above scenario with a help of Jmeter on my Environment.

I monitored engine log and vdsm logs per each host (3 in my env).

No ERRORS were found in neither of logs.
No failures were seen from the UI.

What my Jmeter script did (all tasks were executed concurrently as 50 users) :

Create 50 VMs from 1 HDD Thin template with RampUp 120 Sec concurrently
Start those VMs
Validate they are up
Power them Down
Validate they are down
Extend Disk per each VM from 10G to 11G
Remove Those VMs

logs collected by ovirt-logs-collector can be found in private message below.

Comment 26 Nir Soffer 2018-07-05 10:33:04 UTC

(In reply to Ilan Zuckerman from comment #23)
> I recreated the described above scenario with a help of Jmeter on my
> Environment.
> 
> I monitored engine log and vdsm logs per each host (3 in my env).
> 
> No ERRORS were found in neither of logs.
> No failures were seen from the UI.
> 
> What my Jmeter script did (all tasks were executed concurrently as 50 users)
> :
> 
> Create 50 VMs from 1 HDD Thin template with RampUp 120 Sec concurrently
> Start those VMs
> Validate they are up
> Power them Down
> Validate they are down
> Extend Disk per each VM from 10G to 11G
> Remove Those VMs

I'm not sure what you did. Can you attach a python script using oVirt SDK that
we can run on any setup, and understand what call were made and how they were
issued?

Also, what do you mean by concurrency of 50 users? Please show the command line
used to get the concurrent calls.

Another important detail - on the reporter system wipe-after-delete and
discard-after-delete were enabled. To test this we need storage supporting
discard, and we need to choose these options when creating the disks.

Finally, what was the storage used in this setup?

Comment 27 Nir Soffer 2018-07-05 10:39:04 UTC

> (In reply to Nir Soffer from comment #17)
> we are using the ovirt vagrant provider
> https://www.ovirt.org/develop/projects/proposals/vagrant-provider
> which uses the Ruby SDK.
> But I think it should be the same just using the SDK with the following flow:
>   - Create thin clones from Template
>   - Use the clones (-> extends the disk)

What do you mean by use the clones?
What is the workload running on the vm?
How much data is written to the vm disk? which rate?
How much time the vim is running before it is shut down and removed?

>   - Remove the clones (-> discard)

Do you use both discard-after-delete and wipe-after-delete?

Can you share the script, so we can make sure what we do in the lab matches what
you did on your system?

Comment 28 Nir Soffer 2018-07-05 10:55:42 UTC

(In reply to Ilan Zuckerman from comment #23)
> Create 50 VMs from 1 HDD Thin template with RampUp 120 Sec concurrently
> Start those VMs
> Validate they are up
> Power them Down
> Validate they are down
> Extend Disk per each VM from 10G to 11G

This does not do any LVM extension. It is not clear what Christian means by
"Use the clones (-> extend the disk)" (see comment 27).

I assume that what we need to do is to run code on the vm that will write data
to the disk, causing the disk to be extended.

Lets wait for more info from Christian on this.

> Remove Those VMs

Since all extensions happen on the SPM host, maybe we can simulate this by simply
running concurrent lvextend commands on a single host, and in the same time 
concurrent lvchange --refersh on other hosts, and we don't need to script engine.

If we find an issue in LVM, we will need such script that reproduce the issue from
the shell without dependency on oVirt, so LVM developer can reproduce this easily.

Comment 29 christian.grundmann 2018-07-09 14:15:25 UTC

(In reply to Nir Soffer from comment #27)
> > (In reply to Nir Soffer from comment #17)
> > we are using the ovirt vagrant provider
> > https://www.ovirt.org/develop/projects/proposals/vagrant-provider
> > which uses the Ruby SDK.
> > But I think it should be the same just using the SDK with the following flow:
> >   - Create thin clones from Template
> >   - Use the clones (-> extends the disk)
> 
> What do you mean by use the clones?
  Running Software and Tests inside the VMs

> What is the workload running on the vm?
  High CPU Usage (8 Cores)
  High Memory Usage(30GB)
  Mixed Windows and Linux Clients and Servers (CentOS, Ubuntu, Server 2012R2, Server 2016, Windows 10)

> How much data is written to the vm disk? which rate?
  Around 10-15 GB in the first 5-10 minutes after VM is created/started

> How much time the vm is running before it is shut down and removed?
  Around 30-60 minutes

> Do you use both discard-after-delete and wipe-after-delete?
  We only use discard-after-delete
 
> Can you share the script, so we can make sure what we do in the lab matches
> what
> you did on your system?
  No, not really our Scripts/vagrant Boxes are very specifiy to our Uses Cases.

Comment 30 Eyal Edri 2018-07-17 10:02:18 UTC

testing jira sync, please ignore

Comment 31 Nir Soffer 2018-08-26 16:24:54 UTC

Based on comment 29, I think we need to reproduce this again like this.

The test should:
- create lot of vms in the same time (stress the SPM)
- perform lot of automatic disk extension in the same time (stress the SPM)
- keep the host cpu busy (this may effect lvm or SPM)
- perform many disk discards in the same time (I suspect this is related to the issue)

## Setup

- Create storage domain on a FC storage supporting discard, and enable discard on
this storage domain.

- Create vm with 30g raw disk

- Make sure the vm has at least 15G free in the root file system. If not, resize
the root file system as needed.

- Create template from this vm.

## Basic flow for every vm

1. Create vm based on the template.
Make sure the vm disk is marked for discard-after-delete

2. Start the vm

3. Wait until the vm is up

4. Perform io in the vm, to cause extension of the disk

Connect to the vm using ssh, and run:

dd if=/dev/urandom of=/var/tmp/data bs=1M count=15360 oflag=direct conv=fsync

We can run additional stuff in the vm during the test to simulate what the reporter is
doing, but I don't think this is related.

5. Wait until the dd command is finished.

The vm disk will be extended by 1G 15 times while dd is running.

It is possible that the vm will pause during the test, but the system should resume it.

6. Shut down the vm

7. delete the vm

Make sure the vm disk is discarded when the vm is deleted. You can see messages
about discarding the volume in vdsm log, and using "ps" command you will see that
vdsm is running "blkdiscard /dev/vgname/lvname" for this volume.

I would first test that running this flow wit one vm in a loop for a hour or so works as
expected.

## Complete flow

1. Start basic flow in 50 threads, with 10 seconds delay between threads.
2. When a vm is deleted, repeat the basic flow for every thread
3. Run this for a day or so.
4. collect logs from all hosts and engine for inspection

## Version to test

We know that discarding disks before 4.2.6 could cause delays in other processes,
this was reported by the same reporter of this bug.

- Reproduce with the version reported in this bug (oVirt 4.2.1) or 4.2.3 if you want
to test with RHV.

- Test with 4.2.6 - we improved the way volume are discarded in this version.

It would be best if the test will written using oVirt python SDK, so we can inspect the test
scripts before we run the test.

Elad, this looks like a work for scale team, or at least using one of them servers.

Comment 32 Elad 2018-08-27 08:42:57 UTC

Moving to scale as per comment #31

Comment 34 Nir Soffer 2018-09-05 17:44:22 UTC

Daniel, this is test only bug. If we can reproduce this we will make further
investigation. If we cannot reproduce this, we will close it.

See comment 31, explaning how it should be tested.

Comment 35 guy chen 2018-09-16 10:22:06 UTC

We are working on reproducing the bug.
Please advise the following questions :

1. At the reproduction it say's "Create vm with 30g raw disk", as for http://www.ovirt.org/documentation/admin-guide/chap-Virtual_Machine_Disks/
we can onlly use QCOW2 with Sparse for the extends, please advise if we can use QCOW2 with Sparse for the bug reproduction.
2. At the reproduction it say's to extend the root file system - is it O.k to extend a disk other then root file system or it's a must to extend the disk the root FS resides on ?
3. FC storage is a must or can we use iscsi SD ?
4. Please advise it will be test only on 4.2.6 since we do not have 4.2.3 currently.

Comment 36 Nir Soffer 2018-09-16 10:34:26 UTC

(In reply to guy chen from comment #35)
> We are working on reproducing the bug.
> Please advise the following questions :
> 
> 1. At the reproduction it say's "Create vm with 30g raw disk", as for
> http://www.ovirt.org/documentation/admin-guide/chap-Virtual_Machine_Disks/
> we can onlly use QCOW2 with Sparse for the extends, please advise if we can
> use QCOW2 with Sparse for the bug reproduction.

Read comment 31 again.

"Create vm with 30g raw disk" refer to the template. You should create a vm based
on this template. It will have qcow2 disk with the template as the base image.

> 2. At the reproduction it say's to extend the root file system - is it O.k
> to extend a disk other then root file system or it's a must to extend the
> disk the root FS resides on ?

I don't know what is "At the reproduction". Please read again comment 31. If the
instructions are not clear, quote here the text which is not clear.

> 3. FC storage is a must or can we use iscsi SD ?

Better to use FC so we can create more load (disks will be extended and deleted
faster, so we get more iterations per hour, or need less hours to reproduce.

> 4. Please advise it will be test only on 4.2.6 since we do not have 4.2.3
> currently.

I think we should test with older version, since this increases the chance to 
reproduce, and it can show if changes in 4.2.6 are related, as I explained in 
comment 31.

Comment 37 guy chen 2018-09-16 11:01:23 UTC

Thanks Nir.

Regarding number 2, this is the relevant section :

"
- Create vm with 30g raw disk

- Make sure the vm has at least 15G free in the root file system. If not, resize
  the root file system as needed.
"

Can i have have instead of 1 disk with 30g a VM with 2 disks of 30g, one is bootable and on the other i will do the extend on ?

Comment 38 Nir Soffer 2018-09-16 12:56:14 UTC

(In reply to guy chen from comment #37)
> Can i have have instead of 1 disk with 30g a VM with 2 disks of 30g, one is
> bootable and on the other i will do the extend on ?

You can, but I don't see why it better. Preparing the base image is something you
can do once when you create the template, then all vms will use the same image
with at least 15g free in the root file system.

Comment 43 Daniel Gur 2018-10-02 11:57:29 UTC

Nir please take a look and suggest

Comment 44 Nir Soffer 2018-10-02 16:05:54 UTC

(In reply to guy chen from comment #39)
> Run the scenario as discussed.

With original vdsm version, or latest 4.2?

> During the removal of the VMS started to get IO error's, after that got read
> only FS that prevents me doing any actions on the SD, logs attached.
> 
> -=>>grep ERROR /var/log/vdsm/vdsm.log
...
> 2018-09-20 06:49:26,346+0000 ERROR (jsonrpc/0) [storage.Dispatcher] FINISH
> getStoragePoolInfo error=(-202, 'Unable to read resource owners', 'IO
> timeout') (dispatcher:86)

Looks like your storage was not accessible or very slow here...

> 2018-09-20 06:51:02,174+0000 ERROR (check/loop) [storage.Monitor] Error
> checking path /dev/f1168ddf-917c-4ef4-b553-e034ee2983ec/metadata
> (monitor:498)

Storage monitoring failed to access this domain metadata lv...

...
> 2018-09-20 07:21:12,023+0000 ERROR (Thread-22) [storage.TaskManager.Task]
> (Task='846d39ec-cd8e-41cb-914f-e8a789ca2ef7') Unexpected error (task:875)
> 2018-09-20 07:21:12,027+0000 ERROR (Thread-22) [storage.Dispatcher] FINISH
> downloadImageFromStream error=Can't persist task:
> (u'846d39ec-cd8e-41cb-914f-e8a789ca2ef7: cannot access/create taskdir
> /rhev/data-center/98dea9a8-0a7e-11e8-b98d-b8ca3a638904/mastersd/master/tasks/
> 846d39ec-cd8e-41cb-914f-e8a789ca2ef7: [Errno 30] Read-only file system',)
> (dispatcher:82)

Looks like storage was overloaded by the test, and you did not configure multiapth
so the master file system became read only.

Did you run this setup with FC and multiple paths to storage? iSCSI? do you have
iSCSI multipathing configured?

Comment 45 mlehrer 2018-10-03 08:45:32 UTC

re-applying need info for comment 44 https://bugzilla.redhat.com/show_bug.cgi?id=1553133#c44

Comment 50 Nir Soffer 2018-10-07 15:32:38 UTC

Looking at vdsm logs, it looks like we reproduced the original issue in the last
test run.

The test was started at 07:53:24 (host time), with SPM on host b01-h17-r620.

1. First extend (b01-h17):

-=>>xzgrep 'extending volume' /var/log/vdsm/vdsm.log.3 | head
2018-10-07 07:55:10,948+0000 INFO  (mailbox-spm/0) [storage.SPM.Messages.Extend] processRequest: extending volume b074676d-dcba-4c0a-92d3-3f32ab8edb15 in domain f1168ddf-917c-4ef4-b553-e034ee2983ec (pool 98dea9a8-0a7e-11e8-b98d-b8ca3a638904) to size 2048 (mailbox:161)


2. The first error in the SPM log is:

2018-10-07 08:02:13,304+0000 ERROR (mailbox-spm/4) [storage.SPM.Messages.Extend] processRequest: Exception caught while trying to extend volume: ab2c4b0c-2709-4726-83bd-5671ff5697f1 in domain: f1168ddf-917c-4ef4-b553-e034ee2983ec (mailbox:172)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 166, in processRequest
    pool.extendVolume(volume['domainID'], volume['volumeID'], size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1420, in extendVolume
    sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1564, in extendVolume
    lvm.extendLV(self.sdUUID, volumeUUID, size)  # , isShuttingDown)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1224, in extendLV
    extent_size = int(vg.extent_size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 82, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: f1168ddf-917c-4ef4-b553-e034ee2983ec

This error means that this VG was invalidated and we failed to reload VG metadata.
Unfortunately the original error when we failed to reload the metadata was never
logged, so we don't have any info on the first error.


3. Trying to access this VG using vgs fail with:

# vgs f1168ddf-917c-4ef4-b553-e034ee2983ec

-=>>vgs f1168ddf-917c-4ef4-b553-e034ee2983ec
  /dev/mapper/3600a098038304437415d4b6a59676d51: Checksum error at offset 22089216
  Couldn't read volume group metadata from /dev/mapper/3600a098038304437415d4b6a59676d51.
  Metadata location on /dev/mapper/3600a098038304437415d4b6a59676d51 at 22089216 has invalid summary for VG.
  Failed to read metadata summary from /dev/mapper/3600a098038304437415d4b6a59676d51
  Failed to scan VG from /dev/mapper/3600a098038304437415d4b6a59676d51
  Volume group "f1168ddf-917c-4ef4-b553-e034ee2983ec" not found
  Cannot process volume group f1168ddf-917c-4ef4-b553-e034ee2983ec


4. Using lsblk we can still see the contents of the corrupted VG:

sdd                                                                                     8:48   0     1T  0 disk  
└─3600a098038304437415d4b6a59676d51                                                   253:4    0     1T  0 mpath 
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-metadata                                 253:17   0   512M  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-ids                                      253:18   0   128M  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-leases                                   253:19   0     2G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-outbox                                   253:20   0   128M  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-xleases                                  253:21   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-inbox                                    253:22   0   128M  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-master                                   253:23   0     1G  0 lvm   /rhev/data-center/mnt/blockSD/f1168ddf-917c-4ef4-b553-e034ee2983ec/master
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-5f94fc20--93ce--4694--9e75--6e367eecd7a3 253:38   0    50G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-e1d81892--57e3--4c9c--bef2--bfae08fa4c67 253:39   0    10G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-cb2c6de9--82b2--4467--9d7b--696d5523eff6 253:40   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-dfc7d9b7--b916--4565--8aa0--98f73f814aee 253:41   0     9G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-aae97ceb--af8d--4d8a--ace5--496cb93f09ae 253:42   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-b9582616--78cc--440e--b1f0--2f2e195faeb1 253:43   0     8G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-025d51bd--4d25--4393--850f--d1da09cf1c49 253:44   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-5afb917d--0e51--4cde--9608--0563be327d95 253:45   0     6G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-e4e4e606--c6b2--4503--b4f7--6eae0a1d0e42 253:46   0     4G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-b074676d--dcba--4c0a--92d3--3f32ab8edb15 253:47   0    16G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-68243011--06df--4e86--8ad4--714b89170d5b 253:48   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-d1c2a11e--75c8--4851--903e--ce251552793a 253:49   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-452169f5--0967--473a--8a1e--a4b3922ab915 253:50   0     3G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-4a5db397--2d03--4fa8--a1aa--842ac2dd2c8a 253:51   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-4530f234--1204--42a8--905e--e4ba284bf045 253:52   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-85db5760--6e55--40b4--9319--d7cbc11ab493 253:53   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-0a584504--e2a6--4df2--a03c--81bfbef59c23 253:54   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-ca7e84cb--7ab0--4102--b38e--a6b63fe65943 253:55   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-96f491a1--922c--4297--a075--39c5b757ecdd 253:56   0    16G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-80faa090--d967--4a4c--a017--a354b2f6416c 253:57   0     1G  0 lvm   
  ├─f1168ddf--917c--4ef4--b553--e034ee2983ec-77c695cd--7126--44c2--b5d0--dd7989d6526d 253:58   0     1G  0 lvm   
  └─f1168ddf--917c--4ef4--b553--e034ee2983ec-915b0e60--ac97--4c4a--aad6--a619663857ee 253:60   0    16G  0 lvm   


The attached patch (https://gerrit.ovirt.org/c/94751/) improve logging, showing
detailed errors when reloading vgs and lvs fail.

Comment 51 Nir Soffer 2018-10-07 15:45:06 UTC

Created attachment 1491367 [details]
vdsm log from SPM during test

Comment 52 Nir Soffer 2018-10-07 15:47:10 UTC

Created attachment 1491368 [details]
dump of lvm metadata area 1

This was created using:

    dd if=//dev/mapper/3600a098038304437415d4b6a59676d51 \
        of=3600a098038304437415d4b6a59676d51.metadata.1 \
        bs=1M count=129 iflag=direct

Comment 53 Nir Soffer 2018-10-07 15:50:33 UTC

Created attachment 1491369 [details]
dump of lvm metadata area 2

This was created using:

    dd if=/dev/mapper/3600a098038304437415d4b6a59676d51 \
        of=3600a098038304437415d4b6a59676d51.metadata.2 \
        bs=1M skip=$((8191*128)) count=128 iflag=direct

Comment 54 Nir Soffer 2018-10-07 15:54:26 UTC

Created attachment 1491370 [details]
lvm backup for the corrupted vg

Comment 55 Nir Soffer 2018-10-07 17:07:56 UTC

The patch only improves logging, we still investigating the root cause.

Comment 56 Nir Soffer 2018-10-08 06:00:24 UTC

I want to do another run with the attached patch on all hosts, and a clean
LUN. This may help to detect the reason for this corruption.

Please do this:

1. Install vdsm on all hosts from this build:
https://jenkins.ovirt.org/job/vdsm_master_build-artifacts-on-demand-el7-x86_64/1018/artifact/exported-artifacts/

2. Add a new LUN for this test

3. Zero the entire LUN - we don't want old junk data to be mixed with lvm metadata:

    blkdiscard -z -p 32m /dev/mapper/xxxyyy

4. Create new storage domain from this LUN - best in a new DC. We don't want
    to touch the old corrupted VG.

5. Change vdsm log level to DEBUG - edit /etc/vdsm/logger.conf, and set:

[logger_root]
level=DEBUG
handlers=syslog,logthread
propagate=0

[logger_vds]
level=DEBUG
handlers=syslog,logthread
qualname=vds
propagate=0

[logger_storage]
level=DEBUG
handlers=logthread
qualname=storage
propagate=0

6. Run the test again.

If we are lucky we will reproduce this again and we will have better logs
and cleaner metadata to investigate.

Comment 57 Nir Soffer 2018-10-09 18:45:44 UTC

Removing Performance and TestOnly since this not about performance, and looks like
we reproduce the issue, so this is not test only bug.

Comment 61 Elad 2018-12-05 14:21:12 UTC

Created attachment 1511697 [details]
logs-kaminario

Reproduced with Kaminario storage.

Storage migration failed with Failed reload domain:


2018-12-04 15:20:20,033+0200 ERROR (monitor/9745f66) [storage.Monitor] Error checking domain 9745f66e-8d8e-4257-a652-4346601ecd4f (monitor:424)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 958, in __init__
    manifest = self.manifestClass(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 405, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 327, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 530, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 401, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 398, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 201, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 154, in _accessWrapper
    self.refresh()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 232, in refresh
    lines = self._metaRW.readlines()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 271, in readlines
    for tag in vg.tags:
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 81, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: 9745f66e-8d8e-4257-a652-4346601ecd4f



After that, the VG is reported as [unknown] with Checksum error at offset..


2018-12-04 15:20:40,419+0200 ERROR (monitor/9745f66) [storage.Monitor] Error checking domain 9745f66e-8d8e-4257-a652-4346601ecd4f (monitor:424)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1577, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'9745f66e-8d8e-4257-a652-4346601ecd4f',)





[root@puma45 ~]# pvs
  /dev/mapper/20024f40058540033: Checksum error at offset 65298432
  Couldn't read volume group metadata.
  /dev/mapper/20024f40058540033: Checksum error at offset 16492605014016
  Couldn't read volume group metadata.
  /dev/mapper/20024f40058540033: Checksum error at offset 65298432
  Couldn't read volume group metadata.
  /dev/mapper/20024f40058540033: Checksum error at offset 16492605014016
  Couldn't read volume group metadata.
  /dev/mapper/20024f40058540033: Checksum error at offset 65298432
  Couldn't read volume group metadata.
  /dev/mapper/20024f40058540033: Checksum error at offset 16492605014016
  Couldn't read volume group metadata.
  WARNING: PV /dev/mapper/20024f40058540033 is marked in use but no VG was found using it.
  WARNING: PV /dev/mapper/20024f40058540033 might need repairing.
  PV                                            VG                                   Fmt  Attr PSize   PFree
  /dev/mapper/20024f40058540033                 [unknown]                            lvm2 u--   15.00t    0 
  /dev/mapper/360002ac0000000000000009600021f6b f30b23ee-d99c-43a5-9794-fd22010e8d46 lvm2 a--  <12.00t 7.97t
  /dev/sda3                                     VolGroup01                           lvm2 a--  461.07g    0




This issue keeps repeating also in RHV automation on Netapp as reported in bug 1637405 and now we see it also with other storages.




vdsm-4.20.28-1.el7ev.x86_64
kernel - 3.10.0-862.11.6.el7.x86_64 
device-mapper-multipath-0.4.9-119.el7_5.1.x86_64
lvm2-2.02.177-4.el7.x86_64

Comment 62 Nir Soffer 2019-01-04 14:59:52 UTC

I could reproduce this issue outside of oVirt using the attached reproducer script.

The script simulate extend flow in oVirt system using 2 hosts:
- one host run manger extend flow - this host runs commands like
  lvcreate, lvextend, and lvremove
- the other host run regular host refresh flow - this host runs only
  "lvchange --available y" once and many "lvchange --refresh" commands.
  we expect both to be readonly.

Here is results of example run:

1. Create test vg

# python extend.py create-vg test-vg /dev/mapper/{36001405560f9d98242841cea375c479e,36001405847ec541476043d4ac9260d41,360014055fc8e87631db462caf5970d32,360014052557433e46b14abd91cfaae1d}
2019-01-04 16:29:31,213 INFO    (MainThread) creating vg test-vg

2. Start manager flow on this host:

# time python extend.py run-manager test-vg /dev/mapper/{36001405560f9d98242841cea375c479e,36001405847ec541476043d4ac9260d41,360014055fc8e87631db462caf5970d32,360014052557433e46b14abd91cfaae1d} > extend-run-manager.log 2>&1

3. After few seconds, start regular flow on another host

# time python extend.py run-regular test-vg /dev/mapper/{36001405560f9d98242841cea375c479e,36001405847ec541476043d4ac9260d41,360014055fc8e87631db462caf5970d32,360014052557433e46b14abd91cfaae1d} > extend-run-regular.log 2>&1

The manager flow fail after 6 seconds with corrupted vg.


Here are errors from the manager node log:

Traceback (most recent call last):
  File "extend.py", line 310, in manager_worker
    lvm.extend_lv(lv_name, options.lv_size_mb)
  File "extend.py", line 157, in extend_lv
    full_name)
  File "extend.py", line 225, in run
    return run(cmd)
  File "extend.py", line 341, in run
    raise CommandError(cmd, p.returncode, out, err)
CommandError: Command ['lvm', 'lvextend', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 } ', '--autobackup', 'n', '--size', '+128m', 'test-vg/worker-003'] failed
rc=5
out=
err=  /dev/mapper/36001405560f9d98242841cea375c479e: Checksum error at offset 1994752
  Couldn't read volume group metadata from /dev/mapper/36001405560f9d98242841cea375c479e.
  Metadata location on /dev/mapper/36001405560f9d98242841cea375c479e at 1994752 has invalid summary for VG.
  Failed to read metadata summary from /dev/mapper/36001405560f9d98242841cea375c479e
  Failed to scan VG from /dev/mapper/36001405560f9d98242841cea375c479e
  Volume group "test-vg" not found
  Cannot process volume group test-vg
...
2019-01-04 16:30:45,661 ERROR   (worker-038) failed
Traceback (most recent call last):
  File "extend.py", line 310, in manager_worker
    lvm.extend_lv(lv_name, options.lv_size_mb)
  File "extend.py", line 157, in extend_lv
    full_name)
  File "extend.py", line 225, in run
    return run(cmd)
  File "extend.py", line 341, in run
    raise CommandError(cmd, p.returncode, out, err)
CommandError: Command ['lvm', 'lvextend', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_
count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d
32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  reta
in_min=50  retain_days=0 } ', '--autobackup', 'n', '--size', '+128m', 'test-vg/worker-038'] failed
rc=5
out=
err=  Scan of VG test-vg from /dev/mapper/36001405560f9d98242841cea375c479e found metadata seqno 143 vs previous 142.
  /dev/mapper/36001405560f9d98242841cea375c479e: Checksum error at offset 2432000
  Couldn't read volume group metadata from /dev/mapper/36001405560f9d98242841cea375c479e.
  Metadata location on /dev/mapper/36001405560f9d98242841cea375c479e at 2432000 has invalid summary for VG.
  Failed to read metadata summary from /dev/mapper/36001405560f9d98242841cea375c479e
  Failed to scan VG from /dev/mapper/36001405560f9d98242841cea375c479e
  VG info not found after rescan of test-vg
  Volume group "test-vg" not found
  Cannot process volume group test-vg


Here are some errors from the regular node run:


2019-01-04 16:30:45,691 ERROR   (worker-034) failed to refresh worker-034
Traceback (most recent call last):
  File "extend.py", line 276, in regular_worker
    lvm.refresh_lv(lv_name)
  File "extend.py", line 162, in refresh_lv
    return self.run("lvchange", "--refresh", full_name)
  File "extend.py", line 225, in run
    return run(cmd)
  File "extend.py", line 341, in run
    raise CommandError(cmd, p.returncode, out, err)
CommandError: Command ['lvm', 'lvchange', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 } ', '--refresh', 'test-vg/worker-034'] failed
rc=5
out=
err=  /dev/mapper/36001405560f9d98242841cea375c479e: Checksum error at offset 2432000
  Couldn't read volume group metadata from /dev/mapper/36001405560f9d98242841cea375c479e.
  Metadata location on /dev/mapper/36001405560f9d98242841cea375c479e at 2432000 has invalid summary for VG.
  Failed to read metadata summary from /dev/mapper/36001405560f9d98242841cea375c479e
  Failed to scan VG from /dev/mapper/36001405560f9d98242841cea375c479e
  VG info not found after rescan of test-vg
  Volume group "test-vg" not found
  Cannot process volume group test-vg
...
2019-01-04 16:30:45,729 ERROR   (worker-015) failed
Traceback (most recent call last):
  File "extend.py", line 272, in regular_worker
    lvm.activate_lv(lv_name)
  File "extend.py", line 171, in activate_lv
    return self.run("lvchange", "--available", "y", full_name)
  File "extend.py", line 225, in run
    return run(cmd)
  File "extend.py", line 341, in run
    raise CommandError(cmd, p.returncode, out, err)
CommandError: Command ['lvm', 'lvchange', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 } ', '--available', 'y', 'test-vg/worker-015'] failed
rc=5
out=
err=  Scan of VG test-vg from /dev/mapper/36001405560f9d98242841cea375c479e found metadata seqno 88 vs previous 87.
  /dev/mapper/36001405560f9d98242841cea375c479e: Checksum error at offset 2432000
  Couldn't read volume group metadata from /dev/mapper/36001405560f9d98242841cea375c479e.
  Metadata location on /dev/mapper/36001405560f9d98242841cea375c479e at 2432000 has invalid summary for VG.
  Failed to read metadata summary from /dev/mapper/36001405560f9d98242841cea375c479e
  Failed to scan VG from /dev/mapper/36001405560f9d98242841cea375c479e
  VG info not found after rescan of test-vg
  Recovery of volume group "test-vg" failed.
  Cannot process volume group test-vg


My theory is that "lvchange --refresh" command is seeing inconsistent metadata
on storage since storage is being modified by the manager host very quickly, and
try to "recover" the metadata:

    Recovery of volume group "test-vg" failed.

This corrupt the vg metadata since only the manager host may change the vg metadata.

Tested on CentOS 7.6:

# rpm -qa | grep lvm2
lvm2-2.02.180-10.el7_6.2.x86_64
udisks2-lvm2-2.7.3-8.el7.x86_64
lvm2-libs-2.02.180-10.el7_6.2.x86_64


David, do we have a way to ensure that lvchange is readonly like "lvs --readonly"?

Or maybe we need to use the (deprecated) global:locking_type=4 option for all
commands on regular nodes?

Comment 63 Nir Soffer 2019-01-04 15:01:33 UTC

Created attachment 1518421 [details]
reproducer script

Comment 64 Nir Soffer 2019-01-04 15:02:22 UTC

Created attachment 1518422 [details]
log from reproducer script running manager flow

Comment 65 Nir Soffer 2019-01-04 15:03:42 UTC

Created attachment 1518423 [details]
log from reproducer script running regular node flow

Comment 66 David Teigland 2019-01-04 15:24:22 UTC

In RHEL7 locking_type=4 is still the way to do this (it's deprecated in RHEL8).

Reading a VG has this hidden and dangerous mis-feature of writing the VG if it happens to see something wrong with it.  I've been meaning to fix this for a long time, but there was a lot of prepatory work required.  It happens that over the past couple weeks I finally began working on this and am nearly finished.  Reading a VG will no longer write it and will just use a good copy of the metadata.  An explicit metadata repair command will be required to fix corrupt or stale metadata on a PV.

Comment 67 Nir Soffer 2019-01-04 16:06:35 UTC

Created attachment 1518426 [details]
reproducer script with locking_type 4 change

The same reproducer script using locking_type=1 for manager flows and locking_type=4 for regular node flows.

Comment 68 Nir Soffer 2019-01-04 16:07:38 UTC

Created attachment 1518427 [details]
log from reproducer script  with locking_type 4 running manager flow

Comment 69 Nir Soffer 2019-01-04 16:10:33 UTC

Created attachment 1518429 [details]
log from reproducer script  with locking_type 4 running regular node flow

This log from regular node show about 36% error rate refreshing lv while the lv
is manipulated on the manager node.

$ xzgrep -E 'INFO.+refreshing' extend-run-regular.log.xz | wc -l
54822

$ xzgrep -E 'ERROR.+failed to refresh' extend-run-regular.log.xz | wc -l
19785

Comment 70 David Teigland 2019-01-04 16:12:31 UTC

You can also try setting --config global/metadata_read_only=1 in the refresh command.

Comment 71 Nir Soffer 2019-01-04 16:36:36 UTC

(In reply to David Teigland from comment #66)
Thanks David, I wonder how this issue did not pop up earlier.

I added locking_type=4 to the new reproducer script (attachment 1518426 [details]).

With this the script completed successfully creating, extending and
deleting 500 lvs, with 50 concurrent threads on the manager node, and 50
concurrent refresh threads on the regular node. You can see the log in
attachment 1518427 [details].

However on the regular node, about 36% of the "lvchange --available y" operations
failed as can be seen in attachment 1518429 [details].

Most of the errors were in "lvchange --refresh":

$ xzgrep CommandError extend-run-regular.log.xz | grep '\--refresh' | wc -l
19785
$ xzgrep CommandError extend-run-regular.log.xz | grep '\--available' | wc -l
11


Here are some example failures:

CommandError: Command ['lvm', 'lvchange', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=4  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 } ', '--refresh', 'test-vg/worker-006'] failed
rc=5
out=
err=  Scan of VG test-vg from /dev/mapper/36001405560f9d98242841cea375c479e found metadata seqno 109 vs previous 108.
  Read-only locking type set. Write locks are prohibited.
  Recovery of volume group "test-vg" failed.
  Cannot process volume group test-vg

CommandError: Command ['lvm', 'lvchange', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=4  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 } ', '--available', 'y', 'test-vg/worker-001'] failed
rc=5
out=
err=  Scan of VG test-vg from /dev/mapper/36001405560f9d98242841cea375c479e found metadata seqno 99 vs previous 98.
  Read-only locking type set. Write locks are prohibited.
  Recovery of volume group "test-vg" failed.
  Cannot process volume group test-vg


If we apply locking_type=4 to regular nodes in vdsm, we will have a new issue
with these refresh failures. This may fail randomly in many flows with 
unpredictable results. Way better than corrupting vg metadata, but still an
issue.

It looks like "lvchange --refresh" is not trying hard enough to deal with 
concurent remote updates, doing multiple reads from storage and failing if storage
was changed between the reads. I would expect that when running with locking_type=4
it will retry to read from storage if storage was changed in the middle of
the operation, and fail only after reasonable amount of retries.

For now I think we can treat all readonly commands on regular nodes as temporary
failures, log warning, and retry the operation 2-3 times before failing the flow.

Based on the info in this bug, the corruption rate was once every few weeks, so
these error would probably be very rare, and the cost for retrying the operation
will be negligible.

David, can we configure lvm to be more robust regarding remote updates in RHEL 7?

Is there any advantage to using --config global/metadata_read_only=1?

Comment 72 David Teigland 2019-01-04 16:59:19 UTC

Before the scanning improvements in 7.6, the lvm would ignore the checksum errors during the scan and continue with reading the VG.  When reading the VG it would repeat reading the disks and may not see the same problem, or it would potentially try to repair the metadata.  Without locking_type=4 or readonly, that repair could end up corrupting the metadata.

The scanning/reading of metadata has needed an overhaul for a long time.  I completed the scanning rewrite, and now am doing the reading rewrite.  7.6 fell between the two, and this seems to be one effect of the incomplete transition.

The way vdsm uses lvm without locking between reads and writes between nodes, we naturally expect to see issues like this.  The new reading code will allow this to work because it will just take and use a good copy of the metadata when doing reporting or activation/refresh commands (and it won't ever try to repair something while reading it.)  Even when this work is done, if you are unlucky, all the copies of metadata seen by the refresh command could be partially written and fail with a checksum error, leaving the command with nothing to use.  We could look into adding an automatic retry for this case, but that's still no guarantee, it just reduces the chances.

For now I suspect that retrying the refresh command in vdsm is the only option.

Comment 73 Nir Soffer 2019-01-04 17:22:35 UTC

(In reply to David Teigland from comment #70)
> You can also try setting --config global/metadata_read_only=1 in the refresh
> command.

I tried metadata_read_only=1, and it is not allowed for lvchange --refresh:

CommandError: Command ['lvm', 'lvchange', '--config', ' devices {  preferred_names = ["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=[ "a|^/dev/mapper/36001405560f9d98242841cea375c479e$|", "a|^/dev/mapper/36001405847ec541476043d4ac9260d41$|", "a|^/dev/mapper/360014055fc8e87631db462caf5970d32$|", "a|^/dev/mapper/360014052557433e46b14abd91cfaae1d$|", "r|.*|" ] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  metadata_read_only=1 } backup {  retain_min=50  retain_days=0 } ', '--refresh', 'test-vg/worker-006'] failed
rc=5
out=
err=  Operation prohibited while global/metadata_read_only is set.
  Failed to suspend test-vg/worker-006.

So locking_type=4 seems to be the only way for RHEL 7.


(In reply to David Teigland from comment #72)
> The way vdsm uses lvm without locking between reads and writes between
> nodes, we naturally expect to see issues like this.  The new reading code
> will allow this to work because it will just take and use a good copy of the
> metadata when doing reporting or activation/refresh commands (and it won't
> ever try to repair something while reading it.)  Even when this work is
> done, if you are unlucky, all the copies of metadata seen by the refresh
> command could be partially written and fail with a checksum error, 

How can we see partly written metadata?

I thought that lvm is doing:
- take metadata lock
- write metadata content to the next area in the circular buffer
- if write was successful write new metadata header pointing to the new
  metadata content
- release metadata lock

With this flow partial metadata is no possible, and metadata changes are
atomic.

Comment 74 David Teigland 2019-01-04 17:28:20 UTC

Correct, but you are modifying and reading the metadata from different hosts, right?  In that case you do not have a lock to prevent reading at any point during the modification of the buffer or header.

Comment 75 Nir Soffer 2019-01-04 18:23:01 UTC

We need to check if limiting the number of concurrent lvm commands:
https://gerrit.ovirt.org/c/66932/

Can also lower the chance of this issue. This can be an easy change
that we can ship for 4.2.z. I don't think we can support locking_type=4
in 4.2.z.

Comment 76 Tal Nisan 2019-01-07 09:44:42 UTC

Moving to 4.2.9 given comment #75, when the 4.2.8 async version will be available we'll try to push the limiting concurrent commands patch there after it will be tested

Comment 77 Nir Soffer 2019-01-08 00:05:12 UTC

I tested limiting the number of concurrent lvm operations to 10, and it does
not prevent the corruption.

I created improved simulator for locking type 4, simulating better what the
real system does during extend flow.

The simulator has 2 modes:
- run-manger - simulates manager node flows
- run-regular - simulates regular node extend flow

The run-manager command starts a socket server, waiting for clients
commands.

The run-regular command starts multiple workers based on --concurrency
option. Every worker connect to the server and perform this flow:
    
    for every iteration:
        create LV
        activate LV
        repeat 20 times:
            extend LV
            refresh LV
            sleep --extend-delay seconds
        deactivate LV
        remove LV

The read-write operations (create LV, extend LV, remove LV) are 
performed on the manager node. The read-only operations (activate LV,
refresh LV, deactivate LV) are performed locally.

Read-only commands are performed using locking_type=4, ensuring that LVM 
does not attempt to recover the metadata if a command find the storage in
inconsistent state. The command will fail in this case.

To handle the expected failures of read-only commands, we use a retry
mechanism with exponential back-off. When a command fails, we wait and 
retry the command again; each time the command fail we double the wait
time to let the manager finish pending metadata updates.

I tested creating 50,000 LVs, and then extending and refreshing them 20
time, total of 1,000,000 successful operations at rate of 29.7
operations per second.

I also did some short runs creating 500 LVs, to check retry failure rate
with lower operation rates.

Here is a table with the results from all runs:

extend-delay  extends  retries  max-retry  total-time  extend-rate  retry-rate
------------------------------------------------------------------------------
           1  1000000   237250          8       33677        29.69      21.56%
           2    10000     1027          5         491        20.36       9.33%
           5    10000      221          2        1138         8.78       2.00%
          10    10000      100          2        2220         4.50       0.90%
          20    10000       35          1        4382         2.28       0.31%
          40    10000        8          1        8585         1.16       0.09%

This shows that locking type 4 works well even with extreme load, but we must
be prepared to handle failures of lvm commands running in read-only mode.

The next step is to support locking type 4 in vdsm:

- When host run as SPM, all lvm commands must run in read-write mode recovering
  vg metadata if needed.

- When host is not SPM, all lvm commands must run in read-only mode so they do
  not try to recover vg metadata.

- When running in read-only mode, retry failed commands. Testing show that 1-2
  retries is enough in normal load.

- When host changes read-only mode, wait for ongoing commands running in the old
  mode.

- Creating and removing storage domains can be done on any host. These flows
  should be able to override lvm read only mode if they do not run on the SPM.


Open issues:

- Before the SPM can start, it must be able to activate the special LVs. If
  there is an issue with the VG metadata, activating the special LV may fail
  since we must activate them in read-only mode.

- To acquire the SPM lease, we must be able to activate the ids and leases
  special volumes, but we may fail to activate them in read-only mode.

Comment 78 Nir Soffer 2019-01-08 00:14:09 UTC

Created attachment 1519091 [details]
locking type 4 simulator

Comment 79 Nir Soffer 2019-01-08 00:17:26 UTC

Created attachment 1519092 [details]
log for running 50 threads, 1000 lvs per thread, 1 second extend delay

Comment 80 Nir Soffer 2019-01-08 00:18:34 UTC

Created attachment 1519093 [details]
log for running 50 threads, 100 lvs per thread, 2 second extend delay

Comment 81 Nir Soffer 2019-01-08 00:20:06 UTC

Created attachment 1519094 [details]
log for running 50 threads, 100 lvs per thread, 5 second extend delay

Comment 82 Nir Soffer 2019-01-08 00:20:43 UTC

Created attachment 1519095 [details]
log for running 50 threads, 100 lvs per thread, 10 second extend delay

Comment 83 Nir Soffer 2019-01-08 00:21:20 UTC

Created attachment 1519096 [details]
log for running 50 threads, 100 lvs per thread, 20 second extend delay

Comment 84 Nir Soffer 2019-01-08 00:22:06 UTC

Created attachment 1519097 [details]
log for running 50 threads, 100 lvs per thread, 40 second extend delay

Comment 85 Germano Veit Michel 2019-01-11 03:32:14 UTC

Nir, does this mean support will also need to use locking_type 4 when issuing LVM commands that modify the metadata, and we should also use metadata_read_only when giving commands to just read data?

Maybe this is another reason to get those tools merged...

Comment 86 Nir Soffer 2019-01-11 13:49:26 UTC

(In reply to Germano Veit Michel from comment #85)
Yes, using raw lvm commands on a system without lvm filter is risky.
See also changes in vdsm sos plugin:
https://gerrit.ovirt.org/c/96774/
https://github.com/sosreport/sos/issues/1533

Maybe we need to have /usr/share/vdsm/lvm/lvm.conf with the proper setting
for accessing shared storage. This can be used by lvm commands like this:

    LVM_SYSTEM_DIR=/usr/share/vdsm/lvm lvs -o +tags

Or add "vdsm-tool lvm" wrapper that will do the right thing.

Comment 87 Daniel Gur 2019-01-13 10:50:19 UTC

Tal, Nir As 4.2.8 Final build is already arrived should we change the Target top 4.3?

Comment 88 Tal Nisan 2019-01-13 11:16:20 UTC

(In reply to Daniel Gur from comment #87)
> Tal, Nir As 4.2.8 Final build is already arrived should we change the Target
> top 4.3?

This is targeted to 4.2.8 async release

Comment 89 Mark Parity 2019-01-14 16:53:44 UTC

Looking forward to a resolution of this issue; RE: WoRH CASE 02265434.

Comment 90 Germano Veit Michel 2019-01-15 03:59:47 UTC

(In reply to Nir Soffer from comment #86)
> (In reply to Germano Veit Michel from comment #85)
> Yes, using raw lvm commands on a system without lvm filter is risky.
> See also changes in vdsm sos plugin:
> https://gerrit.ovirt.org/c/96774/
> https://github.com/sosreport/sos/issues/1533
> 
> Maybe we need to have /usr/share/vdsm/lvm/lvm.conf with the proper setting
> for accessing shared storage. This can be used by lvm commands like this:
> 
>     LVM_SYSTEM_DIR=/usr/share/vdsm/lvm lvs -o +tags
> 
> Or add "vdsm-tool lvm" wrapper that will do the right thing.

Thanks Nir. Nice catch on the lvm2 plugin too.

Ideally I want all support to use the snapshot-tools, then no manual lvm command will be run and all is done via vdsm api.
But in the meantime, I'll start writing an article explaining these changes and how we should run LVM commands. We already use some options for filters and lvmetad on our "base" syntax.

Comment 91 Fred Rolland 2019-01-28 09:54:43 UTC

Targeting to 4.2.8-2 when it will be available

Comment 92 Sandro Bonazzola 2019-02-15 14:13:15 UTC

Is this still to be fixed? If so, can you please re-target to 4.2.8-3 ?

Comment 93 Nir Soffer 2019-02-16 20:48:07 UTC

(In reply to Sandro Bonazzola from comment #92)
> Is this still to be fixed? If so, can you please re-target to 4.2.8-3 ?

We are working on the master fix. When we have something for master we will
evaluate again if this can be delivered in 4.2.

Comment 94 Avihai 2019-04-02 13:08:58 UTC

Hi Tal,
This bug is not in the 4.2.8-4 bug list (taken from 'RHV Bugzilla Searches' doc)[1]:

Please re-target.

[1]
https://bugzilla.redhat.com/buglist.cgi?action=wrap&classification=Red%20Hat&f2=cf_ovirt_team&f3=OP&f4=product&f5=classification&f6=CP&j3=OR&list_id=10078056&o2=notsubstring&o4=equals&o5=equals&resolution=---&resolution=CURRENTRELEASE&resolution=ERRATA&target_milestone=ovirt-4.2.8-4&v2=docs&v4=Red%20Hat%20Enterprise%20Virtualization%20Manager&v5=oVirt

Comment 95 Tal Nisan 2019-04-07 12:54:09 UTC

Can't, no 4.2.9 version yet, I suspect we might have to change the product to RHEV, Aileen?

Comment 98 Daniel Gur 2019-08-28 13:15:13 UTC

sync2jira

Comment 99 Daniel Gur 2019-08-28 13:20:15 UTC

sync2jira

Comment 100 mlehrer 2019-10-28 09:36:09 UTC

Please move to 4.4

Comment 104 Peter Lauterbach 2020-01-02 17:39:53 UTC

pm_ack+

Comment 108 Nir Soffer 2020-01-16 22:59:42 UTC

(In reply to Allie DeVolder from comment #107)

Can we get vdsm logs from all hosts in the DC at the time of the error?

Or, it this is easier, output of:

    xzgrep WARN /var/log/vdsm/vdsm.log*

And the estimated time of the first error about bad checksum.

If the issue is indeed LVM corrupting its metadata during read only comamnd
such as "lvs" or "lvchange --refresh", we expect to warnings like:

    WARNING: Inconsistent metadata found for VG xxx-yyy-zzz - updating
    to use version 42

To see these warnings, you need to run vdsm 4.30.20.

The log was added in:

commit 0d6d17dbe3d926460b13f0e31f4c97d55572d09a
Author: Nir Soffer <nsoffer>
Date:   Wed May 29 00:26:11 2019 +0300

    lvm: Log LVM warnings for successful commands

Comment 109 Nir Soffer 2020-01-16 23:02:35 UTC

nijin, please check comment 108. Did we see these warnings in other cases?

Comment 138 Nir Soffer 2020-03-03 16:14:22 UTC

To fix this in RHV we need to use locking_type=4 in all LVM commands running on
non-SPM host. However pvs command cannot be used with locking_type=4 in some case.

We depend on bug 1809660. Once this fix is available in RHEL 7.7 we can require it
and use locking_type=4.

Comment 142 Nir Soffer 2020-03-12 14:53:17 UTC

ovirt-4.3 branch includes all the patches required to fix the data corruption
except the the issue with pvs command, which can be resolved only when we have
build for bug 1809660. 

QE can start testing this build now to check for regressions:
https://jenkins.ovirt.org/job/vdsm_standard-on-merge/2405/artifact/build-artifacts.el7.x86_64/

Note that this version include some unwanted side affects like retrying
lvm read-only commands when they should not be retried, which may cause
delays in some flows.

We are working on fixing these issues.

Comment 145 Nir Soffer 2020-03-21 00:28:15 UTC

The fix is included in ovirt-4.3.9, in vdsm-4.30.43.

Note: On CentOS pvs command is used locking_type=1 instead of locking_type=4.
This is unsafe and may cause data corruption. However pvs command is rarely
used so the risk is very low.
  
Once lvm2-2.02.186-7.el7_8.1 will be available in CentOS, we will release an
update fixing this issue.

Comment 146 Nir Soffer 2020-03-21 15:37:27 UTC

*** Bug 1693075 has been marked as a duplicate of this bug. ***

Comment 147 Tal Nisan 2020-03-22 10:22:47 UTC

Due to the change in the locking mechanism in LVM for RHEL 8 this issue does not exist in RHV 4.4, also tested by the RHV scale team, closing this 4.4 bug.
For tracking the 4.3.9 look in bug 1811391

Comment 148 Daniel Gur 2020-03-22 12:08:26 UTC

As QE validated in 4.4 - I changed it to closed current-release.

Comment 149 Nir Soffer 2020-03-22 13:03:04 UTC

David, can you describe how you tested this on RHEL 8.2?

Comment 150 David Vaanunu 2020-04-06 08:51:53 UTC

1. Used last updated extend.py script
2. On the first host:
   python2 extend.py create-vg <VG_NAME> /dev/mapper/<LUN_ID>
   firewall-cmd --add-port=8000/tcp
   python2 extend.py run-manager -p 8000 <VG_NAME> /dev/mapper/<LUN_ID> 
   2>1_run-manager.log
3. On the second host:
   python2 extend.py run-regular --iterations 1000 -p 8000 <First Host hostnae/IP> <VG_NAME> /dev/mapper/<LUN_ID> 2>1_run-regular.log

Comment 151 Marcus West 2020-05-06 01:43:35 UTC

*** Bug 1692230 has been marked as a duplicate of this bug. ***