Bug 1500395 - One iSCSI storage domain has become inactive after Cannot refresh Logical Volume error
Summary: One iSCSI storage domain has become inactive after Cannot refresh Logical Vol...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Nir Soffer
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-10 13:45 UTC by Filip Brychta
Modified: 2020-05-20 23:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-26 02:15:20 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdms, messages, engine logs (1.61 MB, application/x-gzip)
2017-10-10 13:45 UTC, Filip Brychta
no flags Details
lvm metadata for version 561 (25.00 KB, text/plain)
2018-01-23 21:38 UTC, Nir Soffer
no flags Details

Description Filip Brychta 2017-10-10 13:45:34 UTC
Created attachment 1336774 [details]
vdms, messages, engine logs

Description of problem:
One iSCSI storage domain has become inactive after following err in vdsm log:
LogicalVolumeRefreshError: Cannot refresh Logical Volume: ('lvchange --refresh 30a57b16-33b0-43c7-becd-50c574e86b54/metadata failed',)
2017-09-26 05:31:37,250-0400 INFO  (monitor/30a57b1) [storage.Monitor] Domain 30a57b16-33b0-43c7-becd-50c574e86b54 became INVALID (monitor:456)

Storage domain is still available in engine db:
see attached db.out

LUN is still available:
see attached lsscsi.out which contains output of lsscsi -is

Physical volume is still accessible but it's not part of any volume group - see attached lvm.out which contains output of pvs and vgs and lvs


Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.3.5-0.1.el7

Host:
OS Version:
RHEL - 7.3 - 7.el7
OS Description:
Red Hat Enterprise Linux
Kernel Version:
3.10.0 - 514.26.2.el7.x86_64
KVM Version:
2.6.0 - 28.el7_3.12
LIBVIRT Version:
libvirt-2.0.0-10.el7_3.9
VDSM Version:
vdsm-4.19.20-1.el7ev
SPICE Version:
0.12.4 - 20.el7_3
GlusterFS Version:
[N/A]
CEPH Version:
librbd1-0.94.5-1.el7

How reproducible:
Unknown


Additional info:
Relevant vdms, engine and system logs and command outputs attached.
sosreport from one host is available here: http://web.bc.jonqe.lab.eng.bos.redhat.com/logs/sosreport-b10.jonqe.lab.eng.bos.redhat.com-20171010064005.tar.xz
output of ovirt-log-collector is available here:
http://web.bc.jonqe.lab.eng.bos.redhat.com/logs/sosreport-LogCollector-20171010082742.tar.xz

The problem started at: 2017-09-26 05:31:37,221-0400 ERROR (monitor/30a57b1) [storage.Monitor] Error checking domain 30a57b16-33b0-43c7-becd-50c574e86b54 (monitor:426)
(attached vdsm.log.41)
Note that output of ovirt-log-collector contains logs which contain another issue which happened on Oct  3 12:42:01 and is not relevant for this bz. Logs newer than 28 Sep can be ignored.
There is no issue visible on the IBM storage side.

Comment 1 Nir Soffer 2018-01-21 22:58:47 UTC
We don't have any info about the refresh error, I filed bug 1536880 to improve
this.


We can see that the metadata lv was refreshed successfully 5 mintues before:

2017-09-26 05:26:35,069-0400 INFO  (monitor/30a57b1) [storage.LVM] Refreshing LVs (vg=30a57b16-33b0-43c7-becd-50c574e86b54, lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master']) (lvm:1319)


Reading from the metadata lv was successful during this time:

2017-09-26 05:30:39,832-0400 INFO  (periodic/58) [vdsm.api] FINISH repoStats return={...,
u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000535195', 'lastCheck': '2.8', 'valid': True},
...} from=internal (api:52)

2017-09-26 05:30:51,817-0400 INFO  (jsonrpc/6) [vdsm.api] FINISH repoStats return={...,
u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.416647', 'lastCheck': '4.8', 'valid': True},
...} from=::ffff:10.16.23.33,56374, flow_id=6f60b7de (api:52)


2017-09-26 05:31:07,237-0400 INFO  (jsonrpc/1) [vdsm.api] FINISH repoStats return={..., 
u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.116993', 'lastCheck': '0.2', 'valid': True},
...} from=::ffff:10.16.23.33,56374, flow_id=432e2930 (api:52)


2017-09-26 05:31:09,948-0400 INFO  (periodic/53) [vdsm.api] FINISH repoStats return={...,
u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.116993', 'lastCheck': '2.9', 'valid': True},
...} from=internal (api:52)

2017-09-26 05:31:22,516-0400 INFO  (jsonrpc/2) [vdsm.api] FINISH repoStats return={...,
u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000701047', 'lastCheck': '5.5', 'valid': True},
...} from=::ffff:10.16.23.33,56374, flow_id=6dc3fb59 (api:52)


Then refreshing the lv fails, and from this point monitoring fails every 10 seconds,
since we failed to produce a domain. We don't have any info why the refresh fail:

2017-09-26 05:31:37,221-0400 ERROR (monitor/30a57b1) [storage.Monitor] Error checking domain 30a57b16-33b0-43c7-becd-50c574e86b54 (monitor:426)
...
  File "/usr/share/vdsm/storage/lvm.py", line 1326, in refreshLVs
    raise se.LogicalVolumeRefreshError("%s failed" % list2cmdline(cmd))
LogicalVolumeRefreshError: Cannot refresh Logical Volume: ('lvchange --refresh 30a57b16-33b0-43c7-becd-50c574e86b54/metadata failed',)


This cause the domain to become INVALID, this is expected when a basic
operation like refreshing lv fail:

2017-09-26 05:31:37,250-0400 INFO  (monitor/30a57b1) [storage.Monitor] Domain 30a57b16-33b0-43c7-becd-50c574e86b54 became INVALID (monitor:456)


In 4.2 we don't refresh the metadata in this flow, unless it was extended on the SPM
(very unlikely). So the system will not fail in the same way, but if we cannot access
storage, the system will fail later in another way.

We don't see any failure in /var/log/messages around this time.

In lvm.out we cannot see the storage domain, and the LUN does not
belong to any vg:

  PV                                            VG                                   Fmt  Attr PSize   PFree 
  /dev/mapper/360080e50002d65400000b4ea596efd30                                      lvm2 ---    2.73t  2.73t

Looks like the pv including the metadata lv was corrupted, so lvm cannot detect
this vg. We keep lvm metadata only on the first pv, in the first extent
1m-129m, and in the last extent.

There is no info that can explain why the pv was corrupted.

Filip, can you copy the first 129MiB of this LUN and share it?

You can copy it using:

    dd if=/dev/mapper/360080e50002d65400000b4ea596efd30 bs=1M count=129 \
        iflag=direct of=360080e50002d65400000b4ea596efd30-md

Is is possible that the LUN was accessed from some other host by mistake, 
corrupting lvm metadata? Is it accessible on other machines except the
RHV hosts?

Comment 2 Filip Brychta 2018-01-22 16:31:17 UTC
You can get the data here - http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-md

More people have access to those LUNs so corruption of lvm metadata by mistake is possible but not very probable. As far as I know nobody except rhev was interacting with those LUNs.

Not sure if this can be relevant but the corrupted domain was used as Target Storage Domain for VM Lease by several VMs.


If it really looks like corruption by mistake, please don't waste much time on this. Nothing important was running there so we can just destroy and re-create that storage domain.

Comment 3 Nir Soffer 2018-01-23 21:36:45 UTC
(In reply to Filip Brychta from comment #2)
> http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-
> md

Thanks Filip!

We have 562 metadata versions in this vg:

$ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | wc -l
562

$ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | head
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:32 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:33 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:35 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:35 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:37 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:39 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:40 2017

$ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | tail
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 03:11:14 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:10 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:12 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:13 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:41 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:41 2017
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:44 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:50 2017
# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:53 2017
# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:28:00 2017


This is the metadata of version 561 (one before last):


# Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:53 2017

contents = "Text Format Volume Group"
version = 1

description = ""

creation_host = "b07.jonqe.lab.eng.bos.redhat.com"      # Linux b07.jonqe.lab.eng.bos.redhat.com 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64
creation_time = 1506418073      # Tue Sep 26 05:27:53 2017

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@30a57b16-33b0-43c7-becd-50c574e86b54 {
id = "0hhT8i-MrVx-4O43-Bpmz-bRSe-qFaP-pgyAD9"
seqno = 561
format = "lvm2"
status = ["RESIZEABLE", "READ", "WRITE"]
flags = []
tags = ["MDT_LEASETIMESEC=60", "MDT_VERSION=4", "MDT_PV0=pv:360080e50002d65400000b4ea596efd30&44&uuid:O2WSMq-W0if-JQNJ-fq8g-5YkE-Znr9-fUkoqq&44&pestart:0&44&pecount:22348&44&mapoffset:0", "M
DT_LOCKPOLICY=", "MDT_TYPE=ISCSI", "MDT_LOGBLKSIZE=512", "MDT_SDUUID=30a57b16-33b0-43c7-becd-50c574e86b54", "MDT_LEASERETRIES=3", "MDT_IOOPTIMEOUTSEC=10", "MDT_LOCKRENEWALINTERVALSEC=5", "MD
T_ROLE=Regular", "MDT_PHYBLKSIZE=512", "MDT_VGUUID=0hhT8i-MrVx-4O43-Bpmz-bRSe-qFaP-pgyAD9", "MDT_CLASS=Data", "MDT_DESCRIPTION=BC_userspace2", "RHAT_storage_domain", "MDT_POOL_UUID=00000001-
0001-0001-0001-000000000311", "MDT__SHA_CKSUM=052616d282bb34083f7b9adb6af41ca018d52cde"]
extent_size = 262144
max_lv = 0
max_pv = 0
metadata_copies = 0

physical_volumes {

pv0 {
id = "O2WSMq-W0if-JQNJ-fq8g-5YkE-Znr9-fUkoqq"
device = "/dev/mapper/360080e50002d65400000b4ea596efd30"

status = ["ALLOCATABLE"]
flags = []
dev_size = 5859022848
pe_start = 264192
pe_count = 22348
}
}
logical_volumes {

metadata {
id = "HERyXk-jEkV-K01q-zWR3-SOl1-tn34-cvqIyt"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625593
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 4

type = "striped"
stripe_count = 1

stripes = [
"pv0", 0
]
}
}

outbox {
id = "pNpOWD-cE8b-N3VV-1oNc-uSZ9-hBRt-Rab5Zq"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625595
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 1

type = "striped"
stripe_count = 1

stripes = [
"pv0", 4
]
}
}

xleases {
id = "dpYJs5-DTDu-feHC-eSbp-JgvA-XpKU-dlo44R"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625595
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 8

type = "striped"
stripe_count = 1

stripes = [
"pv0", 5
]
}
}

leases {
id = "Jg0Qdd-rIet-H6Ne-GsYM-JGw9-Y3sf-yK5jh6"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625596
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 16

type = "striped"
stripe_count = 1

stripes = [
"pv0", 13
]
}
}

ids {
id = "sPXwXd-gMvc-W3ub-YlLr-uycc-BAqU-TkkeJw"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625596
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 1

type = "striped"
stripe_count = 1

stripes = [
"pv0", 29
]
}
}

inbox {
id = "hzCPkH-4u3R-udzr-J7D3-8rsP-RzF3-Kk5y99"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625596
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 1

type = "striped"
stripe_count = 1

stripes = [
"pv0", 30
]
}
}

master {
id = "8j2uqL-sTT9-FRwT-KIkA-OzVS-ZqAM-ZVnceS"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1500625597
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 8

type = "striped"
stripe_count = 1

stripes = [
"pv0", 31
]
}
}

62f87a59-4826-4fc0-a074-811a94807cbd {
id = "L4GBvQ-SdWe-ni4G-y2XX-hWlt-ZIZg-DPY0P2"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
tags = ["MD_4", "PU_00000000-0000-0000-0000-000000000000", "IU_ca7a2d81-3b95-4757-8d6d-1762d9b9059e"]
creation_time = 1500629019
creation_host = "b02.jonqe.lab.eng.bos.redhat.com"
segment_count = 1

segment1 {
start_extent = 0
extent_count = 1

type = "striped"
stripe_count = 1

stripes = [
"pv0", 39
]
}
}

[trimmed more logical volumes]

And this is the metadata of last version:

# Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:28:00 2017

contents = "Text Format Volume Group"
version = 1

description = ""

creation_host = "b09.jonqe.lab.eng.bos.redhat.com"      # Linux b09.jonqe.lab.eng.bos.redhat.com 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64
creation_time = 1506418080      # Tue Sep 26 05:28:00 2017

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^...


Looks like someone run pvcreate or vgcreate on this LUN from host
b09.jonqe.lab.eng.bos.redhat.com at Tue Sep 26 05:28:00 2017,
deleting the vg metadata. This machine run LVM2 2.02.166, while
the version before that run LVM2 2.02.171.

So this looks like somebody which had access to this LUN tryied to 
used it and deleted the vg metadata created by RHV.

Comment 4 Nir Soffer 2018-01-23 21:38:37 UTC
Created attachment 1385115 [details]
lvm metadata for version 561

Fragment from the metadata area of /dev/mapper/360080e50002d65400000b4ea596efd30
showing version 561 from Tue Sep 26 05:27:53 2017.

Comment 5 Nir Soffer 2018-01-23 22:02:20 UTC
Zdenek, can you check comment 3 and confirm my theory?

The first 129 MiB of the vg are available at:
http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-md

RHV uses extent size of 128 MiB, so this includes the entire metadata area.

Comment 6 Filip Brychta 2018-01-25 09:17:32 UTC
Thank you for investigation. I discussed it with all people with access to b09 and nobody remembers doing anything related to LVM there. Login logs are rolled out so I can't confirm that nobody logged in that day though.

Feel free to close as not a bug. Thank you for help and sorry for false alarm. If it happens again,I will make sure it's not human error again.

Comment 7 Nir Soffer 2018-01-26 02:15:20 UTC
Filip, thanks for reporting this. Please reopen if you have data suggesting that
this was not a user error.

Comment 8 Franta Kust 2019-05-16 12:55:15 UTC
BZ<2>Jira re-sync


Note You need to log in before you can comment on or make changes to this bug.