Created attachment 1336774 [details] vdms, messages, engine logs Description of problem: One iSCSI storage domain has become inactive after following err in vdsm log: LogicalVolumeRefreshError: Cannot refresh Logical Volume: ('lvchange --refresh 30a57b16-33b0-43c7-becd-50c574e86b54/metadata failed',) 2017-09-26 05:31:37,250-0400 INFO (monitor/30a57b1) [storage.Monitor] Domain 30a57b16-33b0-43c7-becd-50c574e86b54 became INVALID (monitor:456) Storage domain is still available in engine db: see attached db.out LUN is still available: see attached lsscsi.out which contains output of lsscsi -is Physical volume is still accessible but it's not part of any volume group - see attached lvm.out which contains output of pvs and vgs and lvs Version-Release number of selected component (if applicable): Red Hat Virtualization Manager Version: 4.1.3.5-0.1.el7 Host: OS Version: RHEL - 7.3 - 7.el7 OS Description: Red Hat Enterprise Linux Kernel Version: 3.10.0 - 514.26.2.el7.x86_64 KVM Version: 2.6.0 - 28.el7_3.12 LIBVIRT Version: libvirt-2.0.0-10.el7_3.9 VDSM Version: vdsm-4.19.20-1.el7ev SPICE Version: 0.12.4 - 20.el7_3 GlusterFS Version: [N/A] CEPH Version: librbd1-0.94.5-1.el7 How reproducible: Unknown Additional info: Relevant vdms, engine and system logs and command outputs attached. sosreport from one host is available here: http://web.bc.jonqe.lab.eng.bos.redhat.com/logs/sosreport-b10.jonqe.lab.eng.bos.redhat.com-20171010064005.tar.xz output of ovirt-log-collector is available here: http://web.bc.jonqe.lab.eng.bos.redhat.com/logs/sosreport-LogCollector-20171010082742.tar.xz The problem started at: 2017-09-26 05:31:37,221-0400 ERROR (monitor/30a57b1) [storage.Monitor] Error checking domain 30a57b16-33b0-43c7-becd-50c574e86b54 (monitor:426) (attached vdsm.log.41) Note that output of ovirt-log-collector contains logs which contain another issue which happened on Oct 3 12:42:01 and is not relevant for this bz. Logs newer than 28 Sep can be ignored. There is no issue visible on the IBM storage side.
We don't have any info about the refresh error, I filed bug 1536880 to improve this. We can see that the metadata lv was refreshed successfully 5 mintues before: 2017-09-26 05:26:35,069-0400 INFO (monitor/30a57b1) [storage.LVM] Refreshing LVs (vg=30a57b16-33b0-43c7-becd-50c574e86b54, lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master']) (lvm:1319) Reading from the metadata lv was successful during this time: 2017-09-26 05:30:39,832-0400 INFO (periodic/58) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000535195', 'lastCheck': '2.8', 'valid': True}, ...} from=internal (api:52) 2017-09-26 05:30:51,817-0400 INFO (jsonrpc/6) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.416647', 'lastCheck': '4.8', 'valid': True}, ...} from=::ffff:10.16.23.33,56374, flow_id=6f60b7de (api:52) 2017-09-26 05:31:07,237-0400 INFO (jsonrpc/1) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.116993', 'lastCheck': '0.2', 'valid': True}, ...} from=::ffff:10.16.23.33,56374, flow_id=432e2930 (api:52) 2017-09-26 05:31:09,948-0400 INFO (periodic/53) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.116993', 'lastCheck': '2.9', 'valid': True}, ...} from=internal (api:52) 2017-09-26 05:31:22,516-0400 INFO (jsonrpc/2) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000701047', 'lastCheck': '5.5', 'valid': True}, ...} from=::ffff:10.16.23.33,56374, flow_id=6dc3fb59 (api:52) Then refreshing the lv fails, and from this point monitoring fails every 10 seconds, since we failed to produce a domain. We don't have any info why the refresh fail: 2017-09-26 05:31:37,221-0400 ERROR (monitor/30a57b1) [storage.Monitor] Error checking domain 30a57b16-33b0-43c7-becd-50c574e86b54 (monitor:426) ... File "/usr/share/vdsm/storage/lvm.py", line 1326, in refreshLVs raise se.LogicalVolumeRefreshError("%s failed" % list2cmdline(cmd)) LogicalVolumeRefreshError: Cannot refresh Logical Volume: ('lvchange --refresh 30a57b16-33b0-43c7-becd-50c574e86b54/metadata failed',) This cause the domain to become INVALID, this is expected when a basic operation like refreshing lv fail: 2017-09-26 05:31:37,250-0400 INFO (monitor/30a57b1) [storage.Monitor] Domain 30a57b16-33b0-43c7-becd-50c574e86b54 became INVALID (monitor:456) In 4.2 we don't refresh the metadata in this flow, unless it was extended on the SPM (very unlikely). So the system will not fail in the same way, but if we cannot access storage, the system will fail later in another way. We don't see any failure in /var/log/messages around this time. In lvm.out we cannot see the storage domain, and the LUN does not belong to any vg: PV VG Fmt Attr PSize PFree /dev/mapper/360080e50002d65400000b4ea596efd30 lvm2 --- 2.73t 2.73t Looks like the pv including the metadata lv was corrupted, so lvm cannot detect this vg. We keep lvm metadata only on the first pv, in the first extent 1m-129m, and in the last extent. There is no info that can explain why the pv was corrupted. Filip, can you copy the first 129MiB of this LUN and share it? You can copy it using: dd if=/dev/mapper/360080e50002d65400000b4ea596efd30 bs=1M count=129 \ iflag=direct of=360080e50002d65400000b4ea596efd30-md Is is possible that the LUN was accessed from some other host by mistake, corrupting lvm metadata? Is it accessible on other machines except the RHV hosts?
You can get the data here - http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-md More people have access to those LUNs so corruption of lvm metadata by mistake is possible but not very probable. As far as I know nobody except rhev was interacting with those LUNs. Not sure if this can be relevant but the corrupted domain was used as Target Storage Domain for VM Lease by several VMs. If it really looks like corruption by mistake, please don't waste much time on this. Nothing important was running there so we can just destroy and re-create that storage domain.
(In reply to Filip Brychta from comment #2) > http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30- > md Thanks Filip! We have 562 metadata versions in this vg: $ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | wc -l 562 $ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | head # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:32 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:33 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:35 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:35 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:37 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:39 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:40 2017 $ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | tail # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 03:11:14 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:10 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:12 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:13 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:41 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:41 2017 # Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:44 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:50 2017 # Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:53 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:28:00 2017 This is the metadata of version 561 (one before last): # Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:53 2017 contents = "Text Format Volume Group" version = 1 description = "" creation_host = "b07.jonqe.lab.eng.bos.redhat.com" # Linux b07.jonqe.lab.eng.bos.redhat.com 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 creation_time = 1506418073 # Tue Sep 26 05:27:53 2017 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@30a57b16-33b0-43c7-becd-50c574e86b54 { id = "0hhT8i-MrVx-4O43-Bpmz-bRSe-qFaP-pgyAD9" seqno = 561 format = "lvm2" status = ["RESIZEABLE", "READ", "WRITE"] flags = [] tags = ["MDT_LEASETIMESEC=60", "MDT_VERSION=4", "MDT_PV0=pv:360080e50002d65400000b4ea596efd30&44&uuid:O2WSMq-W0if-JQNJ-fq8g-5YkE-Znr9-fUkoqq&44&pestart:0&44&pecount:22348&44&mapoffset:0", "M DT_LOCKPOLICY=", "MDT_TYPE=ISCSI", "MDT_LOGBLKSIZE=512", "MDT_SDUUID=30a57b16-33b0-43c7-becd-50c574e86b54", "MDT_LEASERETRIES=3", "MDT_IOOPTIMEOUTSEC=10", "MDT_LOCKRENEWALINTERVALSEC=5", "MD T_ROLE=Regular", "MDT_PHYBLKSIZE=512", "MDT_VGUUID=0hhT8i-MrVx-4O43-Bpmz-bRSe-qFaP-pgyAD9", "MDT_CLASS=Data", "MDT_DESCRIPTION=BC_userspace2", "RHAT_storage_domain", "MDT_POOL_UUID=00000001- 0001-0001-0001-000000000311", "MDT__SHA_CKSUM=052616d282bb34083f7b9adb6af41ca018d52cde"] extent_size = 262144 max_lv = 0 max_pv = 0 metadata_copies = 0 physical_volumes { pv0 { id = "O2WSMq-W0if-JQNJ-fq8g-5YkE-Znr9-fUkoqq" device = "/dev/mapper/360080e50002d65400000b4ea596efd30" status = ["ALLOCATABLE"] flags = [] dev_size = 5859022848 pe_start = 264192 pe_count = 22348 } } logical_volumes { metadata { id = "HERyXk-jEkV-K01q-zWR3-SOl1-tn34-cvqIyt" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625593 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 4 type = "striped" stripe_count = 1 stripes = [ "pv0", 0 ] } } outbox { id = "pNpOWD-cE8b-N3VV-1oNc-uSZ9-hBRt-Rab5Zq" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625595 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 4 ] } } xleases { id = "dpYJs5-DTDu-feHC-eSbp-JgvA-XpKU-dlo44R" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625595 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 8 type = "striped" stripe_count = 1 stripes = [ "pv0", 5 ] } } leases { id = "Jg0Qdd-rIet-H6Ne-GsYM-JGw9-Y3sf-yK5jh6" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625596 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 16 type = "striped" stripe_count = 1 stripes = [ "pv0", 13 ] } } ids { id = "sPXwXd-gMvc-W3ub-YlLr-uycc-BAqU-TkkeJw" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625596 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 29 ] } } inbox { id = "hzCPkH-4u3R-udzr-J7D3-8rsP-RzF3-Kk5y99" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625596 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 30 ] } } master { id = "8j2uqL-sTT9-FRwT-KIkA-OzVS-ZqAM-ZVnceS" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625597 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 8 type = "striped" stripe_count = 1 stripes = [ "pv0", 31 ] } } 62f87a59-4826-4fc0-a074-811a94807cbd { id = "L4GBvQ-SdWe-ni4G-y2XX-hWlt-ZIZg-DPY0P2" status = ["READ", "WRITE", "VISIBLE"] flags = [] tags = ["MD_4", "PU_00000000-0000-0000-0000-000000000000", "IU_ca7a2d81-3b95-4757-8d6d-1762d9b9059e"] creation_time = 1500629019 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 39 ] } } [trimmed more logical volumes] And this is the metadata of last version: # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:28:00 2017 contents = "Text Format Volume Group" version = 1 description = "" creation_host = "b09.jonqe.lab.eng.bos.redhat.com" # Linux b09.jonqe.lab.eng.bos.redhat.com 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 creation_time = 1506418080 # Tue Sep 26 05:28:00 2017 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^... Looks like someone run pvcreate or vgcreate on this LUN from host b09.jonqe.lab.eng.bos.redhat.com at Tue Sep 26 05:28:00 2017, deleting the vg metadata. This machine run LVM2 2.02.166, while the version before that run LVM2 2.02.171. So this looks like somebody which had access to this LUN tryied to used it and deleted the vg metadata created by RHV.
Created attachment 1385115 [details] lvm metadata for version 561 Fragment from the metadata area of /dev/mapper/360080e50002d65400000b4ea596efd30 showing version 561 from Tue Sep 26 05:27:53 2017.
Zdenek, can you check comment 3 and confirm my theory? The first 129 MiB of the vg are available at: http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-md RHV uses extent size of 128 MiB, so this includes the entire metadata area.
Thank you for investigation. I discussed it with all people with access to b09 and nobody remembers doing anything related to LVM there. Login logs are rolled out so I can't confirm that nobody logged in that day though. Feel free to close as not a bug. Thank you for help and sorry for false alarm. If it happens again,I will make sure it's not human error again.
Filip, thanks for reporting this. Please reopen if you have data suggesting that this was not a user error.
BZ<2>Jira re-sync