Bug 1500395
Summary: | One iSCSI storage domain has become inactive after Cannot refresh Logical Volume error | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Filip Brychta <fbrychta> | ||||||
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Raz Tamir <ratamir> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.1.3 | CC: | amureini, bazulay, fbrychta, lsurette, srevivo, ycui, ykaul, ylavi, zkabelac | ||||||
Target Milestone: | ovirt-4.2.1 | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-01-26 02:15:20 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Filip Brychta
2017-10-10 13:45:34 UTC
We don't have any info about the refresh error, I filed bug 1536880 to improve this. We can see that the metadata lv was refreshed successfully 5 mintues before: 2017-09-26 05:26:35,069-0400 INFO (monitor/30a57b1) [storage.LVM] Refreshing LVs (vg=30a57b16-33b0-43c7-becd-50c574e86b54, lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master']) (lvm:1319) Reading from the metadata lv was successful during this time: 2017-09-26 05:30:39,832-0400 INFO (periodic/58) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000535195', 'lastCheck': '2.8', 'valid': True}, ...} from=internal (api:52) 2017-09-26 05:30:51,817-0400 INFO (jsonrpc/6) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.416647', 'lastCheck': '4.8', 'valid': True}, ...} from=::ffff:10.16.23.33,56374, flow_id=6f60b7de (api:52) 2017-09-26 05:31:07,237-0400 INFO (jsonrpc/1) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.116993', 'lastCheck': '0.2', 'valid': True}, ...} from=::ffff:10.16.23.33,56374, flow_id=432e2930 (api:52) 2017-09-26 05:31:09,948-0400 INFO (periodic/53) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.116993', 'lastCheck': '2.9', 'valid': True}, ...} from=internal (api:52) 2017-09-26 05:31:22,516-0400 INFO (jsonrpc/2) [vdsm.api] FINISH repoStats return={..., u'30a57b16-33b0-43c7-becd-50c574e86b54': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000701047', 'lastCheck': '5.5', 'valid': True}, ...} from=::ffff:10.16.23.33,56374, flow_id=6dc3fb59 (api:52) Then refreshing the lv fails, and from this point monitoring fails every 10 seconds, since we failed to produce a domain. We don't have any info why the refresh fail: 2017-09-26 05:31:37,221-0400 ERROR (monitor/30a57b1) [storage.Monitor] Error checking domain 30a57b16-33b0-43c7-becd-50c574e86b54 (monitor:426) ... File "/usr/share/vdsm/storage/lvm.py", line 1326, in refreshLVs raise se.LogicalVolumeRefreshError("%s failed" % list2cmdline(cmd)) LogicalVolumeRefreshError: Cannot refresh Logical Volume: ('lvchange --refresh 30a57b16-33b0-43c7-becd-50c574e86b54/metadata failed',) This cause the domain to become INVALID, this is expected when a basic operation like refreshing lv fail: 2017-09-26 05:31:37,250-0400 INFO (monitor/30a57b1) [storage.Monitor] Domain 30a57b16-33b0-43c7-becd-50c574e86b54 became INVALID (monitor:456) In 4.2 we don't refresh the metadata in this flow, unless it was extended on the SPM (very unlikely). So the system will not fail in the same way, but if we cannot access storage, the system will fail later in another way. We don't see any failure in /var/log/messages around this time. In lvm.out we cannot see the storage domain, and the LUN does not belong to any vg: PV VG Fmt Attr PSize PFree /dev/mapper/360080e50002d65400000b4ea596efd30 lvm2 --- 2.73t 2.73t Looks like the pv including the metadata lv was corrupted, so lvm cannot detect this vg. We keep lvm metadata only on the first pv, in the first extent 1m-129m, and in the last extent. There is no info that can explain why the pv was corrupted. Filip, can you copy the first 129MiB of this LUN and share it? You can copy it using: dd if=/dev/mapper/360080e50002d65400000b4ea596efd30 bs=1M count=129 \ iflag=direct of=360080e50002d65400000b4ea596efd30-md Is is possible that the LUN was accessed from some other host by mistake, corrupting lvm metadata? Is it accessible on other machines except the RHV hosts? You can get the data here - http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-md More people have access to those LUNs so corruption of lvm metadata by mistake is possible but not very probable. As far as I know nobody except rhev was interacting with those LUNs. Not sure if this can be relevant but the corrupted domain was used as Target Storage Domain for VM Lease by several VMs. If it really looks like corruption by mistake, please don't waste much time on this. Nothing important was running there so we can just destroy and re-create that storage domain. (In reply to Filip Brychta from comment #2) > http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30- > md Thanks Filip! We have 562 metadata versions in this vg: $ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | wc -l 562 $ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | head # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:32 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:33 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:35 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:35 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:36 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:37 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:39 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Fri Jul 21 04:26:40 2017 $ grep -a '# Generated by LVM2 version' 360080e50002d65400000b4ea596efd30-md | tail # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 03:11:14 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:10 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:12 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:26:13 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:41 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:41 2017 # Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:44 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:27:50 2017 # Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:53 2017 # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:28:00 2017 This is the metadata of version 561 (one before last): # Generated by LVM2 version 2.02.171(2)-RHEL7 (2017-05-03): Tue Sep 26 05:27:53 2017 contents = "Text Format Volume Group" version = 1 description = "" creation_host = "b07.jonqe.lab.eng.bos.redhat.com" # Linux b07.jonqe.lab.eng.bos.redhat.com 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 creation_time = 1506418073 # Tue Sep 26 05:27:53 2017 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@30a57b16-33b0-43c7-becd-50c574e86b54 { id = "0hhT8i-MrVx-4O43-Bpmz-bRSe-qFaP-pgyAD9" seqno = 561 format = "lvm2" status = ["RESIZEABLE", "READ", "WRITE"] flags = [] tags = ["MDT_LEASETIMESEC=60", "MDT_VERSION=4", "MDT_PV0=pv:360080e50002d65400000b4ea596efd30&44&uuid:O2WSMq-W0if-JQNJ-fq8g-5YkE-Znr9-fUkoqq&44&pestart:0&44&pecount:22348&44&mapoffset:0", "M DT_LOCKPOLICY=", "MDT_TYPE=ISCSI", "MDT_LOGBLKSIZE=512", "MDT_SDUUID=30a57b16-33b0-43c7-becd-50c574e86b54", "MDT_LEASERETRIES=3", "MDT_IOOPTIMEOUTSEC=10", "MDT_LOCKRENEWALINTERVALSEC=5", "MD T_ROLE=Regular", "MDT_PHYBLKSIZE=512", "MDT_VGUUID=0hhT8i-MrVx-4O43-Bpmz-bRSe-qFaP-pgyAD9", "MDT_CLASS=Data", "MDT_DESCRIPTION=BC_userspace2", "RHAT_storage_domain", "MDT_POOL_UUID=00000001- 0001-0001-0001-000000000311", "MDT__SHA_CKSUM=052616d282bb34083f7b9adb6af41ca018d52cde"] extent_size = 262144 max_lv = 0 max_pv = 0 metadata_copies = 0 physical_volumes { pv0 { id = "O2WSMq-W0if-JQNJ-fq8g-5YkE-Znr9-fUkoqq" device = "/dev/mapper/360080e50002d65400000b4ea596efd30" status = ["ALLOCATABLE"] flags = [] dev_size = 5859022848 pe_start = 264192 pe_count = 22348 } } logical_volumes { metadata { id = "HERyXk-jEkV-K01q-zWR3-SOl1-tn34-cvqIyt" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625593 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 4 type = "striped" stripe_count = 1 stripes = [ "pv0", 0 ] } } outbox { id = "pNpOWD-cE8b-N3VV-1oNc-uSZ9-hBRt-Rab5Zq" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625595 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 4 ] } } xleases { id = "dpYJs5-DTDu-feHC-eSbp-JgvA-XpKU-dlo44R" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625595 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 8 type = "striped" stripe_count = 1 stripes = [ "pv0", 5 ] } } leases { id = "Jg0Qdd-rIet-H6Ne-GsYM-JGw9-Y3sf-yK5jh6" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625596 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 16 type = "striped" stripe_count = 1 stripes = [ "pv0", 13 ] } } ids { id = "sPXwXd-gMvc-W3ub-YlLr-uycc-BAqU-TkkeJw" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625596 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 29 ] } } inbox { id = "hzCPkH-4u3R-udzr-J7D3-8rsP-RzF3-Kk5y99" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625596 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 30 ] } } master { id = "8j2uqL-sTT9-FRwT-KIkA-OzVS-ZqAM-ZVnceS" status = ["READ", "WRITE", "VISIBLE"] flags = [] creation_time = 1500625597 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 8 type = "striped" stripe_count = 1 stripes = [ "pv0", 31 ] } } 62f87a59-4826-4fc0-a074-811a94807cbd { id = "L4GBvQ-SdWe-ni4G-y2XX-hWlt-ZIZg-DPY0P2" status = ["READ", "WRITE", "VISIBLE"] flags = [] tags = ["MD_4", "PU_00000000-0000-0000-0000-000000000000", "IU_ca7a2d81-3b95-4757-8d6d-1762d9b9059e"] creation_time = 1500629019 creation_host = "b02.jonqe.lab.eng.bos.redhat.com" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 type = "striped" stripe_count = 1 stripes = [ "pv0", 39 ] } } [trimmed more logical volumes] And this is the metadata of last version: # Generated by LVM2 version 2.02.166(2)-RHEL7 (2016-11-16): Tue Sep 26 05:28:00 2017 contents = "Text Format Volume Group" version = 1 description = "" creation_host = "b09.jonqe.lab.eng.bos.redhat.com" # Linux b09.jonqe.lab.eng.bos.redhat.com 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 creation_time = 1506418080 # Tue Sep 26 05:28:00 2017 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^... Looks like someone run pvcreate or vgcreate on this LUN from host b09.jonqe.lab.eng.bos.redhat.com at Tue Sep 26 05:28:00 2017, deleting the vg metadata. This machine run LVM2 2.02.166, while the version before that run LVM2 2.02.171. So this looks like somebody which had access to this LUN tryied to used it and deleted the vg metadata created by RHV. Created attachment 1385115 [details]
lvm metadata for version 561
Fragment from the metadata area of /dev/mapper/360080e50002d65400000b4ea596efd30
showing version 561 from Tue Sep 26 05:27:53 2017.
Zdenek, can you check comment 3 and confirm my theory? The first 129 MiB of the vg are available at: http://web.bc.jonqe.lab.eng.bos.redhat.com/360080e50002d65400000b4ea596efd30-md RHV uses extent size of 128 MiB, so this includes the entire metadata area. Thank you for investigation. I discussed it with all people with access to b09 and nobody remembers doing anything related to LVM there. Login logs are rolled out so I can't confirm that nobody logged in that day though. Feel free to close as not a bug. Thank you for help and sorry for false alarm. If it happens again,I will make sure it's not human error again. Filip, thanks for reporting this. Please reopen if you have data suggesting that this was not a user error. BZ<2>Jira re-sync |