Bug 1768821 - Cannot activate/deactivate storage domain
Summary: Cannot activate/deactivate storage domain
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.30.33
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.2
: ---
Assignee: Amit Bawer
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-05 10:16 UTC by oliver.albl
Modified: 2023-09-14 05:45 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-07-01 06:04:47 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.4?
mtessun: planning_ack+


Attachments (Terms of Use)
Logfiles (19.07 MB, application/gzip)
2019-11-05 10:16 UTC, oliver.albl
no flags Details
engine.log (4.14 MB, application/gzip)
2019-12-04 05:28 UTC, oliver.albl
no flags Details
vdsm.log from SPM (3.19 MB, application/gzip)
2019-12-04 05:29 UTC, oliver.albl
no flags Details
metadata (96.12 KB, application/x-7z-compressed)
2019-12-05 17:39 UTC, oliver.albl
no flags Details
/var/log/messages snippet from one scan (350.04 KB, text/plain)
2019-12-08 12:31 UTC, oliver.albl
no flags Details
MD2 and LVM backup (114.70 KB, application/x-7z-compressed)
2019-12-09 16:12 UTC, oliver.albl
no flags Details
Engine Log from startup (4.64 MB, text/plain)
2019-12-11 09:16 UTC, oliver.albl
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1553133 0 urgent CLOSED Creating many thin clones corrupts vg metadata 2024-06-13 20:51:20 UTC
Red Hat Bugzilla 1780910 0 unspecified CLOSED Using "Ignore OVF update failure" on maintenance puts SD into Inactive state 2021-02-22 00:41:40 UTC

Description oliver.albl 2019-11-05 10:16:38 UTC
Created attachment 1632916 [details]
Logfiles

Description of problem:
Cannot activate or deactivate storage domain after experiencing very low I/O performance.

Version-Release number of selected component (if applicable):
4.3.6.7 (including CentOS hosts 7.7 1908)

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Cannot activate or deactivate storage domain

Expected results:


Additional info:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/XIFBJG4BYAXT4KKRDGYGHBXKD36E2I52/

Comment 1 oliver.albl 2019-11-05 20:33:36 UTC
Storage vendor suspects HBA/slow drain device as a possible reason for slow I/O, but this is not yet confirmed. Storage connection seems to be stable, but we still have the storage domain that we cannot activate or deactivate. As a result, it seems that ovirt engine is trying to rescan scsi bus every 5 to 10 minutes reporting “VM <name> is not responding” from time to time, but without noticeable impact.

During the i/o peformance impact I got ovirt alerts like “Storage domains with IDs [8e3b0009-d364-440f-8bb0-69b79f568935, a07a6172-b9a4-4f08-8883-31e345f5d349, a794a775-fd70-4655-b310-9363c88dad0d, ae63f4a0-e66f-4655-9e0f-4fddc0fcf5ea, bf121271-13c3-4710-9fd4-3a7253613846, c8d6d586-b021-4eb6-a170-5018d0c44e86, e86138b0-b15d-4165-8eed-5234f59a2792, ecc71a64-62c1-43f4-bf1f-3bc1b22c7a8a, f2e6539b-85c3-45ba-b0d6-e8bf3f087454, fdfa45ea-ca21-4d41-b7a8-65feebb4f1af, 0d59c7e5-f545-4daa-8490-5ecb3ec942ee, 1cd2cb38-e9a6-4235-896a-8f17de84196d, 4796ca5f-1f3d-47f0-90d0-1cd7df606056, 4c009a43-716f-4936-b2bd-73f2f3de88e5, 6e5cd8b8-fc21-481c-b163-d83bdcd995a7, 7963ab71-9d1b-4079-8806-348c1d1cf309] could not be synchronized. To synchronize them, please move them to maintenance and then activate.”.

Is this something we need to address?

Comment 2 Tal Nisan 2019-11-06 10:54:02 UTC
Not sure, with extremely slow I/O it's quite expected, what do you think Nir?

Comment 3 oliver.albl 2019-11-13 09:41:36 UTC
Is there a way to get the storage domain active again or to deactivate and remove it?

Comment 4 Nir Soffer 2019-11-13 20:01:37 UTC
(In reply to Tal Nisan from comment #2)
> Not sure, with extremely slow I/O it's quite expected, what do you think Nir?

We need to check the logs and understand why moving to maintenance or activating
fails.

Comment 5 oliver.albl 2019-11-20 05:33:41 UTC
The continually running scans from 50 hosts start becoming an impediment when running under higher load (VMs get paused due to no storage space error, ...). If there is no way to activate/recover the storage domain, I have to remove/destroy it. If there is anything I can do/test/investigate while I have the current setup, please let me know.

Comment 6 Amit Bawer 2019-12-03 22:22:07 UTC
Oliver, please also provide:

1. engine.log, vdsm.log files for attempting to deactivate/activate an unresponsive storage domain.
2. engine db dump:  engine-backup --mode=backup --scope=db

Comment 7 oliver.albl 2019-12-04 05:28:28 UTC
Created attachment 1641879 [details]
engine.log

Comment 8 oliver.albl 2019-12-04 05:29:06 UTC
Created attachment 1641880 [details]
vdsm.log from SPM

Comment 9 oliver.albl 2019-12-04 05:34:06 UTC
Added logfiles to BZ. DB backup (>100 MB) is available here: https://at.cloud.fabasoft.com/folio/public/2xauqvz5mmtcc07v7lsdccfsep. Sent password via E-Mail.

Comment 10 oliver.albl 2019-12-05 08:03:17 UTC
Please let me update the current situation:
- I cannot activate the SD
- I cannot take the SD in maintenance mode
- SD has state "Inactive"
- I see "/usr/bin/python2 /usr/libexec/vdsm/fc-scan" running every 5 minutes resulting in additional 10K read IOPS on the array

Comment 11 Amit Bawer 2019-12-05 16:39:47 UTC
Root cause: Checksum error in VG metadata e653f283-d793-4b7e-b8d2-8b5ccb809238, seems related to bug https://bugzilla.redhat.com/1553133

Relevant vgs info:

  /dev/mapper/3600601603cc045003df2775c8c8cddeb: Checksum error at offset 8795972626944
  Couldn't read volume group metadata from /dev/mapper/3600601603cc045003df2775c8c8cddeb.
  Metadata location on /dev/mapper/3600601603cc045003df2775c8c8cddeb at 8795972626944 has invalid summary for VG.
  Failed to read metadata summary from /dev/mapper/3600601603cc045003df2775c8c8cddeb
  Failed to scan VG from /dev/mapper/3600601603cc045003df2775c8c8cddeb
  Volume group "e653f283-d793-4b7e-b8d2-8b5ccb809238" not found
  Cannot process volume group e653f283-d793-4b7e-b8d2-8b5ccb809238

Relevant mpath info:

sdn                                                                                     8:208   0     8T  0 disk  
└─3600601603cc045003df2775c8c8cddeb                                                   253:15    0     8T  0 mpath 
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-metadata                                 253:2003  0   512M  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-outbox                                   253:2004  0   128M  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-xleases                                  253:2005  0     1G  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-leases                                   253:2006  0     2G  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-ids                                      253:2007  0   128M  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-inbox                                    253:2008  0   128M  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-master                                   253:2009  0     1G  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-1d9cb5dd--5456--4ddc--ab57--a7396d324b38 253:214   0  47.9G  0 lvm   
  ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-29d8aaa7--9ea5--4601--aef3--30ed2cac892d 253:218   0  62.4G  0 lvm   
  └─e653f283--d793--4b7e--b8d2--8b5ccb809238-146950eb--bb2f--4685--a52f--26892168f512 253:234   0  48.6G  0 lvm   

This renders the storage domain inactive by monitoring and unavailable for VDSM operations:

2019-11-04 17:22:50,342+0100 INFO  (jsonrpc/2) [vdsm.api] START deleteImage(sdUUID=u'e653f283-d793-4b7e-b8d2-8b5ccb809238', spUUID=u'5849b030-626e-47cb-ad90-3ce782d831b3', imgUUID=u'142d735a-853d-472f-be4c-853aa8da7d9b', postZero=u'false', force=u'false', discard=False) from=::ffff:ipaddress,48328, flow_id=16f798b9, task_id=ab564c4a-e4f8-4b90-b996-01668b446d71 (api:48)
2019-11-04 17:22:51,080+0100 WARN  (jsonrpc/2) [storage.LVM] Reloading VGs failed (vgs=[u'e653f283-d793-4b7e-b8d2-8b5ccb809238'] rc=5 out=[] err=['  /dev/mapper/3600601603cc045003df2775c8c8cddeb: Checksum error at offset 8795972626944', "  Couldn't read volume group metadata from /dev/mapper/3600601603cc045003df2775c8c8cddeb.", '  Metadata location on /dev/mapper/3600601603cc045003df2775c8c8cddeb at 8795972626944 has invalid summary for VG.', '  Failed to read metadata summary from /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Failed to scan VG from /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Volume group "e653f283-d793-4b7e-b8d2-8b5ccb809238" not found', '  Cannot process volume group e653f283-d793-4b7e-b8d2-8b5ccb809238']) (lvm:470)
2019-11-04 17:22:51,081+0100 INFO  (jsonrpc/2) [vdsm.api] FINISH deleteImage error=Storage domain does not exist: (u'e653f283-d793-4b7e-b8d2-8b5ccb809238',) from=::ffff:ipaddress,48328, flow_id=16f798b9, task_id=ab564c4a-e4f8-4b90-b996-01668b446d71 (api:52)
2019-11-04 17:22:51,081+0100 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='ab564c4a-e4f8-4b90-b996-01668b446d71') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in deleteImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1496, in deleteImage
    dom = sdCache.produce(sdUUID=sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1822, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1680, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'e653f283-d793-4b7e-b8d2-8b5ccb809238',)

2019-12-04 05:01:08,837+0100 WARN  (monitor/e653f28) [storage.LVM] Reloading VGs failed (vgs=[u'e653f283-d793-4b7e-b8d2-8b5ccb809238'] rc=5 out=[] err=['  /dev/mapper/3600601603cc045003df2775c8c8cddeb: Checksum error at offset 8795972626944', "  Couldn't read volume group metadata from /dev/mapper/3600601603cc045003df2775c8c8cddeb.", '  Metadata location on /dev/mapper/3600601603cc045003df2775c8c8cddeb at 8795972626944 has invalid summary for VG.', '  Failed to read metadata summary from /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Failed to scan VG from /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Volume group "e653f283-d793-4b7e-b8d2-8b5ccb809238" not found', '  Cannot process volume group e653f283-d793-4b7e-b8d2-8b5ccb809238']) (lvm:470)
2019-12-04 05:01:08,839+0100 ERROR (monitor/e653f28) [storage.Monitor] Error checking domain e653f283-d793-4b7e-b8d2-8b5ccb809238 (monitor:425)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 406, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1822, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1680, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'e653f283-d793-4b7e-b8d2-8b5ccb809238',)



David, what steps should be considered here for fixing the corrupted VG metadata, given lvm2 version here is 2.0.2? 
is it possible to utilize lvm2 2.0.3 for fixing the issue here?


Oliver, could you please provide:

1. First 128MB of Bad VG meatdata

dd if=/dev/mapper/3600601603cc045003df2775c8c8cddeb of=md1 bs=128M count=1 skip=4096 iflag=skip_bytes
gzip md1 

2. grep WARN messages from all hosts vdsm logs, for LVM commands lookup.


- Issue with switching SD into maintenance is being checked.


- fc-scan is part of the HBA monitoring and executed periodically once per 300 seconds.
  See repo_stats_cache_refresh_timeout option in /usr/share/doc/vdsm-*/vdsm.conf.sample 
  Overriding defaults is done in /etc/vdsm/vdsm.conf

Comment 12 David Teigland 2019-12-05 17:12:58 UTC
If you only have lvm 2.02, you are left with the old method which involves getting the latest VG metadata backup file and running pvcreate with --uuid and --restorefile, followed by vgcfgrestore.  (You might want to dd the metadata to some test devices to test the pvcreate/vgcfgrestore commands first.)

With a recent 2.03 version, you would have the vgck --updatemetadata command which would likely repair this by finding and using a good copy of metadata from another metadata area.  (It might be possible to compile a recent lvm version for this system and run vgck from that version of lvm without installing it.  libdevicemapper incompatibilities might make that troublesome.)

If vgck --updatemetadata cannot repair it, then a very recent upstream version of lvm includes the pvck --repair command which could.

Comment 13 oliver.albl 2019-12-05 17:39:20 UTC
Created attachment 1642445 [details]
metadata

Attached metadata from requested SD. Used 7zip to compress because gzip was over 24 MB. Please let me know if you need another format.

Comment 14 oliver.albl 2019-12-05 17:43:44 UTC
(In reply to Amit Bawer from comment #11)
> Root cause: Checksum error in VG metadata
> e653f283-d793-4b7e-b8d2-8b5ccb809238, seems related to bug
> https://bugzilla.redhat.com/1553133
> 
> Relevant vgs info:
> 
>   /dev/mapper/3600601603cc045003df2775c8c8cddeb: Checksum error at offset
> 8795972626944
>   Couldn't read volume group metadata from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb.
>   Metadata location on /dev/mapper/3600601603cc045003df2775c8c8cddeb at
> 8795972626944 has invalid summary for VG.
>   Failed to read metadata summary from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb
>   Failed to scan VG from /dev/mapper/3600601603cc045003df2775c8c8cddeb
>   Volume group "e653f283-d793-4b7e-b8d2-8b5ccb809238" not found
>   Cannot process volume group e653f283-d793-4b7e-b8d2-8b5ccb809238
> 
> Relevant mpath info:
> 
> sdn                                                                         
> 8:208   0     8T  0 disk  
> └─3600601603cc045003df2775c8c8cddeb                                         
> 253:15    0     8T  0 mpath 
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-metadata                       
> 253:2003  0   512M  0 lvm   
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-outbox                         
> 253:2004  0   128M  0 lvm   
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-xleases                        
> 253:2005  0     1G  0 lvm   
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-leases                         
> 253:2006  0     2G  0 lvm   
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-ids                            
> 253:2007  0   128M  0 lvm   
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-inbox                          
> 253:2008  0   128M  0 lvm   
>   ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-master                         
> 253:2009  0     1G  0 lvm   
>  
> ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-1d9cb5dd--5456--4ddc--ab57--
> a7396d324b38 253:214   0  47.9G  0 lvm   
>  
> ├─e653f283--d793--4b7e--b8d2--8b5ccb809238-29d8aaa7--9ea5--4601--aef3--
> 30ed2cac892d 253:218   0  62.4G  0 lvm   
>  
> └─e653f283--d793--4b7e--b8d2--8b5ccb809238-146950eb--bb2f--4685--a52f--
> 26892168f512 253:234   0  48.6G  0 lvm   
> 
> This renders the storage domain inactive by monitoring and unavailable for
> VDSM operations:
> 
> 2019-11-04 17:22:50,342+0100 INFO  (jsonrpc/2) [vdsm.api] START
> deleteImage(sdUUID=u'e653f283-d793-4b7e-b8d2-8b5ccb809238',
> spUUID=u'5849b030-626e-47cb-ad90-3ce782d831b3',
> imgUUID=u'142d735a-853d-472f-be4c-853aa8da7d9b', postZero=u'false',
> force=u'false', discard=False) from=::ffff:ipaddress,48328,
> flow_id=16f798b9, task_id=ab564c4a-e4f8-4b90-b996-01668b446d71 (api:48)
> 2019-11-04 17:22:51,080+0100 WARN  (jsonrpc/2) [storage.LVM] Reloading VGs
> failed (vgs=[u'e653f283-d793-4b7e-b8d2-8b5ccb809238'] rc=5 out=[] err=[' 
> /dev/mapper/3600601603cc045003df2775c8c8cddeb: Checksum error at offset
> 8795972626944', "  Couldn't read volume group metadata from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb.", '  Metadata location on
> /dev/mapper/3600601603cc045003df2775c8c8cddeb at 8795972626944 has invalid
> summary for VG.', '  Failed to read metadata summary from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Failed to scan VG from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Volume group
> "e653f283-d793-4b7e-b8d2-8b5ccb809238" not found', '  Cannot process volume
> group e653f283-d793-4b7e-b8d2-8b5ccb809238']) (lvm:470)
> 2019-11-04 17:22:51,081+0100 INFO  (jsonrpc/2) [vdsm.api] FINISH deleteImage
> error=Storage domain does not exist:
> (u'e653f283-d793-4b7e-b8d2-8b5ccb809238',) from=::ffff:ipaddress,48328,
> flow_id=16f798b9, task_id=ab564c4a-e4f8-4b90-b996-01668b446d71 (api:52)
> 2019-11-04 17:22:51,081+0100 ERROR (jsonrpc/2) [storage.TaskManager.Task]
> (Task='ab564c4a-e4f8-4b90-b996-01668b446d71') Unexpected error (task:875)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> _run
>     return fn(*args, **kargs)
>   File "<string>", line 2, in deleteImage
>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
> method
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1496, in
> deleteImage
>     dom = sdCache.produce(sdUUID=sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in
> produce
>     domain.getRealDomain()
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in
> getRealDomain
>     return self._cache._realProduce(self._sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in
> _realProduce
>     domain = self._findDomain(sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in
> _findDomain
>     return findMethod(sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1822, in findDomain
>     return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1680, in findDomainPath
>     raise se.StorageDomainDoesNotExist(sdUUID)
> StorageDomainDoesNotExist: Storage domain does not exist:
> (u'e653f283-d793-4b7e-b8d2-8b5ccb809238',)
> 
> 2019-12-04 05:01:08,837+0100 WARN  (monitor/e653f28) [storage.LVM] Reloading
> VGs failed (vgs=[u'e653f283-d793-4b7e-b8d2-8b5ccb809238'] rc=5 out=[] err=['
> /dev/mapper/3600601603cc045003df2775c8c8cddeb: Checksum error at offset
> 8795972626944', "  Couldn't read volume group metadata from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb.", '  Metadata location on
> /dev/mapper/3600601603cc045003df2775c8c8cddeb at 8795972626944 has invalid
> summary for VG.', '  Failed to read metadata summary from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Failed to scan VG from
> /dev/mapper/3600601603cc045003df2775c8c8cddeb', '  Volume group
> "e653f283-d793-4b7e-b8d2-8b5ccb809238" not found', '  Cannot process volume
> group e653f283-d793-4b7e-b8d2-8b5ccb809238']) (lvm:470)
> 2019-12-04 05:01:08,839+0100 ERROR (monitor/e653f28) [storage.Monitor] Error
> checking domain e653f283-d793-4b7e-b8d2-8b5ccb809238 (monitor:425)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 406,
> in _checkDomainStatus
>     self.domain.selftest()
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in
> __getattr__
>     return getattr(self.getRealDomain(), attrName)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in
> getRealDomain
>     return self._cache._realProduce(self._sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in
> _realProduce
>     domain = self._findDomain(sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in
> _findDomain
>     return findMethod(sdUUID)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1822, in findDomain
>     return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1680, in findDomainPath
>     raise se.StorageDomainDoesNotExist(sdUUID)
> StorageDomainDoesNotExist: Storage domain does not exist:
> (u'e653f283-d793-4b7e-b8d2-8b5ccb809238',)
> 
> 
> 
> David, what steps should be considered here for fixing the corrupted VG
> metadata, given lvm2 version here is 2.0.2? 
> is it possible to utilize lvm2 2.0.3 for fixing the issue here?
> 
> 
> Oliver, could you please provide:
> 
> 1. First 128MB of Bad VG meatdata
> 
> dd if=/dev/mapper/3600601603cc045003df2775c8c8cddeb of=md1 bs=128M count=1
> skip=4096 iflag=skip_bytes
> gzip md1 
> 
> 2. grep WARN messages from all hosts vdsm logs, for LVM commands lookup.
> 
> 
> - Issue with switching SD into maintenance is being checked.
> 
> 
> - fc-scan is part of the HBA monitoring and executed periodically once per
> 300 seconds.
>   See repo_stats_cache_refresh_timeout option in
> /usr/share/doc/vdsm-*/vdsm.conf.sample 
>   Overriding defaults is done in /etc/vdsm/vdsm.conf

Thanks for explaining! I did not see HBA related entries in /var/log/messages or increased read IOPS when fc-scan is running with all SDs active. I guess the inactive SD causes this fc-scan behavior?

Comment 15 Amit Bawer 2019-12-08 12:01:19 UTC
Per logging question from comment #14, it's likely you experience timeout exceptions in periodic commands executions for inactive SDs. Could you provide a snippet?

Comment 16 oliver.albl 2019-12-08 12:31:01 UTC
Created attachment 1643010 [details]
/var/log/messages snippet from one scan

Comment 17 Amit Bawer 2019-12-09 07:25:55 UTC
(In reply to oliver.albl from comment #16)
> Created attachment 1643010 [details]
> /var/log/messages snippet from one scan

Those driver messages are of no special significance [1]

[1] https://access.redhat.com/solutions/3878941

Comment 18 oliver.albl 2019-12-09 07:58:32 UTC
The messages itself not but it seems the additional 10K read IOPS on the array are related to then (occasionally resulting in "vm has been paused due to no storage space" events). I think the best way to proceed would be to get the SD in maintenance mode or active again. This should address driver messages and IOPS.

Comment 19 Amit Bawer 2019-12-09 15:24:05 UTC
(In reply to oliver.albl from comment #18)
> The messages itself not but it seems the additional 10K read IOPS on the
> array are related to then (occasionally resulting in "vm has been paused due
> to no storage space" events). I think the best way to proceed would be to
> get the SD in maintenance mode or active again. This should address driver
> messages and IOPS.

- To put the SD into maintenance, https://bugzilla.redhat.com/1780910 should be addressed first.
- To activate the SD we should fix the VG issue, in one of the courses advised in comment #12.
  We should remove the SD first from the monitoring attempts, not to run locking_type=1 LVM commands while trying to fix the VG.

Looking at the dumped VG md provided, it seems flaked with a reoccurring foreign uuid "e653f283-d793-4b7e-b8d2-8b5ccb809238" at unexpended places.

Maybe we need to gather the dumped VG MD a little differently to rule out the possibility that this is the corruption itself:

  dd if=/dev/mapper/3600601603cc045003df2775c8c8cddeb of=md2 bs=1M count=129 iflag=direct conv=fsync

and then compress the md2 file.

If possible also share the last backup of this VG metadata from /etc/lvm/backup
similarly to comment #3 in https://bugzilla.redhat.com/1553133

Comment 20 oliver.albl 2019-12-09 16:12:36 UTC
Created attachment 1643345 [details]
MD2 and LVM backup

Got LVM backups on 50 hosts within 10 minutes, attached last one.

Comment 21 Amit Bawer 2019-12-10 15:01:28 UTC
Thanks for gathering the dumps Oliver

Checking md2:

No repeating seq numbers here, but we have almost 500 records of VG activity

$ grep -a "seqno" md2 | wc -l 
495
$ grep -a "seqno" md2 |uniq | wc -l 
495


Averaging ~20 LVM operations per minute:

...
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:00 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:01 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:02 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:03 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:03 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:04 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:04 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:06 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:07 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:07 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:08 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:08 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:09 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:11 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:11 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:11 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:12 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:13 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:13 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:13 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:14 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:15 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:15 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:17 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:17 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:18 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:19 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:20 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:22 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:22 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:23 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:23 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:24 2019
# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 15:01:25 2019

Looking up for inconsistencies we meet a foreign VG e653f283-d793-4b7e-b8d2-8b5ccb809238 section starting in a middle 
of fabavmhost082:

# Generated by LVM2 version 2.02.185(2)-RHEL7 (2019-05-13): Mon Nov  4 14:33:19 2019

contents = "Text Format Volume Group"
version = 1

description = ""

creation_host = "fabavmhost082" # Linux fabavmhost082 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64
creation_time = 1572874399  # Mon Nov  4 14:33:19 2019

^@striped"
stripe_count = 1

stripes = [
"pv0", 62252
]
}
segment2 {
start_ee653f283-d793-4b7e-b8d2-8b5ccb809238 {
id = "T4CnQi-vMAb-sHp5-VMkq-my7j-BNwZ-Ik1M0X"
seqno = 1584381
format = "lvm2"
status = ["RESIZEABLE", "READ", "WRITE"]
flags = []
tags = ["MDT_LEASETIMESEC=60", "MDT_DESCRIPTION=FABAVMHOST_LUN_219", "MDT_PV0=pv:3600601603cc045003df2775c8c8cddeb&44&uuid:AgCLHY-mSSg-RtNx-L7yO-AkJn-sqpk-YKkv0L&44&pestart:0&44&pecount:65533&44&mapoffset:0", "MDT_LOCKPOLICY=", "MDT_VGUUID=T4CnQi-vMAb-sHp5-VMkq-my7j-BNwZ-Ik1M0X", "MDT_LEASERETRIES=3", "MDT_IOOPTIMEOUTSEC=10", "MDT_LOCKRENEWALINTERVALSEC=5", "MDT_ROLE=Regular", "MDT_SDUUID=e653f283-d793-4b7e-b8d2-8b5ccb809238", "MDT_CLASS=Data", "MDT_TYPE=FCP", "RHAT_storage_domain", "MDT_POOL_UUID=5849b030-626e-47cb-ad90-3ce782d831b3", "MDT__SHA_CKSUM=9b7fd4ca261758224047a0c0b61a58d1e21031b4", "MDT_VERSION=5", "MDT_BLOCK_SIZE=512", "MDT_ALIGNMENT=1048576"]
extent_size = 262144
max_lv = 0
max_pv = 0
metadata_copies = 0


The "fabavmhost082" host record above is cut by the "e653f283-d793-4b7e-b8d2-8b5ccb809238 {" section at offset 85088768, which is a 512 block size multiple.

Oliver, we'll need the WARN log lines from your env to try and locate the LVM command attepting to change the VG metadata.

Comment 22 Amit Bawer 2019-12-10 16:48:23 UTC
Hi Oliver,

To manually put the inactive SD into maintenance status, you'll have to manipulate the DB entry for the domain,
and change its status to '6' (Maintenance).

In the example below the problematic storage_id is 505bfaad-bdaf-45fa-a586-a842fdab75fd:


[root@localhost ~]# su - postgres
-bash-4.2$ LD_LIBRARY_PATH=/opt/rh/rh-postgresql10/root/lib64/   /opt/rh/rh-postgresql10/root/usr/bin/psql engine 
psql (10.6)
Type "help" for help.

engine=# select * from storage_pool_iso_map;
              storage_id              |           storage_pool_id            | status 
--------------------------------------+--------------------------------------+--------
 505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a |      3
 49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a |      3
(2 rows)


engine=# update storage_pool_iso_map set status=6 where storage_id='49b02aa4-6ad8-4685-ae25-4bb9bf85547d';
UPDATE 1
engine=# 

engine=# select * from storage_pool_iso_map;
              storage_id              |           storage_pool_id            | status 
--------------------------------------+--------------------------------------+--------
 49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a |      3
 505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a |      6
(2 rows)

engine=# 


After you'll see that the status turns into "maintenance" for the broken SD, you can fix the VG metadata.

Comment 23 Amit Bawer 2019-12-10 16:50:24 UTC
(In reply to Amit Bawer from comment #22)
> Hi Oliver,
> 
> To manually put the inactive SD into maintenance status, you'll have to
> manipulate the DB entry for the domain,
> and change its status to '6' (Maintenance).
> 
> In the example below the problematic storage_id is
> 505bfaad-bdaf-45fa-a586-a842fdab75fd:
> 
> 
> [root@localhost ~]# su - postgres
> -bash-4.2$ LD_LIBRARY_PATH=/opt/rh/rh-postgresql10/root/lib64/  
> /opt/rh/rh-postgresql10/root/usr/bin/psql engine 
> psql (10.6)
> Type "help" for help.
> 
> engine=# select * from storage_pool_iso_map;
>               storage_id              |           storage_pool_id           
> | status 
> --------------------------------------+--------------------------------------
> +--------
>  505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a
> |      3
>  49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a
> |      3
> (2 rows)
> 
> 
> engine=# update storage_pool_iso_map set status=6 where
> storage_id='49b02aa4-6ad8-4685-ae25-4bb9bf85547d';

engine=# update storage_pool_iso_map set status=6 where storage_id='505bfaad-bdaf-45fa-a586-a842fdab75fd';


> UPDATE 1
> engine=# 
> 
> engine=# select * from storage_pool_iso_map;
>               storage_id              |           storage_pool_id           
> | status 
> --------------------------------------+--------------------------------------
> +--------
>  49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a
> |      3
>  505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a
> |      6
> (2 rows)
> 
> engine=# 
> 
> 
> After you'll see that the status turns into "maintenance" for the broken SD,
> you can fix the VG metadata.

Comment 24 oliver.albl 2019-12-10 17:52:57 UTC
Logrotate for last 100 VDSM logs allows going back to December 1st (while the incident occured on November 4th). Available VDSM logs for November 4th would be in the "Logfiles" attachment from November 5th in this BZ.

Comment 25 oliver.albl 2019-12-10 17:54:21 UTC
(In reply to Amit Bawer from comment #23)
> (In reply to Amit Bawer from comment #22)
> > Hi Oliver,
> > 
> > To manually put the inactive SD into maintenance status, you'll have to
> > manipulate the DB entry for the domain,
> > and change its status to '6' (Maintenance).
> > 
> > In the example below the problematic storage_id is
> > 505bfaad-bdaf-45fa-a586-a842fdab75fd:
> > 
> > 
> > [root@localhost ~]# su - postgres
> > -bash-4.2$ LD_LIBRARY_PATH=/opt/rh/rh-postgresql10/root/lib64/  
> > /opt/rh/rh-postgresql10/root/usr/bin/psql engine 
> > psql (10.6)
> > Type "help" for help.
> > 
> > engine=# select * from storage_pool_iso_map;
> >               storage_id              |           storage_pool_id           
> > | status 
> > --------------------------------------+--------------------------------------
> > +--------
> >  505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a
> > |      3
> >  49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a
> > |      3
> > (2 rows)
> > 
> > 
> > engine=# update storage_pool_iso_map set status=6 where
> > storage_id='49b02aa4-6ad8-4685-ae25-4bb9bf85547d';
> 
> engine=# update storage_pool_iso_map set status=6 where
> storage_id='505bfaad-bdaf-45fa-a586-a842fdab75fd';
> 
> 
> > UPDATE 1
> > engine=# 
> > 
> > engine=# select * from storage_pool_iso_map;
> >               storage_id              |           storage_pool_id           
> > | status 
> > --------------------------------------+--------------------------------------
> > +--------
> >  49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a
> > |      3
> >  505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a
> > |      6
> > (2 rows)
> > 
> > engine=# 
> > 
> > 
> > After you'll see that the status turns into "maintenance" for the broken SD,
> > you can fix the VG metadata.

Can I update database while engine is running?

Comment 26 Amit Bawer 2019-12-10 20:05:28 UTC
(In reply to oliver.albl from comment #25)
> (In reply to Amit Bawer from comment #23)
> > (In reply to Amit Bawer from comment #22)
> > > Hi Oliver,
> > > 
> > > To manually put the inactive SD into maintenance status, you'll have to
> > > manipulate the DB entry for the domain,
> > > and change its status to '6' (Maintenance).
> > > 
> > > In the example below the problematic storage_id is
> > > 505bfaad-bdaf-45fa-a586-a842fdab75fd:
> > > 
> > > 
> > > [root@localhost ~]# su - postgres
> > > -bash-4.2$ LD_LIBRARY_PATH=/opt/rh/rh-postgresql10/root/lib64/  
> > > /opt/rh/rh-postgresql10/root/usr/bin/psql engine 
> > > psql (10.6)
> > > Type "help" for help.
> > > 
> > > engine=# select * from storage_pool_iso_map;
> > >               storage_id              |           storage_pool_id           
> > > | status 
> > > --------------------------------------+--------------------------------------
> > > +--------
> > >  505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a
> > > |      3
> > >  49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a
> > > |      3
> > > (2 rows)
> > > 
> > > 
> > > engine=# update storage_pool_iso_map set status=6 where
> > > storage_id='49b02aa4-6ad8-4685-ae25-4bb9bf85547d';
> > 
> > engine=# update storage_pool_iso_map set status=6 where
> > storage_id='505bfaad-bdaf-45fa-a586-a842fdab75fd';
> > 
> > 
> > > UPDATE 1
> > > engine=# 
> > > 
> > > engine=# select * from storage_pool_iso_map;
> > >               storage_id              |           storage_pool_id           
> > > | status 
> > > --------------------------------------+--------------------------------------
> > > +--------
> > >  49b02aa4-6ad8-4685-ae25-4bb9bf85547d | 293333d8-1753-11ea-906e-525400a1903a
> > > |      3
> > >  505bfaad-bdaf-45fa-a586-a842fdab75fd | 293333d8-1753-11ea-906e-525400a1903a
> > > |      6
> > > (2 rows)
> > > 
> > > engine=# 
> > > 
> > > 
> > > After you'll see that the status turns into "maintenance" for the broken SD,
> > > you can fix the VG metadata.
> 
> Can I update database while engine is running?

This is better done when engine is down.

Comment 27 oliver.albl 2019-12-11 05:31:17 UTC
After performing the following steps, SD is set to active before becoming inactive again:

stop ovirt-engine

engine=# select * from storage_pool_iso_map where storage_id = 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
              storage_id              |           storage_pool_id            | status
--------------------------------------+--------------------------------------+--------
 e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3 |      4
(1 row)

engine=# update storage_pool_iso_map set status=6 where storage_id='e653f283-d793-4b7e-b8d2-8b5ccb809238';
UPDATE 1

engine=# select * from storage_pool_iso_map where storage_id = 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
              storage_id              |           storage_pool_id            | status
--------------------------------------+--------------------------------------+--------
 e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3 |      6
(1 row)

start ovirt-engine

engine=# select * from storage_pool_iso_map where storage_id = 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
              storage_id              |           storage_pool_id            | status
--------------------------------------+--------------------------------------+--------
 e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3 |      3
(1 row)

wait a couple of minutes

engine=# select * from storage_pool_iso_map where storage_id = 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
              storage_id              |           storage_pool_id            | status
--------------------------------------+--------------------------------------+--------
 e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3 |      4

Comment 28 Amit Bawer 2019-12-11 09:03:44 UTC
(In reply to oliver.albl from comment #27)
> After performing the following steps, SD is set to active before becoming
> inactive again:
> 
> stop ovirt-engine
> 
> engine=# select * from storage_pool_iso_map where storage_id =
> 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
>               storage_id              |           storage_pool_id           
> | status
> --------------------------------------+--------------------------------------
> +--------
>  e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3
> |      4
> (1 row)
> 
> engine=# update storage_pool_iso_map set status=6 where
> storage_id='e653f283-d793-4b7e-b8d2-8b5ccb809238';
> UPDATE 1
> 
> engine=# select * from storage_pool_iso_map where storage_id =
> 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
>               storage_id              |           storage_pool_id           
> | status
> --------------------------------------+--------------------------------------
> +--------
>  e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3
> |      6
> (1 row)
> 
> start ovirt-engine
> 
> engine=# select * from storage_pool_iso_map where storage_id =
> 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
>               storage_id              |           storage_pool_id           
> | status
> --------------------------------------+--------------------------------------
> +--------
>  e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3
> |      3
> (1 row)
> 
> wait a couple of minutes
> 
> engine=# select * from storage_pool_iso_map where storage_id =
> 'e653f283-d793-4b7e-b8d2-8b5ccb809238';
>               storage_id              |           storage_pool_id           
> | status
> --------------------------------------+--------------------------------------
> +--------
>  e653f283-d793-4b7e-b8d2-8b5ccb809238 | 5849b030-626e-47cb-ad90-3ce782d831b3
> |      4

Could you provide your engine.log for the starting up?

Comment 29 oliver.albl 2019-12-11 09:16:51 UTC
Created attachment 1643822 [details]
Engine Log from startup

Startup at 06:22:42

Comment 30 Amit Bawer 2019-12-17 11:43:27 UTC
Hi Oliver,

Please do the following:

1. Put the SD into maintenance with "Ignore OVF Update", this is not expected to work but it would notify the hosts the SD should be inactive.
2. Right after the operation in 1 is done, go and change the SD status in the DB without restarting the engine (per comment #23)

engine=# update storage_pool_iso_map set status=6 where storage_id='YOUR FAILING SD UID'; 

This should put the SD into maintenance status before the engine auto recovery would attempt to make it active and determine it to be inactive again.

Comment 31 oliver.albl 2019-12-17 11:56:02 UTC
Hi Amit,

just to make sure I got this right: "Right after the operation in 1" means, that after the SD is set to inactive again, correct?

Comment 32 Amit Bawer 2019-12-17 13:21:26 UTC
(In reply to oliver.albl from comment #31)
> Hi Amit,
> 
> just to make sure I got this right: "Right after the operation in 1" means,
> that after the SD is set to inactive again, correct?

Let the engine events log first confirm the operation of step 1 is done, after it change the SD status in the DB.
The recoveries and the reports are periodic processes, so you might need to retry steps 1-2 couple of times to make it work.

Comment 33 oliver.albl 2019-12-17 13:41:23 UTC
Looking at the last try, there is no confirmation of the operation in the engines event log, just errors:
   2019-12-04 05:55:50.844+01    |        2 | VDSM command SetVolumeDescriptionVDS failed: Could not acquire resource. Probably resource factory threw an exception.: ()
   2019-12-04 05:55:50.88+01     |        1 | Failed to update OVF disks 77c64b39-fe50-4d05-b77f-8131ad1f95f9, OVF data isn't updated on those OVF stores (Data Center DC, Storage Domain VMHOST_LUN_219).
   2019-12-04 05:55:50.887+01    |        1 | Failed to update VMs/Templates OVF data for Storage Domain VMHOST_LUN_219 in Data Center DC.
   2019-12-04 05:55:53.09+01     |        2 | Failed to deactivate Storage Domain VMHOST_LUN_219 (Data Center DC).
   2019-12-04 06:00:59.721+01    |        1 | Storage Domain VMHOST_LUN_219 (Data Center DC) was deactivated by system because it's not visible by any of the hosts.

So I would proceed with:
1. Put the SD into maintenance with "Ignore OVF Update", this is not expected to work but it would notify the hosts the SD should be inactive.

2. Check engines events log and wait for "Failed to deactivate Storage Domain VMHOST_LUN_219 (Data Center DC)" event
   2019-12-04 05:55:50.844+01    |        2 | VDSM command SetVolumeDescriptionVDS failed: Could not acquire resource. Probably resource factory threw an exception.: ()
   2019-12-04 05:55:50.88+01     |        1 | Failed to update OVF disks 77c64b39-fe50-4d05-b77f-8131ad1f95f9, OVF data isn't updated on those OVF stores (Data Center DC, Storage Domain VMHOST_LUN_219).
   2019-12-04 05:55:50.887+01    |        1 | Failed to update VMs/Templates OVF data for Storage Domain VMHOST_LUN_219 in Data Center DC.
==>2019-12-04 05:55:53.09+01     |        2 | Failed to deactivate Storage Domain VMHOST_LUN_219 (Data Center DC).

3. Change the SD status in the DB without restarting the engine. This should be before the event:
   2019-12-04 06:00:59.721+01    |        1 | Storage Domain VMHOST_LUN_219 (Data Center DC) was deactivated by system because it's not visible by any of the hosts.

Comment 34 Amit Bawer 2019-12-17 14:01:11 UTC
(In reply to oliver.albl from comment #33)

In my reproduction attempt I've used the DB hack after (3) happened, but you could try this workaround as well.

Comment 35 oliver.albl 2019-12-17 15:14:06 UTC
Looks good. SD is in maintenance and I did not see any peak IOPS in the last 20 minutes. Will this state survive the next engine restart?

Comment 36 Amit Bawer 2019-12-17 16:13:01 UTC
(In reply to oliver.albl from comment #35)
> Looks good. SD is in maintenance and I did not see any peak IOPS in the last
> 20 minutes. Will this state survive the next engine restart?

It should, but next step should be fixing the VG corruption for this SD per David's instructions in comment #12.

Comment 37 Amit Bawer 2019-12-30 09:50:16 UTC
Hi Oliver,

Were you able to recover the storage domain?

Comment 38 oliver.albl 2020-01-03 15:16:15 UTC
Hi Amit,

  not yet. Planned for next week.

Comment 39 oliver.albl 2020-01-03 15:33:09 UTC
Just to make sure - I will run the following commands on SPM:

VG_NAME="ee653f283-d793-4b7e-b8d2-8b5ccb809238"
VG_UUID="T4CnQi-vMAb-sHp5-VMkq-my7j-BNwZ-Ik1M0X"
LVM_BACKUP=~/most-recent-lvm-backup.conf

pvcreate $VG_NAME --uuid $VG_UUID --restorefile $LVM_BACKUP --verbose --test

vgcfgrestore --file $LVM_BACKUP $VG_NAME --verbose --test

Comment 40 Amit Bawer 2020-01-05 09:14:38 UTC
(In reply to oliver.albl from comment #39)
> Just to make sure - I will run the following commands on SPM:
> 
> VG_NAME="ee653f283-d793-4b7e-b8d2-8b5ccb809238"
> VG_UUID="T4CnQi-vMAb-sHp5-VMkq-my7j-BNwZ-Ik1M0X"
> LVM_BACKUP=~/most-recent-lvm-backup.conf

Just to clarify, the most up-to-date backup file isn't necessarily on your current SPM, it should be on the last Host
which successfully modified the VG before it became broken.

> 
> pvcreate $VG_NAME --uuid $VG_UUID --restorefile $LVM_BACKUP --verbose --test
> 
> vgcfgrestore --file $LVM_BACKUP $VG_NAME --verbose --test

Comment 41 oliver.albl 2020-01-05 10:01:27 UTC
I will take the most up-to-date backup file from the last host which successfully modified the VG before it became broken.

Can you confirm that the commands and executing them on SPM is the correct way:

VG_NAME="ee653f283-d793-4b7e-b8d2-8b5ccb809238"
VG_UUID="T4CnQi-vMAb-sHp5-VMkq-my7j-BNwZ-Ik1M0X"
LVM_BACKUP=~/most-recent-lvm-backup.conf

pvcreate $VG_NAME --uuid $VG_UUID --restorefile $LVM_BACKUP --verbose --test

vgcfgrestore --file $LVM_BACKUP $VG_NAME --verbose --test

Activate storage domain

Comment 42 Amit Bawer 2020-01-05 10:46:45 UTC
(In reply to oliver.albl from comment #41)
> I will take the most up-to-date backup file from the last host which
> successfully modified the VG before it became broken.
> 
> Can you confirm that the commands and executing them on SPM is the correct
> way:
> 
> VG_NAME="ee653f283-d793-4b7e-b8d2-8b5ccb809238"
> VG_UUID="T4CnQi-vMAb-sHp5-VMkq-my7j-BNwZ-Ik1M0X"
> LVM_BACKUP=~/most-recent-lvm-backup.conf
> 
> pvcreate $VG_NAME --uuid $VG_UUID --restorefile $LVM_BACKUP --verbose --test
> 
> vgcfgrestore --file $LVM_BACKUP $VG_NAME --verbose --test
> 
> Activate storage domain

Using --test flag will cause a dry-run, which is okay if for testing all should work fine.
This will not do the actual recovery of the VG, which for doing so you'll have to drop the --test flag.

Adding David in case there are further comments.

Comment 43 Amit Bawer 2020-01-13 12:53:08 UTC
Hi Oliver,

No input from David so far, I think you could carry on with the plan. Thanks

Comment 44 Amit Bawer 2020-06-30 13:17:06 UTC
No feedback from user so far,
Root cause is fixed on 1553133
IMO could be closed.

Comment 45 oliver.albl 2020-07-01 04:55:29 UTC
We resolved the issue and in the meantime were able to remove the storage domain. Closing is okay. Thank you for your help!

Comment 46 Red Hat Bugzilla 2023-09-14 05:45:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.