Bug 2244641 - During the storage domain import, the engine will fail to find OVF_STORE if there is also a ConnectStoragePoolVDSCommand request
Summary: During the storage domain import, the engine will fail to find OVF_STORE if t...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.5.3
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.3-async
: ---
Assignee: Benny Zlotnik
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-10-17 13:46 UTC by nijin ashok
Modified: 2024-01-31 14:00 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-4.5.3.10
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-01-31 14:00:24 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 893 0 None open core: Fail when getVolumeInfo fails during OVF read 2023-11-27 10:04:36 UTC
Github oVirt ovirt-engine pull 903 0 None Merged backport: core: Fail when getVolumeInfo fails during OVF read 2024-01-21 15:47:11 UTC

Description nijin ashok 2023-10-17 13:46:47 UTC
Description of problem:

Storage domain d1f38c8f was imported and attached to the DC:

~~~
2023-10-17 13:16:44,739Z INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [2a91458c] START, ActivateStorageDomainVDSCommand( ActivateStorageDomainVDSCommandParameters:{storagePoolId='4a9a3278-0134-11ee-9af7-525400acea40', ignoreFailoverLimit='false', storageDomainId='d1f38c8f-2308-469b-b6e0-1e511da7c1ba'}), log id: 13575ec8
~~~

In between, the SPM got changed and a `ConnectStoragePoolVDSCommand` was initiated. The domain map also has this domain `d1f38c8f-2308-469b-b6e0-1e511da7c1ba=active` although it is still not active:  

~~~
2023-10-17 13:16:47,693Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-102008) [2a91458c] START, ConnectStoragePoolVDSCommand(HostName = dhcp131-94.gsslab.pnq2.redhat.com, ConnectStoragePoolVDSCommandParameters:{hostId='8b08f230-698c-4691-b8cc-8698c8a063a6', vdsId='8b08f230-698c-4691-b8cc-8698c8a063a6', storagePoolId='4a9a3278-0134-11ee-9af7-525400acea40', masterVersion='1'}), log id: 2844965b

2023-10-17 13:16:47,693Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-102008) [2a91458c] Executing with domain map: {845d23c1-92cb-4ebb-8dcb-7f1a1c9252c8=active, b05f55d1-8219-4539-b7d0-b2ee2e7249ff=active, d1f38c8f-2308-469b-b6e0-1e511da7c1ba=active}
~~~

So the vdsm started monitoring this domain:

~~~
2023-10-17 13:16:09,408+0000 INFO  (jsonrpc/1) [vdsm.api] START connectStoragePool(spUUID='4a9a3278-0134-11ee-9af7-525400acea40', hostID=2, msdUUID='b05f55d1-8219-4539-b7d0-b2ee2e7249ff', masterVersion=1, domainsMap={'845d23c1-92cb-4ebb-8dcb-7f1a1c9252c8': 'active', 'b05f55d1-8219-4539-b7d0-b2ee2e7249ff': 'active', 'd1f38c8f-2308-469b-b6e0-1e511da7c1ba': 'active'}) from=::ffff:10.74.130.139,47104, flow_id=3dab8248, task_id=bfc517ec-eaae-4570-b324-7899e56e1f30 (api:48)

2023-10-17 13:16:11,761+0000 INFO  (jsonrpc/1) [storage.monitor] Start monitoring d1f38c8f-2308-469b-b6e0-1e511da7c1ba (monitor:193)

2023-10-17 13:16:17,257+0000 INFO  (periodic/1) [vdsm.api] FINISH repoStats return.... 'd1f38c8f-2308-469b-b6e0-1e511da7c1ba': {'code': 0, 'lastCheck': '4.7', 'delay': '0.000396777', 'valid': True, 'version': 5, 'acquired': False, 'actual': True}} from=internal, task_id=cf2a7d54-71de-401a-b50b-8048aed97e06 (api:54)
~~~

Then the `attachStorageDomain` was started which moved this domain status to "Attached":

~~~
2023-10-17 13:16:13,250+0000 INFO  (jsonrpc/0) [vdsm.api] START attachStorageDomain(sdUUID='d1f38c8f-2308-469b-b6e0-1e511da7c1ba', spUUID='4a9a3278-0134-11ee-9af7-525400acea40') from=::ffff:10.74.130.139,47282, flow_id=11d78e7d-1b58-4cb4-81a1-f4b62259754d, task_id=7d72421a-4c93-44cb-b54a-22bc1d3e58b4 (api:48)

2023-10-17 13:16:34,716+0000 INFO  (jsonrpc/0) [storage.storagepoolmemorybackend] new storage pool master version 1 and domains map {'845d23c1-92cb-4ebb-8dcb-7f1a1c9252c8': 'Active', 'b05f55d1-8219-4539-b7d0-b2ee2e7249ff': 'Active', 'd1f38c8f-2308-469b-b6e0-1e511da7c1ba': 'Attached'} (spbackends:434)
~~~

At the end of attachSD, it does updateMonitoringThreads [1] which will check monitoredDomains - activeDomains and will stop this domain since it's now in "attached status".

~~~
2023-10-17 13:16:34,730+0000 INFO  (jsonrpc/0) [storage.monitor] Stop monitoring d1f38c8f-2308-469b-b6e0-1e511da7c1ba (shutdown=False) (monitor:268)
2023-10-17 13:16:34,738+0000 INFO  (monitor/d1f38c8) [storage.check] Stop checking '/dev/d1f38c8f-2308-469b-b6e0-1e511da7c1ba/metadata' (check:135)
2023-10-17 13:16:37,741+0000 INFO  (monitor/d1f38c8) [storage.blocksd] Tearing down domain d1f38c8f-2308-469b-b6e0-1e511da7c1ba (blockSD:997)
~~~

Stop monitoring also does tear down which deactivates all LVs in that VG. The getVolumeInfo followed after the attachStorageDomain will all fail because the metadata LV is not active.

~~~
2023-10-17 13:16:38,150+0000 INFO  (jsonrpc/6) [vdsm.api] START getVolumeInfo(sdUUID='d1f38c8f-2308-469b-b6e0-1e511da7c1ba', spUUID='4a9a3278-0134-11ee-9af7-525400acea40', imgUUID='2391895b-6797-483e-a807-36fe0c513f9d', volUUID='3060d6b6-445a-4b4a-9dec-e8632216e9f7') from=::ffff:10.74.130.139,47104, flow_id=11d78e7d-1b58-4cb4-81a1-f4b62259754d, task_id=d227efac-17c8-4df1-8a0f-b4f728e75bdd (api:48)

2023-10-17 13:16:38,160+0000 ERROR (jsonrpc/6) [storage.volumemanifest] Internal block device read failure: 'name=/dev/d1f38c8f-2308-469b-b6e0-1e511da7c1ba/metadata, offset=1064960, size=512' (blockVolume:102)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 100, in getMetadata
    lines = sd.read_metadata_block(slot).splitlines()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 928, in read_metadata_block
    sc.METADATA_SIZE)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/misc.py", line 101, in readblock
    raise se.MiscBlockReadException(name, offset, size)
vdsm.storage.exception.MiscBlockReadException: Internal block device read failure: 'name=/dev/d1f38c8f-2308-469b-b6e0-1e511da7c1ba/metadata, offset=1064960, size=512'

2023-10-17 13:16:38,255+0000 INFO  (jsonrpc/0) [storage.volumemanifest] Info request: sdUUID=d1f38c8f-2308-469b-b6e0-1e511da7c1ba imgUUID=f16de09d-81ac-4f4e-9d6f-d8c9c4f5dde3 volUUID = b679c6e0-4afb-43d9-a41e-ea1b7e2a28fb  (volume:243)
2023-10-17 13:16:38,266+0000 ERROR (jsonrpc/0) [storage.volumemanifest] Internal block device read failure: 'name=/dev/d1f38c8f-2308-469b-b6e0-1e511da7c1ba/metadata, offset=1097728, size=512' (blockVolume:102)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 100, in getMetadata
    lines = sd.read_metadata_block(slot).splitlines()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 928, in read_metadata_block
    sc.METADATA_SIZE)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/misc.py", line 101, in readblock
    raise se.MiscBlockReadException(name, offset, size)
vdsm.storage.exception.MiscBlockReadException: Internal block device read failure: 'name=/dev/d1f38c8f-2308-469b-b6e0-1e511da7c1ba/metadata, offset=1097728, size=512'


Engine logs:

2023-10-17 13:16:38,139Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [11d78e7d-1b58-4cb4-81a1-f4b62259754d] START, GetVolumeInfoVDSCommand(HostName = rhelh, GetVolumeInfoVDSCommandParameters:{hostId='964891f5-cb62-431b-bcbd-2903c94c2f79', storagePoolId='4a9a3278-0134-11ee-9af7-525400acea40', storageDomainId='d1f38c8f-2308-469b-b6e0-1e511da7c1ba', imageGroupId='2391895b-6797-483e-a807-36fe0c513f9d', imageId='3060d6b6-445a-4b4a-9dec-e8632216e9f7'}), log id: 68b7362c
2023-10-17 13:16:38,155Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [11d78e7d-1b58-4cb4-81a1-f4b62259754d] Failed building DiskImage: candidate can not be null please use static method createGuidFromString

2023-10-17 13:16:38,243Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [11d78e7d-1b58-4cb4-81a1-f4b62259754d] START, GetVolumeInfoVDSCommand(HostName = rhelh, GetVolumeInfoVDSCommandParameters:{hostId='964891f5-cb62-431b-bcbd-2903c94c2f79', storagePoolId='4a9a3278-0134-11ee-9af7-525400acea40', storageDomainId='d1f38c8f-2308-469b-b6e0-1e511da7c1ba', imageGroupId='f16de09d-81ac-4f4e-9d6f-d8c9c4f5dde3', imageId='b679c6e0-4afb-43d9-a41e-ea1b7e2a28fb'}), log id: 446fb053
2023-10-17 13:16:38,260Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [11d78e7d-1b58-4cb4-81a1-f4b62259754d] Failed building DiskImage: candidate can not be null please use static method createGuidFromString
2023-10-17 13:16:38,260Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [11d78e7d-1b58-4cb4-81a1-f4b62259754d] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand' return value '
VolumeInfoReturn:{status='Status [code=0, message=Done]'}
~~~

Since all GetVolumeInfo failed it won't be able to find OVF_STORE:

~~~
2023-10-17 13:16:39,299Z WARN  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [] There are no OVF_STORE disks on storage domain id d1f38c8f-2308-469b-b6e0-1e511da7c1ba
~~~

And the engine will not be able to see the VM info from OVF_STORE.



Version-Release number of selected component (if applicable):

rhvm-4.5.3.8-2.el8ev.noarch

How reproducible:

100%

Steps to Reproduce:

Attach a storage domain to the DC and immediately after it starts to attach, switch the SPM to another host.

Actual results:

During the storage domain import, the engine will fail to find OVF_STORE if there is also a ConnectStoragePoolVDSCommand request.

Expected results:


Additional info:


[1] https://github.com/oVirt/vdsm/blob/1938e3cb68bd658c53b042335fc830414fc238bd/lib/vdsm/storage/sp.py#L1054
    https://github.com/oVirt/vdsm/blob/1938e3cb68bd658c53b042335fc830414fc238bd/lib/vdsm/storage/sp.py#L1575

Comment 3 Arik 2023-10-18 07:38:08 UTC
Benny, please take a look at this one

Comment 4 Benny Zlotnik 2023-11-26 09:53:22 UTC
I think the problem is in[1]

        List<Disk> unregisteredDisks = new ArrayList<>();
        for (Guid unregisteredDiskId : imagesList) {
            GetUnregisteredDiskQueryParameters unregQueryParams = new GetUnregisteredDiskQueryParameters(
                    unregisteredDiskId, getStorageDomainId(), getStoragePoolId());
            QueryReturnValue unregQueryReturn = runInternalQuery(QueryType.GetUnregisteredDisk,
                    unregQueryParams);
            if (unregQueryReturn.getSucceeded()) {
                unregisteredDisks.add(unregQueryReturn.getReturnValue());
            } else {
                log.debug("Could not get populated disk: {}", unregQueryReturn.getExceptionString());
            }
        }
        getQueryReturnValue().setReturnValue(unregisteredDisks);


ovirt-engine managed to successfully get the imagesList from vdsm, but the subsequent calls to GetUnregisteredDisk (which eventually call getVolumeInfo) failed because the LVs were deactivated.
So ultimately unregisteredDisks remained empty because non of the queries succeeded, and was passed leading to this log message:
2023-10-17 13:16:39,299Z WARN  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (EE-ManagedThreadFactory-engine-Thread-101987) [] There are no OVF_STORE disks on storage domain id d1f38c8f-2308-469b-b6e0-1e511da7c1ba


I think this is conceptually similar to bug 2126602, and we need to fail the SD attachment operation and have it retried manually later


[1] https://github.com/oVirt/ovirt-engine/blob/d48b5f123e2d5980d937dba426373b508445e715/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/GetUnregisteredDisksQuery.java#L81

Comment 6 Casper (RHV QE bot) 2024-01-21 16:00:27 UTC
This bug has low overall severity and is not going to be further verified by QE.

Comment 9 Casper (RHV QE bot) 2024-01-31 14:00:24 UTC
This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE.


Note You need to log in before you can comment on or make changes to this bug.