Description of problem: When powering up a HostedEngine environment, the Host that the HostedEngine VM runs is not connected to all Storage Domains of the DC, only to hosted_storage. But it is in Up status. This host gets the SPM role without seeing all SDs (hosted_storage is master), and several commands start failing as the host is not connected to the other SDs. It happens on this situation: 1. Enable global maintenance 2. Migrate Hosted-Engine to Host X 3. Shutdown the Hosted-Engine 4. Power cycle the host 5. Start Hosted-Engine on Host X 6. HostedEngine goes up on Host X - Host X is still in Up state in the DB since step 2 7. Host X is in Up state, but engine did not send connectStorageServer commands to connect to all other SDs, it is only connected to hosted_storage (master). I think it happens because the HE goes up on a host that was set to Up in the DB before the shutdown, so it doesn't send the connect storage commands to it. If the HE goes up on another host that was not up in the DB during shutdown, the problem does not seem to happen. Host in Up status after powering up the env: # vdsm-client Host getStorageDomains [ "4d38d22c-88c1-4054-a942-15bc51cd8214" <-- hosted_storage ] Expected SDs to be connected to if host is in Up status: # vdsm-client Host getStorageDomains [ "f7eeca0e-b360-4d88-959a-1e0e0730f846", "4d38d22c-88c1-4054-a942-15bc51cd8214", "8b5fc9dc-019b-4c8a-ba89-7b0dc19c5186", "c9d1a566-d436-4fd7-82c1-f886ca239a14", "72114491-e7e4-4680-b095-7d3b83a967c7" ] Version-Release number of selected component (if applicable): ovirt-engine-4.2.7.4-1.el7.noarch vdsm-4.20.43-1.el7.x86_64 How reproducible: Always Steps to Reproduce: As above Actual results: Host is in Up status, but only connected to hosted_storage Expected results: If Host is in Up status, it must be connected to the entire Pool. Additional information: 1. ovirt-engine starts 2018-11-21 11:58:02,652+10 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 44) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. 2018-11-21 11:58:05,557+10 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (ServerService Thread Pool -- 41) [] Initialize vdsBroker 'host1.rhvlab:54321' 2018-11-21 11:58:07,598+10 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host1.rhvlab/192.168.100.1 2018-11-21 11:58:12,266+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-5) [] START, GetHardwareInfoAsyncVDSCommand(HostName = host1.rhvlab, VdsIdAndVdsVDSCommandParametersBase:{hostId='8b0876b5-6e38-464f-a018-a93c91d27724', vds='Host[host1.rhvlab,8b0876b5-6e38-464f-a018-a93c91d27724]'}), log id: 14d9690b 2. warnings about SDs not connected: 2018-11-21 11:58:13,616+10 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-4) [6bb8e8de] domain '72114491-e7e4-4680-b095-7d3b83a967c7:Export' in problem 'NOT_REPORTED'. vds: 'host1.rhvlab' 2018-11-21 11:58:13,643+10 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-4) [6bb8e8de] domain 'c9d1a566-d436-4fd7-82c1-f886ca239a14:NFS' in problem 'NOT_REPORTED'. vds: 'host1.rhvlab' 2018-11-21 11:58:13,652+10 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-4) [6bb8e8de] domain 'f7eeca0e-b360-4d88-959a-1e0e0730f846:iSCSI' in problem 'NOT_REPORTED'. vds: 'host1.rhvlab' 2018-11-21 11:58:13,676+10 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-4) [6bb8e8de] domain '8b5fc9dc-019b-4c8a-ba89-7b0dc19c5186:ISO' in problem 'NOT_REPORTED'. vds: 'host1.rhvlab' 3. connect storage pool 2018-11-21 11:58:17,993+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [8d9bb30] START, ConnectStoragePoolVDSCommand(HostName = host1.rhvlab, ConnectStoragePoolVDSCommandParameters:{hostId='8b0876b5-6e38-464f-a018-a93c91d27724', vdsId='8b0876b5-6e38-464f-a018-a93c91d27724', storagePoolId='bcced3da-e61d-11e8-9e0a-52540015c1ff', masterVersion='1'}), log id: 717c0c84 4. spm start on this host 2018-11-21 11:58:18,650+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [8d9bb30] START, SpmStartVDSCommand(HostName = host1.rhvlab, SpmStartVDSCommandParameters:{hostId='8b0876b5-6e38-464f-a018-a93c91d27724', storagePoolId='bcced3da-e61d-11e8-9e0a-52540015c1ff', prevId='-1', prevLVER='4', storagePoolFormatType='V4', recoveryMode='Manual', SCSIFencing='false'}), log id: 416a4529 5. Random things start failing, because the host is not connected to several SDs, but the engine thinks it is in up state and send commands that require those SDs connected 2018-11-21 12:00:49,605+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [aefb9ca2-8ac3-4ba2-bd53-79e2f21299e3] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host1.rhvlab command GetVolumeInfoVDS failed: Storage domain does not exist: (u'c9d1a566-d436-4fd7-82c1-f886ca239a14',)
(In reply to Germano Veit Michel from comment #0) > It happens on this situation: > 1. Enable global maintenance > 2. Migrate Hosted-Engine to Host X > 3. Shutdown the Hosted-Engine > 4. Power cycle the host > 5. Start Hosted-Engine on Host X Here the engine is also going to start on the engine. AFAIK after engine start there is a kind of grace period where the engine should try to reconcile hosts status before taking any further action. We should investigate why this wasn't enough in the reported case. > 6. HostedEngine goes up on Host X > - Host X is still in Up state in the DB since step 2 > 7. Host X is in Up state, but engine did not send connectStorageServer > commands to connect to all other SDs, it is only connected to hosted_storage > (master).
(In reply to Germano Veit Michel from comment #0) Please use : Host A = initial Hosted-Engine Host Host X = New Hosted-Engine Host > > It happens on this situation: > 1. Enable global maintenance > 2. Migrate Hosted-Engine to Host X > 3. Shutdown the Hosted-Engine which one ? A or X ??? > 4. Power cycle the host which one ? A or X ??? > 5. Start Hosted-Engine on Host X > 6. HostedEngine goes up on Host X > - Host X is still in Up state in the DB since step 2 > 7. Host X is in Up state, but engine did not send connectStorageServer > commands to connect to all other SDs, it is only connected to hosted_storage > (master).
(In reply to Eli Mesika from comment #2) > (In reply to Germano Veit Michel from comment #0) > > Please use : > > Host A = initial Hosted-Engine Host > Host X = New Hosted-Engine Host > > > > > It happens on this situation: > > 1. Enable global maintenance > > 2. Migrate Hosted-Engine to Host X > > 3. Shutdown the Hosted-Engine > > which one ? A or X ??? X > > > 4. Power cycle the host > > which one ? A or X ??? X > > > > 5. Start Hosted-Engine on Host X > > 6. HostedEngine goes up on Host X > > - Host X is still in Up state in the DB since step 2 > > 7. Host X is in Up state, but engine did not send connectStorageServer > > commands to connect to all other SDs, it is only connected to hosted_storage > > (master). In fact, I think any order will reproduce this, I get this bug every single time I power up my test environment.
This warning is related to storage Please look at IrsProxy::addDomainInProblemData Seems that in that case host must go to non-operational Tal, can you take a look and move to storage ?
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Closing the upstram bug in favor of downstream, to concentrate efforts. *** This bug has been marked as a duplicate of bug 1772688 ***