Bug 1486360

Summary: After restart of ovirt-engine service master storage is in unknown state for short period of time
Product: [oVirt] ovirt-engine Reporter: Lukas Svaty <lsvaty>
Component: Backend.CoreAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.4.2CC: amureini, bugs, lsvaty, ylavi
Target Milestone: ---Keywords: AutomationBlocker, Regression
Target Release: ---Flags: lsvaty: planning_ack?
lsvaty: devel_ack?
lsvaty: testing_ack+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-31 07:16:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lukas Svaty 2017-08-29 14:49:13 UTC
Description of problem:
After a restart of ovirt-engine service, master storage is in the unknown state for short period of time. This period is really short, and I believe Apache has restarted right after, so by the time admin logs into admin portal to check storages they are already in UP state.

Version-Release number of selected component (if applicable):


How reproducible:
ovirt-engine-4.2.0-0.0.master.20170823165744.git116f435.el7.centos.noarch

Steps to Reproduce:
1. service ovirt-engine restart && tail -f /var/log/ovirt-engine/engine.log 
2. check for log messages of storage changing state

Actual results:
Storage going to unknown state for short period of time.

Expected results:
No action on hosts, storages or other engine entities.

Additional info:
2017-08-29 17:47:58,785+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [1522e02f] Storage Pool '5c8d7813-df84-45f8-b380-0a369d0475e7' - Updating Storage Domain '9a1d0c82-7cbd-4174-8b44-50b7682697be' status from 'Active' to 'Unknown', reason: null

Comment 2 Yaniv Kaul 2017-08-30 07:23:24 UTC
I don't see a real user problem here. Close-WONTFIX?

Comment 3 Lukas Svaty 2017-08-30 08:29:37 UTC
From the manual point of view, I do not see a problem as well. However in automation where we do not rely on slow interaction with web admin, once the engine is reporting DC UP and hosts UP after the restart, it should not go to Unknown state after that. Due to these few flow are hard to achieve such as

Using AnySDK
1. Edit config by engine-config
2. Restart engine
3. Check storage is UP (just as a dummy WA for this bug)
4. Create VM

While after the restart of the engine storage is in UP state, during the creation of VM (disk) it goes to the unknown state, which fails the VM creation. This is relevant not only for QA automation but also for the community who uses engine-config to configure custom properties for VMs which they are planning to add to VM, for example for hooks.

Comment 4 Yaniv Kaul 2017-08-30 08:40:23 UTC
(In reply to Lukas Svaty from comment #3)
> From the manual point of view, I do not see a problem as well. However in
> automation where we do not rely on slow interaction with web admin, once the
> engine is reporting DC UP and hosts UP after the restart, it should not go
> to Unknown state after that. Due to these few flow are hard to achieve such
> as
> 
> Using AnySDK
> 1. Edit config by engine-config
> 2. Restart engine
> 3. Check storage is UP (just as a dummy WA for this bug)
> 4. Create VM
> 
> While after the restart of the engine storage is in UP state, during the
> creation of VM (disk) it goes to the unknown state, which fails the VM
> creation. This is relevant not only for QA automation but also for the
> community who uses engine-config to configure custom properties for VMs
> which they are planning to add to VM, for example for hooks.

I completely understand the scenario - yet I don't think a lot of people would be hit by this issue in real life.
I think a short sleep could be a good workaround - or the real workaround you've suggested above.

Does VM creation fail, or the disks?

BTW, disk creation can fail at any time when the storage is not up. 
So defensive programming would do the sanity check to verify if the storage is up before trying to create the disk. I personally think it's an overkill.

How long does it take it to go back to Up?

Comment 5 Lukas Svaty 2017-08-30 08:56:26 UTC
It is around 10 seconds after first health check (check dc and host status up) after the restart that it goes to the unknown state, recovers ~15 seconds after first health check after the restart, thus ~5 seconds window. (Not accurate measurement at all, can retest if needed).


VM add fails on: 

17:17:57 2017-08-29 17:17:57,036 ERROR  NOTE: Test failed on setup phase!
17:17:57 2017-08-29 17:17:57,036 ERROR Result: FAILED
17:17:57 2017-08-29 17:17:57,036 ERROR  ERR: Failed to create element NOT as expected:
17:17:57 	Status: 400
17:17:57 	Reason: Bad Request
17:17:57 	Detail: [Cannot add VM: Storage Domain cannot be accessed.
17:17:57 -Please check that at least one Host is operational and Data Center state is up.]

Don't think the start of disk creation was even executed based on the 400, not sure what it would do if the storage drops in the middle of disk creation.

Comment 6 Allon Mureinik 2017-08-31 07:16:53 UTC
(In reply to Lukas Svaty from comment #3)
> From the manual point of view, I do not see a problem as well. However in
> automation where we do not rely on slow interaction with web admin, once the
> engine is reporting DC UP and hosts UP after the restart, it should not go
> to Unknown state after that. Due to these few flow are hard to achieve such
> as
This isn't the way the system behaves.
We don't just assume the storage is up, we monitor it (which takes time, as you noted), and make sure it's really accessible.

Comment 7 Lukas Svaty 2017-08-31 15:25:13 UTC
So what is the desired check for automation purpose to start working with engine entities?

1. DC is up (it goes down after storage goes to unknown)
2. Storage is up (it goes to unknown right after it is up after restart)
3. Hosts are up (storage goes down after the restart few seconds after they are polled as up)
4. Web admin is accessible (Again too soon)
5. Health page (Way too soon)

I believe we need some kind of check for automation to say restart was successful, the engine can be now used for API calls as the environment is in a stable state.

Comment 8 Allon Mureinik 2017-11-19 10:36:56 UTC
Storage+DC are up.