Bug 1482454

Summary: restarting engine cause all DC's to become non-responsive status for a couple of seconds
Product: [oVirt] ovirt-engine Reporter: Avihai <aefrat>
Component: BLL.StorageAssignee: Tal Nisan <tnisan>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: aefrat, bugs, ebenahar, mburman
Target Milestone: ---Keywords: Automation, AutomationBlocker
Target Release: ---Flags: sbonazzo: ovirt-4.3-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-16 08:42:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine & vdsm logs none

Description Avihai 2017-08-17 10:29:35 UTC
restarting engine cause all DC's to become non-responsive status for a couple of seconds

Description of problem:
Restarting the engine cause all DC's on that engine to become non-responsive status for a couple of seconds so every operation that you will want to perform in these seconds will fail .

I encounter this running automation qcow2_v3 TP in the TEARDOWN phase after restarting the engine & tests fails when the DC is in non-responsive/problematic status for these several seconds

Version-Release number of selected component (if applicable):
4.2.0-0.0.master.20170813134654.gitaee967b.el7.centos

How reproducible:
100%


Steps to Reproduce:
1.Create DC , cluster & SD & see all is in active state
2.restart engine (2017-08-17 11:55:49)
3.wait for about ~16 seconds after restart  

Actual results:
See that DC status changed to 'Non Responsive' 2017-08-17 11:56:04.

About 2 sec later DC when up again .


Expected results:
DC should not change after engine restart.


Additional info:
Engine log:
2017-08-17 11:56:04,626+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in 'd5b34a49-8efc-4793-bc3c-a83c26419910' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,627+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in '38c5362b-4df4-42d6-9707-49ec23012fc4' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,628+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in '1842421a-81fd-4639-a795-5228ba726fac' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,629+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in 'c1a9a5f1-90cc-42df-b23a-09acb294544c' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,630+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in 'd920169d-0a4a-45dc-b542-452a32076b7b' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,631+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in '7015ea1d-7434-426d-a2f1-22ca9fe31d3a' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,639+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma
in '0394335d-2462-48af-9a24-72a21a05dcba' status from 'Active' to 'Unknown', reason: null
2017-08-17 11:56:04,931+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [3acb1def] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC(980), Invalid 
status on Data Center golden_env_mixed. Setting status to Non Responsive.
2017-08-17 11:56:04,937+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [4cac1232] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC(980), Invalid 
status on Data Center dc_upgrade_4_0_to_4_1. Setting status to Non Responsive.

Comment 1 Avihai 2017-08-17 10:30:18 UTC
Created attachment 1314628 [details]
engine & vdsm logs

Comment 2 Avihai 2017-08-17 10:35:33 UTC
This issue as it fails all my tests that include engine restart tests as DC is down for ~2 seconds & automation test teardown operations (detach Dc,remove DC...) fails as DC is not available .

Comment 3 Yaniv Kaul 2017-08-20 08:00:40 UTC
1. Is this a regression?
2. I would say that you either need to improve your tests to check for status, or wait a while after engine restart. let's assume we fix it and it's not 'non-responsive' . What status do you expect it to be? It won't be 'Active' for a while for sure.

Comment 4 Raz Tamir 2017-08-20 08:19:46 UTC
As Avihai is on PTO for few days,

The issue is not in checking the status, but the fact that restarting the ovirt-engine service triggers re-initialization of the data center.

Comment 5 Allon Mureinik 2017-08-24 12:05:55 UTC
(In reply to Raz Tamir from comment #4)
> As Avihai is on PTO for few days,
> 
> The issue is not in checking the status, but the fact that restarting the
> ovirt-engine service triggers re-initialization of the data center.

I can't see any re-initialization in the logs (although if you point me to something I'm missing that would be great).
As far as I can see, the engine just marks statuses as "unknown" until it gets confirmation that they are up.

Comment 6 Yaniv Kaul 2017-09-05 07:10:32 UTC
(In reply to Allon Mureinik from comment #5)
> (In reply to Raz Tamir from comment #4)
> > As Avihai is on PTO for few days,
> > 
> > The issue is not in checking the status, but the fact that restarting the
> > ovirt-engine service triggers re-initialization of the data center.
> 
> I can't see any re-initialization in the logs (although if you point me to
> something I'm missing that would be great).
> As far as I can see, the engine just marks statuses as "unknown" until it
> gets confirmation that they are up.

CLOSE-NOTABUG / WONTFIX / DEFERRED?

Comment 7 Avihai 2017-09-05 11:46:09 UTC
(In reply to Yaniv Kaul from comment #3)
> 1. Is this a regression?
No. I checked & this is occurring also on 4.1.

> 2. I would say that you either need to improve your tests to check for
> status, or wait a while after engine restart. let's assume we fix it and
> it's not 'non-responsive' . What status do you expect it to be? It won't be
> 'Active' for a while for sure.

To clarify, the issues are:

1) DC goes to 'unresponsive' state after engine restart -  why ? we did not restart VDSM but the engine. 

2) After engine restart, the DC states goes like this:
A) DC reach an 'active' state 
B) DC goes to 'Unknown'
C) DC change back to 'active'

Automation currently after engine restart, waits for 'active' DC state but as DC goes from 'active' -> 'unknown' state automation tries to perform actions on DC & fails as DC is not available.

Sure, I can change automation tests to wait for these states changes but the question is are these DC states changes by design or not ?

IMHO, it does not look reasonable for the DC to be 'active' & then go to 'unresponsive' .

I would expect that 
1) After restart engine, DC should not go to 'unresponsive' state at all.

2) If DC by design have to go to another state other then 'active' please change DC to 'active' state only when it is finally ready to work & avoid toggling states ('active' -> 'unknown' -> 'active')

Comment 8 Tal Nisan 2018-07-16 08:42:21 UTC
Closing old bugs, feel free to reopen if still needed.

Comment 9 Michael Burman 2018-07-29 14:38:56 UTC
This bug is alive and affecting our automation tests. 
Fresh bug is BZ 1609565
Please fix this issue.