restarting engine cause all DC's to become non-responsive status for a couple of seconds Description of problem: Restarting the engine cause all DC's on that engine to become non-responsive status for a couple of seconds so every operation that you will want to perform in these seconds will fail . I encounter this running automation qcow2_v3 TP in the TEARDOWN phase after restarting the engine & tests fails when the DC is in non-responsive/problematic status for these several seconds Version-Release number of selected component (if applicable): 4.2.0-0.0.master.20170813134654.gitaee967b.el7.centos How reproducible: 100% Steps to Reproduce: 1.Create DC , cluster & SD & see all is in active state 2.restart engine (2017-08-17 11:55:49) 3.wait for about ~16 seconds after restart Actual results: See that DC status changed to 'Non Responsive' 2017-08-17 11:56:04. About 2 sec later DC when up again . Expected results: DC should not change after engine restart. Additional info: Engine log: 2017-08-17 11:56:04,626+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in 'd5b34a49-8efc-4793-bc3c-a83c26419910' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,627+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in '38c5362b-4df4-42d6-9707-49ec23012fc4' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,628+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in '1842421a-81fd-4639-a795-5228ba726fac' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,629+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in 'c1a9a5f1-90cc-42df-b23a-09acb294544c' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,630+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in 'd920169d-0a4a-45dc-b542-452a32076b7b' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,631+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in '7015ea1d-7434-426d-a2f1-22ca9fe31d3a' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,639+03 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler1) [3acb1def] Storage Pool 'de8c0e96-c0f8-4803-864e-ecf78f2ceb94' - Updating Storage Doma in '0394335d-2462-48af-9a24-72a21a05dcba' status from 'Active' to 'Unknown', reason: null 2017-08-17 11:56:04,931+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [3acb1def] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC(980), Invalid status on Data Center golden_env_mixed. Setting status to Non Responsive. 2017-08-17 11:56:04,937+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [4cac1232] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC(980), Invalid status on Data Center dc_upgrade_4_0_to_4_1. Setting status to Non Responsive.
Created attachment 1314628 [details] engine & vdsm logs
This issue as it fails all my tests that include engine restart tests as DC is down for ~2 seconds & automation test teardown operations (detach Dc,remove DC...) fails as DC is not available .
1. Is this a regression? 2. I would say that you either need to improve your tests to check for status, or wait a while after engine restart. let's assume we fix it and it's not 'non-responsive' . What status do you expect it to be? It won't be 'Active' for a while for sure.
As Avihai is on PTO for few days, The issue is not in checking the status, but the fact that restarting the ovirt-engine service triggers re-initialization of the data center.
(In reply to Raz Tamir from comment #4) > As Avihai is on PTO for few days, > > The issue is not in checking the status, but the fact that restarting the > ovirt-engine service triggers re-initialization of the data center. I can't see any re-initialization in the logs (although if you point me to something I'm missing that would be great). As far as I can see, the engine just marks statuses as "unknown" until it gets confirmation that they are up.
(In reply to Allon Mureinik from comment #5) > (In reply to Raz Tamir from comment #4) > > As Avihai is on PTO for few days, > > > > The issue is not in checking the status, but the fact that restarting the > > ovirt-engine service triggers re-initialization of the data center. > > I can't see any re-initialization in the logs (although if you point me to > something I'm missing that would be great). > As far as I can see, the engine just marks statuses as "unknown" until it > gets confirmation that they are up. CLOSE-NOTABUG / WONTFIX / DEFERRED?
(In reply to Yaniv Kaul from comment #3) > 1. Is this a regression? No. I checked & this is occurring also on 4.1. > 2. I would say that you either need to improve your tests to check for > status, or wait a while after engine restart. let's assume we fix it and > it's not 'non-responsive' . What status do you expect it to be? It won't be > 'Active' for a while for sure. To clarify, the issues are: 1) DC goes to 'unresponsive' state after engine restart - why ? we did not restart VDSM but the engine. 2) After engine restart, the DC states goes like this: A) DC reach an 'active' state B) DC goes to 'Unknown' C) DC change back to 'active' Automation currently after engine restart, waits for 'active' DC state but as DC goes from 'active' -> 'unknown' state automation tries to perform actions on DC & fails as DC is not available. Sure, I can change automation tests to wait for these states changes but the question is are these DC states changes by design or not ? IMHO, it does not look reasonable for the DC to be 'active' & then go to 'unresponsive' . I would expect that 1) After restart engine, DC should not go to 'unresponsive' state at all. 2) If DC by design have to go to another state other then 'active' please change DC to 'active' state only when it is finally ready to work & avoid toggling states ('active' -> 'unknown' -> 'active')
Closing old bugs, feel free to reopen if still needed.
This bug is alive and affecting our automation tests. Fresh bug is BZ 1609565 Please fix this issue.