Bug 1397830

Summary: Manual fence for hypervisor not working for 30+ minutes when hypervisor went down
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-engineAssignee: Martin Perina <mperina>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: high    
Version: 3.6.9CC: gklein, lsurette, lsvaty, mgoldboi, michal.skrivanek, mperina, nashok, oourfali, pkliczew, rbalakri, Rhev-m-bugs, sfroemer, srevivo, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-19 08:05:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description nijin ashok 2016-11-23 12:35:54 UTC
Description of problem:

When hypervisor went down, it seems like the hypervisor stays in "connecting" stage for more than 30 minutes. Power management is not configured for the host. While doing "confirm host has been rebooted", the engine log is showing "VDS_CANNOT_CLEAR_VMS_WRONG_STATUS" . The vdsNotResponding was entered only after 30 minutes and after that manual fencing was working and VM was marked as down.

Version-Release number of selected component (if applicable):

rhevm-3.6.9.2-0.1.el6.noarch


How reproducible:

Was observed in customer environment.

Steps to Reproduce:

1. Manually crash the hypervisor so that it is not accessible. 

2. "confirm host has been rebooted" doesn't clear the VMs status to down.


Actual results:

"confirm host has been rebooted" doesn't clear the VMs status to down.


Expected results:

"confirm host has been rebooted" should make the VMs down.

Additional info:

Comment 3 Oved Ourfali 2016-11-30 09:54:35 UTC
Michal - when would we get this error?
(VDS_CANNOT_CLEAR_VMS_WRONG_STATUS)

Comment 4 Michal Skrivanek 2016-11-30 10:35:50 UTC
When the host is not in NonResponsive state (and there are VMs to clear). So if it was e.g. Connecting or anything else the action is not executed

Comment 7 Piotr Kliczewski 2016-12-06 09:37:02 UTC
I see that the engine tries to connect during this period of time. I see in the logs:

2016-11-10 09:35:00,046 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering 1 hosts
2016-11-10 09:35:00,046 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering hosts id: f326e97e-ab09-4110-8148-5c00343589f5 , name : lxf101s001
2016-11-10 09:35:00,049 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Lock Acquired to object 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-11-10 09:35:00,064 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Running command: ActivateVdsCommand internal: true. Entities affected :  ID: f326e97e-ab09-4110-8148-5c00343589f5 Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2016-11-10 09:35:00,064 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Before acquiring lock in order to prevent monitoring for host 'lxf101s001' from data-center 'Default'
2016-11-10 09:35:00,064 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}'


and plenty of:

[org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}'

The locking issue could be the cause here.

Comment 8 Yaniv Kaul 2017-02-13 18:50:55 UTC
Any news on this?

Comment 9 Martin Perina 2017-03-14 20:58:00 UTC
We were not able to reproduce the issue nor understand from the logs host how to reproduce it. We suspect that there was some race which caused the locking issue and that's why non-responding treatment was not executed for the host.

Did it happen only once or was this issue seen in customers environment more times? If more times, is it possible to share logs also from different hosts? If not than I suggest to close this as WORKSFORME.

Comment 10 Lukas Svaty 2017-03-22 09:15:59 UTC
In case you will be able to reproduce this issue in the future, please attach these logs, so we won't miss it anymore - engine, server, ui, javascript-console

Comment 11 Lukas Svaty 2017-03-22 11:03:00 UTC
ignore my last comment pls, wrong bug

Comment 12 nijin ashok 2017-03-27 15:01:43 UTC
(In reply to Martin Perina from comment #9)
> Did it happen only once or was this issue seen in customers environment more
> times? If more times, is it possible to share logs also from different
> hosts? If not than I suggest to close this as WORKSFORME.

Sorry, was out office for a week. I think this has only happened once for the customer. To confirm,  I am setting needinfo to Steffen as he is the TAM of customer who knows the environment and have more frequent contact with customer.

Steffen, Can you please answer Martin's questions?

Comment 13 Moran Goldboim 2017-04-19 08:05:15 UTC
(In reply to nijin ashok from comment #12)
> (In reply to Martin Perina from comment #9)
> > Did it happen only once or was this issue seen in customers environment more
> > times? If more times, is it possible to share logs also from different
> > hosts? If not than I suggest to close this as WORKSFORME.
> 
> Sorry, was out office for a week. I think this has only happened once for
> the customer. To confirm,  I am setting needinfo to Steffen as he is the TAM
> of customer who knows the environment and have more frequent contact with
> customer.
> 
> Steffen, Can you please answer Martin's questions?

We tried to reproduce this issue internally without success, I'm closing this bug since we don't have the data we need to solve it. please reopen if requested data can be collected.