Bug 1397830 - Manual fence for hypervisor not working for 30+ minutes when hypervisor went down
Summary: Manual fence for hypervisor not working for 30+ minutes when hypervisor went ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.9
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: Martin Perina
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-23 12:35 UTC by nijin ashok
Modified: 2020-03-11 15:24 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-19 08:05:15 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)

Description nijin ashok 2016-11-23 12:35:54 UTC
Description of problem:

When hypervisor went down, it seems like the hypervisor stays in "connecting" stage for more than 30 minutes. Power management is not configured for the host. While doing "confirm host has been rebooted", the engine log is showing "VDS_CANNOT_CLEAR_VMS_WRONG_STATUS" . The vdsNotResponding was entered only after 30 minutes and after that manual fencing was working and VM was marked as down.

Version-Release number of selected component (if applicable):

rhevm-3.6.9.2-0.1.el6.noarch


How reproducible:

Was observed in customer environment.

Steps to Reproduce:

1. Manually crash the hypervisor so that it is not accessible. 

2. "confirm host has been rebooted" doesn't clear the VMs status to down.


Actual results:

"confirm host has been rebooted" doesn't clear the VMs status to down.


Expected results:

"confirm host has been rebooted" should make the VMs down.

Additional info:

Comment 3 Oved Ourfali 2016-11-30 09:54:35 UTC
Michal - when would we get this error?
(VDS_CANNOT_CLEAR_VMS_WRONG_STATUS)

Comment 4 Michal Skrivanek 2016-11-30 10:35:50 UTC
When the host is not in NonResponsive state (and there are VMs to clear). So if it was e.g. Connecting or anything else the action is not executed

Comment 7 Piotr Kliczewski 2016-12-06 09:37:02 UTC
I see that the engine tries to connect during this period of time. I see in the logs:

2016-11-10 09:35:00,046 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering 1 hosts
2016-11-10 09:35:00,046 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering hosts id: f326e97e-ab09-4110-8148-5c00343589f5 , name : lxf101s001
2016-11-10 09:35:00,049 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Lock Acquired to object 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-11-10 09:35:00,064 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Running command: ActivateVdsCommand internal: true. Entities affected :  ID: f326e97e-ab09-4110-8148-5c00343589f5 Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2016-11-10 09:35:00,064 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Before acquiring lock in order to prevent monitoring for host 'lxf101s001' from data-center 'Default'
2016-11-10 09:35:00,064 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}'


and plenty of:

[org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}'

The locking issue could be the cause here.

Comment 8 Yaniv Kaul 2017-02-13 18:50:55 UTC
Any news on this?

Comment 9 Martin Perina 2017-03-14 20:58:00 UTC
We were not able to reproduce the issue nor understand from the logs host how to reproduce it. We suspect that there was some race which caused the locking issue and that's why non-responding treatment was not executed for the host.

Did it happen only once or was this issue seen in customers environment more times? If more times, is it possible to share logs also from different hosts? If not than I suggest to close this as WORKSFORME.

Comment 10 Lukas Svaty 2017-03-22 09:15:59 UTC
In case you will be able to reproduce this issue in the future, please attach these logs, so we won't miss it anymore - engine, server, ui, javascript-console

Comment 11 Lukas Svaty 2017-03-22 11:03:00 UTC
ignore my last comment pls, wrong bug

Comment 12 nijin ashok 2017-03-27 15:01:43 UTC
(In reply to Martin Perina from comment #9)
> Did it happen only once or was this issue seen in customers environment more
> times? If more times, is it possible to share logs also from different
> hosts? If not than I suggest to close this as WORKSFORME.

Sorry, was out office for a week. I think this has only happened once for the customer. To confirm,  I am setting needinfo to Steffen as he is the TAM of customer who knows the environment and have more frequent contact with customer.

Steffen, Can you please answer Martin's questions?

Comment 13 Moran Goldboim 2017-04-19 08:05:15 UTC
(In reply to nijin ashok from comment #12)
> (In reply to Martin Perina from comment #9)
> > Did it happen only once or was this issue seen in customers environment more
> > times? If more times, is it possible to share logs also from different
> > hosts? If not than I suggest to close this as WORKSFORME.
> 
> Sorry, was out office for a week. I think this has only happened once for
> the customer. To confirm,  I am setting needinfo to Steffen as he is the TAM
> of customer who knows the environment and have more frequent contact with
> customer.
> 
> Steffen, Can you please answer Martin's questions?

We tried to reproduce this issue internally without success, I'm closing this bug since we don't have the data we need to solve it. please reopen if requested data can be collected.


Note You need to log in before you can comment on or make changes to this bug.