Hide Forgot
Description of problem: When hypervisor went down, it seems like the hypervisor stays in "connecting" stage for more than 30 minutes. Power management is not configured for the host. While doing "confirm host has been rebooted", the engine log is showing "VDS_CANNOT_CLEAR_VMS_WRONG_STATUS" . The vdsNotResponding was entered only after 30 minutes and after that manual fencing was working and VM was marked as down. Version-Release number of selected component (if applicable): rhevm-3.6.9.2-0.1.el6.noarch How reproducible: Was observed in customer environment. Steps to Reproduce: 1. Manually crash the hypervisor so that it is not accessible. 2. "confirm host has been rebooted" doesn't clear the VMs status to down. Actual results: "confirm host has been rebooted" doesn't clear the VMs status to down. Expected results: "confirm host has been rebooted" should make the VMs down. Additional info:
Michal - when would we get this error? (VDS_CANNOT_CLEAR_VMS_WRONG_STATUS)
When the host is not in NonResponsive state (and there are VMs to clear). So if it was e.g. Connecting or anything else the action is not executed
I see that the engine tries to connect during this period of time. I see in the logs: 2016-11-10 09:35:00,046 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering 1 hosts 2016-11-10 09:35:00,046 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering hosts id: f326e97e-ab09-4110-8148-5c00343589f5 , name : lxf101s001 2016-11-10 09:35:00,049 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Lock Acquired to object 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-11-10 09:35:00,064 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Running command: ActivateVdsCommand internal: true. Entities affected : ID: f326e97e-ab09-4110-8148-5c00343589f5 Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2016-11-10 09:35:00,064 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Before acquiring lock in order to prevent monitoring for host 'lxf101s001' from data-center 'Default' 2016-11-10 09:35:00,064 INFO [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}' and plenty of: [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}' The locking issue could be the cause here.
Any news on this?
We were not able to reproduce the issue nor understand from the logs host how to reproduce it. We suspect that there was some race which caused the locking issue and that's why non-responding treatment was not executed for the host. Did it happen only once or was this issue seen in customers environment more times? If more times, is it possible to share logs also from different hosts? If not than I suggest to close this as WORKSFORME.
In case you will be able to reproduce this issue in the future, please attach these logs, so we won't miss it anymore - engine, server, ui, javascript-console
ignore my last comment pls, wrong bug
(In reply to Martin Perina from comment #9) > Did it happen only once or was this issue seen in customers environment more > times? If more times, is it possible to share logs also from different > hosts? If not than I suggest to close this as WORKSFORME. Sorry, was out office for a week. I think this has only happened once for the customer. To confirm, I am setting needinfo to Steffen as he is the TAM of customer who knows the environment and have more frequent contact with customer. Steffen, Can you please answer Martin's questions?
(In reply to nijin ashok from comment #12) > (In reply to Martin Perina from comment #9) > > Did it happen only once or was this issue seen in customers environment more > > times? If more times, is it possible to share logs also from different > > hosts? If not than I suggest to close this as WORKSFORME. > > Sorry, was out office for a week. I think this has only happened once for > the customer. To confirm, I am setting needinfo to Steffen as he is the TAM > of customer who knows the environment and have more frequent contact with > customer. > > Steffen, Can you please answer Martin's questions? We tried to reproduce this issue internally without success, I'm closing this bug since we don't have the data we need to solve it. please reopen if requested data can be collected.