| Summary: | Manual fence for hypervisor not working for 30+ minutes when hypervisor went down | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> |
| Component: | ovirt-engine | Assignee: | Martin Perina <mperina> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Petr Matyáš <pmatyas> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.6.9 | CC: | gklein, lsurette, lsvaty, mgoldboi, michal.skrivanek, mperina, nashok, oourfali, pkliczew, rbalakri, Rhev-m-bugs, sfroemer, srevivo, ykaul |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-04-19 08:05:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
nijin ashok
2016-11-23 12:35:54 UTC
Michal - when would we get this error? (VDS_CANNOT_CLEAR_VMS_WRONG_STATUS) When the host is not in NonResponsive state (and there are VMs to clear). So if it was e.g. Connecting or anything else the action is not executed I see that the engine tries to connect during this period of time. I see in the logs:
2016-11-10 09:35:00,046 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering 1 hosts
2016-11-10 09:35:00,046 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-64) [] Autorecovering hosts id: f326e97e-ab09-4110-8148-5c00343589f5 , name : lxf101s001
2016-11-10 09:35:00,049 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Lock Acquired to object 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-11-10 09:35:00,064 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Running command: ActivateVdsCommand internal: true. Entities affected : ID: f326e97e-ab09-4110-8148-5c00343589f5 Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2016-11-10 09:35:00,064 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (DefaultQuartzScheduler_Worker-64) [5b097bca] Before acquiring lock in order to prevent monitoring for host 'lxf101s001' from data-center 'Default'
2016-11-10 09:35:00,064 INFO [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}'
and plenty of:
[org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-64) [5b097bca] Failed to acquire lock and wait lock 'EngineLock:{exclusiveLocks='[f326e97e-ab09-4110-8148-5c00343589f5=<VDS_INIT, >]', sharedLocks='null'}'
The locking issue could be the cause here.
Any news on this? We were not able to reproduce the issue nor understand from the logs host how to reproduce it. We suspect that there was some race which caused the locking issue and that's why non-responding treatment was not executed for the host. Did it happen only once or was this issue seen in customers environment more times? If more times, is it possible to share logs also from different hosts? If not than I suggest to close this as WORKSFORME. In case you will be able to reproduce this issue in the future, please attach these logs, so we won't miss it anymore - engine, server, ui, javascript-console ignore my last comment pls, wrong bug (In reply to Martin Perina from comment #9) > Did it happen only once or was this issue seen in customers environment more > times? If more times, is it possible to share logs also from different > hosts? If not than I suggest to close this as WORKSFORME. Sorry, was out office for a week. I think this has only happened once for the customer. To confirm, I am setting needinfo to Steffen as he is the TAM of customer who knows the environment and have more frequent contact with customer. Steffen, Can you please answer Martin's questions? (In reply to nijin ashok from comment #12) > (In reply to Martin Perina from comment #9) > > Did it happen only once or was this issue seen in customers environment more > > times? If more times, is it possible to share logs also from different > > hosts? If not than I suggest to close this as WORKSFORME. > > Sorry, was out office for a week. I think this has only happened once for > the customer. To confirm, I am setting needinfo to Steffen as he is the TAM > of customer who knows the environment and have more frequent contact with > customer. > > Steffen, Can you please answer Martin's questions? We tried to reproduce this issue internally without success, I'm closing this bug since we don't have the data we need to solve it. please reopen if requested data can be collected. |