Created attachment 746843 [details] logs Description of problem: in two hosts cluster with multiple iscsi domains, put hsm host in maintenance. block connectivity from the hsm host to the domains activate the host Version-Release number of selected component (if applicable): sf16 How reproducible: 100% Steps to Reproduce: 1. in 2 hosts cluster with multiple iscsi storage domains, put the hsm host in maintenance 2. block connectivity to the storage from the hsm host 3. activate the host Actual results: host is not changing status from maintenance to unassigned (user thinks nothing is happening) if we try to activate a host again we are getting CanDoAction that the same action is in progress. Expected results: host should change state to unassigned. Additional info: logs 2013-05-12 13:55:30,758 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-45) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:32,135 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-2) Lock Acquired to object EngineLock [exclusiveLocks= key: 2982e993-2ca5-42bb-86ed-8db10986c47e value: VDS , sharedLocks= ] 2013-05-12 13:55:32,158 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (pool-4-thread-47) [491d7e8a] Running command: ActivateVdsCommand internal: false. Entities affected : ID: 2982e993-2ca5-42bb-86ed-8db10986c47e Type: VDS 2013-05-12 13:55:32,173 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-47) [491d7e8a] START, SetVdsStatusVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, status=Unassigned, nonOperationalReason=NONE), log id: 7779db55 2013-05-12 13:55:34,248 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-54) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:34,252 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-54) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:35,625 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-9) Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 2982e993-2ca5-42bb-86ed-8db10986c47e value: VDS , sharedLocks= ] 2013-05-12 13:55:35,626 WARN [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-9) CanDoAction of action ActivateVds failed. Reasons:VAR__ACTION__ACTIVATE,VAR__TYPE__HOST,ACTION_TYPE_FAILED_OBJECT_LOCKED 2013-05-12 13:55:40,879 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-64) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:44,321 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-75) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:44,324 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-75) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:51,020 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-84) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:54,454 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-93) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:54,457 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-93) No string for UNASSIGNED type. Use default Log 2013-05-12 13:55:59,410 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-2) Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 2982e993-2ca5-42bb-86ed-8db10986c47e value: VDS , sharedLocks= ] 2013-05-12 13:55:59,410 WARN [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-2) CanDoAction of action ActivateVds failed. Reasons:VAR__ACTION__ACTIVATE,VAR__TYPE__HOST,ACTION_TYPE_FAILED_OBJECT_LOCKED 2013-05-12 13:56:01,099 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-4) No string for UNASSIGNED type. Use default Log 2013-05-12 13:56:04,527 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-14) No string for UNASSIGNED type. Use default Log 2013-05-12 13:56:04,531 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-14) No string for UNASSIGNED type. Use default Log 2013-05-12 13:56:11,295 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-13) [5ea40e1d] No string for UNASSIGNED type. Use default Log (END)
From what I saw at code - ActivateVdsCommand sends SetVdsStatus VDS command with unssigned (can be seen at the log as well). However, SetVdsStatusVdsCommand does not update the database. From what I saw - this happens if vdsManager is null for the vds (Looking at ResourceManager - it it returns null vds manager for vds if it does not exist at _vdsManagersDict). Not sure why it is not there - Add VDS should have added it to the dictionary, but who removed it? Michael,what are your thoughts about this? Thanks
After talking with Dafna and Haim - we don't think this is a blocker for RC.
By Roy Golan: We could not reproduce this problem. Can you please double check it is still happening in your environment ?
*** Bug 963546 has been marked as a duplicate of this bug. ***
Same scenario reproduced on RHEVM 3.3 - IS11 environment: RHEVM: rhevm-3.3.0-0.16.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.11-1.el6ev.noarch VDSM: vdsm-4.12.0-72.git287bb7e.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 How reproducible: unknown Steps to Reproduce: Based on BZ1002049 1. Crete iSCSI Data Center (DC) with two Hosts connected to 3 Storage Domain (SD) * On first Storage Server (EMC) - one SD * On second Storage Server (XIO) - two SD’s see diagram below: ------------------------------------------------------------------ [V] Host_01 (SPM) _____ connected _________ SD_01 (EMC) [V] Host_02 _______| |_______ SD_02 (XIO) |_______ SD_03 (XIO) - Master ------------------------------------------------------------------ 2. From both Hosts block connectivity to SD_01 via iptables 3. Status all enviroment: Host_01 - Unassigned Host_02 (SPM) - UP SD_01 - Active SD_02 - Active SD_02 - Active 4. Remove iptables Expected results: Host stuck forever in “Unassigned" mode Impact on user: All environment continue work normally. Workaround: service "ovirt-engine" restart Additional info: Reboot host - don’t help Option “Cofurn - Host has been rebooted” - don’t work var/log/ovirt-engine/engine.log 2013-08-28 15:05:18,322 WARN [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-4) [32c8a243] CanDoAction of action FenceVdsManualy failed. Reasons :VAR__TYPE__HOST,VAR__ACTION__MANUAL_FENCE,ACTION_TYPE_FAILED_VDS_NOT_MATCH_VALID_STATUS 2013-08-28 15:05:18,322 INFO [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-4) [32c8a243] Lock freed to object EngineLock [exclusiveLocks= key: 9576d8ca-4466-46e6-bebc-ccd922075ac6 value: VDS_FENCE , sharedLocks= ] /var/log/vdsm/vdsm.log
was able to reproduce it only after putting one host to maintenance and activate. then I saw my engine fails to acquire the monitoring lock (VDS_INIT) because it wasn't released by the previous monitoring cycle: my host id starts with 2b294: 2013-09-09 09:48:15,275 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Before acquiring lock EngineLock [exclusiveLocks= key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 value: VDS_INIT , sharedLocks= ] 2013-09-09 09:48:15,277 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Successed acquiring lock EngineLock [exclusiveLocks= key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 value: VDS_INIT , sharedLocks= ] succeeded ******* there is no log for releasing the lock for 2b294 ************ ******* 30 seconds later ******** 2013-09-09 09:48:48,299 DEBUG [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-60) Refreshing Gluster Data [lightweight] 2013-09-09 09:48:48,297 DEBUG [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler_Worker-79) Load Balancer timer entered. 2013-09-09 09:48:48,313 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Before acquiring lock EngineLock [exclusiveLocks= key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 value: VDS_INIT , sharedLocks= ] 2013-09-09 09:48:48,327 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) Before releasing a lock EngineLock [exclusiveLocks= key: 4a581347-0f48-4fbc-a64f-1ca7d4b5bd01 value: VDS_INIT , sharedLocks= ] 2013-09-09 09:48:48,330 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Failed to acquire lock. Exclusive lock is taken for key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 , value: VDS_INIT trying to figure out now how the lock wasn't not released. the logs show nothing special.
Vlad I need a reproducer with logs(there's a problem with the logs attached) engine.log open to DEBUG, server.log and a thread dump I'm still having problem to reproducing that.
Vlad will update
During network problem and BZ1021561, can't add right now relevant logs, next week I will run same test and add relevant log in "DEBUG" mode.
I reproduce scenario. Attached engine.log in DEBUG mode RHEVM 3.3 - IS20 environment: Host OS: RHEL 6.5 RHEVM: rhevm-3.3.0-0.28.beta1.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.17-1.el6ev.noarch VDSM: vdsm-4.13.0-0.5.beta1.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-29.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64
Created attachment 816848 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
tigris02 is stuck Unassigned because the monitoring task (VdsRuntimeInfo) is failing to re-schedule iteself because the Quartz Scheduler EJB is flaged as shutdown. EJB components are shutting down when the server is going down or the the ear is undeployed. but this seems to happen without any intervention. I also seen a "DestroyJavaJvm" thread in the thread dump which is also something appears on a shutdown and I think when a classloader is unloading (maybe during undeploy) Juan have you seen this behaviour before? *** here's what the log says when trying to schedule *** 2013-10-28 17:11:19,221 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) EJBComponentUnavailableException: JBAS014559: Invocation cannot proceed as component is shutting down: org.jboss.as.ejb3.component.EJBComponentUnavailableException: JBAS014559: Invocation cannot proceed as component is shutting down at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:59) [jboss-as-ejb3.jar:7.3.0.Final-redhat-6] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.3.0.Final-redhat-6] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.3.0.Final-redhat-6] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.3.0.Final-redhat-6] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.3.0.Final-redhat-6] at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:182) [jboss-as-ee.jar:7.3.0.Final-redhat-6] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.2.Final-redhat-1] at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.3.0.Final-redhat-6] at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.addExternallyManagedVms(Unknown Source) at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.saveDataToDb(VdsUpdateRunTimeInfo.java:173) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.Refresh(VdsUpdateRunTimeInfo.java:361) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:237) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) [:1.7.0_45] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2013-10-28 17:11:19,415 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-83) failed to reschedule the job: org.quartz.SchedulerException: The Scheduler has been shutdown. at org.quartz.core.QuartzScheduler.validateState(QuartzScheduler.java:749) [quartz.jar:] at org.quartz.core.QuartzScheduler.rescheduleJob(QuartzScheduler.java:1060) [quartz.jar:] at org.quartz.impl.StdScheduler.rescheduleJob(StdScheduler.java:312) [quartz.jar:] at org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl.rescheduleAJob(SchedulerUtilQuartzImpl.java:331) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.FixedDelayJobListener.jobWasExecuted(FixedDelayJobListener.java:91) [scheduler.jar:] at org.quartz.core.QuartzScheduler.notifyJobListenersWasExecuted(QuartzScheduler.java:1936) [quartz.jar:] at org.quartz.core.JobRunShell.notifyJobListenersComplete(JobRunShell.java:361) [quartz.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:235) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]
Jboss is restarting so ignore comment 16. this is expected
(In reply to vvyazmin from comment #14) > I reproduce scenario. > Attached engine.log in DEBUG mode > > RHEVM 3.3 - IS20 environment: > > Host OS: RHEL 6.5 > > RHEVM: rhevm-3.3.0-0.28.beta1.el6ev.noarch > PythonSDK: rhevm-sdk-python-3.3.0.17-1.el6ev.noarch > VDSM: vdsm-4.13.0-0.5.beta1.el6ev.x86_64 > LIBVIRT: libvirt-0.10.2-29.el6.x86_64 > QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64 > SANLOCK: sanlock-2.8-1.el6.x86_64 According to comments #16 and #17 it looks like you have restarted the engine when trying to reproduce the issue. Is this correct ?
no. I did not manually restart engine at any point (I would write it down in the reproduction).
(In reply to Dafna Ron from comment #19) > no. I did not manually restart engine at any point (I would write it down in > the reproduction). Sorry Dafna (I know you haven't), the needinfo was targeted to Vlad Vlad can you please answer comment #18
this is a real issue like I suspected. the engine unloads the EJB ear and this lead to failure on getting beans to react. need to see why it happens and how does this different from a manual restart of the engine which shouldn't change much here Juan, since this seems to be an a real issue here, have you seen jboss reloading the ear in some cases without manual intervention?
The only situation where JBoss will reload an application is if the deployment marker file is touched. In our case it is in "/var/lib/ovirt-engine/deployments/engine.ear.deployed". Is this file being touched manually or by some automatic process? To check if this is the issue I would suggest to disable scanning after the initial startup. In "/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in" change "/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in" change "scan-interval" to 0, then restart the engine. After doing this restart the engine and repeat the test. I would also suggest the timestamp of the "/var/lib/ovirt-engine/deployments/engine.ear.deployed" file right after restarting the engine, before running the test and after running the test. It shouldn't change. If it does then we need to know what is touching it.
After disabling the scanning ("scan-interval" to 0) and rerun the test the host became "unassigned" cksum the "/var/lib/ovirt-engine/deployments/engine.ear.deployed" returns the same output before and after the test.
Looks like the scanning has an effect, we should probably disable it as we don't use it in production environments (may be useful in development environments). The checksum (the output of the cksum command) isn't good in this case, as the content of the file will never change, it is always empty. I will suggest to repeat the test with scanning enabled, as originally installed, and then check the timestamp of the .deployed file. You can check the timestamp with the "ls -l" or "stat" commands.
I set scan-intreval set to 5000 (like default) rerun the test but again the host is "unassigned I am using is21
(In reply to Aharon Canan from comment #25) > I set scan-intreval set to 5000 (like default) > rerun the test but again the host is "unassigned > > I am using is21 Please provide also the results when scanning is disabled to see if it caused that
will be able to provide on coming Wednesday
like I wrote in comment #23, I also attaching the logs.
Created attachment 823877 [details] logs
state changed to unassigned and then to non-op
So AFAIU this is exactly what should happen, SO what is the bug ?
Barak, I probably didn't manage to reproduce, but following comment #21 there is real issue here. please let me know if you need more info.
(In reply to Aharon Canan from comment #33) > Barak, > > I probably didn't manage to reproduce, > but following comment #21 there is real issue here. The phenomena observed on comment #21 is general Jboss behavior and was not reproduced as well, See comment #22 for explanation. However if there was a real issue there we would have encountered that many times, so it looks like an specific environmen issue (that has nothing to do with the bug description) > > please let me know if you need more info. Moving this bug to CLOSE WORKSFORME
I can't see anything special in the logs. The situation is that you have a host with no access to the storage domain, the host is in maintenance mode and doesn't switch mode to unassigned?
(In reply to Liran Zelkha from comment #39) > I can't see anything special in the logs. The situation is that you have a > host with no access to the storage domain, the host is in maintenance mode > and doesn't switch mode to unassigned? Hosts are stuck in Unassigned mode (admin tried to activate them but one of the SD is blocked by iptables).
But if the SD is not available you can't start the hosts. So even after the SD is up the host is unassigned?
(In reply to Liran Zelkha from comment #41) > But if the SD is not available you can't start the hosts. So even after the > SD is up the host is unassigned? Host must become Non-Operational (not stuck Unassigned), should not it? Both maintenance and activate buttons are unavailable if host is in unassigned state. I don't think it's expected.
From the scenario I'm testing here, I see that a host is NonOperational, and when I activate it, it becomes unassigned for 1-2 minutes, and then switches back to NonOperational. Is that the scenario you have? Can you send the host name, so I'll know what to look for in the logs? Should the fix be that hosts that are unassigned can be moved to maintenance mode, or should the fix be that NonOperational-->Activate failed-->NonOperational (and not to unassigned)?
(In reply to Liran Zelkha from comment #43) > From the scenario I'm testing here, I see that a host is NonOperational, and > when I activate it, it becomes unassigned for 1-2 minutes, and then switches > back to NonOperational. Is that the scenario you have? The above is a standard behaviour. If this is the case than this bug shoud be CLOSED NOTABUG > Can you send the host name, so I'll know what to look for in the logs? > Should the fix be that hosts that are unassigned can be moved to maintenance > mode, in case this is a transient status than the above should not be relevant > or should the fix be that NonOperational-->Activate > failed-->NonOperational (and not to unassigned)? unassigned is happening due to the host recovery mechanism (moves the host to unassigned ....) Roman & Pavel - could you please check whether "stuck in unassigned" is transient for 2-3 minutes ? or stck forever ?
(In reply to Barak from comment #44) > (In reply to Liran Zelkha from comment #43) > > From the scenario I'm testing here, I see that a host is NonOperational, and > > when I activate it, it becomes unassigned for 1-2 minutes, and then switches > > back to NonOperational. Is that the scenario you have? > > The above is a standard behaviour. > If this is the case than this bug shoud be CLOSED NOTABUG > > > > Can you send the host name, so I'll know what to look for in the logs? > > Should the fix be that hosts that are unassigned can be moved to maintenance > > mode, > > in case this is a transient status than the above should not be relevant > > > or should the fix be that NonOperational-->Activate > > failed-->NonOperational (and not to unassigned)? > > unassigned is happening due to the host recovery mechanism (moves the host > to unassigned ....) > > Roman & Pavel - could you please check whether "stuck in unassigned" is > transient for 2-3 minutes ? or stck forever ? In this case the hypervisors got stuck forever. All of them only one was up.
Hi Roman, Can you please send the host name that was stuck? Or the one that wasn't stuck - just so I can trace it in the logs?
can we get the engine log from the point in time it had happened ? to a specific host ?
(In reply to Barak from comment #48) > can we get the engine log from the point in time it had happened ? > to a specific host ? The only log we have are the logs in attached to this case.
This bug is referenced in ovirt-engine-3.4.0-beta3 logs. Moving to ON_QA
Checking again the code and logs, it seems like VdsUpdateRuntimeInfo does "understand" that the host should be migrated to NonOperational, but some flag in VdsManager is set to true, and so the host is not actually updated. I'll try to fix that.
3.4.0-0.3.master.el6ev. Host moves to non operatonal
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0506.html