962180 – engine: host stuck on Unassigned when moving from status Maintenance when storage is not availble from the host

Bug 962180 - engine: host stuck on Unassigned when moving from status Maintenance when storage is not availble from the host

Summary: engine: host stuck on Unassigned when moving from status Maintenance when st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Liran Zelkha
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (1):	963546 (view as bug list)
Depends On:
Blocks:	902971 1019461 1068926 rhev3.4beta 1142926
TreeView+	depends on / blocked

Reported:	2013-05-12 10:58 UTC by Dafna Ron
Modified:	2018-12-05 15:59 UTC (History)
CC List:	20 users (show)
Fixed In Version:	ovirt-3.4.0-beta3
Doc Type:	Bug Fix
Doc Text:	Previously, a host moving from maintenance to active when storage domains are not available would become stuck in Unassigned mode. Now, the specific scenario in 'VdsUpdateRuntimeInfo' that did not save NonOperational status to the database has been resolved, and the host moves successfully from Unassigned to Non-Operational.
Clone Of:
Clones:	1068926 (view as bug list)
Environment:
Last Closed:	2014-06-09 14:59:06 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (542.61 KB, application/x-gzip) 2013-05-12 10:58 UTC, Dafna Ron	no flags	Details
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (5.75 MB, application/x-gzip) 2013-10-28 15:50 UTC, vvyazmin@redhat.com	no flags	Details
logs (2.20 MB, application/x-gzip) 2013-11-14 11:20 UTC, Aharon Canan	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:0506	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Virtualization Manager 3.4.0 update	2014-06-09 18:55:38 UTC
oVirt gerrit	24969	'None'	'MERGED'	'core: Ensure NonOperational state is saved'	2019-11-21 13:29:07 UTC
oVirt gerrit	25298	'None'	'MERGED'	'core: Ensure NonOperational state is saved'	2019-11-21 13:29:07 UTC

Description Dafna Ron 2013-05-12 10:58:54 UTC

Created attachment 746843 [details]
logs

Description of problem:

in two hosts cluster with multiple iscsi domains, put hsm host in maintenance.
block connectivity from the hsm host to the domains
activate the host

Version-Release number of selected component (if applicable):

sf16

How reproducible:

100%

Steps to Reproduce:
1. in 2 hosts cluster with multiple iscsi storage domains, put the hsm host in maintenance
2. block connectivity to the storage from the hsm host
3. activate the host
 
Actual results:

host is not changing status from maintenance to unassigned (user thinks nothing is happening) if we try to activate a host again we are getting CanDoAction that the same action is in progress.

Expected results:

host should change state to unassigned.


Additional info: logs

2013-05-12 13:55:30,758 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-45) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:32,135 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-2) Lock Acquired to object EngineLock [exclusiveLocks= key: 2982e993-2ca5-42bb-86ed-8db10986c47e value: VDS
, sharedLocks= ]
2013-05-12 13:55:32,158 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (pool-4-thread-47) [491d7e8a] Running command: ActivateVdsCommand internal: false. Entities affected :  ID: 2982e993-2ca5-42bb-86ed-8db10986c47e Type: VDS
2013-05-12 13:55:32,173 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-47) [491d7e8a] START, SetVdsStatusVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, status=Unassigned, nonOperationalReason=NONE), log id: 7779db55
2013-05-12 13:55:34,248 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-54) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:34,252 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-54) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:35,625 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-9) Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 2982e993-2ca5-42bb-86ed-8db10986c47e value: VDS
, sharedLocks= ]
2013-05-12 13:55:35,626 WARN  [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-9) CanDoAction of action ActivateVds failed. Reasons:VAR__ACTION__ACTIVATE,VAR__TYPE__HOST,ACTION_TYPE_FAILED_OBJECT_LOCKED
2013-05-12 13:55:40,879 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-64) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:44,321 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-75) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:44,324 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-75) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:51,020 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-84) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:54,454 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-93) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:54,457 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-93) No string for UNASSIGNED type. Use default Log
2013-05-12 13:55:59,410 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-2) Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 2982e993-2ca5-42bb-86ed-8db10986c47e value: VDS
, sharedLocks= ]
2013-05-12 13:55:59,410 WARN  [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp-/127.0.0.1:8702-2) CanDoAction of action ActivateVds failed. Reasons:VAR__ACTION__ACTIVATE,VAR__TYPE__HOST,ACTION_TYPE_FAILED_OBJECT_LOCKED
2013-05-12 13:56:01,099 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-4) No string for UNASSIGNED type. Use default Log
2013-05-12 13:56:04,527 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-14) No string for UNASSIGNED type. Use default Log
2013-05-12 13:56:04,531 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-14) No string for UNASSIGNED type. Use default Log
2013-05-12 13:56:11,295 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-13) [5ea40e1d] No string for UNASSIGNED type. Use default Log
(END)

Comment 2 Yair Zaslavsky 2013-05-13 09:04:16 UTC

From what I saw at code - ActivateVdsCommand sends SetVdsStatus VDS command with unssigned (can be seen at the log as well).
However, SetVdsStatusVdsCommand does not update the database.
From what I saw - this happens if vdsManager is null for the vds (Looking at ResourceManager - it it returns null vds manager for vds if it does not exist at _vdsManagersDict).
Not sure why it is not there - Add VDS should have added it to the dictionary, but who removed it?

Michael,what are your thoughts about this?

Thanks

Comment 3 Yair Zaslavsky 2013-05-13 10:37:06 UTC

After talking with Dafna and Haim  - we don't think this is a blocker for RC.

Comment 5 Barak 2013-08-12 14:49:16 UTC

By Roy Golan:

We could not reproduce this problem.

Can you please double check it is still happening in your environment ?

Comment 6 Barak 2013-08-12 15:25:19 UTC

*** Bug 963546 has been marked as a duplicate of this bug. ***

Comment 7 vvyazmin@redhat.com 2013-08-28 12:42:50 UTC

Same scenario reproduced on RHEVM 3.3 - IS11 environment:

RHEVM:  rhevm-3.3.0-0.16.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.11-1.el6ev.noarch
VDSM:  vdsm-4.12.0-72.git287bb7e.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
unknown 

Steps to Reproduce:
Based on BZ1002049

1. Crete iSCSI Data Center (DC) with two Hosts connected to 3 Storage Domain (SD)
* On first Storage Server (EMC) - one SD
* On second Storage Server (XIO) - two SD’s
see diagram below:
------------------------------------------------------------------
[V] Host_01 (SPM) _____ connected _________ SD_01 (EMC)
[V] Host_02 _______|                 |_______ SD_02 (XIO)
                                     |_______ SD_03 (XIO) - Master
------------------------------------------------------------------
2. From both Hosts block connectivity to SD_01 via iptables
3. Status all enviroment:
Host_01 - Unassigned
Host_02 (SPM) - UP

SD_01 - Active
SD_02 - Active
SD_02 - Active
4. Remove iptables

Expected results:
Host stuck forever in “Unassigned" mode

Impact on user:
All environment continue work normally. 

Workaround:
service "ovirt-engine" restart

Additional info:
Reboot host - don’t help
Option “Cofurn - Host has been rebooted” - don’t work

var/log/ovirt-engine/engine.log
2013-08-28 15:05:18,322 WARN  [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-4) [32c8a243] CanDoAction of action FenceVdsManualy failed. Reasons
:VAR__TYPE__HOST,VAR__ACTION__MANUAL_FENCE,ACTION_TYPE_FAILED_VDS_NOT_MATCH_VALID_STATUS
2013-08-28 15:05:18,322 INFO  [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-4) [32c8a243] Lock freed to object EngineLock [exclusiveLocks= key:
 9576d8ca-4466-46e6-bebc-ccd922075ac6 value: VDS_FENCE
, sharedLocks= ]

/var/log/vdsm/vdsm.log

Comment 8 Roy Golan 2013-09-09 11:18:24 UTC

was able to reproduce it only after putting one host to maintenance and activate.

then I saw my engine fails to acquire the monitoring lock (VDS_INIT) because it wasn't released by the previous monitoring cycle:


my host id starts with 2b294:


2013-09-09 09:48:15,275 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Before acquiring lock EngineLock [exclusiveLocks= key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 value: VDS_INIT
, sharedLocks= ]
2013-09-09 09:48:15,277 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Successed acquiring lock EngineLock [exclusiveLocks= key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 value: VDS_INIT
, sharedLocks= ] succeeded 



******* there is no log for releasing the lock for 2b294 ************

******* 30 seconds later ********

2013-09-09 09:48:48,299 DEBUG [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-60) Refreshing Gluster Data [lightweight]
2013-09-09 09:48:48,297 DEBUG [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler_Worker-79) Load Balancer timer entered.
2013-09-09 09:48:48,313 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Before acquiring lock EngineLock [exclusiveLocks= key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 value: VDS_INIT
, sharedLocks= ]
2013-09-09 09:48:48,327 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) Before releasing a lock EngineLock [exclusiveLocks= key: 4a581347-0f48-4fbc-a64f-1ca7d4b5bd01 value: VDS_INIT
, sharedLocks= ]
2013-09-09 09:48:48,330 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-69) Failed to acquire lock. Exclusive lock is taken for key: 2b294ef6-7e1c-404c-94b5-5b9529a48c64 , value: VDS_INIT



trying to figure out now how the lock wasn't not released. the logs show nothing special.

Comment 9 Roy Golan 2013-09-10 06:48:26 UTC

Vlad I need a reproducer with logs(there's a problem with the logs attached) engine.log open to DEBUG, server.log and a thread dump

I'm still having problem to reproducing that.

Comment 11 Aharon Canan 2013-09-10 11:46:28 UTC

Vlad will update

Comment 12 vvyazmin@redhat.com 2013-10-23 08:14:49 UTC

During network problem and BZ1021561, can't add right now relevant logs, next week I will run same test and add relevant log in "DEBUG" mode.

Comment 14 vvyazmin@redhat.com 2013-10-28 15:49:18 UTC

I reproduce scenario. 
Attached  engine.log in DEBUG mode

RHEVM 3.3 - IS20 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.28.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.17-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.5.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-29.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 15 vvyazmin@redhat.com 2013-10-28 15:50:11 UTC

Created attachment 816848 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 16 Roy Golan 2013-10-29 15:31:57 UTC

tigris02 is stuck Unassigned because the monitoring task (VdsRuntimeInfo) is failing to re-schedule iteself because the Quartz Scheduler EJB is flaged as shutdown.

EJB components are shutting down when the server is going down or the the ear is undeployed. but this seems to happen without any intervention.

I also seen a "DestroyJavaJvm" thread in the thread dump which is also something appears on a shutdown and I think when a classloader is unloading (maybe during undeploy)

Juan have you seen this behaviour before?

*** here's what the log says when trying to schedule ***

2013-10-28 17:11:19,221 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) EJBComponentUnavailableException: JBAS014559: Invocation cannot proceed as component is shutting down: org.jboss.as.ejb3.component.EJBComponentUnavailableException: JBAS014559: Invocation cannot proceed as component is shutting down
	at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:59) [jboss-as-ejb3.jar:7.3.0.Final-redhat-6]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.3.0.Final-redhat-6]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.3.0.Final-redhat-6]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.3.0.Final-redhat-6]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.3.0.Final-redhat-6]
	at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:182) [jboss-as-ee.jar:7.3.0.Final-redhat-6]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.2.Final-redhat-1]
	at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.3.0.Final-redhat-6]
	at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.addExternallyManagedVms(Unknown Source)
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.saveDataToDb(VdsUpdateRunTimeInfo.java:173) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.Refresh(VdsUpdateRunTimeInfo.java:361) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:237) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) [:1.7.0_45]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45]
	at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2013-10-28 17:11:19,415 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-83) failed to reschedule the job: org.quartz.SchedulerException: The Scheduler has been shutdown.
	at org.quartz.core.QuartzScheduler.validateState(QuartzScheduler.java:749) [quartz.jar:]
	at org.quartz.core.QuartzScheduler.rescheduleJob(QuartzScheduler.java:1060) [quartz.jar:]
	at org.quartz.impl.StdScheduler.rescheduleJob(StdScheduler.java:312) [quartz.jar:]
	at org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl.rescheduleAJob(SchedulerUtilQuartzImpl.java:331) [scheduler.jar:]
	at org.ovirt.engine.core.utils.timer.FixedDelayJobListener.jobWasExecuted(FixedDelayJobListener.java:91) [scheduler.jar:]
	at org.quartz.core.QuartzScheduler.notifyJobListenersWasExecuted(QuartzScheduler.java:1936) [quartz.jar:]
	at org.quartz.core.JobRunShell.notifyJobListenersComplete(JobRunShell.java:361) [quartz.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:235) [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

Comment 17 Roy Golan 2013-10-29 15:36:52 UTC

Jboss is restarting so ignore comment 16. this is expected

Comment 18 Barak 2013-10-29 19:02:10 UTC

(In reply to vvyazmin from comment #14)
> I reproduce scenario. 
> Attached  engine.log in DEBUG mode
> 
> RHEVM 3.3 - IS20 environment:
> 
> Host OS: RHEL 6.5
> 
> RHEVM:  rhevm-3.3.0-0.28.beta1.el6ev.noarch
> PythonSDK:  rhevm-sdk-python-3.3.0.17-1.el6ev.noarch
> VDSM:  vdsm-4.13.0-0.5.beta1.el6ev.x86_64
> LIBVIRT:  libvirt-0.10.2-29.el6.x86_64
> QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
> SANLOCK:  sanlock-2.8-1.el6.x86_64

According to comments #16 and #17 it looks like you have restarted the engine when trying to reproduce the issue.
Is this correct ?

Comment 19 Dafna Ron 2013-10-30 10:49:40 UTC

no. I did not manually restart engine at any point (I would write it down in the reproduction).

Comment 20 Barak 2013-10-30 11:38:56 UTC

(In reply to Dafna Ron from comment #19)
> no. I did not manually restart engine at any point (I would write it down in
> the reproduction).

Sorry Dafna (I know you haven't), the needinfo was targeted to Vlad
Vlad can you please answer comment #18

Comment 21 Roy Golan 2013-10-30 14:49:04 UTC

this is a real issue like I suspected. the engine unloads the EJB ear and this lead to failure on getting beans to react. 

need to see why it happens and how does this different from a manual restart of the engine which shouldn't change much here

Juan, since this seems to be an a real issue here, have you seen jboss reloading the ear in some cases without manual intervention?

Comment 22 Juan Hernández 2013-10-30 15:26:26 UTC

The only situation where JBoss will reload an application is if the deployment marker file is touched. In our case it is in "/var/lib/ovirt-engine/deployments/engine.ear.deployed". Is this file being touched manually or by some automatic process?

To check if this is the issue I would suggest to disable scanning after the initial startup. In "/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in" change "/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in" change "scan-interval" to 0, then restart the engine. After doing this restart the engine and repeat the test.

I would also suggest the timestamp of the "/var/lib/ovirt-engine/deployments/engine.ear.deployed" file right after restarting the engine, before running the test and after running the test. It shouldn't change. If it does then we need to know what is touching it.

Comment 23 Aharon Canan 2013-11-06 11:40:29 UTC

After disabling the scanning ("scan-interval" to 0) and rerun the test the host became "unassigned"

cksum the "/var/lib/ovirt-engine/deployments/engine.ear.deployed" returns the same output before and after the test.

Comment 24 Juan Hernández 2013-11-06 11:48:33 UTC

Looks like the scanning has an effect, we should probably disable it as we don't use it in production environments (may be useful in development environments).

The checksum (the output of the cksum command) isn't good in this case, as the content of the file will never change, it is always empty.

I will suggest to repeat the test with scanning enabled, as originally installed, and then check the timestamp of the .deployed file. You can check the timestamp with the "ls -l" or "stat" commands.

Comment 25 Aharon Canan 2013-11-06 11:58:46 UTC

 I set scan-intreval set to 5000 (like default) 
rerun the test but again the host is "unassigned

I am using is21

Comment 26 Eli Mesika 2013-11-08 07:57:00 UTC

(In reply to Aharon Canan from comment #25)
>  I set scan-intreval set to 5000 (like default) 
> rerun the test but again the host is "unassigned
> 
> I am using is21

Please provide also the results when scanning is disabled to see if it caused that

Comment 27 Aharon Canan 2013-11-11 16:05:50 UTC

will be able to provide on coming Wednesday

Comment 28 Aharon Canan 2013-11-14 11:19:11 UTC

like I wrote in comment #23, 

I also attaching the logs.

Comment 29 Aharon Canan 2013-11-14 11:20:05 UTC

Created attachment 823877 [details]
logs

Comment 30 Aharon Canan 2013-11-14 11:24:38 UTC

state changed to unassigned and then to non-op

Comment 31 Barak 2013-12-02 14:22:11 UTC

So AFAIU this is exactly what should happen,
SO what is the bug ?

Comment 33 Aharon Canan 2013-12-17 09:27:31 UTC

Barak, 

I probably didn't manage to reproduce, 
but following comment #21 there is real issue here.

please let me know if you need more info.

Comment 35 Barak 2013-12-25 14:33:33 UTC

(In reply to Aharon Canan from comment #33)
> Barak, 
> 
> I probably didn't manage to reproduce, 
> but following comment #21 there is real issue here.

The phenomena observed on comment #21 is general Jboss behavior and was not reproduced as well, See comment #22 for explanation.

However if there was a real issue there we would have encountered that many times, so it looks like an specific environmen issue (that has nothing to do with the bug description)


> 
> please let me know if you need more info.

Moving this bug to CLOSE WORKSFORME

Comment 39 Liran Zelkha 2014-01-14 09:21:27 UTC

I can't see anything special in the logs. The situation is that you have a host with no access to the storage domain, the host is in maintenance mode and doesn't switch mode to unassigned?

Comment 40 Pavel Zhukov 2014-01-14 09:32:21 UTC

(In reply to Liran Zelkha from comment #39)
> I can't see anything special in the logs. The situation is that you have a
> host with no access to the storage domain, the host is in maintenance mode
> and doesn't switch mode to unassigned?

Hosts are stuck in Unassigned mode (admin tried to activate them but one of the SD is blocked by iptables).

Comment 41 Liran Zelkha 2014-01-14 09:40:40 UTC

But if the SD is not available you can't start the hosts. So even after the SD is up the host is unassigned?

Comment 42 Pavel Zhukov 2014-01-14 10:31:07 UTC

(In reply to Liran Zelkha from comment #41)
> But if the SD is not available you can't start the hosts. So even after the
> SD is up the host is unassigned?

Host must become Non-Operational (not stuck Unassigned), should not it? Both maintenance and activate buttons are unavailable if host is in unassigned state. I don't think it's expected.

Comment 43 Liran Zelkha 2014-01-14 13:42:27 UTC

From the scenario I'm testing here, I see that a host is NonOperational, and when I activate it, it becomes unassigned for 1-2 minutes, and then switches back to NonOperational. Is that the scenario you have? 
Can you send the host name, so I'll know what to look for in the logs?
Should the fix be that hosts that are unassigned can be moved to maintenance mode, or should the fix be that NonOperational-->Activate failed-->NonOperational (and not to unassigned)?

Comment 44 Barak 2014-01-19 12:35:33 UTC

(In reply to Liran Zelkha from comment #43)
> From the scenario I'm testing here, I see that a host is NonOperational, and
> when I activate it, it becomes unassigned for 1-2 minutes, and then switches
> back to NonOperational. Is that the scenario you have? 

The above is a standard behaviour.
If this is the case than this bug shoud be CLOSED NOTABUG


> Can you send the host name, so I'll know what to look for in the logs?
> Should the fix be that hosts that are unassigned can be moved to maintenance
> mode, 

in case this is a transient status than the above should not be relevant

> or should the fix be that NonOperational-->Activate
> failed-->NonOperational (and not to unassigned)?

unassigned is happening due to the host recovery mechanism (moves the host to unassigned ....)

Roman & Pavel - could you please check whether "stuck in unassigned" is transient for 2-3 minutes ? or stck forever ?

Comment 45 Roman Hodain 2014-01-20 13:19:49 UTC

(In reply to Barak from comment #44)
> (In reply to Liran Zelkha from comment #43)
> > From the scenario I'm testing here, I see that a host is NonOperational, and
> > when I activate it, it becomes unassigned for 1-2 minutes, and then switches
> > back to NonOperational. Is that the scenario you have? 
> 
> The above is a standard behaviour.
> If this is the case than this bug shoud be CLOSED NOTABUG
> 
> 
> > Can you send the host name, so I'll know what to look for in the logs?
> > Should the fix be that hosts that are unassigned can be moved to maintenance
> > mode, 
> 
> in case this is a transient status than the above should not be relevant
> 
> > or should the fix be that NonOperational-->Activate
> > failed-->NonOperational (and not to unassigned)?
> 
> unassigned is happening due to the host recovery mechanism (moves the host
> to unassigned ....)
> 
> Roman & Pavel - could you please check whether "stuck in unassigned" is
> transient for 2-3 minutes ? or stck forever ?

In this case the hypervisors got stuck forever. All of them only one was up.

Comment 46 Liran Zelkha 2014-01-20 19:29:12 UTC

Hi Roman,

Can you please send the host name that was stuck? Or the one that wasn't stuck - just so I can trace it in the logs?

Comment 48 Barak 2014-01-22 14:26:29 UTC

can we get the engine log from the point in time it had happened ?
to a specific host ?

Comment 49 Roman Hodain 2014-01-23 13:18:52 UTC

(In reply to Barak from comment #48)
> can we get the engine log from the point in time it had happened ?
> to a specific host ?

The only log we have are the logs in attached to this case.

Comment 50 Sandro Bonazzola 2014-02-19 12:27:36 UTC

This bug is referenced in ovirt-engine-3.4.0-beta3 logs. Moving to ON_QA

Comment 55 Liran Zelkha 2014-02-24 20:02:44 UTC

Checking again the code and logs, it seems like VdsUpdateRuntimeInfo does "understand" that the host should be migrated to NonOperational, but some flag in VdsManager is set to true, and so the host is not actually updated. I'll try to fix that.

Comment 56 Leonid Natapov 2014-03-11 13:43:41 UTC

3.4.0-0.3.master.el6ev. Host moves to non operatonal

Comment 57 errata-xmlrpc 2014-06-09 14:59:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html

Note You need to log in before you can comment on or make changes to this bug.

acanan
acathrow
bazulay
dron
emesika
iheim
jkt
juan.hernandez
knesenko
lbopf
lpeer
lzelkha
pep
perobins
pstehlik
pzhukov
Rhev-m-bugs
rhodain
yeylon
yzaslavs