Bug 965972 - [rhevm] Hosts stuck in status “Unassigned”, “Connecting", "Non Responding” - when restrict connection from host to DC, for 5 minutes
Summary: [rhevm] Hosts stuck in status “Unassigned”, “Connecting", "Non Responding” - ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.3.0
Assignee: Liron Aravot
QA Contact: Leonid Natapov
URL:
Whiteboard: storage
Depends On:
Blocks: 3.3snap2
TreeView+ depends on / blocked
 
Reported: 2013-05-22 08:08 UTC by vvyazmin@redhat.com
Modified: 2016-02-10 20:26 UTC (History)
15 users (show)

Fixed In Version: is22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-05-26 10:40:26 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
amureini: Triaged+


Attachments (Terms of Use)
## Logs rhevm (190.99 KB, application/x-gzip)
2013-05-22 08:08 UTC, vvyazmin@redhat.com
no flags Details
rhevm screenshot (208.36 KB, image/png)
2013-09-10 06:28 UTC, lijin
no flags Details
log files (840.38 KB, application/zip)
2013-09-10 06:29 UTC, lijin
no flags Details
vdsm log file (2.31 MB, application/zip)
2013-10-09 10:16 UTC, lijin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 20400 0 'None' MERGED core: deadlock and related changes in domain monitoring/InitVdsOnUp 2020-11-20 20:40:54 UTC
oVirt gerrit 20895 0 'None' MERGED core: deadlock and related changes in domain monitoring/InitVdsOnUp 2020-11-20 20:40:32 UTC

Description vvyazmin@redhat.com 2013-05-22 08:08:41 UTC
Created attachment 751556 [details]
## Logs rhevm

Description of problem:
Hosts stack in status “Unassigned”, “Connecting", "Non Responding” - Failed to acquire a permit within 5 MINUTES

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17 environment:

RHEVM: rhevm-3.2.0-10.26.rc.el6ev.noarch
VDSM: vdsm-4.10.2-18.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 150 hosts (in my case 150 fake hosts) connected to one SD
2. Block connection  (via iptables) 92 hosts to SD
3. Remove restriction from  iptables
  
Actual results:
Hosts stack in “Unassigned”, “Connecting", "Non Responding” (wait 2 hours)

Expected results:
Hosts succeed to connect to SD

Impact on user:

Workaround:
Solution from Antonio Hernandez Fernandez:
1. Edit file /usr/share/ovirt-engine/service/engine-service.xml.in
2. In line 153:
 <strict-max-pool name="slsb-strict-max-pool" max-pool-size="20" instance-acquisition-timeout="5" instance-acquisition-timeout-unit="MINUTES"/>
change to max-pool-size="100" 

3. Restart RHEVM (service ovirt-engine stop && service ovirt-engine start) 

Additional info:

/var/log/ovirt-engine/engine.log2013-05-21 14:54:48,309 ERROR [org.jboss.as.ejb3.invocation] (QuartzScheduler_Worker-23) JBAS014134: EJB Invocation failed on component VdsEventListener for method public abstract void org.ovirt.engine.core.common.businessentities.IVdsEventListener.handleVdsVersion(org.ovirt.engine.core.compat.Guid): javax.ejb.EJBException: JBAS014516: Failed to acquire a permit within 5 MINUTES
	at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:192) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:181) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.handleVdsVersion(Unknown Source) [engine-common.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.AfterRefreshTreatment(VdsUpdateRunTimeInfo.java:429) [engine-vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:251) [engine-vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [:1.7.0_19]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
	at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]


/var/log/vdsm/vdsm.log

Comment 1 vvyazmin@redhat.com 2013-05-22 10:35:39 UTC
Succeed reproduce same scenario with 53 hosts

Comment 2 Ayal Baron 2013-05-23 12:37:39 UTC
what happens when using 2 domains? (which is the best practice which is documented to customers, especially with such volumes of hosts)

Comment 3 vvyazmin@redhat.com 2013-05-26 10:40:26 UTC
Problem was solved in RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

Comment 4 lijin 2013-09-10 06:26:38 UTC
I hit the same issue,the state of one host stuck in "unassined".

steps:
Connect two hosts in one DC,boot 7 windows guests and run iozone in these guests for two weeks.

packages info:
rhevm-3.2.0-11.30.el6ev.noarch
vdsm-4.10.2-22.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.8.x86_64
qemu-kvm-rhev-0.12.1.2-2.375.el6.x86_64
kernel-2.6.32-369.el6.x86_64
sanlock-2.6-2.el6.x86_64

I will upload the screenshot and log file later.

Comment 5 lijin 2013-09-10 06:28:16 UTC
Created attachment 795845 [details]
rhevm screenshot

Comment 6 lijin 2013-09-10 06:29:19 UTC
Created attachment 795847 [details]
log files

Comment 7 Liron Aravot 2013-09-29 14:14:56 UTC
 lijin , can you please provide -
1. vdsm logs of the problematic host
2. exact scenario - number of domains/hosts and their statuses, what exactly was done to reach that situation.
3. just to be sure, are you referring to the 9/9?

thanks

Comment 8 lijin 2013-10-09 10:15:50 UTC
(In reply to Liron Aravot from comment #7)
>  lijin , can you please provide -
> 1. vdsm logs of the problematic host
> 2. exact scenario - number of domains/hosts and their statuses, what exactly
> was done to reach that situation.
> 3. just to be sure, are you referring to the 9/9?
> 
> thanks

sorry for the late response:
1.I will upload the vdsm log later;

2.senario:
  1).There are two hosts(first:10.66.11.0,second:10.66.11.50) in the specific DC;
  2).7 windows guests were running in the first host 10.66.11.0,no virtual machines were running in the second host 10.66.11.50;
  3).run iozone in these guests for two weeks.
  4).after about two weeks,almost all the guests were in "migration" status;the first host stuck in "unassined" state,the second host stays in "up";

3.for this question,do you want to confirm this issue is 100% reproducible?I hit this issue for only once,the host stuck in "unassigned" status for a few days,and we did not run tests in rhevm3.2 recently.

Thanks for your effort.
Any question,please let me know.

Comment 9 lijin 2013-10-09 10:16:38 UTC
Created attachment 809839 [details]
vdsm log file

Comment 10 Liron Aravot 2013-10-22 13:19:28 UTC
The issue is in memory deadlock, 
further description is provided in the solving patch. 

Moving to post.

Comment 13 Leonid Natapov 2013-11-27 12:54:34 UTC
vdsm-4.13.0-0.8.beta1.el6ev.x86_64. is 24.2 After blocking connection from host to storage using iptables for 20 min ,the DC became operational after restoring connection.

Comment 14 Itamar Heim 2014-01-21 22:31:06 UTC
Closing - RHEV 3.3 Released

Comment 15 Itamar Heim 2014-01-21 22:31:09 UTC
Closing - RHEV 3.3 Released

Comment 16 Itamar Heim 2014-01-21 22:33:59 UTC
Closing - RHEV 3.3 Released


Note You need to log in before you can comment on or make changes to this bug.