Bug 965972 - [rhevm] Hosts stuck in status “Unassigned”, “Connecting", "Non Responding” - when restrict connection from host to DC, for 5 minutes
[rhevm] Hosts stuck in status “Unassigned”, “Connecting", "Non Responding” - ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Unspecified
unspecified Severity urgent
: ---
: 3.3.0
Assigned To: Liron Aravot
Leonid Natapov
storage
: Reopened
Depends On:
Blocks: 3.3snap2
  Show dependency treegraph
 
Reported: 2013-05-22 04:08 EDT by vvyazmin@redhat.com
Modified: 2016-02-10 15:26 EST (History)
15 users (show)

See Also:
Fixed In Version: is22
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-05-26 06:40:26 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amureini: Triaged+


Attachments (Terms of Use)
## Logs rhevm (190.99 KB, application/x-gzip)
2013-05-22 04:08 EDT, vvyazmin@redhat.com
no flags Details
rhevm screenshot (208.36 KB, image/png)
2013-09-10 02:28 EDT, lijin
no flags Details
log files (840.38 KB, application/zip)
2013-09-10 02:29 EDT, lijin
no flags Details
vdsm log file (2.31 MB, application/zip)
2013-10-09 06:16 EDT, lijin
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 20400 None None None Never
oVirt gerrit 20895 None None None Never

  None (edit)
Description vvyazmin@redhat.com 2013-05-22 04:08:41 EDT
Created attachment 751556 [details]
## Logs rhevm

Description of problem:
Hosts stack in status “Unassigned”, “Connecting", "Non Responding” - Failed to acquire a permit within 5 MINUTES

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17 environment:

RHEVM: rhevm-3.2.0-10.26.rc.el6ev.noarch
VDSM: vdsm-4.10.2-18.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 150 hosts (in my case 150 fake hosts) connected to one SD
2. Block connection  (via iptables) 92 hosts to SD
3. Remove restriction from  iptables
  
Actual results:
Hosts stack in “Unassigned”, “Connecting", "Non Responding” (wait 2 hours)

Expected results:
Hosts succeed to connect to SD

Impact on user:

Workaround:
Solution from Antonio Hernandez Fernandez:
1. Edit file /usr/share/ovirt-engine/service/engine-service.xml.in
2. In line 153:
 <strict-max-pool name="slsb-strict-max-pool" max-pool-size="20" instance-acquisition-timeout="5" instance-acquisition-timeout-unit="MINUTES"/>
change to max-pool-size="100" 

3. Restart RHEVM (service ovirt-engine stop && service ovirt-engine start) 

Additional info:

/var/log/ovirt-engine/engine.log2013-05-21 14:54:48,309 ERROR [org.jboss.as.ejb3.invocation] (QuartzScheduler_Worker-23) JBAS014134: EJB Invocation failed on component VdsEventListener for method public abstract void org.ovirt.engine.core.common.businessentities.IVdsEventListener.handleVdsVersion(org.ovirt.engine.core.compat.Guid): javax.ejb.EJBException: JBAS014516: Failed to acquire a permit within 5 MINUTES
	at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:192) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:181) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.handleVdsVersion(Unknown Source) [engine-common.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.AfterRefreshTreatment(VdsUpdateRunTimeInfo.java:429) [engine-vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:251) [engine-vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [:1.7.0_19]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
	at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]


/var/log/vdsm/vdsm.log
Comment 1 vvyazmin@redhat.com 2013-05-22 06:35:39 EDT
Succeed reproduce same scenario with 53 hosts
Comment 2 Ayal Baron 2013-05-23 08:37:39 EDT
what happens when using 2 domains? (which is the best practice which is documented to customers, especially with such volumes of hosts)
Comment 3 vvyazmin@redhat.com 2013-05-26 06:40:26 EDT
Problem was solved in RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64
Comment 4 lijin 2013-09-10 02:26:38 EDT
I hit the same issue,the state of one host stuck in "unassined".

steps:
Connect two hosts in one DC,boot 7 windows guests and run iozone in these guests for two weeks.

packages info:
rhevm-3.2.0-11.30.el6ev.noarch
vdsm-4.10.2-22.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.8.x86_64
qemu-kvm-rhev-0.12.1.2-2.375.el6.x86_64
kernel-2.6.32-369.el6.x86_64
sanlock-2.6-2.el6.x86_64

I will upload the screenshot and log file later.
Comment 5 lijin 2013-09-10 02:28:16 EDT
Created attachment 795845 [details]
rhevm screenshot
Comment 6 lijin 2013-09-10 02:29:19 EDT
Created attachment 795847 [details]
log files
Comment 7 Liron Aravot 2013-09-29 10:14:56 EDT
 lijin , can you please provide -
1. vdsm logs of the problematic host
2. exact scenario - number of domains/hosts and their statuses, what exactly was done to reach that situation.
3. just to be sure, are you referring to the 9/9?

thanks
Comment 8 lijin 2013-10-09 06:15:50 EDT
(In reply to Liron Aravot from comment #7)
>  lijin , can you please provide -
> 1. vdsm logs of the problematic host
> 2. exact scenario - number of domains/hosts and their statuses, what exactly
> was done to reach that situation.
> 3. just to be sure, are you referring to the 9/9?
> 
> thanks

sorry for the late response:
1.I will upload the vdsm log later;

2.senario:
  1).There are two hosts(first:10.66.11.0,second:10.66.11.50) in the specific DC;
  2).7 windows guests were running in the first host 10.66.11.0,no virtual machines were running in the second host 10.66.11.50;
  3).run iozone in these guests for two weeks.
  4).after about two weeks,almost all the guests were in "migration" status;the first host stuck in "unassined" state,the second host stays in "up";

3.for this question,do you want to confirm this issue is 100% reproducible?I hit this issue for only once,the host stuck in "unassigned" status for a few days,and we did not run tests in rhevm3.2 recently.

Thanks for your effort.
Any question,please let me know.
Comment 9 lijin 2013-10-09 06:16:38 EDT
Created attachment 809839 [details]
vdsm log file
Comment 10 Liron Aravot 2013-10-22 09:19:28 EDT
The issue is in memory deadlock, 
further description is provided in the solving patch. 

Moving to post.
Comment 13 Leonid Natapov 2013-11-27 07:54:34 EST
vdsm-4.13.0-0.8.beta1.el6ev.x86_64. is 24.2 After blocking connection from host to storage using iptables for 20 min ,the DC became operational after restoring connection.
Comment 14 Itamar Heim 2014-01-21 17:31:06 EST
Closing - RHEV 3.3 Released
Comment 15 Itamar Heim 2014-01-21 17:31:09 EST
Closing - RHEV 3.3 Released
Comment 16 Itamar Heim 2014-01-21 17:33:59 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.