Bug 965972

Summary: [rhevm] Hosts stuck in status “Unassigned”, “Connecting", "Non Responding” - when restrict connection from host to DC, for 5 minutes
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acanan, acathrow, amureini, bcao, dyasny, iheim, laravot, lijin, lpeer, lyarwood, Rhev-m-bugs, scohen, tnisan, yeylon, ykaul
Target Milestone: ---Keywords: Reopened
Target Release: 3.3.0Flags: amureini: Triaged+
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: is22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-05-26 10:40:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1032811    
Attachments:
Description Flags
## Logs rhevm
none
rhevm screenshot
none
log files
none
vdsm log file none

Description vvyazmin@redhat.com 2013-05-22 08:08:41 UTC
Created attachment 751556 [details]
## Logs rhevm

Description of problem:
Hosts stack in status “Unassigned”, “Connecting", "Non Responding” - Failed to acquire a permit within 5 MINUTES

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17 environment:

RHEVM: rhevm-3.2.0-10.26.rc.el6ev.noarch
VDSM: vdsm-4.10.2-18.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 150 hosts (in my case 150 fake hosts) connected to one SD
2. Block connection  (via iptables) 92 hosts to SD
3. Remove restriction from  iptables
  
Actual results:
Hosts stack in “Unassigned”, “Connecting", "Non Responding” (wait 2 hours)

Expected results:
Hosts succeed to connect to SD

Impact on user:

Workaround:
Solution from Antonio Hernandez Fernandez:
1. Edit file /usr/share/ovirt-engine/service/engine-service.xml.in
2. In line 153:
 <strict-max-pool name="slsb-strict-max-pool" max-pool-size="20" instance-acquisition-timeout="5" instance-acquisition-timeout-unit="MINUTES"/>
change to max-pool-size="100" 

3. Restart RHEVM (service ovirt-engine stop && service ovirt-engine start) 

Additional info:

/var/log/ovirt-engine/engine.log2013-05-21 14:54:48,309 ERROR [org.jboss.as.ejb3.invocation] (QuartzScheduler_Worker-23) JBAS014134: EJB Invocation failed on component VdsEventListener for method public abstract void org.ovirt.engine.core.common.businessentities.IVdsEventListener.handleVdsVersion(org.ovirt.engine.core.compat.Guid): javax.ejb.EJBException: JBAS014516: Failed to acquire a permit within 5 MINUTES
	at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:192) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:181) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.handleVdsVersion(Unknown Source) [engine-common.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.AfterRefreshTreatment(VdsUpdateRunTimeInfo.java:429) [engine-vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:251) [engine-vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [:1.7.0_19]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
	at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]


/var/log/vdsm/vdsm.log

Comment 1 vvyazmin@redhat.com 2013-05-22 10:35:39 UTC
Succeed reproduce same scenario with 53 hosts

Comment 2 Ayal Baron 2013-05-23 12:37:39 UTC
what happens when using 2 domains? (which is the best practice which is documented to customers, especially with such volumes of hosts)

Comment 3 vvyazmin@redhat.com 2013-05-26 10:40:26 UTC
Problem was solved in RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

Comment 4 lijin 2013-09-10 06:26:38 UTC
I hit the same issue,the state of one host stuck in "unassined".

steps:
Connect two hosts in one DC,boot 7 windows guests and run iozone in these guests for two weeks.

packages info:
rhevm-3.2.0-11.30.el6ev.noarch
vdsm-4.10.2-22.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.8.x86_64
qemu-kvm-rhev-0.12.1.2-2.375.el6.x86_64
kernel-2.6.32-369.el6.x86_64
sanlock-2.6-2.el6.x86_64

I will upload the screenshot and log file later.

Comment 5 lijin 2013-09-10 06:28:16 UTC
Created attachment 795845 [details]
rhevm screenshot

Comment 6 lijin 2013-09-10 06:29:19 UTC
Created attachment 795847 [details]
log files

Comment 7 Liron Aravot 2013-09-29 14:14:56 UTC
 lijin , can you please provide -
1. vdsm logs of the problematic host
2. exact scenario - number of domains/hosts and their statuses, what exactly was done to reach that situation.
3. just to be sure, are you referring to the 9/9?

thanks

Comment 8 lijin 2013-10-09 10:15:50 UTC
(In reply to Liron Aravot from comment #7)
>  lijin , can you please provide -
> 1. vdsm logs of the problematic host
> 2. exact scenario - number of domains/hosts and their statuses, what exactly
> was done to reach that situation.
> 3. just to be sure, are you referring to the 9/9?
> 
> thanks

sorry for the late response:
1.I will upload the vdsm log later;

2.senario:
  1).There are two hosts(first:10.66.11.0,second:10.66.11.50) in the specific DC;
  2).7 windows guests were running in the first host 10.66.11.0,no virtual machines were running in the second host 10.66.11.50;
  3).run iozone in these guests for two weeks.
  4).after about two weeks,almost all the guests were in "migration" status;the first host stuck in "unassined" state,the second host stays in "up";

3.for this question,do you want to confirm this issue is 100% reproducible?I hit this issue for only once,the host stuck in "unassigned" status for a few days,and we did not run tests in rhevm3.2 recently.

Thanks for your effort.
Any question,please let me know.

Comment 9 lijin 2013-10-09 10:16:38 UTC
Created attachment 809839 [details]
vdsm log file

Comment 10 Liron Aravot 2013-10-22 13:19:28 UTC
The issue is in memory deadlock, 
further description is provided in the solving patch. 

Moving to post.

Comment 13 Leonid Natapov 2013-11-27 12:54:34 UTC
vdsm-4.13.0-0.8.beta1.el6ev.x86_64. is 24.2 After blocking connection from host to storage using iptables for 20 min ,the DC became operational after restoring connection.

Comment 14 Itamar Heim 2014-01-21 22:31:06 UTC
Closing - RHEV 3.3 Released

Comment 15 Itamar Heim 2014-01-21 22:31:09 UTC
Closing - RHEV 3.3 Released

Comment 16 Itamar Heim 2014-01-21 22:33:59 UTC
Closing - RHEV 3.3 Released