Bug 965972
Summary: | [rhevm] Hosts stuck in status “Unassigned”, “Connecting", "Non Responding” - when restrict connection from host to DC, for 5 minutes | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||||||
Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Leonid Natapov <lnatapov> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 3.2.0 | CC: | acanan, acathrow, amureini, bcao, dyasny, iheim, laravot, lijin, lpeer, lyarwood, Rhev-m-bugs, scohen, tnisan, yeylon, ykaul | ||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||
Target Release: | 3.3.0 | Flags: | amureini:
Triaged+
|
||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | storage | ||||||||||||
Fixed In Version: | is22 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2013-05-26 10:40:26 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1032811 | ||||||||||||
Attachments: |
|
Succeed reproduce same scenario with 53 hosts what happens when using 2 domains? (which is the best practice which is documented to customers, especially with such volumes of hosts) Problem was solved in RHEVM 3.2 - SF17.1 environment: RHEVM: rhevm-3.2.0-11.28.el6ev.noarch VDSM: vdsm-4.10.2-21.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 I hit the same issue,the state of one host stuck in "unassined". steps: Connect two hosts in one DC,boot 7 windows guests and run iozone in these guests for two weeks. packages info: rhevm-3.2.0-11.30.el6ev.noarch vdsm-4.10.2-22.0.el6ev.x86_64 libvirt-0.10.2-18.el6_4.8.x86_64 qemu-kvm-rhev-0.12.1.2-2.375.el6.x86_64 kernel-2.6.32-369.el6.x86_64 sanlock-2.6-2.el6.x86_64 I will upload the screenshot and log file later. Created attachment 795845 [details]
rhevm screenshot
Created attachment 795847 [details]
log files
lijin , can you please provide - 1. vdsm logs of the problematic host 2. exact scenario - number of domains/hosts and their statuses, what exactly was done to reach that situation. 3. just to be sure, are you referring to the 9/9? thanks (In reply to Liron Aravot from comment #7) > lijin , can you please provide - > 1. vdsm logs of the problematic host > 2. exact scenario - number of domains/hosts and their statuses, what exactly > was done to reach that situation. > 3. just to be sure, are you referring to the 9/9? > > thanks sorry for the late response: 1.I will upload the vdsm log later; 2.senario: 1).There are two hosts(first:10.66.11.0,second:10.66.11.50) in the specific DC; 2).7 windows guests were running in the first host 10.66.11.0,no virtual machines were running in the second host 10.66.11.50; 3).run iozone in these guests for two weeks. 4).after about two weeks,almost all the guests were in "migration" status;the first host stuck in "unassined" state,the second host stays in "up"; 3.for this question,do you want to confirm this issue is 100% reproducible?I hit this issue for only once,the host stuck in "unassigned" status for a few days,and we did not run tests in rhevm3.2 recently. Thanks for your effort. Any question,please let me know. Created attachment 809839 [details]
vdsm log file
The issue is in memory deadlock, further description is provided in the solving patch. Moving to post. vdsm-4.13.0-0.8.beta1.el6ev.x86_64. is 24.2 After blocking connection from host to storage using iptables for 20 min ,the DC became operational after restoring connection. Closing - RHEV 3.3 Released Closing - RHEV 3.3 Released Closing - RHEV 3.3 Released |
Created attachment 751556 [details] ## Logs rhevm Description of problem: Hosts stack in status “Unassigned”, “Connecting", "Non Responding” - Failed to acquire a permit within 5 MINUTES Version-Release number of selected component (if applicable): RHEVM 3.2 - SF17 environment: RHEVM: rhevm-3.2.0-10.26.rc.el6ev.noarch VDSM: vdsm-4.10.2-18.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create iSCSI DC with 150 hosts (in my case 150 fake hosts) connected to one SD 2. Block connection (via iptables) 92 hosts to SD 3. Remove restriction from iptables Actual results: Hosts stack in “Unassigned”, “Connecting", "Non Responding” (wait 2 hours) Expected results: Hosts succeed to connect to SD Impact on user: Workaround: Solution from Antonio Hernandez Fernandez: 1. Edit file /usr/share/ovirt-engine/service/engine-service.xml.in 2. In line 153: <strict-max-pool name="slsb-strict-max-pool" max-pool-size="20" instance-acquisition-timeout="5" instance-acquisition-timeout-unit="MINUTES"/> change to max-pool-size="100" 3. Restart RHEVM (service ovirt-engine stop && service ovirt-engine start) Additional info: /var/log/ovirt-engine/engine.log2013-05-21 14:54:48,309 ERROR [org.jboss.as.ejb3.invocation] (QuartzScheduler_Worker-23) JBAS014134: EJB Invocation failed on component VdsEventListener for method public abstract void org.ovirt.engine.core.common.businessentities.IVdsEventListener.handleVdsVersion(org.ovirt.engine.core.compat.Guid): javax.ejb.EJBException: JBAS014516: Failed to acquire a permit within 5 MINUTES at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:192) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:181) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.handleVdsVersion(Unknown Source) [engine-common.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.AfterRefreshTreatment(VdsUpdateRunTimeInfo.java:429) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:251) [engine-vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [:1.7.0_19] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19] at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:] /var/log/vdsm/vdsm.log