Description of problem: While running automation test at some point the hosts are stuck at 'connecting' status while the hosts are up and VDSM is responding (vdsClient). The error in the egine: 2015-09-01 05:41:03,631 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected: 02:21:26.429 Status: 503 02:21:26.429 Reason: Service Temporarily Unavailable Version-Release number of selected component (if applicable): vdsm-4.17.3-1.el7ev.noarch rhevm-3.6.0-0.12.master.el6.noarch How reproducible: sometimes Steps to Reproduce: 1. Run teir1 automation
Created attachment 1069090 [details] rhevm logs
Created attachment 1069091 [details] vdsm logs
It looks like after running: Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'sw162': {u'remove': u'true'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 60}} We lost connectivity from the engine and was not able to reconnect. Not sure if it is related but in the engine log I can see: 2015-09-01 03:37:32,013 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-8) [host_nics_syncAction_b7d812f9-d3cc-40a9] Host 'host_mixed_1' ('5d1e1eb1-6aa4-4d51-836c-c309d0850a99') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks it seems to be network not infra issue.
Dan can you please take a look at the logs and see whether there are any issues with setup networks?
In https://gerrit.ovirt.org/#/c/43615/ I made Vdsm require an explicit defaultRoute=True on the management network, since Engine<=3.3 are no longer supported. However, if even Engine<=3.5 not always sends defaultRoute (e.g. for clusterLevels<=3.3) We'll need to carry the legacy code further on.
I don't think this is related to defaultRoute=True since when we hit this bug I have ssh connection to the host (from different subnet) and the only way to fix this is to restart engine service.
Meni, please reproduce the bug with my patch, to prove your point.
Is your patch in the latest QE build?
No, it is not yet merged.
I have a setup with this if you want to look. I can ssh the host and vdsClient is responsive.
set back the old title since I got it on two environments again
Created attachment 1072915 [details] env1 hosts logs
Created attachment 1072916 [details] env1 engine logs
Created attachment 1072917 [details] env2 hosts logs
Created attachment 1072918 [details] env2 engine logs
Reproduce: run host_network_api job
Just to be clear, Sometimes when one of the hosts in the cluster went through setupNetworks all other hosts in the DC went into connecting status. Restart to engine solves it, hence this looks like infra bug, moving to engine.
OK, how can I reproduce it?
Looking at the latest engine log I can see that there was only one host failing at time. Initially it was 10.35.128.28 and after some time it was 10.35.128.31. After some time engine stopped connecting to it. It would be good what is going on on message level so please provide engine log with debug level so we can understand full picture.
is it related to Bug 1262051 ?
I thought it was but now we suspect it is infra issue so we need more information to make sure.
Please provide engine log in debug.
We suspected that after fixing BZ #1256446 this problem would be gone. We retested with patched environment and we got it again. The issue is that when we run setupNetworks we see Host status changing from 'Up' -> 'Connecting' -> 'Up'. This status changes are triggered by connection policy reset and it is expected behavior. The issue with it is that sometimes the host status change is so slow that when we attempt to run next setupNetworks tests if fails due to host being in 'Connecting' state. The host recovers in few seconds but it is too late. I suggest to update SetupNetworks command to wait for host status change before assuming that operation succeeded.
As far as we know this issue has been there before. So not a regression. Removing blocker flag. Putting needinfo on Alona and Dan to share their thoughts.
Sounds to me like the tests needs to be changed.
Moving to network to decide whether the test should change, or setup networks logic should change.
Meni looking at comment #24, I think that fixing the tests to fit the new slower nature of the SetupNetworks in 3.6
this bug is about all hosts are stuck in connecting status and the only way to fix it is to restart engine service. We have another bug about hosts move from 'up' to 'connecting' and back to 'up'(https://bugzilla.redhat.com/show_bug.cgi?id=1262051) and this is related to the speed of the operation.
(In reply to Meni Yakove from comment #29) > this bug is about all hosts are stuck in connecting status and the only way > to fix it is to restart engine service. > We have another bug about hosts move from 'up' to 'connecting' and back to > 'up'(https://bugzilla.redhat.com/show_bug.cgi?id=1262051) and this is > related to the speed of the operation. It sounds like you want see this issue anymore once Bug 1262051 is fixed (or the tests are changed to fit the new slower nature of setup networks). Please verify that once fixed. In addition, Piotr - please re-examine to make sure there is no additional issue hiding here.
From what I see in the logs we have protection issue. There is host monitoring cycle which is running just before setupNetworks are triggered. When we loose connectivity due to network configuration changes host monitoring never receives a response. When vdsTimeout occurs the connection is reset even though it could be healthy at that time. We need to make sure that there is no message exchange when we run setupNetworks.
Restoring need info. Can we get engine log in debug (like in comment #22)?
Still wait for Nelly to run the tests.
Created attachment 1088817 [details] Debug log engine
Attached debug log of the engine
In order to provide thread dump please send SIGQUIT to the engine process by: kill -3 <java-pid> > output.txt Once you run ps to see check pid please make sure that -Xrs is not set. This flag tells jvm to ignore SIGQUIT.
I can see 3 mins of no log entries in the log: 2015-11-02 18:56:18,815 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-91) [] START, GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='41857f47-d3db-4491-b0d8-a2c96b9d037f', vds='Host[host_mixed_1,41857f47-d3db-4491-b0d8-a2c96b9d037f]'}), log id: 3d3a7e99 2015-11-02 18:56:18,815 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler_Worker-91) [] Message sent: SEND reply-to:jms.topic.vdsm_responses destination:jms.topic.vdsm_requests content-length:98 {"jsonrpc":"2.0","method":"Host.getStats","params":{},"id":"427d60bc-479f-4d87-9ca2-d6384c882b2a"}^@ 2015-11-02 18:59:06,607 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"message":"Vds timeout occured","code":"10.35.117.44:2033222199"},"id":null} 2015-11-02 18:59:06,608 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"message":"Vds timeout occured","code":"10.35.117.42:2062834906"},"id":null} Do you see high load on the host which could stack jvm for so long time?
I have this: ovirt 30817 30815 99 01:07 ? 00:01:01 ovirt-engine -server -XX:+TieredCompilation -Xms1024M -Xmx1024M -XX:PermSize=256m -XX:MaxPermSize=256m -Djava.awt.headless=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djsse.enableSNIExtension=false -Djava.security.krb5.conf=/etc/ovirt-engine/krb5.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/ovirt-engine/dump -Djava.util.logging.manager=org.jboss.logmanager -Dlogging.configuration=file:///var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine-logging.properties -Dorg.jboss.resolver.warning=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djboss.modules.write-indexes=false -Djboss.server.default.config=ovirt-engine -Djboss.home.dir=/usr/share/jbossas -Djboss.server.base.dir=/usr/share/ovirt-engine -Djboss.server.data.dir=/var/lib/ovirt-engine -Djboss.server.log.dir=/var/log/ovirt-engine -Djboss.server.config.dir=/var/lib/ovirt-engine/jboss_runtime/config -Djboss.server.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -Djboss.controller.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -jar /usr/share/jbossas/jboss-modules.jar -mp /var/lib/ovirt-engine/jboss_runtime/modules/00-modules-common:/var/lib/ovirt-engine/jboss_runtime/modules/01-ovirt-engine-extension-aaa-jdbc-modules:/var/lib/ovirt-engine/jboss_runtime/modules/02-ovirt-engine-extension-aaa-ldap-modules:/var/lib/ovirt-engine/jboss_runtime/modules/03-ovirt-engine-extension-aaa-misc-modules:/var/lib/ovirt-engine/jboss_runtime/modules/04-jbossas-modules -jaxpmodule javax.xml.jaxp-provider org.jboss.as.standalone -c ovirt-engine.xml and when I kill -3 30817 > output.txt the txt file in empty and the process still up
In order to take a thread dump we need to send SIGQUIT signal by: kill -3 <engine-pid> If there is no output we could have standard output redirected or -Xrs flag set. In order to get the thread dump we can do: 1. Add following lines to ovirt-engine.py for section self._engineArgs.extend([ ....... '-XX:+UnlockDiagnosticVMOptions', '-XX:+LogVMOutput', '-XX:LogFile=/tmp/jvm.log', ]) Restart of the engine is required and when kill -3 is send thread dump is redirected to /tmp/jvm.log file. 2. Check you java version by running java -version and if it is 1.7 install java-1.7.0-openjdk-devel package. After the installation we can run: jstack <engine-pid> We can redirect the output to a file by: kill -3 <pid> > output.txt or jstack <pid> > output.text
(In reply to Piotr Kliczewski from comment #39) > In order to take a thread dump we need to send SIGQUIT signal by: > > kill -3 <engine-pid> > > If there is no output we could have standard output redirected or -Xrs flag > set. > In order to get the thread dump we can do: > 1. Add following lines to ovirt-engine.py for section > > self._engineArgs.extend([ > ....... > '-XX:+UnlockDiagnosticVMOptions', > '-XX:+LogVMOutput', > '-XX:LogFile=/tmp/jvm.log', > ]) > > Restart of the engine is required and when kill -3 is send thread dump is > redirected to /tmp/jvm.log file. > > 2. Check you java version by running java -version and if it is 1.7 install > java-1.7.0-openjdk-devel package. After the installation we can run: > > jstack <engine-pid> > > We can redirect the output to a file by: > > kill -3 <pid> > output.txt or jstack <pid> > output.text Pioter, with this process jvm.log will be created only when IU kill -3 ? we have jvm.log but we didn't face the connecting issue. is it possible that on every engine restart the file is created.
The file is created at engine start time. It only contains jvm information. The thread dump is only added to the file when you run kill -3.
So I need way to generate the file name. kill -9 > <file_name> not working
the filename is provided in jvm args as in comment #40. When you start the engine with the args the file is created only with jvm info but when you send SIGQUIT it writes thread dump to the file provided in LogFile arg. I suggest to use -3 not -9.
I use -3 (-9 was a typo), we have a patch that check if all hosts are in connecting then we run kill -3 <java_pid> and restart engine so, in this case, we will never have the right file. is this correct?
it is correct.
ok, I need a way to save each file with a different name.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
We havent seen this bug for a while.
Created attachment 1110135 [details] integration-engine sos report But re-surfaced in the RHEV-CI integration-engine system, re-openning and attaching engine SOS report. The bug seems to affect multiple hosts across different clusters with different cluster levels (both 3.5 and 3.6). Will attach host sos reports soon
Created attachment 1110136 [details] 3.6 cluster host report (new-ise rhel 7.2 install)
Created attachment 1110137 [details] 3.6 cluster host report
Created attachment 1110139 [details] integration-engine DB logs
Bug re-surfaced in the RHEV-CI integration-engine system, re-openning and attaching engine SOS report. The bug seems to affect multiple hosts across different clusters with different cluster levels (both 3.5 and 3.6).
hopefully we will be able to have it in
looks ok, jobs doesn't stuck anymore.