Bug 1259000 - Hosts stuck in 'connecting' status while running automation tests
Hosts stuck in 'connecting' status while running automation tests
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra (Show other bugs)
3.6.2
x86_64 Linux
unspecified Severity high (vote)
: ovirt-3.6.2
: 3.6.2.5
Assigned To: Piotr Kliczewski
Meni Yakove
: Automation, AutomationBlocker, Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-01 13:24 EDT by Meni Yakove
Modified: 2016-02-18 06:03 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 06:03:05 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
rhevm logs (557.39 KB, application/x-bzip)
2015-09-01 13:26 EDT, Meni Yakove
no flags Details
vdsm logs (431.22 KB, application/x-bzip)
2015-09-01 13:26 EDT, Meni Yakove
no flags Details
env1 hosts logs (480.00 KB, application/x-bzip)
2015-09-13 11:36 EDT, Meni Yakove
no flags Details
env1 engine logs (746.97 KB, application/x-bzip)
2015-09-13 11:37 EDT, Meni Yakove
no flags Details
env2 hosts logs (1.95 MB, application/x-bzip)
2015-09-13 11:38 EDT, Meni Yakove
no flags Details
env2 engine logs (685.89 KB, application/x-bzip)
2015-09-13 11:38 EDT, Meni Yakove
no flags Details
Debug log engine (14.73 MB, application/x-gzip)
2015-11-03 02:34 EST, Nelly Credi
no flags Details
integration-engine sos report (16.49 MB, application/x-xz)
2015-12-29 04:52 EST, Barak Korren
no flags Details
3.6 cluster host report (new-ise rhel 7.2 install) (17.45 MB, application/x-xz)
2015-12-29 04:58 EST, Barak Korren
no flags Details
3.6 cluster host report (18.90 MB, application/x-xz)
2015-12-29 05:01 EST, Barak Korren
no flags Details
integration-engine DB logs (2.20 MB, application/x-xz)
2015-12-29 05:38 EST, Barak Korren
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 45712 master MERGED Revert "net/api: drop support for legacy mgmt networks" Never
oVirt gerrit 46237 ovirt-3.6 MERGED Revert "net/api: drop support for legacy mgmt networks" Never
oVirt gerrit 51513 master MERGED i/o thread blocked during connection 2016-01-08 07:35 EST
oVirt gerrit 51552 master MERGED Update version of jsonrpc to 1.1.6 2016-01-11 04:48 EST
oVirt gerrit 51553 ovirt-engine-3.6 MERGED Update version of jsonrpc to 1.1.6 2016-01-11 04:54 EST
oVirt gerrit 51554 ovirt-engine-3.6.2 MERGED Update version of jsonrpc to 1.1.6 2016-01-11 05:10 EST

  None (edit)
Description Meni Yakove 2015-09-01 13:24:54 EDT
Description of problem:
While running automation test at some point the hosts are stuck at 'connecting' status while the hosts are up and VDSM is responding (vdsClient).

The error in the egine:
2015-09-01 05:41:03,631 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
02:21:26.429 	Status: 503
02:21:26.429 	Reason: Service Temporarily Unavailable

Version-Release number of selected component (if applicable):
vdsm-4.17.3-1.el7ev.noarch
rhevm-3.6.0-0.12.master.el6.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Run teir1 automation
Comment 1 Meni Yakove 2015-09-01 13:26:13 EDT
Created attachment 1069090 [details]
rhevm logs
Comment 2 Meni Yakove 2015-09-01 13:26:42 EDT
Created attachment 1069091 [details]
vdsm logs
Comment 3 Piotr Kliczewski 2015-09-02 04:22:36 EDT
It looks like after running:

Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'sw162': {u'remove': u'true'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 60}}

We lost connectivity from the engine and was not able to reconnect. Not sure if it is related but in the engine log I can see:

2015-09-01 03:37:32,013 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-8) [host_nics_syncAction_b7d812f9-d3cc-40a9] Host 'host_mixed_1' ('5d1e1eb1-6aa4-4d51-836c-c309d0850a99') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks

it seems to be network not infra issue.
Comment 4 Piotr Kliczewski 2015-09-02 04:29:39 EDT
Dan can you please take a look at the logs and see whether there are any issues with setup networks?
Comment 5 Dan Kenigsberg 2015-09-03 05:38:25 EDT
In https://gerrit.ovirt.org/#/c/43615/ I made Vdsm require an explicit defaultRoute=True on the management network, since Engine<=3.3 are no longer supported.

However, if even Engine<=3.5 not always sends defaultRoute (e.g. for clusterLevels<=3.3)

We'll need to carry the legacy code further on.
Comment 6 Meni Yakove 2015-09-08 14:17:22 EDT
I don't think this is related to  defaultRoute=True since when we hit this bug I have ssh connection to the host (from different subnet) and the only way to fix this is to restart engine service.
Comment 7 Dan Kenigsberg 2015-09-09 06:02:00 EDT
Meni, please reproduce the bug with my patch, to prove your point.
Comment 8 Meni Yakove 2015-09-10 13:28:13 EDT
Is your patch in the latest QE build?
Comment 9 Dan Kenigsberg 2015-09-13 04:51:34 EDT
No, it is not yet merged.
Comment 10 Meni Yakove 2015-09-13 09:55:41 EDT
I have a setup with this if you want to look.
I can ssh the host and vdsClient is responsive.
Comment 11 Meni Yakove 2015-09-13 11:15:25 EDT
set back the old title since I got it on two environments again
Comment 12 Meni Yakove 2015-09-13 11:36:15 EDT
Created attachment 1072915 [details]
env1 hosts logs
Comment 13 Meni Yakove 2015-09-13 11:37:04 EDT
Created attachment 1072916 [details]
env1 engine logs
Comment 14 Meni Yakove 2015-09-13 11:38:18 EDT
Created attachment 1072917 [details]
env2 hosts logs
Comment 15 Meni Yakove 2015-09-13 11:38:57 EDT
Created attachment 1072918 [details]
env2 engine logs
Comment 16 Meni Yakove 2015-09-13 11:40:54 EDT
Reproduce: run host_network_api job
Comment 17 Barak 2015-09-17 07:41:59 EDT
Just to be clear,
Sometimes when one of the hosts in the cluster went through setupNetworks all other hosts in the DC went into connecting status.

Restart to engine solves it, hence this looks like infra bug,
moving to engine.
Comment 18 Piotr Kliczewski 2015-09-17 07:54:09 EDT
OK, how can I reproduce it?
Comment 19 Piotr Kliczewski 2015-09-17 08:00:37 EDT
Looking at the latest engine log I can see that there was only one host failing at time. Initially it was 10.35.128.28 and after some time it was 10.35.128.31. After some time engine stopped connecting to it. It would be good what is going on on message level so please provide engine log with debug level so we can understand full picture.
Comment 20 Barak 2015-09-17 08:02:05 EDT
is it related to Bug 1262051 ?
Comment 21 Piotr Kliczewski 2015-09-17 08:05:56 EDT
I thought it was but now we suspect it is infra issue so we need more information to make sure.
Comment 22 Piotr Kliczewski 2015-09-17 08:06:23 EDT
Please provide engine log in debug.
Comment 24 Piotr Kliczewski 2015-09-24 07:30:20 EDT
We suspected that after fixing BZ #1256446 this problem would be gone. We retested with patched environment and we got it again. The issue is that when we run setupNetworks we see Host status changing from 'Up' -> 'Connecting' -> 'Up'.
This status changes are triggered by connection policy reset and it is expected behavior.

The issue with it is that sometimes the host status change is so slow that when we attempt to run next setupNetworks tests if fails due to host being in 'Connecting' state. The host recovers in few seconds but it is too late.

I suggest to update SetupNetworks command to wait for host status change before assuming that operation succeeded.
Comment 25 Oved Ourfali 2015-09-24 07:34:19 EDT
As far as we know this issue has been there before. So not a regression.
Removing blocker flag.
Putting needinfo on Alona and Dan to share their thoughts.
Comment 26 Oved Ourfali 2015-09-24 07:39:14 EDT
Sounds to me like the tests needs to be changed.
Comment 27 Oved Ourfali 2015-10-11 12:42:14 EDT
Moving to network to decide whether the test should change, or setup networks logic should change.
Comment 28 Barak 2015-10-15 09:04:53 EDT
Meni looking at comment #24, I think that fixing the tests to fit the new slower nature of the SetupNetworks in 3.6
Comment 29 Meni Yakove 2015-10-25 06:06:58 EDT
this bug is about all hosts are stuck in connecting status and the only way to fix it is to restart engine service.
We have another bug about hosts move from 'up' to 'connecting' and back to 'up'(https://bugzilla.redhat.com/show_bug.cgi?id=1262051) and this is related to the speed of the operation.
Comment 30 Oved Ourfali 2015-10-26 06:01:47 EDT
(In reply to Meni Yakove from comment #29)
> this bug is about all hosts are stuck in connecting status and the only way
> to fix it is to restart engine service.
> We have another bug about hosts move from 'up' to 'connecting' and back to
> 'up'(https://bugzilla.redhat.com/show_bug.cgi?id=1262051) and this is
> related to the speed of the operation.

It sounds like you want see this issue anymore once Bug 1262051 is fixed (or the tests are changed to fit the new slower nature of setup networks).
Please verify that once fixed.

In addition, Piotr - please re-examine to make sure there is no additional issue hiding here.
Comment 31 Piotr Kliczewski 2015-10-26 06:36:11 EDT
From what I see in the logs we have protection issue. There is host monitoring cycle which is running just before setupNetworks are triggered. When we loose connectivity due to network configuration changes host monitoring never receives a response. When vdsTimeout occurs the connection is reset even though it could be healthy at that time.

We need to make sure that there is no message exchange when we run setupNetworks.
Comment 32 Piotr Kliczewski 2015-10-26 06:50:10 EDT
Restoring need info. Can we get engine log in debug (like in comment #22)?
Comment 33 Meni Yakove 2015-11-02 04:36:11 EST
Still wait for Nelly to run the tests.
Comment 34 Nelly Credi 2015-11-03 02:34 EST
Created attachment 1088817 [details]
Debug log engine
Comment 35 Nelly Credi 2015-11-03 02:38:50 EST
Attached debug log of the engine
Comment 36 Piotr Kliczewski 2015-11-03 03:44:14 EST
In order to provide thread dump please send SIGQUIT to the engine process by:
kill -3 <java-pid> > output.txt

Once you run ps to see check pid please make sure that -Xrs is not set. This flag tells jvm to ignore SIGQUIT.
Comment 37 Piotr Kliczewski 2015-11-03 05:56:34 EST
I can see 3 mins of no log entries in the log:

2015-11-02 18:56:18,815 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-91) [] START, GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='41857f47-d3db-4491-b0d8-a2c96b9d037f', vds='Host[host_mixed_1,41857f47-d3db-4491-b0d8-a2c96b9d037f]'}), log id: 3d3a7e99
2015-11-02 18:56:18,815 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler_Worker-91) [] Message sent: SEND
reply-to:jms.topic.vdsm_responses
destination:jms.topic.vdsm_requests
content-length:98

{"jsonrpc":"2.0","method":"Host.getStats","params":{},"id":"427d60bc-479f-4d87-9ca2-d6384c882b2a"}^@
2015-11-02 18:59:06,607 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"message":"Vds timeout occured","code":"10.35.117.44:2033222199"},"id":null}
2015-11-02 18:59:06,608 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"message":"Vds timeout occured","code":"10.35.117.42:2062834906"},"id":null}

Do you see high load on the host which could stack jvm for so long time?
Comment 38 Meni Yakove 2015-11-03 18:12:46 EST
I have this:

ovirt    30817 30815 99 01:07 ?        00:01:01 ovirt-engine -server -XX:+TieredCompilation -Xms1024M -Xmx1024M -XX:PermSize=256m -XX:MaxPermSize=256m -Djava.awt.headless=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djsse.enableSNIExtension=false -Djava.security.krb5.conf=/etc/ovirt-engine/krb5.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/ovirt-engine/dump -Djava.util.logging.manager=org.jboss.logmanager -Dlogging.configuration=file:///var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine-logging.properties -Dorg.jboss.resolver.warning=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djboss.modules.write-indexes=false -Djboss.server.default.config=ovirt-engine -Djboss.home.dir=/usr/share/jbossas -Djboss.server.base.dir=/usr/share/ovirt-engine -Djboss.server.data.dir=/var/lib/ovirt-engine -Djboss.server.log.dir=/var/log/ovirt-engine -Djboss.server.config.dir=/var/lib/ovirt-engine/jboss_runtime/config -Djboss.server.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -Djboss.controller.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -jar /usr/share/jbossas/jboss-modules.jar -mp /var/lib/ovirt-engine/jboss_runtime/modules/00-modules-common:/var/lib/ovirt-engine/jboss_runtime/modules/01-ovirt-engine-extension-aaa-jdbc-modules:/var/lib/ovirt-engine/jboss_runtime/modules/02-ovirt-engine-extension-aaa-ldap-modules:/var/lib/ovirt-engine/jboss_runtime/modules/03-ovirt-engine-extension-aaa-misc-modules:/var/lib/ovirt-engine/jboss_runtime/modules/04-jbossas-modules -jaxpmodule javax.xml.jaxp-provider org.jboss.as.standalone -c ovirt-engine.xml

and when I kill -3 30817 > output.txt the txt file in empty and the process still up
Comment 39 Piotr Kliczewski 2015-11-04 03:36:57 EST
In order to take a thread dump we need to send SIGQUIT signal by:

kill -3 <engine-pid>

If there is no output we could have standard output redirected or -Xrs flag set.
In order to get the thread dump we can do:
1. Add following lines to ovirt-engine.py for section         

self._engineArgs.extend([
            .......
            '-XX:+UnlockDiagnosticVMOptions',
            '-XX:+LogVMOutput',
            '-XX:LogFile=/tmp/jvm.log',
])

Restart of the engine is required and when kill -3 is send thread dump is redirected to /tmp/jvm.log file.

2. Check you java version by running java -version and if it is 1.7 install java-1.7.0-openjdk-devel package. After the installation we can run:

jstack <engine-pid>

We can redirect the output to a file by:

kill -3 <pid> > output.txt or jstack <pid> > output.text
Comment 40 Meni Yakove 2015-11-11 10:50:42 EST
(In reply to Piotr Kliczewski from comment #39)
> In order to take a thread dump we need to send SIGQUIT signal by:
> 
> kill -3 <engine-pid>
> 
> If there is no output we could have standard output redirected or -Xrs flag
> set.
> In order to get the thread dump we can do:
> 1. Add following lines to ovirt-engine.py for section         
> 
> self._engineArgs.extend([
>             .......
>             '-XX:+UnlockDiagnosticVMOptions',
>             '-XX:+LogVMOutput',
>             '-XX:LogFile=/tmp/jvm.log',
> ])
> 
> Restart of the engine is required and when kill -3 is send thread dump is
> redirected to /tmp/jvm.log file.
> 
> 2. Check you java version by running java -version and if it is 1.7 install
> java-1.7.0-openjdk-devel package. After the installation we can run:
> 
> jstack <engine-pid>
> 
> We can redirect the output to a file by:
> 
> kill -3 <pid> > output.txt or jstack <pid> > output.text

Pioter, with this process jvm.log will be created only when IU kill -3 ? we have jvm.log but we didn't face the connecting issue. is it possible that on every engine restart the file is created.
Comment 41 Piotr Kliczewski 2015-11-12 03:25:05 EST
The file is created at engine start time. It only contains jvm information. The thread dump is only added to the file when you run kill -3.
Comment 42 Meni Yakove 2015-11-12 04:43:18 EST
So I need  way to generate the file name.
kill -9 > <file_name> not working
Comment 43 Piotr Kliczewski 2015-11-12 04:51:33 EST
the filename is provided in jvm args as in comment #40.
When you start the engine with the args the file is created only with jvm info but when you send SIGQUIT it writes thread dump to the file provided in LogFile arg.

I suggest to use -3 not -9.
Comment 44 Meni Yakove 2015-11-12 08:21:25 EST
I use -3 (-9 was a typo), we have a patch that check if all hosts are in connecting then we run kill -3 <java_pid> and restart engine so, in this case, we will never have the right file. is this correct?
Comment 45 Piotr Kliczewski 2015-11-12 09:02:10 EST
it is correct.
Comment 46 Meni Yakove 2015-11-12 10:24:07 EST
ok, I need a way to save each file with a different name.
Comment 47 Red Hat Bugzilla Rules Engine 2015-12-01 19:25:31 EST
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Comment 48 Nelly Credi 2015-12-02 08:59:59 EST
We havent seen this bug for a while.
Comment 49 Barak Korren 2015-12-29 04:52 EST
Created attachment 1110135 [details]
integration-engine sos report

But re-surfaced in the RHEV-CI integration-engine system, re-openning and attaching engine SOS report.
The bug seems to affect multiple hosts across different clusters with different cluster levels (both 3.5 and 3.6). Will attach host sos reports soon
Comment 50 Barak Korren 2015-12-29 04:58 EST
Created attachment 1110136 [details]
3.6 cluster host report (new-ise rhel 7.2 install)
Comment 51 Barak Korren 2015-12-29 05:01 EST
Created attachment 1110137 [details]
3.6 cluster host report
Comment 52 Barak Korren 2015-12-29 05:38 EST
Created attachment 1110139 [details]
integration-engine DB logs
Comment 53 Barak Korren 2015-12-29 05:52:56 EST
Bug re-surfaced in the RHEV-CI integration-engine system, re-openning and attaching engine SOS report.
The bug seems to affect multiple hosts across different clusters with different cluster levels (both 3.5 and 3.6).
Comment 54 Piotr Kliczewski 2016-01-08 08:50:24 EST
hopefully we will be able to have it in
Comment 55 Meni Yakove 2016-01-20 10:47:06 EST
looks ok, jobs doesn't stuck anymore.

Note You need to log in before you can comment on or make changes to this bug.