Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1259000

Summary:

Hosts stuck in 'connecting' status while running automation tests

Product:

[oVirt] ovirt-engine

Reporter:

Meni Yakove <myakove>

Component:

BLL.Infra

Assignee:

Piotr Kliczewski <pkliczew>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Meni Yakove <myakove>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.6.2

CC:

alkaplan, bazulay, bkorren, bugs, danken, gklein, lsurette, masayag, mgoldboi, myakove, ncredi, oourfali, pkliczew, rbalakri, Rhev-m-bugs, ycui, yeylon, ykaul, ylavi

Target Milestone:

ovirt-3.6.2

Keywords:

Automation, AutomationBlocker, Reopened

Target Release:

3.6.2.5

Flags:

rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack+

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-02-18 11:03:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
rhevm logs	none
vdsm logs	none
env1 hosts logs	none
env1 engine logs	none
env2 hosts logs	none
env2 engine logs	none
Debug log engine	none
integration-engine sos report	none
3.6 cluster host report (new-ise rhel 7.2 install)	none
3.6 cluster host report	none
integration-engine DB logs	none

Description Meni Yakove 2015-09-01 17:24:54 UTC

Description of problem:
While running automation test at some point the hosts are stuck at 'connecting' status while the hosts are up and VDSM is responding (vdsClient).

The error in the egine:
2015-09-01 05:41:03,631 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
02:21:26.429 	Status: 503
02:21:26.429 	Reason: Service Temporarily Unavailable

Version-Release number of selected component (if applicable):
vdsm-4.17.3-1.el7ev.noarch
rhevm-3.6.0-0.12.master.el6.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Run teir1 automation

Comment 1 Meni Yakove 2015-09-01 17:26:13 UTC

Created attachment 1069090 [details]
rhevm logs

Comment 2 Meni Yakove 2015-09-01 17:26:42 UTC

Created attachment 1069091 [details]
vdsm logs

Comment 3 Piotr Kliczewski 2015-09-02 08:22:36 UTC

It looks like after running:

Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'sw162': {u'remove': u'true'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 60}}

We lost connectivity from the engine and was not able to reconnect. Not sure if it is related but in the engine log I can see:

2015-09-01 03:37:32,013 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-8) [host_nics_syncAction_b7d812f9-d3cc-40a9] Host 'host_mixed_1' ('5d1e1eb1-6aa4-4d51-836c-c309d0850a99') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks

it seems to be network not infra issue.

Comment 4 Piotr Kliczewski 2015-09-02 08:29:39 UTC

Dan can you please take a look at the logs and see whether there are any issues with setup networks?

Comment 5 Dan Kenigsberg 2015-09-03 09:38:25 UTC

In https://gerrit.ovirt.org/#/c/43615/ I made Vdsm require an explicit defaultRoute=True on the management network, since Engine<=3.3 are no longer supported.

However, if even Engine<=3.5 not always sends defaultRoute (e.g. for clusterLevels<=3.3)

We'll need to carry the legacy code further on.

Comment 6 Meni Yakove 2015-09-08 18:17:22 UTC

I don't think this is related to  defaultRoute=True since when we hit this bug I have ssh connection to the host (from different subnet) and the only way to fix this is to restart engine service.

Comment 7 Dan Kenigsberg 2015-09-09 10:02:00 UTC

Meni, please reproduce the bug with my patch, to prove your point.

Comment 8 Meni Yakove 2015-09-10 17:28:13 UTC

Is your patch in the latest QE build?

Comment 9 Dan Kenigsberg 2015-09-13 08:51:34 UTC

No, it is not yet merged.

Comment 10 Meni Yakove 2015-09-13 13:55:41 UTC

I have a setup with this if you want to look.
I can ssh the host and vdsClient is responsive.

Comment 11 Meni Yakove 2015-09-13 15:15:25 UTC

set back the old title since I got it on two environments again

Comment 12 Meni Yakove 2015-09-13 15:36:15 UTC

Created attachment 1072915 [details]
env1 hosts logs

Comment 13 Meni Yakove 2015-09-13 15:37:04 UTC

Created attachment 1072916 [details]
env1 engine logs

Comment 14 Meni Yakove 2015-09-13 15:38:18 UTC

Created attachment 1072917 [details]
env2 hosts logs

Comment 15 Meni Yakove 2015-09-13 15:38:57 UTC

Created attachment 1072918 [details]
env2 engine logs

Comment 16 Meni Yakove 2015-09-13 15:40:54 UTC

Reproduce: run host_network_api job

Comment 17 Barak 2015-09-17 11:41:59 UTC

Just to be clear,
Sometimes when one of the hosts in the cluster went through setupNetworks all other hosts in the DC went into connecting status.

Restart to engine solves it, hence this looks like infra bug,
moving to engine.

Comment 18 Piotr Kliczewski 2015-09-17 11:54:09 UTC

OK, how can I reproduce it?

Comment 19 Piotr Kliczewski 2015-09-17 12:00:37 UTC

Looking at the latest engine log I can see that there was only one host failing at time. Initially it was 10.35.128.28 and after some time it was 10.35.128.31. After some time engine stopped connecting to it. It would be good what is going on on message level so please provide engine log with debug level so we can understand full picture.

Comment 20 Barak 2015-09-17 12:02:05 UTC

is it related to Bug 1262051 ?

Comment 21 Piotr Kliczewski 2015-09-17 12:05:56 UTC

I thought it was but now we suspect it is infra issue so we need more information to make sure.

Comment 22 Piotr Kliczewski 2015-09-17 12:06:23 UTC

Please provide engine log in debug.

Comment 24 Piotr Kliczewski 2015-09-24 11:30:20 UTC

We suspected that after fixing BZ #1256446 this problem would be gone. We retested with patched environment and we got it again. The issue is that when we run setupNetworks we see Host status changing from 'Up' -> 'Connecting' -> 'Up'.
This status changes are triggered by connection policy reset and it is expected behavior.

The issue with it is that sometimes the host status change is so slow that when we attempt to run next setupNetworks tests if fails due to host being in 'Connecting' state. The host recovers in few seconds but it is too late.

I suggest to update SetupNetworks command to wait for host status change before assuming that operation succeeded.

Comment 25 Oved Ourfali 2015-09-24 11:34:19 UTC

As far as we know this issue has been there before. So not a regression.
Removing blocker flag.
Putting needinfo on Alona and Dan to share their thoughts.

Comment 26 Oved Ourfali 2015-09-24 11:39:14 UTC

Sounds to me like the tests needs to be changed.

Comment 27 Oved Ourfali 2015-10-11 16:42:14 UTC

Moving to network to decide whether the test should change, or setup networks logic should change.

Comment 28 Barak 2015-10-15 13:04:53 UTC

Meni looking at comment #24, I think that fixing the tests to fit the new slower nature of the SetupNetworks in 3.6

Comment 29 Meni Yakove 2015-10-25 10:06:58 UTC

this bug is about all hosts are stuck in connecting status and the only way to fix it is to restart engine service.
We have another bug about hosts move from 'up' to 'connecting' and back to 'up'(https://bugzilla.redhat.com/show_bug.cgi?id=1262051) and this is related to the speed of the operation.

Comment 30 Oved Ourfali 2015-10-26 10:01:47 UTC

(In reply to Meni Yakove from comment #29)
> this bug is about all hosts are stuck in connecting status and the only way
> to fix it is to restart engine service.
> We have another bug about hosts move from 'up' to 'connecting' and back to
> 'up'(https://bugzilla.redhat.com/show_bug.cgi?id=1262051) and this is
> related to the speed of the operation.

It sounds like you want see this issue anymore once Bug 1262051 is fixed (or the tests are changed to fit the new slower nature of setup networks).
Please verify that once fixed.

In addition, Piotr - please re-examine to make sure there is no additional issue hiding here.

Comment 31 Piotr Kliczewski 2015-10-26 10:36:11 UTC

From what I see in the logs we have protection issue. There is host monitoring cycle which is running just before setupNetworks are triggered. When we loose connectivity due to network configuration changes host monitoring never receives a response. When vdsTimeout occurs the connection is reset even though it could be healthy at that time.

We need to make sure that there is no message exchange when we run setupNetworks.

Comment 32 Piotr Kliczewski 2015-10-26 10:50:10 UTC

Restoring need info. Can we get engine log in debug (like in comment #22)?

Comment 33 Meni Yakove 2015-11-02 09:36:11 UTC

Still wait for Nelly to run the tests.

Comment 34 Nelly Credi 2015-11-03 07:34:25 UTC

Created attachment 1088817 [details]
Debug log engine

Comment 35 Nelly Credi 2015-11-03 07:38:50 UTC

Attached debug log of the engine

Comment 36 Piotr Kliczewski 2015-11-03 08:44:14 UTC

In order to provide thread dump please send SIGQUIT to the engine process by:
kill -3 <java-pid> > output.txt

Once you run ps to see check pid please make sure that -Xrs is not set. This flag tells jvm to ignore SIGQUIT.

Comment 37 Piotr Kliczewski 2015-11-03 10:56:34 UTC

I can see 3 mins of no log entries in the log:

2015-11-02 18:56:18,815 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-91) [] START, GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='41857f47-d3db-4491-b0d8-a2c96b9d037f', vds='Host[host_mixed_1,41857f47-d3db-4491-b0d8-a2c96b9d037f]'}), log id: 3d3a7e99
2015-11-02 18:56:18,815 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler_Worker-91) [] Message sent: SEND
reply-to:jms.topic.vdsm_responses
destination:jms.topic.vdsm_requests
content-length:98

{"jsonrpc":"2.0","method":"Host.getStats","params":{},"id":"427d60bc-479f-4d87-9ca2-d6384c882b2a"}^@
2015-11-02 18:59:06,607 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"message":"Vds timeout occured","code":"10.35.117.44:2033222199"},"id":null}
2015-11-02 18:59:06,608 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"message":"Vds timeout occured","code":"10.35.117.42:2062834906"},"id":null}

Do you see high load on the host which could stack jvm for so long time?

Comment 38 Meni Yakove 2015-11-03 23:12:46 UTC

I have this:

ovirt    30817 30815 99 01:07 ?        00:01:01 ovirt-engine -server -XX:+TieredCompilation -Xms1024M -Xmx1024M -XX:PermSize=256m -XX:MaxPermSize=256m -Djava.awt.headless=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djsse.enableSNIExtension=false -Djava.security.krb5.conf=/etc/ovirt-engine/krb5.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/ovirt-engine/dump -Djava.util.logging.manager=org.jboss.logmanager -Dlogging.configuration=file:///var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine-logging.properties -Dorg.jboss.resolver.warning=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djboss.modules.write-indexes=false -Djboss.server.default.config=ovirt-engine -Djboss.home.dir=/usr/share/jbossas -Djboss.server.base.dir=/usr/share/ovirt-engine -Djboss.server.data.dir=/var/lib/ovirt-engine -Djboss.server.log.dir=/var/log/ovirt-engine -Djboss.server.config.dir=/var/lib/ovirt-engine/jboss_runtime/config -Djboss.server.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -Djboss.controller.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -jar /usr/share/jbossas/jboss-modules.jar -mp /var/lib/ovirt-engine/jboss_runtime/modules/00-modules-common:/var/lib/ovirt-engine/jboss_runtime/modules/01-ovirt-engine-extension-aaa-jdbc-modules:/var/lib/ovirt-engine/jboss_runtime/modules/02-ovirt-engine-extension-aaa-ldap-modules:/var/lib/ovirt-engine/jboss_runtime/modules/03-ovirt-engine-extension-aaa-misc-modules:/var/lib/ovirt-engine/jboss_runtime/modules/04-jbossas-modules -jaxpmodule javax.xml.jaxp-provider org.jboss.as.standalone -c ovirt-engine.xml

and when I kill -3 30817 > output.txt the txt file in empty and the process still up

Comment 39 Piotr Kliczewski 2015-11-04 08:36:57 UTC

In order to take a thread dump we need to send SIGQUIT signal by:

kill -3 <engine-pid>

If there is no output we could have standard output redirected or -Xrs flag set.
In order to get the thread dump we can do:
1. Add following lines to ovirt-engine.py for section         

self._engineArgs.extend([
            .......
            '-XX:+UnlockDiagnosticVMOptions',
            '-XX:+LogVMOutput',
            '-XX:LogFile=/tmp/jvm.log',
])

Restart of the engine is required and when kill -3 is send thread dump is redirected to /tmp/jvm.log file.

2. Check you java version by running java -version and if it is 1.7 install java-1.7.0-openjdk-devel package. After the installation we can run:

jstack <engine-pid>

We can redirect the output to a file by:

kill -3 <pid> > output.txt or jstack <pid> > output.text

Comment 40 Meni Yakove 2015-11-11 15:50:42 UTC

(In reply to Piotr Kliczewski from comment #39)
> In order to take a thread dump we need to send SIGQUIT signal by:
> 
> kill -3 <engine-pid>
> 
> If there is no output we could have standard output redirected or -Xrs flag
> set.
> In order to get the thread dump we can do:
> 1. Add following lines to ovirt-engine.py for section         
> 
> self._engineArgs.extend([
>             .......
>             '-XX:+UnlockDiagnosticVMOptions',
>             '-XX:+LogVMOutput',
>             '-XX:LogFile=/tmp/jvm.log',
> ])
> 
> Restart of the engine is required and when kill -3 is send thread dump is
> redirected to /tmp/jvm.log file.
> 
> 2. Check you java version by running java -version and if it is 1.7 install
> java-1.7.0-openjdk-devel package. After the installation we can run:
> 
> jstack <engine-pid>
> 
> We can redirect the output to a file by:
> 
> kill -3 <pid> > output.txt or jstack <pid> > output.text

Pioter, with this process jvm.log will be created only when IU kill -3 ? we have jvm.log but we didn't face the connecting issue. is it possible that on every engine restart the file is created.

Comment 41 Piotr Kliczewski 2015-11-12 08:25:05 UTC

The file is created at engine start time. It only contains jvm information. The thread dump is only added to the file when you run kill -3.

Comment 42 Meni Yakove 2015-11-12 09:43:18 UTC

So I need  way to generate the file name.
kill -9 > <file_name> not working

Comment 43 Piotr Kliczewski 2015-11-12 09:51:33 UTC

the filename is provided in jvm args as in comment #40.
When you start the engine with the args the file is created only with jvm info but when you send SIGQUIT it writes thread dump to the file provided in LogFile arg.

I suggest to use -3 not -9.

Comment 44 Meni Yakove 2015-11-12 13:21:25 UTC

I use -3 (-9 was a typo), we have a patch that check if all hosts are in connecting then we run kill -3 <java_pid> and restart engine so, in this case, we will never have the right file. is this correct?

Comment 45 Piotr Kliczewski 2015-11-12 14:02:10 UTC

it is correct.

Comment 46 Meni Yakove 2015-11-12 15:24:07 UTC

ok, I need a way to save each file with a different name.

Comment 47 Red Hat Bugzilla Rules Engine 2015-12-02 00:25:31 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 48 Nelly Credi 2015-12-02 13:59:59 UTC

We havent seen this bug for a while.

Comment 49 Barak Korren 2015-12-29 09:52:11 UTC

Created attachment 1110135 [details]
integration-engine sos report

But re-surfaced in the RHEV-CI integration-engine system, re-openning and attaching engine SOS report.
The bug seems to affect multiple hosts across different clusters with different cluster levels (both 3.5 and 3.6). Will attach host sos reports soon

Comment 50 Barak Korren 2015-12-29 09:58:54 UTC

Created attachment 1110136 [details]
3.6 cluster host report (new-ise rhel 7.2 install)

Comment 51 Barak Korren 2015-12-29 10:01:07 UTC

Created attachment 1110137 [details]
3.6 cluster host report

Comment 52 Barak Korren 2015-12-29 10:38:16 UTC

Created attachment 1110139 [details]
integration-engine DB logs

Comment 53 Barak Korren 2015-12-29 10:52:56 UTC

Bug re-surfaced in the RHEV-CI integration-engine system, re-openning and attaching engine SOS report.
The bug seems to affect multiple hosts across different clusters with different cluster levels (both 3.5 and 3.6).

Comment 54 Piotr Kliczewski 2016-01-08 13:50:24 UTC

hopefully we will be able to have it in

Comment 55 Meni Yakove 2016-01-20 15:47:06 UTC

looks ok, jobs doesn't stuck anymore.