Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1582379

Summary:	All hosts stuck in connecting/not responding state until engine restarted
Product:	[oVirt] vdsm-jsonrpc-java	Reporter:	Germano Veit Michel <gveitmic>
Component:	Core	Assignee:	Ravi Nori <rnori>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Pavol Brilla <pbrilla>
Severity:	high	Docs Contact:
Priority:	medium
Version:	---	CC:	bugs, ebenahar, fromani, gveitmic, mgoldboi, mperina, nicolas, nsoffer, pkliczew, rnori
Target Milestone:	ovirt-4.2.7	Flags:	rule-engine: ovirt-4.2+
Target Release:	1.4.15
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	v1.4.15	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-02 14:30:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2018-05-25 04:03:23 UTC

Description of problem:

I think I reproduced by chance one of those weird instances where all hosts go non-responding and the only solution is to restart the engine.Here is what happened:

2018-05-25 12:32:35: Selected 5 VMs in the UI and run all of them at the same time

2018-05-25 12:32:39: Realized I didn't want those VMs to start, they were still all selected, powering up. I powered them off via the GUI. 

2018-05-25 12:32:47 First heartbeat exceeded (SPM)

2018-05-25 12:36:29 other 2 hosts go to not responding and get stuck there.

From 12:40 to 12:50:

I'm looking at the logs, taking thread dumps, testing connectivity, even restarting vdsms. Nothing helps, no obvious problems (except maybe vdsm on h3 - SPM restarted without fencing).

2018-05-25 12:51:23: used gdb to extract a coredump of the jvm. engine paused until 12:58.

2018-05-25 12:58 java heap dump

After restarting the engine, everything is up and green.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3.5-1.el7.centos.noarch
vdsm-jsonrpc-java-1.4.12-1.el7.centos.noarch

How reproducible:
0%, tried a few more times the start/poweroff of VMs, did not hit it again.

Actual results:
All unresponsive until engine restarted

Expected results:
All responsive

Comment 11 Nir Soffer 2018-05-30 14:25:57 UTC

Germano, it sounds like vdsm crashed (e.g. segfault) on host 3, and we would
like to understand why. Can you attach /var/log/messages from this host, and 
the relevant abrt crash reports?

This probably need a separate bug, feel free to open one for vdsm.

Comment 12 Germano Veit Michel 2018-05-31 04:35:39 UTC

(In reply to Nir Soffer from comment #11)
> Germano, it sounds like vdsm crashed (e.g. segfault) on host 3, and we would
> like to understand why. Can you attach /var/log/messages from this host, and 
> the relevant abrt crash reports?
> 
> This probably need a separate bug, feel free to open one for vdsm.

Yes, this is the reboot I mentioned on comment #0.

It was not vdsm that crashed. The host had a kernel panic on kvm and rebooted, this host does't have much memory and there is some cache flushing involved. I'll take a better look and submit a kernel bug later if necessary.

So I think there only thing to be done here is to make the engine more resilient to vdsm/host failures?

Comment 13 Piotr Kliczewski 2018-06-04 07:49:16 UTC

(In reply to Germano Veit Michel from comment #12)
> 
> So I think there only thing to be done here is to make the engine more
> resilient to vdsm/host failures?

Yes, now we need to reproduce and see exactly how sslengine behaves in similar situation and handle it correctly.

Comment 14 Moran Goldboim 2018-06-21 13:10:47 UTC

reducing the priority since this is a corner case, however we would probably like to target this one to 4.3. pending on Pioter analysis.

Comment 15 Piotr Kliczewski 2018-06-21 14:10:45 UTC

Ravi, Did you try to reproduce the issue?

Comment 16 Ravi Nori 2018-06-21 16:54:14 UTC

@Piotr

I was unable to reproduce the issue.

Comment 17 Piotr Kliczewski 2018-06-22 09:15:42 UTC

Germano, Any ideas how to reproduce?

Comment 18 Germano Veit Michel 2018-06-24 23:44:27 UTC

(In reply to Piotr Kliczewski from comment #17)
> Germano, Any ideas how to reproduce?

Unfortunately no. I did try a few more times to repeat what I was doing as per comment #0 with no luck. And I've been using the same environment for some time, and it has been all good.

Can't you attempt to force such a situation by modifying the code?

Comment 19 Piotr Kliczewski 2018-06-25 08:19:05 UTC

(In reply to Germano Veit Michel from comment #18)
> 
> Can't you attempt to force such a situation by modifying the code?

Let's try to do it. I will talk to Ravi what needs to be done.

Comment 20 Pavol Brilla 2018-09-10 11:14:23 UTC

Verification steps?

Comment 21 Ravi Nori 2018-09-10 11:56:17 UTC

Verification steps

1. Have vdsm in up status
2. Kill vdsm host
3. Start vdsm host and make sure vdsm is running

The host should move to UP status.

Repeat the above 20 times to make sure everything works.

Comment 22 Pavol Brilla 2018-09-12 10:58:02 UTC

35 times 0 issues, host went always up

Comment 23 Germano Veit Michel 2018-11-01 00:28:10 UTC

*** Bug 1641836 has been marked as a duplicate of this bug. ***

Comment 24 Sandro Bonazzola 2018-11-02 14:30:44 UTC

This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 25 Martin Perina 2018-12-11 10:41:29 UTC

*** Bug 1657852 has been marked as a duplicate of this bug. ***