Bug 1582379 - All hosts stuck in connecting/not responding state until engine restarted
Summary: All hosts stuck in connecting/not responding state until engine restarted
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm-jsonrpc-java
Classification: oVirt
Component: Core
Version: ---
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.2.7
: 1.4.15
Assignee: Ravi Nori
QA Contact: Pavol Brilla
URL:
Whiteboard:
: 1641836 1657852 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-25 04:03 UTC by Germano Veit Michel
Modified: 2021-12-10 16:28 UTC (History)
10 users (show)

Fixed In Version: v1.4.15
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-02 14:30:44 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44218 0 None None None 2021-12-10 16:28:49 UTC
Red Hat Knowledge Base (Solution) 3678501 0 Troubleshoot None RHV: after network outage, some hypervisors are still in Not-Responding state. 2019-06-11 06:16:37 UTC
oVirt gerrit 93530 0 master MERGED All hosts stuck in connecting/not responding state until engine restarted 2020-12-01 20:51:38 UTC

Description Germano Veit Michel 2018-05-25 04:03:23 UTC
Description of problem:

I think I reproduced by chance one of those weird instances where all hosts go non-responding and the only solution is to restart the engine.Here is what happened:

2018-05-25 12:32:35: Selected 5 VMs in the UI and run all of them at the same time

2018-05-25 12:32:39: Realized I didn't want those VMs to start, they were still all selected, powering up. I powered them off via the GUI. 

2018-05-25 12:32:47 First heartbeat exceeded (SPM)

2018-05-25 12:36:29 other 2 hosts go to not responding and get stuck there.

From 12:40 to 12:50:

I'm looking at the logs, taking thread dumps, testing connectivity, even restarting vdsms. Nothing helps, no obvious problems (except maybe vdsm on h3 - SPM restarted without fencing).

2018-05-25 12:51:23: used gdb to extract a coredump of the jvm. engine paused until 12:58.

2018-05-25 12:58 java heap dump

After restarting the engine, everything is up and green.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3.5-1.el7.centos.noarch
vdsm-jsonrpc-java-1.4.12-1.el7.centos.noarch

How reproducible:
0%, tried a few more times the start/poweroff of VMs, did not hit it again.

Actual results:
All unresponsive until engine restarted

Expected results:
All responsive

Comment 11 Nir Soffer 2018-05-30 14:25:57 UTC
Germano, it sounds like vdsm crashed (e.g. segfault) on host 3, and we would
like to understand why. Can you attach /var/log/messages from this host, and 
the relevant abrt crash reports?

This probably need a separate bug, feel free to open one for vdsm.

Comment 12 Germano Veit Michel 2018-05-31 04:35:39 UTC
(In reply to Nir Soffer from comment #11)
> Germano, it sounds like vdsm crashed (e.g. segfault) on host 3, and we would
> like to understand why. Can you attach /var/log/messages from this host, and 
> the relevant abrt crash reports?
> 
> This probably need a separate bug, feel free to open one for vdsm.

Yes, this is the reboot I mentioned on comment #0.

It was not vdsm that crashed. The host had a kernel panic on kvm and rebooted, this host does't have much memory and there is some cache flushing involved. I'll take a better look and submit a kernel bug later if necessary.

So I think there only thing to be done here is to make the engine more resilient to vdsm/host failures?

Comment 13 Piotr Kliczewski 2018-06-04 07:49:16 UTC
(In reply to Germano Veit Michel from comment #12)
> 
> So I think there only thing to be done here is to make the engine more
> resilient to vdsm/host failures?

Yes, now we need to reproduce and see exactly how sslengine behaves in similar situation and handle it correctly.

Comment 14 Moran Goldboim 2018-06-21 13:10:47 UTC
reducing the priority since this is a corner case, however we would probably like to target this one to 4.3. pending on Pioter analysis.

Comment 15 Piotr Kliczewski 2018-06-21 14:10:45 UTC
Ravi, Did you try to reproduce the issue?

Comment 16 Ravi Nori 2018-06-21 16:54:14 UTC
@Piotr

I was unable to reproduce the issue.

Comment 17 Piotr Kliczewski 2018-06-22 09:15:42 UTC
Germano, Any ideas how to reproduce?

Comment 18 Germano Veit Michel 2018-06-24 23:44:27 UTC
(In reply to Piotr Kliczewski from comment #17)
> Germano, Any ideas how to reproduce?

Unfortunately no. I did try a few more times to repeat what I was doing as per comment #0 with no luck. And I've been using the same environment for some time, and it has been all good.

Can't you attempt to force such a situation by modifying the code?

Comment 19 Piotr Kliczewski 2018-06-25 08:19:05 UTC
(In reply to Germano Veit Michel from comment #18)
> 
> Can't you attempt to force such a situation by modifying the code?

Let's try to do it. I will talk to Ravi what needs to be done.

Comment 20 Pavol Brilla 2018-09-10 11:14:23 UTC
Verification steps?

Comment 21 Ravi Nori 2018-09-10 11:56:17 UTC
Verification steps

1. Have vdsm in up status
2. Kill vdsm host
3. Start vdsm host and make sure vdsm is running

The host should move to UP status.

Repeat the above 20 times to make sure everything works.

Comment 22 Pavol Brilla 2018-09-12 10:58:02 UTC
35 times 0 issues, host went always up

Comment 23 Germano Veit Michel 2018-11-01 00:28:10 UTC
*** Bug 1641836 has been marked as a duplicate of this bug. ***

Comment 24 Sandro Bonazzola 2018-11-02 14:30:44 UTC
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 25 Martin Perina 2018-12-11 10:41:29 UTC
*** Bug 1657852 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.