Description of problem: I think I reproduced by chance one of those weird instances where all hosts go non-responding and the only solution is to restart the engine.Here is what happened: 2018-05-25 12:32:35: Selected 5 VMs in the UI and run all of them at the same time 2018-05-25 12:32:39: Realized I didn't want those VMs to start, they were still all selected, powering up. I powered them off via the GUI. 2018-05-25 12:32:47 First heartbeat exceeded (SPM) 2018-05-25 12:36:29 other 2 hosts go to not responding and get stuck there. From 12:40 to 12:50: I'm looking at the logs, taking thread dumps, testing connectivity, even restarting vdsms. Nothing helps, no obvious problems (except maybe vdsm on h3 - SPM restarted without fencing). 2018-05-25 12:51:23: used gdb to extract a coredump of the jvm. engine paused until 12:58. 2018-05-25 12:58 java heap dump After restarting the engine, everything is up and green. Version-Release number of selected component (if applicable): ovirt-engine-4.2.3.5-1.el7.centos.noarch vdsm-jsonrpc-java-1.4.12-1.el7.centos.noarch How reproducible: 0%, tried a few more times the start/poweroff of VMs, did not hit it again. Actual results: All unresponsive until engine restarted Expected results: All responsive
Germano, it sounds like vdsm crashed (e.g. segfault) on host 3, and we would like to understand why. Can you attach /var/log/messages from this host, and the relevant abrt crash reports? This probably need a separate bug, feel free to open one for vdsm.
(In reply to Nir Soffer from comment #11) > Germano, it sounds like vdsm crashed (e.g. segfault) on host 3, and we would > like to understand why. Can you attach /var/log/messages from this host, and > the relevant abrt crash reports? > > This probably need a separate bug, feel free to open one for vdsm. Yes, this is the reboot I mentioned on comment #0. It was not vdsm that crashed. The host had a kernel panic on kvm and rebooted, this host does't have much memory and there is some cache flushing involved. I'll take a better look and submit a kernel bug later if necessary. So I think there only thing to be done here is to make the engine more resilient to vdsm/host failures?
(In reply to Germano Veit Michel from comment #12) > > So I think there only thing to be done here is to make the engine more > resilient to vdsm/host failures? Yes, now we need to reproduce and see exactly how sslengine behaves in similar situation and handle it correctly.
reducing the priority since this is a corner case, however we would probably like to target this one to 4.3. pending on Pioter analysis.
Ravi, Did you try to reproduce the issue?
@Piotr I was unable to reproduce the issue.
Germano, Any ideas how to reproduce?
(In reply to Piotr Kliczewski from comment #17) > Germano, Any ideas how to reproduce? Unfortunately no. I did try a few more times to repeat what I was doing as per comment #0 with no luck. And I've been using the same environment for some time, and it has been all good. Can't you attempt to force such a situation by modifying the code?
(In reply to Germano Veit Michel from comment #18) > > Can't you attempt to force such a situation by modifying the code? Let's try to do it. I will talk to Ravi what needs to be done.
Verification steps?
Verification steps 1. Have vdsm in up status 2. Kill vdsm host 3. Start vdsm host and make sure vdsm is running The host should move to UP status. Repeat the above 20 times to make sure everything works.
35 times 0 issues, host went always up
*** Bug 1641836 has been marked as a duplicate of this bug. ***
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
*** Bug 1657852 has been marked as a duplicate of this bug. ***