Bug 1493176 - RHVH stuck on startup after 'probing EDD... ok' step
Summary: RHVH stuck on startup after 'probing EDD... ok' step
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-node
Classification: oVirt
Component: General
Version: 4.1
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.2.1
: 4.2
Assignee: Yuval Turgeman
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-19 14:09 UTC by Nelly Credi
Modified: 2018-02-12 11:50 UTC (History)
8 users (show)

Fixed In Version: imgbased-1.0.6
Doc Type: Bug Fix
Doc Text:
Cause: On shutdown, while imgbase-copy-bootfiles.service copies kernel+initrd from /boot to /boot/ovirt-<version>, systemd sends the TERM signal to the service and all its sub processes. Consequence: /boot/ovirt-<version> holds a corrupted initrd (or sometimes kernel), and the system fails to boot. Fix: Set the service TimeoutStopSec to infinity, KillMode to none, and try to catch and ignore SIGTERM from the script itself. Result: systemd won't kill the processes and will let the copy finish successfully
Clone Of:
Environment:
Last Closed: 2018-02-12 11:50:08 UTC
oVirt Team: Node
Embargoed:
talayan: needinfo-
rule-engine: ovirt-4.2+
cshao: testing_ack+


Attachments (Terms of Use)
journalctl.dump (786.21 KB, text/plain)
2017-10-17 08:24 UTC, Tareq Alayan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 81900 0 'None' MERGED Try to block `kill` from imgbase-copy-bootfiles 2020-03-11 06:02:27 UTC
oVirt gerrit 82262 0 'None' MERGED Try to block `kill` from imgbase-copy-bootfiles 2020-03-11 06:02:27 UTC
oVirt gerrit 82263 0 'None' MERGED Try to block `kill` from imgbase-copy-bootfiles 2020-03-11 06:02:27 UTC
oVirt gerrit 86060 0 'None' MERGED Use temp files when copying kernel and initrd 2020-03-11 06:02:27 UTC
oVirt gerrit 86348 0 'None' MERGED Use temp files when copying kernel and initrd 2020-03-11 06:02:27 UTC
oVirt gerrit 86505 0 'None' MERGED Use temp files when copying kernel and initrd 2020-03-11 06:02:27 UTC

Description Nelly Credi 2017-09-19 14:09:47 UTC
Description of problem:
RHVH wont start after reboot


Version-Release number of selected component (if applicable):
we have been chasing this issue for a while now
current version is
RHVH-4.1-20170914.1

How reproducible:
in our automation it happens a lot, virt qe didnt see it at all

Steps to Reproduce:
1. Install RHVH
2. Cycle multiple times until it reproduces
3.

Actual results:
The host wont start
it is stuck after probing EDD step

Expected results:
Host should start

Additional info:
According to Yuval T, this happens in cold reboot and related to the imgbase-copy-bootfiles script, but Ill let him put his input

Comment 1 Ryan Barry 2017-09-19 14:18:46 UTC
Yuval got some results this morning, though I'll let him chime in.

It seems that systemd may be killing the script. I wonder if we can simply set TimeoutSec=30

Comment 2 Yuval Turgeman 2017-09-19 15:15:20 UTC
imgbase-copy-bootfiles copies the kernel and initrd on shutdown from /boot to /boot/rhvh..., and while this copy is being done systemd kills the unit's processes (cp), leaving a partial initrd (or kernel) file under /boot/rhvh, and making the system unbootable.  Something like the following:

# ls -l /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img 

-rw-------. 1 root root 59685039 Sep 19 16:32 /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img
-rw-------. 1 root root 59685039 Sep 19 16:50 /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img

# /usr/sbin/imgbase-copy-bootfiles shutdown & while [ 1 ]; do killall -9 cp; done 2>/dev/null
<ctrl-c>

# ls -l /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img

-rw-------. 1 root root 59685039 Sep 19 16:32 /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img
-rw-------. 1 root root        0 Sep 19 16:54 /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img

Adding KillMode=none to the imgbase-copy-bootfiles service unit seems to solve this.

Comment 3 Petr Balogh 2017-09-20 11:08:49 UTC
Hi,

can we somehow use/change this in kickstart template in foreman? Or we should wait for new build where it will be fixed?

Thanks, Petr

Comment 4 Yuval Turgeman 2017-09-24 09:16:11 UTC
(In reply to Petr Balogh from comment #3)
> Hi,
> 
> can we somehow use/change this in kickstart template in foreman? Or we
> should wait for new build where it will be fixed?
> 
> Thanks, Petr

I think it's very rare, it's not a fix in the kickstart but in the system itself -
 check out the patch, it should solve this issue.

Comment 5 Qin Yuan 2017-10-11 08:59:32 UTC
I tried to reproduce this issue:

1. Install RHVH-4.1-20170914.1-RHVH-x86_64-dvd1.iso on Dell PowerEdge R730 for many times.
2. Cold boot Dell PowerEdge R730 installed with RHVH-4.1-20170914.1 for many times.

The issue didn't occur.

Nelly, could you help to verify this bug?

Comment 6 Sandro Bonazzola 2017-10-11 11:12:33 UTC
Should be in oVirt 4.1.7 RC3

Comment 7 Tareq Alayan 2017-10-16 14:21:47 UTC
it is stuck again on some host when trying to install rhvh-4.1-0.20171012.0.
i did try to install on 3 hosts and 2 out of 3 successfully installed.

Comment 8 Red Hat Bugzilla Rules Engine 2017-10-16 14:22:41 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 12 Tareq Alayan 2017-10-17 08:24:39 UTC
Created attachment 1339618 [details]
journalctl.dump

journalctl dump attached

Comment 13 Tareq Alayan 2017-10-17 08:25:56 UTC
Ryan i attched the journalctl output.

Comment 14 Yaniv Kaul 2017-10-19 10:13:04 UTC
Reducing severity as it doesn't reproduce on many hosts.

Comment 15 Sandro Bonazzola 2017-10-20 06:08:51 UTC
Re-targeting to 4.1.8 not being a 4.1.7 blocker

Comment 16 Yuval Turgeman 2017-12-07 11:10:55 UTC
I think this one is fixed, can we close it ?

Comment 17 Nelly Credi 2017-12-07 11:16:18 UTC
I believe so, we havent seen it for a while now

Comment 18 Nelly Credi 2017-12-28 09:41:04 UTC
yuval can probably add more details, 
but looks like only SIGTERM was handled, while in PM tests a different signal is sent (SIGKILL?) causing the issue to reproduce in these tests

Comment 19 Moran Goldboim 2018-01-02 10:39:26 UTC
(In reply to Nelly Credi from comment #18)
> yuval can probably add more details, 
> but looks like only SIGTERM was handled, while in PM tests a different
> signal is sent (SIGKILL?) causing the issue to reproduce in these tests

Nelly, following latest iterations around this bug, do you know how often it reproduces and on what percentages of the systems?

we would like to understand the current status here, thanks.

Comment 20 Nelly Credi 2018-01-02 13:24:58 UTC
atm it reproduces during some sla PM test (afaik it happens every time). 
once it happens the host cannot recover,
so it is causing more failures in other tests

Comment 21 Qin Yuan 2018-01-30 08:40:21 UTC
Hi Tareq,

Can you help to verify this bug, as we can not reproduce this bug with our machines.

The latest 4.2 iso containing the new patch is RHVH-4.2-20180128.0-RHVH-x86_64-dvd1.iso. (4.1 iso RHVH-4.1-20180128.0-RHVH-x86_64-dvd1.iso also contains the new patch)

Comment 25 Tareq Alayan 2018-02-04 14:03:43 UTC
it didn'r reporduce with RHVH-4.2-20180203.0-RHVH-x86_64-dvd1.iso

Comment 26 cshao 2018-02-05 02:42:25 UTC
Verify this bug according #c25.

Comment 27 Sandro Bonazzola 2018-02-12 11:50:08 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.