Bug 1493176
| Summary: | RHVH stuck on startup after 'probing EDD... ok' step | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-node | Reporter: | Nelly Credi <ncredi> | ||||
| Component: | General | Assignee: | Yuval Turgeman <yturgema> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Qin Yuan <qiyuan> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.1 | CC: | bugs, cshao, lveyde, mgoldboi, ncredi, pbalogh, rbarry, talayan | ||||
| Target Milestone: | ovirt-4.2.1 | Keywords: | AutomationBlocker | ||||
| Target Release: | 4.2 | Flags: | talayan:
needinfo-
rule-engine: ovirt-4.2+ cshao: testing_ack+ |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | imgbased-1.0.6 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Cause:
On shutdown, while imgbase-copy-bootfiles.service copies kernel+initrd from /boot to /boot/ovirt-<version>, systemd sends the TERM signal to the service and all its sub processes.
Consequence:
/boot/ovirt-<version> holds a corrupted initrd (or sometimes kernel), and the system fails to boot.
Fix:
Set the service TimeoutStopSec to infinity, KillMode to none, and try to catch and ignore SIGTERM from the script itself.
Result:
systemd won't kill the processes and will let the copy finish successfully
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-02-12 11:50:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Node | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Nelly Credi
2017-09-19 14:09:47 UTC
Yuval got some results this morning, though I'll let him chime in. It seems that systemd may be killing the script. I wonder if we can simply set TimeoutSec=30 imgbase-copy-bootfiles copies the kernel and initrd on shutdown from /boot to /boot/rhvh..., and while this copy is being done systemd kills the unit's processes (cp), leaving a partial initrd (or kernel) file under /boot/rhvh, and making the system unbootable. Something like the following: # ls -l /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img -rw-------. 1 root root 59685039 Sep 19 16:32 /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img -rw-------. 1 root root 59685039 Sep 19 16:50 /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img # /usr/sbin/imgbase-copy-bootfiles shutdown & while [ 1 ]; do killall -9 cp; done 2>/dev/null <ctrl-c> # ls -l /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img -rw-------. 1 root root 59685039 Sep 19 16:32 /boot/initramfs-3.10.0-693.2.2.el7.x86_64.img -rw-------. 1 root root 0 Sep 19 16:54 /boot/rhvh-4.1-0.20170914.0+1/initramfs-3.10.0-693.2.2.el7.x86_64.img Adding KillMode=none to the imgbase-copy-bootfiles service unit seems to solve this. Hi, can we somehow use/change this in kickstart template in foreman? Or we should wait for new build where it will be fixed? Thanks, Petr (In reply to Petr Balogh from comment #3) > Hi, > > can we somehow use/change this in kickstart template in foreman? Or we > should wait for new build where it will be fixed? > > Thanks, Petr I think it's very rare, it's not a fix in the kickstart but in the system itself - check out the patch, it should solve this issue. I tried to reproduce this issue: 1. Install RHVH-4.1-20170914.1-RHVH-x86_64-dvd1.iso on Dell PowerEdge R730 for many times. 2. Cold boot Dell PowerEdge R730 installed with RHVH-4.1-20170914.1 for many times. The issue didn't occur. Nelly, could you help to verify this bug? Should be in oVirt 4.1.7 RC3 it is stuck again on some host when trying to install rhvh-4.1-0.20171012.0. i did try to install on 3 hosts and 2 out of 3 successfully installed. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Created attachment 1339618 [details]
journalctl.dump
journalctl dump attached
Ryan i attched the journalctl output. Reducing severity as it doesn't reproduce on many hosts. Re-targeting to 4.1.8 not being a 4.1.7 blocker I think this one is fixed, can we close it ? I believe so, we havent seen it for a while now yuval can probably add more details, but looks like only SIGTERM was handled, while in PM tests a different signal is sent (SIGKILL?) causing the issue to reproduce in these tests (In reply to Nelly Credi from comment #18) > yuval can probably add more details, > but looks like only SIGTERM was handled, while in PM tests a different > signal is sent (SIGKILL?) causing the issue to reproduce in these tests Nelly, following latest iterations around this bug, do you know how often it reproduces and on what percentages of the systems? we would like to understand the current status here, thanks. atm it reproduces during some sla PM test (afaik it happens every time). once it happens the host cannot recover, so it is causing more failures in other tests Hi Tareq, Can you help to verify this bug, as we can not reproduce this bug with our machines. The latest 4.2 iso containing the new patch is RHVH-4.2-20180128.0-RHVH-x86_64-dvd1.iso. (4.1 iso RHVH-4.1-20180128.0-RHVH-x86_64-dvd1.iso also contains the new patch) it didn'r reporduce with RHVH-4.2-20180203.0-RHVH-x86_64-dvd1.iso Verify this bug according #c25. This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |