Bug 1931443 - After reboot a RHV-H host fails to boot displaying error: ../../grub-core/loader/i386/pc/linux.c:170:invalid magic number
Summary: After reboot a RHV-H host fails to boot displaying error: ../../grub-core/loa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: imgbased
Version: 4.4.3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.4.6
: 4.4.6
Assignee: Asaf Rachmani
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-22 12:47 UTC by Sam Wachira
Modified: 2024-03-25 18:12 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-03 10:24:29 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5829141 0 None None None 2021-02-24 16:49:58 UTC
Red Hat Product Errata RHSA-2021:2239 0 None None None 2021-06-03 10:25:14 UTC
oVirt gerrit 113306 0 None MERGED bootsetup: copy kernel to boot partition. 2021-04-22 07:29:55 UTC

Internal Links: 2139408

Description Sam Wachira 2021-02-22 12:47:12 UTC
Description of problem:
After reboot, a RHV-H host fails to boot after GRUB2 screen displaying errors.

error: ../../grub-core/loader/i386/pc/linux.c:170:invalid magic number
error: ../../grub-core/loader/i386/pc/linux.c:1418:you need to load the kernel first.

This issue occurs sometimes when a host is placed in maintenance then rebooted.
It also happens sometimes after upgrading a RHV-H host.
It does not affect all hosts in a cluster and also does not affect the same host every time. 

This issue has occurred unexpectedly in the customer environment multiple times (see linked cases) and once in my RHV 4.4.3 lab.

We still don't know the cause but it seems a task/process is truncating the initramfs and vmlinuz files in /boot and /boot/rhvh--4.4.x.x-x.YYYYMMDD.0+1 to 0 bytes.


Version-Release number of selected component (if applicable):
RHV-H 4.4.3
RHV-H 4.3.7
RHV-H 4.3.6

How reproducible:
100%

Steps to Reproduce: **only reproduces sometimes in customer environment.
1. Place host in maintenance.
2. Reboot host.
or
1. Place host in maintenance.
2. Upgrade RHV-H to newer version
3. Reboot

Steps to Reproduce: **reproduces successfully.
1. Place host in maintenance.
2. Wipe initramfs and vmlinuz files in /boot directory
[root@rhvh1 ~]# > /boot/initramfs-4.18.0-240.1.1.el8_3.x86_64.img
[root@rhvh1 ~]# > /boot/vmlinuz-4.18.0-240.1.1.el8_3.x86_64
[root@rhvh1 ~]# > /boot/rhvh-4.4.3.1-0.20201116.0+1/initramfs-4.18.0-240.1.1.el8_3.x86_64.img
[root@rhvh1 ~]# > /boot/rhvh-4.4.3.1-0.20201116.0+1/vmlinuz-4.18.0-240.1.1.el8_3.x86_64
3. Reboot host


Actual results:

After GRUB screen, the following errors are displayed.
error: ../../grub-core/loader/i386/pc/linux.c:170:invalid magic number.
error: ../../grub-core/loader/i386/pc/linux.c:1418:you need to load the kernel first.
Press any key to continue…

Host is unable to find the kernel to boot.
Pressing any key reboots the host and same errors are displayed.


Expected results:
RHV-H host should boot.


Additional info:

1. Upon booting a rescue ISO and inspecting /boot, it appears the file size for initramfs and vmlinuz files in /boot/rhvh-4.4.3.1-0.20201116.0+1/ is 0 bytes.

[root@rhvh1 ~]# ls -l /boot/rhvh-4.4.3.1-0.20201116.0+1/
total 4.1M
drwxr-xr-x. 2 root root 4.0K Feb 3 15:43 .
dr-xr-xr-x. 7 root root 4.0K Jan 6 16:24 ..
-rw-r--r--. 1 root root 186K Oct 16 19:52 config-4.18.0-240.1.1.el8_3.x86_64
-rw-------. 1 root root    0 Jan 6 16:04 initramfs-4.18.0-240.1.1.el8_3.x86_64.img
-rw-------. 1 root root 3.9M Oct 16 17:52 System.map-4.18.0-240.1.1.el8_3.x86_64
-rwxr-xr-x. 1 root root    0 Oct 16 17:52 vmlinuz-4.18.0-240.1.1.el8_3.x86_64
-rw-r--r--. 1 root root  172 Oct 16 19:51 .vmlinuz-4.18.0-240.1.1.el8_3.x86_64.hmac

Recovering from this issue requires copying initramfs and vmlinuz files from /boot to /boot/rhvh-4.4.3.1-0.20201116.0+1.

However, in some cases, the initramfs and vmlinuz files in /boot are also truncated to 0 bytes so the files need to be recovered from a backup.

Comment 2 peyu 2021-02-23 06:15:33 UTC
Hi Sam, 
I noticed that you reproduced this issue once in the RHHI environment. Is the customer's environment also RHHI?

Comment 3 Sam Wachira 2021-02-23 10:54:43 UTC
Hi Pengshan,
Both environments are running RHHI-V.

In my environment, I encountered this issue by accident after rebooting a RHV-H host from RHV-M UI.
So far we have not been able to identify the cause so it is difficult to reproduce exactly.

Comment 4 peyu 2021-02-24 02:12:11 UTC
Hi Sandro,
According to Comment 3, could you move this bug to the RHHI team?

Comment 6 Marina Kalinin 2021-02-24 18:49:22 UTC
(In reply to peyu from comment #4)
> Hi Sandro,
> According to Comment 3, could you move this bug to the RHHI team?

I am not sure it is the right move. Or at least, on Eng side, it feels to me it should be investigated by Node Engineering first. But it can be moved to RHHI-V QE. My 5cents.

Comment 7 Sandro Bonazzola 2021-03-02 08:38:42 UTC
Since this is not easily reproducible and seems to happen only in a few occasions, reducing the priority and re-targeting to 4.4.6

Comment 9 Nikolai Sednev 2021-04-22 07:22:58 UTC
Looks like we have a documented workaround here: https://access.redhat.com/solutions/5829141

Comment 10 Sandro Bonazzola 2021-04-22 07:29:59 UTC
This shouldn't happen anymore even in 4.4.5 thanks to https://gerrit.ovirt.org/113306

Comment 13 SATHEESARAN 2021-05-05 04:46:52 UTC
Verified with RHV 4.4.6 (RHVH-4.4-20210426.1-RHVH-x86_64-dvd1.iso) with the following steps:

Created the hyperconverged setup with 3 nodes.
Rebooted one of the node, repeatedly in a loop for 200 times, with interval of 10 mins between reboots.
Everytime, RHVH node boots up properly. No issue reported

Comment 25 errata-xmlrpc 2021-06-03 10:24:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Virtualization Host security update [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2239


Note You need to log in before you can comment on or make changes to this bug.