1574029 – CFME image for EC2 is not booting when using newer instance types(c5): dracut-initqueue: Warning: Could not boot.

Bug 1574029 - CFME image for EC2 is not booting when using newer instance types(c5): dracut-initqueue: Warning: Could not boot.

Summary: CFME image for EC2 is not booting when using newer instance types(c5): dracut...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Satoe Imaishi
QA Contact:	Matouš Mojžíš
Docs Contact:
URL:
Whiteboard:	ec2:appliance
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-02 17:16 UTC by Matouš Mojžíš
Modified:	2019-02-07 23:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.10.0.14
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-07 23:01:55 UTC
Category:	---
Cloudforms Team:	AWS
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
broken booting (172.98 KB, image/png) 2018-05-02 17:16 UTC, Matouš Mojžíš	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2019:0212	0	None	None	None	2019-02-07 23:02:05 UTC

Description Matouš Mojžíš 2018-05-02 17:16:44 UTC

Created attachment 1430229 [details]
broken booting

Description of problem:


Version-Release number of selected component (if applicable):
5.9.2.4

How reproducible:
Always

Steps to Reproduce:
1. Deployed instance from 5.9.2.4 CFME image
2. Instance was stucked at boot.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Joe Rafaniello 2018-05-02 18:27:45 UTC

According to https://access.redhat.com/solutions/2515741

This happens if it can't find a logical volume:  "A logical volume cannot be found on the system"

Matous:
1) does this happen every time you reboot?
2) does it print more if you wait?  Specifically, something like this after the "Could not boot"
[  200.161146]  dracut-initqueue[779]: Warning: /dev/vgroot/root does not exist
[  200.170895]  systemd[1]: Starting Dracut Emergency Shell...

3) Have you tried redeploying in ec2 again (maybe there was a problem importing into ec2?)

Comment 3 Joe Rafaniello 2018-05-02 18:28:53 UTC

Note, I don't believe we changed our kickstarts for generating the logical volumes in a long time.

Does this happen with all 5.9 versions?
Does it happen with 5.8?

Comment 4 Matouš Mojžíš 2018-05-03 13:52:43 UTC

Joe,

1) Yes
2) It does, but I don't have access to logs in EC2 except latest screenshot of instance logs.
3) Yes, I tried it in different region. Same issue.

I test deployment of most of the versions and the latest I tested - 5.8.4.2 and 5.9.2.2 are working correctly.

Comment 7 Joe Rafaniello 2018-05-03 14:37:52 UTC

Matous, are the successful deployments to the same environment? What about the failures?  Is there anything common when it works vs when it doesn't?  Perhaps there's an environmental problem causing the inability to find the logical volumes.

Comment 8 Matouš Mojžíš 2018-05-03 15:02:06 UTC

Joe,

successful deployments are to the same environment.
The only common thing is deploying 5.9.2.4. Other versions work correctly.

Comment 9 Joe Rafaniello 2018-05-03 15:12:32 UTC

Matous, how many 5.9.2.4 deployments did you do and how many failed in this way?

Did any other version fail in this way in the same environment?

Comment 10 Matouš Mojžíš 2018-05-03 15:14:45 UTC

Joe,

I tried to deploy 5.9.2.4 four times. It failed every time. I also tried to reimport image to ec2. Still didn't work.

Any other versions are working correctly in this way in same environment.
I test the image with exactly same steps for every version.

Comment 11 Joe Rafaniello 2018-05-03 15:52:49 UTC

Matous, can you try the steps here:  https://access.redhat.com/solutions/2515741 ?

Note, there are comments at the bottom with some useful information.

In a private comment, can you share the ip/credentials of a 5.9.2.2.  I'd like to compare the rpm package list with one that Satoe did with 5.9.2.4.  Maybe the kernel or something in the OS is different?

Comment 13 Joe Rafaniello 2018-05-03 17:41:15 UTC

The differences in rpms from 5.9.2.3 (working for Matous) to 5.9.2.4 (working for Satoe, not working for Matous):

diff --git a/Users/joerafaniello/Downloads/5922_rpm.txt b/Users/joerafaniello/Downloads/5924_rpm.txt
index b3890db62..36eed519f 100644
--- a/Users/joerafaniello/Downloads/5923_rpm.txt
+++ b/Users/joerafaniello/Downloads/5924_rpm.txt
@@ -1,10 +1,10 @@
 acl-2.2.51-14.el7.x86_64
 adcli-0.8.1-4.el7.x86_64
-ansible-2.4.3.0-1.el7ae.noarch
-ansible-tower-server-3.2.3-1.el7at.x86_64
-ansible-tower-setup-3.2.3-1.el7at.x86_64
-ansible-tower-venv-ansible-3.2.3-1.el7at.x86_64
-ansible-tower-venv-tower-3.2.3-1.el7at.x86_64
+ansible-2.4.4.0-1.el7ae.noarch
+ansible-tower-server-3.2.4-1.el7at.x86_64
+ansible-tower-setup-3.2.4-1.el7at.x86_64
+ansible-tower-venv-ansible-3.2.4-1.el7at.x86_64
+ansible-tower-venv-tower-3.2.4-1.el7at.x86_64
 apr-1.4.8-3.el7_4.1.x86_64
 apr-util-1.5.2-6.el7.x86_64
 at-3.1.13-23.el7.x86_64
@@ -32,11 +32,11 @@ bzip2-libs-1.0.6-13.el7.x86_64
 ca-certificates-2017.2.20-71.el7.noarch
 c-ares-1.10.0-3.el7.x86_64
 certmonger-0.78.4-3.el7.x86_64
-cfme-5.9.2.3-1.el7cf.x86_64
-cfme-appliance-5.9.2.3-1.el7cf.x86_64
-cfme-appliance-common-5.9.2.3-1.el7cf.x86_64
-cfme-appliance-tools-5.9.2.3-1.el7cf.x86_64
-cfme-gemset-5.9.2.3-1.el7cf.x86_64
+cfme-5.9.2.4-1.el7cf.x86_64
+cfme-appliance-5.9.2.4-1.el7cf.x86_64
+cfme-appliance-common-5.9.2.4-1.el7cf.x86_64
+cfme-appliance-tools-5.9.2.4-1.el7cf.x86_64
+cfme-gemset-5.9.2.4-1.el7cf.x86_64
 checkpolicy-2.5-6.el7.x86_64
 chkconfig-1.7.4-1.el7.x86_64
 chrony-3.2-2.el7.x86_64

Comment 14 Matouš Mojžíš 2018-05-04 15:20:40 UTC

I have found what's wrong.
5.9.2.2 image doesn't allow me to use c5.xlarge so I used for it c4.xlarge instance.
5.9.2.4 allows me to ues c5.xlarge instance and image is broken for this instance type.
When I run 5.9.2.4 image on c4.xlarge then it works correctly.
Difference between c4 and c5 is newer processors and added ENA support.

Comment 16 Joe Rafaniello 2018-05-04 15:52:36 UTC

Wow, nice find Matous.

Related:  https://bugs.centos.org/view.php?id=14107

From above link (credit to hhsnow user):

  If you're rolling your own CentOS 7 AMI and are encountering issues booting it on the c5's (but works with other instance families), try running these two before taking the AMI:

  yum install dracut-config-generic
  dracut -f

  From AWS support:
  In order to have the correct initramfs to be created with your script I had to install a package named "dracut-config-generic". This package provides the configuration to turn off the host specific initramfs generation with dracut and generates a generic image by default. This package includes a single configuration file for dracut:

  | $ cat /usr/lib/dracut/dracut.conf.d/02-generic-image.conf
  | hostonly="no"

  After I installed this package and rebuild the initramfs with "dracut -f" it started to boot normally on C5s. You can note the size difference between the initramfs files:

  * Before the installation of "dracut-config-generic":
  -rw------- 1 root root 19719505 Nov 13 16:28 initramfs-3.10.0-693.5.2.el7.x86_64.img

  * After the installation of "dracut-config-generic":
  -rw------- 1 root root 46942834 Nov 15 01:01 initramfs-3.10.0-693.5.2.el7.x86_64.img


Also, for hhsnow user:

  Is ENA enabled for your instance? Via command line: 
  aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[].Instances[].EnaSupport'

  If not (stop instance and):
  aws ec2 modify-instance-attribute --instance-id {instance_id} --ena-support

  To make the AMI automatically support ENA, you need to enable ENA on the instance before you create the image."

Comment 17 Joe Rafaniello 2018-07-23 13:54:34 UTC

Hi Matouš,

I haven't heard reply back on this bug.  Is this still an issue?  Did my links provide any help?  This is still marked urgent, can it be lowered or closed?

Comment 18 Joe Rafaniello 2018-07-31 16:57:51 UTC

Closing for now, please reopen if this is still a problem and you find out what we're doing wrong for this platform.  I'm not sure what actions we need to do to fix this.

Comment 19 Matouš Mojžíš 2018-08-01 15:09:59 UTC

Hi Joe,

I had some issues installing and running ENA drivers for this image so that's why it took me so long.

I was able to install on CFME 5.10 image ENA drivers with this tutorial:
https://github.com/amzn/amzn-drivers/tree/master/kernel/linux/ena

also I did after installing drivers this:
  yum install dracut-config-generic
  dracut -f

then c5 instance types started working for this image.
I think we should distribute cfme image for amazon with these ENA drivers.

So setting back to ASSIGNED and lowering severity.

Comment 20 Joe Rafaniello 2018-08-06 14:10:58 UTC

Satoe,

Can you review comment #19?  I don't know if these amazon drivers are already packaged or if we would need to create rpms for them and if we can redistribute them.  I didn't look to see if there were additional build packages needed to do this.

If this is something you think you can work on, we can reassign.  If this needs prioritization, we can have others review the need for this enhancement/bug fix.

For now, it can stay in my queue until we know what we can and will do.

Thanks,
Joe

Comment 21 Satoe Imaishi 2018-08-17 21:44:31 UTC

ENA driver should be already included in RHEL, so I'm not sure why the driver needs to be added. I'll have to do some research and test, to figure out exactly what's needed.

Comment 22 CFME Bot 2018-08-22 18:22:14 UTC

New commit detected on ManageIQ/manageiq-appliance-build/master:

https://github.com/ManageIQ/manageiq-appliance-build/commit/75895b3eedc09cb960b29326065e86f902403366
commit 75895b3eedc09cb960b29326065e86f902403366
Author:     Satoe Imaishi <simaishi>
AuthorDate: Tue Aug 21 15:20:55 2018 -0400
Commit:     Satoe Imaishi <simaishi>
CommitDate: Tue Aug 21 15:20:55 2018 -0400

    Add dracut-config-generic for EC2
    https://bugzilla.redhat.com/show_bug.cgi?id=1574029

 kickstarts/partials/packages/includes.ks.erb | 2 +
 1 file changed, 2 insertions(+)

Comment 24 Taylor Owen 2018-08-28 04:02:24 UTC

Just as a data point, I was also having this issue on the t3 instance types. It boots just fine on the t2 instances though.

Comment 25 Matouš Mojžíš 2018-09-17 13:51:44 UTC

Verified in 5.10.0.15. CFME appliance is booting in EC2 environments with newer instance types - tried c5, c5d. 
There are also different changes in these instance types like disk naming - https://bugzilla.redhat.com/show_bug.cgi?id=1629853

Comment 27 errata-xmlrpc 2019-02-07 23:01:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212

Note You need to log in before you can comment on or make changes to this bug.