Hide Forgot
Created attachment 1430229 [details] broken booting Description of problem: Version-Release number of selected component (if applicable): 5.9.2.4 How reproducible: Always Steps to Reproduce: 1. Deployed instance from 5.9.2.4 CFME image 2. Instance was stucked at boot. 3. Actual results: Expected results: Additional info:
According to https://access.redhat.com/solutions/2515741 This happens if it can't find a logical volume: "A logical volume cannot be found on the system" Matous: 1) does this happen every time you reboot? 2) does it print more if you wait? Specifically, something like this after the "Could not boot" [ 200.161146] dracut-initqueue[779]: Warning: /dev/vgroot/root does not exist [ 200.170895] systemd[1]: Starting Dracut Emergency Shell... 3) Have you tried redeploying in ec2 again (maybe there was a problem importing into ec2?)
Note, I don't believe we changed our kickstarts for generating the logical volumes in a long time. Does this happen with all 5.9 versions? Does it happen with 5.8?
Joe, 1) Yes 2) It does, but I don't have access to logs in EC2 except latest screenshot of instance logs. 3) Yes, I tried it in different region. Same issue. I test deployment of most of the versions and the latest I tested - 5.8.4.2 and 5.9.2.2 are working correctly.
Matous, are the successful deployments to the same environment? What about the failures? Is there anything common when it works vs when it doesn't? Perhaps there's an environmental problem causing the inability to find the logical volumes.
Joe, successful deployments are to the same environment. The only common thing is deploying 5.9.2.4. Other versions work correctly.
Matous, how many 5.9.2.4 deployments did you do and how many failed in this way? Did any other version fail in this way in the same environment?
Joe, I tried to deploy 5.9.2.4 four times. It failed every time. I also tried to reimport image to ec2. Still didn't work. Any other versions are working correctly in this way in same environment. I test the image with exactly same steps for every version.
Matous, can you try the steps here: https://access.redhat.com/solutions/2515741 ? Note, there are comments at the bottom with some useful information. In a private comment, can you share the ip/credentials of a 5.9.2.2. I'd like to compare the rpm package list with one that Satoe did with 5.9.2.4. Maybe the kernel or something in the OS is different?
The differences in rpms from 5.9.2.3 (working for Matous) to 5.9.2.4 (working for Satoe, not working for Matous): diff --git a/Users/joerafaniello/Downloads/5922_rpm.txt b/Users/joerafaniello/Downloads/5924_rpm.txt index b3890db62..36eed519f 100644 --- a/Users/joerafaniello/Downloads/5923_rpm.txt +++ b/Users/joerafaniello/Downloads/5924_rpm.txt @@ -1,10 +1,10 @@ acl-2.2.51-14.el7.x86_64 adcli-0.8.1-4.el7.x86_64 -ansible-2.4.3.0-1.el7ae.noarch -ansible-tower-server-3.2.3-1.el7at.x86_64 -ansible-tower-setup-3.2.3-1.el7at.x86_64 -ansible-tower-venv-ansible-3.2.3-1.el7at.x86_64 -ansible-tower-venv-tower-3.2.3-1.el7at.x86_64 +ansible-2.4.4.0-1.el7ae.noarch +ansible-tower-server-3.2.4-1.el7at.x86_64 +ansible-tower-setup-3.2.4-1.el7at.x86_64 +ansible-tower-venv-ansible-3.2.4-1.el7at.x86_64 +ansible-tower-venv-tower-3.2.4-1.el7at.x86_64 apr-1.4.8-3.el7_4.1.x86_64 apr-util-1.5.2-6.el7.x86_64 at-3.1.13-23.el7.x86_64 @@ -32,11 +32,11 @@ bzip2-libs-1.0.6-13.el7.x86_64 ca-certificates-2017.2.20-71.el7.noarch c-ares-1.10.0-3.el7.x86_64 certmonger-0.78.4-3.el7.x86_64 -cfme-5.9.2.3-1.el7cf.x86_64 -cfme-appliance-5.9.2.3-1.el7cf.x86_64 -cfme-appliance-common-5.9.2.3-1.el7cf.x86_64 -cfme-appliance-tools-5.9.2.3-1.el7cf.x86_64 -cfme-gemset-5.9.2.3-1.el7cf.x86_64 +cfme-5.9.2.4-1.el7cf.x86_64 +cfme-appliance-5.9.2.4-1.el7cf.x86_64 +cfme-appliance-common-5.9.2.4-1.el7cf.x86_64 +cfme-appliance-tools-5.9.2.4-1.el7cf.x86_64 +cfme-gemset-5.9.2.4-1.el7cf.x86_64 checkpolicy-2.5-6.el7.x86_64 chkconfig-1.7.4-1.el7.x86_64 chrony-3.2-2.el7.x86_64
I have found what's wrong. 5.9.2.2 image doesn't allow me to use c5.xlarge so I used for it c4.xlarge instance. 5.9.2.4 allows me to ues c5.xlarge instance and image is broken for this instance type. When I run 5.9.2.4 image on c4.xlarge then it works correctly. Difference between c4 and c5 is newer processors and added ENA support.
Wow, nice find Matous. Related: https://bugs.centos.org/view.php?id=14107 From above link (credit to hhsnow user): If you're rolling your own CentOS 7 AMI and are encountering issues booting it on the c5's (but works with other instance families), try running these two before taking the AMI: yum install dracut-config-generic dracut -f From AWS support: In order to have the correct initramfs to be created with your script I had to install a package named "dracut-config-generic". This package provides the configuration to turn off the host specific initramfs generation with dracut and generates a generic image by default. This package includes a single configuration file for dracut: | $ cat /usr/lib/dracut/dracut.conf.d/02-generic-image.conf | hostonly="no" After I installed this package and rebuild the initramfs with "dracut -f" it started to boot normally on C5s. You can note the size difference between the initramfs files: * Before the installation of "dracut-config-generic": -rw------- 1 root root 19719505 Nov 13 16:28 initramfs-3.10.0-693.5.2.el7.x86_64.img * After the installation of "dracut-config-generic": -rw------- 1 root root 46942834 Nov 15 01:01 initramfs-3.10.0-693.5.2.el7.x86_64.img Also, for hhsnow user: Is ENA enabled for your instance? Via command line: aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[].Instances[].EnaSupport' If not (stop instance and): aws ec2 modify-instance-attribute --instance-id {instance_id} --ena-support To make the AMI automatically support ENA, you need to enable ENA on the instance before you create the image."
Hi Matouš, I haven't heard reply back on this bug. Is this still an issue? Did my links provide any help? This is still marked urgent, can it be lowered or closed?
Closing for now, please reopen if this is still a problem and you find out what we're doing wrong for this platform. I'm not sure what actions we need to do to fix this.
Hi Joe, I had some issues installing and running ENA drivers for this image so that's why it took me so long. I was able to install on CFME 5.10 image ENA drivers with this tutorial: https://github.com/amzn/amzn-drivers/tree/master/kernel/linux/ena also I did after installing drivers this: yum install dracut-config-generic dracut -f then c5 instance types started working for this image. I think we should distribute cfme image for amazon with these ENA drivers. So setting back to ASSIGNED and lowering severity.
Satoe, Can you review comment #19? I don't know if these amazon drivers are already packaged or if we would need to create rpms for them and if we can redistribute them. I didn't look to see if there were additional build packages needed to do this. If this is something you think you can work on, we can reassign. If this needs prioritization, we can have others review the need for this enhancement/bug fix. For now, it can stay in my queue until we know what we can and will do. Thanks, Joe
ENA driver should be already included in RHEL, so I'm not sure why the driver needs to be added. I'll have to do some research and test, to figure out exactly what's needed.
New commit detected on ManageIQ/manageiq-appliance-build/master: https://github.com/ManageIQ/manageiq-appliance-build/commit/75895b3eedc09cb960b29326065e86f902403366 commit 75895b3eedc09cb960b29326065e86f902403366 Author: Satoe Imaishi <simaishi> AuthorDate: Tue Aug 21 15:20:55 2018 -0400 Commit: Satoe Imaishi <simaishi> CommitDate: Tue Aug 21 15:20:55 2018 -0400 Add dracut-config-generic for EC2 https://bugzilla.redhat.com/show_bug.cgi?id=1574029 kickstarts/partials/packages/includes.ks.erb | 2 + 1 file changed, 2 insertions(+)
Just as a data point, I was also having this issue on the t3 instance types. It boots just fine on the t2 instances though.
Verified in 5.10.0.15. CFME appliance is booting in EC2 environments with newer instance types - tried c5, c5d. There are also different changes in these instance types like disk naming - https://bugzilla.redhat.com/show_bug.cgi?id=1629853
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0212