Created attachment 1112289 [details] Screenshot of the deployment node console Description of problem: When attempting to deploy overcloud nodes using a Director node with the OSP 7.2 packages and images, the ironic nodes often end up in the wait call-back state until the time out period has expired. Nova scheduler will re-shuffle the nodes if enough are available per role, and the process will continue until the overall stack timeout is reached and the deployment fails. Looking at the console output of the overcloud nodes stuck in wait call-back, the failure occurs during the install_bootloader phase of the dracut pre-mount script /lib/dracut/hooks/pre-mount/50-init.sh. The error returned is "mount: you must specify the filesystem type" "Failed to mount root partition /dev/sda on /mnt/rootfs" Version-Release number of selected component (if applicable): How reproducible: The behavior is inconsistent. In our last attempt using a lab with 12 overcloud nodes, 4 of the nodes made it to an active state, while the other 8 failed while attempting to install the bootloader. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: We are experiencing this behavior on a lab with a director node freshly installed with RHEL 7.2 and OSP 7.2 packages, as well as a lab where the director node was upgraded from OSP 7.1 to 7.2. Screen shot of the behavior is attached.
I'm seeing the same behavior at another customer on 7.2. They were successfully running on 7.0 or 7.1 (I'm not sure exactly which release they were on before), but as soon as they upgraded to 7.2 they started to get somewhat random instances of this bug. A workaround is to disable localboot, which can be done with: nova flavor-key control unset capabilities:boot_option This should be run against any flavors being used (replace "control" with the name of the other flavors). To turn localboot back on, do: flavor-key control set capabilities:boot_option=local Note that this introduces a dependency on the undercloud for even booting the overcloud nodes, so it is not a permanent solution, but as a temporary workaround it may be useful. It can also help to verify that anyone seeing this behavior is hitting the same bug. Also note that I'm told this problem goes away with the new IPA ramdisk that will be used in OSP 8. Unfortunately that is not available in OSP 7 so it doesn't help existing deployments.
Latest: A random number of nodes (in the active state) will get stuck after the overcloud-full image is laid down and the node is rebooted. If we identify and restart the hung nodes, the install will continue. Looking at /var/log/messages on one of the nodes, it looks as if it recognizes the link state of some of the 10G NICs at inconsistent intervals. For reference, here is our network layout: eno1: link, unused eno2: link, provisioning NIC eno3: no link eno4: no link eno49: link, bond0 eno50: link, bond1 ens2f0: link, bond0 ens2f1: link, bond1 VLAN Mappings: bond0: External Network, Internal Network bond1: Storage network, Storage backend network, Tenant network First sign of the problem... Jan 7 16:14:55 localhost dhcp-all-interfaces.sh: Inspecting interface: eno49...No link detected, skipping Jan 7 16:14:55 localhost systemd: dhcp-interface: main process exited, code=exited, status=1/FAILURE Jan 7 16:14:55 localhost systemd: Failed to start DHCP interface eno49. Jan 7 16:14:55 localhost systemd: Unit dhcp-interface entered failed state. Jan 7 16:14:55 localhost systemd: dhcp-interface failed. .. it then messes up the device mapping as the templates include eno49 (as nic3).. Jan 7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic1 mapped to: eno1 Jan 7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic2 mapped to: eno2 Jan 7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic3 mapped to: eno50 Jan 7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic4 mapped to: ens2f0 Jan 7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic5 mapped to: ens2f1 ..then in the middle of os-collect-config doing it's thing, eno49 magically comes up... Jan 7 16:14:56 localhost NetworkManager[989]: <info> (eno49): link disconnected Jan 7 16:14:56 localhost NetworkManager[989]: <info> (eno49): link connected Jan 7 16:14:56 localhost kernel: ixgbe 0000:04:00.0 eno49: NIC Link is Up 10 Gbps, Flow Control: RX/TX ..then everything falls apart.. Jan 7 16:15:01 localhost NetworkManager[989]: <info> (ens2f0): enslaved to non-master-type device ovs-system; ignoring Jan 7 16:15:01 localhost NetworkManager[989]: <info> (ens2f0): link disconnected Jan 7 16:15:01 localhost NetworkManager[989]: <info> (eno50): enslaved to non-master-type device ovs-system; ignoring Jan 7 16:15:01 localhost NetworkManager[989]: <info> (eno50): link connected Jan 7 16:15:02 localhost os-collect-config: [2016/01/07 04:15:02 PM] [INFO] running ifup on interface: nic6 Jan 7 16:15:02 localhost kdumpctl: No memory reserved for crash kernel. Jan 7 16:15:02 localhost kdumpctl: Starting kdump: [FAILED] Jan 7 16:15:02 localhost systemd: kdump.service: main process exited, code=exited, status=1/FAILURE Jan 7 16:15:02 localhost systemd: Failed to start Crash recovery kernel arming. Jan 7 16:15:02 localhost systemd: Startup finished in 1.635s (kernel) + 3.783s (initrd) + 24.873s (userspace) = 30.292s. Jan 7 16:15:02 localhost systemd: Unit kdump.service entered failed state. Jan 7 16:15:02 localhost systemd: kdump.service failed. Jan 7 16:15:02 localhost /etc/sysconfig/network-scripts/ifup-eth: Device nic6 does not seem to be present, delaying initialization. Jan 7 16:15:02 localhost os-collect-config: Traceback (most recent call last): Jan 7 16:15:02 localhost os-collect-config: File "/usr/bin/os-net-config", line 10, in <module> Jan 7 16:15:02 localhost os-collect-config: sys.exit(main()) Jan 7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main Jan 7 16:15:02 localhost os-collect-config: activate=not opts.no_activate) Jan 7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 402, in apply Jan 7 16:15:02 localhost os-collect-config: self.ifup(interface) Jan 7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 150, in ifup Jan 7 16:15:02 localhost os-collect-config: self.execute(msg, '/sbin/ifup', interface) Jan 7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 130, in execute Jan 7 16:15:02 localhost os-collect-config: processutils.execute(cmd, *args, **kwargs) Jan 7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 266, in execute Jan 7 16:15:02 localhost os-collect-config: cmd=sanitized_cmd) Jan 7 16:15:02 localhost os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Jan 7 16:15:02 localhost os-collect-config: Command: /sbin/ifup nic6 Jan 7 16:15:02 localhost os-collect-config: Exit code: 1 Jan 7 16:15:02 localhost os-collect-config: Stdout: u'ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Device nic6 does not seem to be present, delaying initialization.\n' Jan 7 16:15:02 localhost os-collect-config: Stderr: u'' Jan 7 16:15:02 localhost os-collect-config: + RETVAL=1 Jan 7 16:15:02 localhost os-collect-config: + [[ 1 == 2 ]] Jan 7 16:15:02 localhost os-collect-config: + [[ 1 != 0 ]] Jan 7 16:15:02 localhost os-collect-config: + echo 'ERROR: os-net-config configuration failed.' Jan 7 16:15:02 localhost os-collect-config: ERROR: os-net-config configuration failed. Jan 7 16:15:02 localhost os-collect-config: + exit 1 Jan 7 16:15:02 localhost os-collect-config: [2016-01-07 16:15:02,211] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1] Jan 7 16:15:02 localhost os-collect-config: [2016-01-07 16:15:02,211] (os-refresh-config) [ERROR] Aborting... Jan 7 16:15:02 localhost os-collect-config: 2016-01-07 16:15:02.215 4463 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1 Jan 7 16:15:02 localhost os-collect-config: 2016-01-07 16:15:02.215 4463 WARNING os-collect-config [-] Sleeping 30.00 seconds before re-exec. And it just sits there until you reboot the node. In this case I need to use the numbered NIC scheme in the templates because one of the hosts has the 10GB NICs in a different PCI slot, thus changing the udev names.
We are hitting this same bug at a customer site while performing a fresh install. I'm attempting the workaround mentioned by Ben.
Workaround does not work for a fresh install.
Verified workaround is to use the 7.1 deploy ramdisk & 7.1 deploy kernel. Use 7.2 for discover & overcloud image.
I have the same issue, I opened a bug on december about this issue https://bugzilla.redhat.com/show_bug.cgi?id=1292598 (we can probably close it). The error message is: ///lib/dracut/hooks/pre-mount/50-init.sh@210(install_bootloader) partprobe /dev/sda dracut-pre-mount Error: Error informing the kernel about modifications to partition /dev/sda1 - Device or resource busy. This means Linux won't know about any changes you made to /dev/sda1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
I confirm it's ok with 7.1 deploy ramdisk & 7.1 deploy kernel.
I see 3 problems mentioned in this thread, let's concentrate on the initial problem: "mount: you must specify the filesystem type" "Failed to mount root partition /dev/sda on /mnt/rootfs". Please create separate reports for other issues (especially network one which seems to belong to os_net_config actually).
With respect to the comment on 01-06-2016, when the flavor gets set to disable localboot, please confirm whether the node properties need to get updated as well else basic checks with fail for deployment against capabilities. Thank you
*** Bug 1287689 has been marked as a duplicate of this bug. ***
Two suggested workarounds from the duplicate bug 1287689: 1. "I used the upstream ramdisk/kernel and I was able to get the baremetal nodes to install." 2. "I hit this in a virt environment as well when trying to deploy an overcloud for a 2nd time (delete and redeploy). I worked around it by recreating the overcloud nodes image files: for image in $(ls /var/lib/libvirt/images/ | grep baremetalbrbm); do qemu-img create -f qcow2 /var/lib/libvirt/images/$image 41G; done"
Ruchika, it's not required, but it's recommended that you update nodes as well. Could anyone with this problem please get full logs from the failing deploy image? Grab /run/initramfs/rdsosreport.txt and output of journalctl. You can use curl to push this files to any remote location (e.g. FTP).
My needinfo for logs still stays, please don't remove it. Actually would be also great to grab ironic-conductor logs from /var/log/ironic or journalctl -u openstack-ironic-conductor
Created attachment 1116713 [details] rdsosreport from overcloud node
(In reply to Dmitry Tantsur from comment #15) > Ruchika, it's not required, but it's recommended that you update nodes as > well. > > > Could anyone with this problem please get full logs from the failing deploy > image? Grab /run/initramfs/rdsosreport.txt and output of journalctl. You can > use curl to push this files to any remote location (e.g. FTP). Dmitry, I uploaded an rdsosreport generated from the deploy image. I have separate journalctl output, but it is in the rdsosreport as well so I didn't upload it.
(In reply to John Browning from comment #7) > Verified workaround is to use the 7.1 deploy ramdisk & 7.1 deploy kernel. > Use 7.2 for discover & overcloud image. May I ask how is it possible that Q&A didn't detect this?
Hi Felipe. It was tested and we run CI against the versions. The issue seems to be combination of specific hardware, RHEL 7.2 and deploy ramdisk/kernel disk. Since it seems to be hardware specific issue we did not catch it in our testing. Workaround for 7.2 is stated here, we are working on solution for 7.3 and for OSP8 this should be solved with IPA replacing these disk versions.
Hi Jaromir, It's not only with specific hardware, I've got this issue on virtual environment.
I also the same error has occurred. Error Message: 'Failed to mount root partition /dev/sda on /mnt/rootfs' After partprobe command(@210) that is executed in 50-init.sh(initramfs), I tried the code added to sleep a few seconds. Then, the deployment was successful. In order to deploy to be successful, probably, it is necessary to mount /dev/sda2. But the device file of the partition (/dev/sda1, /dev/sda2) does not yet exist, So I think that it was unable to mount. device files: ls /dev/sda* /dev/sda /dev/sda1 /dev/sda2 I think in order to mount certainly, it is necessary to wait until the device file of the partition is created. Image :deploy-ramdisk-ironic.initramfs Script:/lib/dracut/hooks/pre-mount/50-init.sh
Thanks Nicolas, I will run it with our CI team to find out where is the issue.
Hello, This is also affecting our deployment. For Red Hat folks, please see: https://access.redhat.com/support/cases/#/case/01571628 for more information, sosreports, etc. It would be really good to understand how CI has missed this as it has essentially cost us around 2 weeks engineering time whilst we chased various potential causes like old uefi, bad firmware, raid config, incorrect undercloud deployment, etc. Thanks
Hi Mike, Could you please provide an update on the progress of the bug? It is affecting CEE customers. Thank you, Julia Team Lead GSS EMEA
(In reply to Dmitry Tantsur from comment #10) > I see 3 problems mentioned in this thread, let's concentrate on the initial > problem: "mount: you must specify the filesystem type" "Failed to mount root > partition /dev/sda on /mnt/rootfs". Please create separate reports for other > issues (especially network one which seems to belong to os_net_config > actually). I noticed this problem in the OSP8 beta with the deployment ramdisk contained in the deploy-ramdisk-ironic-8.0-20151203.1-beta-2.tar. Although I am booting to a remote iSCSI LUN hosted on a storage array, I still ran into the problem described in this thread. A workaround for me was to destroy the LUN and re-create it, so that during provisioning the deployment ramdisk sees nothing but a raw block device. I haven't tested, but a similar fix for local disk (assuming that's what you're using) may be to wipe the partition table on the nodes to be deployed to, before launching the overcloud deployment. Although comment #4 and #5 appears to imply this happens on a fresh install, so maybe take my comments "for what it's worth".
This bug is ON_QA and planned for the 7.3 release of OSP director. For OSP 8, the plan is to use the ironic-python-agent image for deployment.
(In reply to Mike Burns from comment #34) > This bug is ON_QA and planned for the 7.3 release of OSP director. For OSP > 8, the plan is to use the ironic-python-agent image for deployment. Thanks, is there an ETA on either please?
Hello, I have disabled UEFI boot and switch to Legacy. I have re-created the RAID arrays on all nodes. I have changed the boot order to boot from PXE first and this goes through a second PXE boot/install but allows all nodes to boot correctly? My suspicion here is multiple issues - possibly flashing the controller firmware has helped. Also perhaps patchy UEFI support? Or patchy UEFI firmware?
Verified: The last set of images includes the deploy-ramdisk-ironic.tar used for 7.1GA, where this issue wasn't reported. Verified the ability to deploy successfully with this set of images. Verified that sha1 checksum is the same for both deploy-ramdisk-ironic.tar used in 7.1GA and the latest deploy-ramdisk-ironic.tar provided to QE.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0264.html