Bug 1296330 - Unable to consistently deploy overcloud nodes with OSP 7.2: Failed to mount root partition
Summary: Unable to consistently deploy overcloud nodes with OSP 7.2: Failed to mount r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: y3
: 7.0 (Kilo)
Assignee: Mike Burns
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
: 1287689 (view as bug list)
Depends On:
Blocks: 1299084
TreeView+ depends on / blocked
 
Reported: 2016-01-06 22:06 UTC by Jeremy
Modified: 2023-02-22 23:02 UTC (History)
40 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
An issue with the OpenStack Platform director 7.2 ramdisk and kernel image caused provisioning failure with the following error: mount: you must specify the filesystem type Failed to mount root partition /dev/sda on /mnt/rootfs This update reverts the ramdisk and kernel image to the OpenStack Platform director 7.1 images. Using these images, the director now provisions Overcloud nodes without failure. NOTE: An alternative workaround is to disable localboot option for the different node types. For example, to disable localboot with Controller nodes, run: $ nova flavor-key control unset capabilities:boot_option
Clone Of:
: 1299084 (view as bug list)
Environment:
Last Closed: 2016-02-18 16:48:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screenshot of the deployment node console (40.94 KB, image/png)
2016-01-06 22:06 UTC, Jeremy
no flags Details
rdsosreport from overcloud node (331.39 KB, text/plain)
2016-01-20 20:02 UTC, Matt Wisch
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2157861 0 None None None 2016-02-11 11:51:28 UTC
Red Hat Product Errata RHBA-2016:0264 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OSP 7 director Bug Fix Advisory 2016-02-18 21:41:29 UTC

Description Jeremy 2016-01-06 22:06:24 UTC
Created attachment 1112289 [details]
Screenshot of the deployment node console

Description of problem:
When attempting to deploy overcloud nodes using a Director node with the OSP 7.2 packages and images, the ironic nodes often end up in the wait call-back state until the time out period has expired.  Nova scheduler will re-shuffle the nodes if enough are available per role, and the process will continue until the overall stack timeout is reached and the deployment fails.

Looking at the console output of the overcloud nodes stuck in wait call-back, the failure occurs during the install_bootloader phase of the dracut pre-mount script /lib/dracut/hooks/pre-mount/50-init.sh.  The error returned is "mount: you must specify the filesystem type" "Failed to mount root partition /dev/sda on /mnt/rootfs"


Version-Release number of selected component (if applicable):


How reproducible:
The behavior is inconsistent.  In our last attempt using a lab with 12 overcloud nodes, 4 of the nodes made it to an active state, while the other 8 failed while attempting to install the bootloader.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

We are experiencing this behavior on a lab with a director node freshly installed with RHEL 7.2 and OSP 7.2 packages, as well as a lab where the director node was upgraded from OSP 7.1 to 7.2.

Screen shot of the behavior is attached.

Comment 1 Ben Nemec 2016-01-06 23:57:52 UTC
I'm seeing the same behavior at another customer on 7.2.  They were successfully running on 7.0 or 7.1 (I'm not sure exactly which release they were on before), but as soon as they upgraded to 7.2 they started to get somewhat random instances of this bug.

A workaround is to disable localboot, which can be done with:

nova flavor-key control unset capabilities:boot_option

This should be run against any flavors being used (replace "control" with the name of the other flavors).

To turn localboot back on, do:

flavor-key control set capabilities:boot_option=local

Note that this introduces a dependency on the undercloud for even booting the overcloud nodes, so it is not a permanent solution, but as a temporary workaround it may be useful.  It can also help to verify that anyone seeing this behavior is hitting the same bug.

Also note that I'm told this problem goes away with the new IPA ramdisk that will be used in OSP 8.  Unfortunately that is not available in OSP 7 so it doesn't help existing deployments.

Comment 2 Jeremy 2016-01-07 22:50:43 UTC
Latest: 

A random number of nodes (in the active state) will get stuck after the overcloud-full image is laid down and the node is rebooted.  If we identify and restart the hung nodes, the install will continue.

Looking at /var/log/messages on one of the nodes, it looks as if it recognizes the link state of some of the 10G NICs at inconsistent intervals.

For reference, here is our network layout:

eno1: link, unused
eno2: link, provisioning NIC
eno3: no link
eno4: no link
eno49: link, bond0
eno50: link, bond1
ens2f0: link, bond0
ens2f1: link, bond1

VLAN Mappings:
bond0: External Network, Internal Network
bond1: Storage network, Storage backend network, Tenant network

First sign of the problem...

Jan  7 16:14:55 localhost dhcp-all-interfaces.sh: Inspecting interface: eno49...No link detected, skipping
Jan  7 16:14:55 localhost systemd: dhcp-interface: main process exited, code=exited, status=1/FAILURE
Jan  7 16:14:55 localhost systemd: Failed to start DHCP interface eno49.
Jan  7 16:14:55 localhost systemd: Unit dhcp-interface entered failed state.
Jan  7 16:14:55 localhost systemd: dhcp-interface failed.

.. it then messes up the device mapping as the templates include eno49 (as nic3)..

Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic1 mapped to: eno1
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic2 mapped to: eno2
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic3 mapped to: eno50
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic4 mapped to: ens2f0
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic5 mapped to: ens2f1

..then in the middle of os-collect-config doing it's thing, eno49 magically comes up...

Jan  7 16:14:56 localhost NetworkManager[989]: <info>  (eno49): link disconnected
Jan  7 16:14:56 localhost NetworkManager[989]: <info>  (eno49): link connected
Jan  7 16:14:56 localhost kernel: ixgbe 0000:04:00.0 eno49: NIC Link is Up 10 Gbps, Flow Control: RX/TX

..then everything falls apart..

Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (ens2f0): enslaved to non-master-type device ovs-system; ignoring
Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (ens2f0): link disconnected
Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (eno50): enslaved to non-master-type device ovs-system; ignoring
Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (eno50): link connected
Jan  7 16:15:02 localhost os-collect-config: [2016/01/07 04:15:02 PM] [INFO] running ifup on interface: nic6
Jan  7 16:15:02 localhost kdumpctl: No memory reserved for crash kernel.
Jan  7 16:15:02 localhost kdumpctl: Starting kdump: [FAILED]
Jan  7 16:15:02 localhost systemd: kdump.service: main process exited, code=exited, status=1/FAILURE
Jan  7 16:15:02 localhost systemd: Failed to start Crash recovery kernel arming.
Jan  7 16:15:02 localhost systemd: Startup finished in 1.635s (kernel) + 3.783s (initrd) + 24.873s (userspace) = 30.292s.
Jan  7 16:15:02 localhost systemd: Unit kdump.service entered failed state.
Jan  7 16:15:02 localhost systemd: kdump.service failed.
Jan  7 16:15:02 localhost /etc/sysconfig/network-scripts/ifup-eth: Device nic6 does not seem to be present, delaying initialization.
Jan  7 16:15:02 localhost os-collect-config: Traceback (most recent call last):
Jan  7 16:15:02 localhost os-collect-config: File "/usr/bin/os-net-config", line 10, in <module>
Jan  7 16:15:02 localhost os-collect-config: sys.exit(main())
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main
Jan  7 16:15:02 localhost os-collect-config: activate=not opts.no_activate)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 402, in apply
Jan  7 16:15:02 localhost os-collect-config: self.ifup(interface)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 150, in ifup
Jan  7 16:15:02 localhost os-collect-config: self.execute(msg, '/sbin/ifup', interface)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 130, in execute
Jan  7 16:15:02 localhost os-collect-config: processutils.execute(cmd, *args, **kwargs)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 266, in execute
Jan  7 16:15:02 localhost os-collect-config: cmd=sanitized_cmd)
Jan  7 16:15:02 localhost os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Jan  7 16:15:02 localhost os-collect-config: Command: /sbin/ifup nic6
Jan  7 16:15:02 localhost os-collect-config: Exit code: 1
Jan  7 16:15:02 localhost os-collect-config: Stdout: u'ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Device nic6 does not seem to be present, delaying initialization.\n'
Jan  7 16:15:02 localhost os-collect-config: Stderr: u''
Jan  7 16:15:02 localhost os-collect-config: + RETVAL=1
Jan  7 16:15:02 localhost os-collect-config: + [[ 1 == 2 ]]
Jan  7 16:15:02 localhost os-collect-config: + [[ 1 != 0 ]]
Jan  7 16:15:02 localhost os-collect-config: + echo 'ERROR: os-net-config configuration failed.'
Jan  7 16:15:02 localhost os-collect-config: ERROR: os-net-config configuration failed.
Jan  7 16:15:02 localhost os-collect-config: + exit 1
Jan  7 16:15:02 localhost os-collect-config: [2016-01-07 16:15:02,211] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]
Jan  7 16:15:02 localhost os-collect-config: [2016-01-07 16:15:02,211] (os-refresh-config) [ERROR] Aborting...
Jan  7 16:15:02 localhost os-collect-config: 2016-01-07 16:15:02.215 4463 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1
Jan  7 16:15:02 localhost os-collect-config: 2016-01-07 16:15:02.215 4463 WARNING os-collect-config [-] Sleeping 30.00 seconds before re-exec.


And it just sits there until you reboot the node.  In this case I need to use the numbered NIC scheme in the templates because one of the hosts has the 10GB NICs in a different PCI slot, thus changing the udev names.

Comment 3 Jeremy 2016-01-07 22:50:56 UTC
Latest: 

A random number of nodes (in the active state) will get stuck after the overcloud-full image is laid down and the node is rebooted.  If we identify and restart the hung nodes, the install will continue.

Looking at /var/log/messages on one of the nodes, it looks as if it recognizes the link state of some of the 10G NICs at inconsistent intervals.

For reference, here is our network layout:

eno1: link, unused
eno2: link, provisioning NIC
eno3: no link
eno4: no link
eno49: link, bond0
eno50: link, bond1
ens2f0: link, bond0
ens2f1: link, bond1

VLAN Mappings:
bond0: External Network, Internal Network
bond1: Storage network, Storage backend network, Tenant network

First sign of the problem...

Jan  7 16:14:55 localhost dhcp-all-interfaces.sh: Inspecting interface: eno49...No link detected, skipping
Jan  7 16:14:55 localhost systemd: dhcp-interface: main process exited, code=exited, status=1/FAILURE
Jan  7 16:14:55 localhost systemd: Failed to start DHCP interface eno49.
Jan  7 16:14:55 localhost systemd: Unit dhcp-interface entered failed state.
Jan  7 16:14:55 localhost systemd: dhcp-interface failed.

.. it then messes up the device mapping as the templates include eno49 (as nic3)..

Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic1 mapped to: eno1
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic2 mapped to: eno2
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic3 mapped to: eno50
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic4 mapped to: ens2f0
Jan  7 16:14:55 localhost os-collect-config: [2016/01/07 04:14:55 PM] [INFO] nic5 mapped to: ens2f1

..then in the middle of os-collect-config doing it's thing, eno49 magically comes up...

Jan  7 16:14:56 localhost NetworkManager[989]: <info>  (eno49): link disconnected
Jan  7 16:14:56 localhost NetworkManager[989]: <info>  (eno49): link connected
Jan  7 16:14:56 localhost kernel: ixgbe 0000:04:00.0 eno49: NIC Link is Up 10 Gbps, Flow Control: RX/TX

..then everything falls apart..

Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (ens2f0): enslaved to non-master-type device ovs-system; ignoring
Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (ens2f0): link disconnected
Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (eno50): enslaved to non-master-type device ovs-system; ignoring
Jan  7 16:15:01 localhost NetworkManager[989]: <info>  (eno50): link connected
Jan  7 16:15:02 localhost os-collect-config: [2016/01/07 04:15:02 PM] [INFO] running ifup on interface: nic6
Jan  7 16:15:02 localhost kdumpctl: No memory reserved for crash kernel.
Jan  7 16:15:02 localhost kdumpctl: Starting kdump: [FAILED]
Jan  7 16:15:02 localhost systemd: kdump.service: main process exited, code=exited, status=1/FAILURE
Jan  7 16:15:02 localhost systemd: Failed to start Crash recovery kernel arming.
Jan  7 16:15:02 localhost systemd: Startup finished in 1.635s (kernel) + 3.783s (initrd) + 24.873s (userspace) = 30.292s.
Jan  7 16:15:02 localhost systemd: Unit kdump.service entered failed state.
Jan  7 16:15:02 localhost systemd: kdump.service failed.
Jan  7 16:15:02 localhost /etc/sysconfig/network-scripts/ifup-eth: Device nic6 does not seem to be present, delaying initialization.
Jan  7 16:15:02 localhost os-collect-config: Traceback (most recent call last):
Jan  7 16:15:02 localhost os-collect-config: File "/usr/bin/os-net-config", line 10, in <module>
Jan  7 16:15:02 localhost os-collect-config: sys.exit(main())
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main
Jan  7 16:15:02 localhost os-collect-config: activate=not opts.no_activate)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 402, in apply
Jan  7 16:15:02 localhost os-collect-config: self.ifup(interface)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 150, in ifup
Jan  7 16:15:02 localhost os-collect-config: self.execute(msg, '/sbin/ifup', interface)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 130, in execute
Jan  7 16:15:02 localhost os-collect-config: processutils.execute(cmd, *args, **kwargs)
Jan  7 16:15:02 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 266, in execute
Jan  7 16:15:02 localhost os-collect-config: cmd=sanitized_cmd)
Jan  7 16:15:02 localhost os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Jan  7 16:15:02 localhost os-collect-config: Command: /sbin/ifup nic6
Jan  7 16:15:02 localhost os-collect-config: Exit code: 1
Jan  7 16:15:02 localhost os-collect-config: Stdout: u'ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Device nic6 does not seem to be present, delaying initialization.\n'
Jan  7 16:15:02 localhost os-collect-config: Stderr: u''
Jan  7 16:15:02 localhost os-collect-config: + RETVAL=1
Jan  7 16:15:02 localhost os-collect-config: + [[ 1 == 2 ]]
Jan  7 16:15:02 localhost os-collect-config: + [[ 1 != 0 ]]
Jan  7 16:15:02 localhost os-collect-config: + echo 'ERROR: os-net-config configuration failed.'
Jan  7 16:15:02 localhost os-collect-config: ERROR: os-net-config configuration failed.
Jan  7 16:15:02 localhost os-collect-config: + exit 1
Jan  7 16:15:02 localhost os-collect-config: [2016-01-07 16:15:02,211] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]
Jan  7 16:15:02 localhost os-collect-config: [2016-01-07 16:15:02,211] (os-refresh-config) [ERROR] Aborting...
Jan  7 16:15:02 localhost os-collect-config: 2016-01-07 16:15:02.215 4463 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1
Jan  7 16:15:02 localhost os-collect-config: 2016-01-07 16:15:02.215 4463 WARNING os-collect-config [-] Sleeping 30.00 seconds before re-exec.


And it just sits there until you reboot the node.  In this case I need to use the numbered NIC scheme in the templates because one of the hosts has the 10GB NICs in a different PCI slot, thus changing the udev names.

Comment 4 John Browning 2016-01-12 00:45:31 UTC
We are hitting this same bug at a customer site while performing a fresh install.
I'm attempting the workaround mentioned by Ben.

Comment 5 John Browning 2016-01-12 00:47:38 UTC
Workaround does not work for a fresh install.

Comment 7 John Browning 2016-01-12 18:23:10 UTC
Verified workaround is to use the 7.1 deploy ramdisk & 7.1 deploy kernel.
Use 7.2 for discover & overcloud image.

Comment 8 Nicolas Hicher 2016-01-13 13:51:22 UTC
I have the same issue, I opened a bug on december about this issue  https://bugzilla.redhat.com/show_bug.cgi?id=1292598 (we can probably close it).

The error message is:

///lib/dracut/hooks/pre-mount/50-init.sh@210(install_bootloader) partprobe /dev/sda                                                                            
dracut-pre-mount Error: Error informing the kernel about modifications to partition /dev/sda1 - Device or resource busy.  This means Linux won't know about any changes you made to /dev/sda1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.

Comment 9 Nicolas Hicher 2016-01-13 15:11:06 UTC
I confirm it's ok with 7.1 deploy ramdisk & 7.1 deploy kernel.

Comment 10 Dmitry Tantsur 2016-01-15 12:52:42 UTC
I see 3 problems mentioned in this thread, let's concentrate on the initial problem: "mount: you must specify the filesystem type" "Failed to mount root partition /dev/sda on /mnt/rootfs". Please create separate reports for other issues (especially network one which seems to belong to os_net_config actually).

Comment 12 Ruchika K 2016-01-16 17:53:30 UTC
With respect to the comment on 01-06-2016, when the flavor gets set to disable localboot, please confirm whether the node properties need to get updated as well else basic checks with fail for deployment against capabilities.

Thank you

Comment 13 Dmitry Tantsur 2016-01-19 11:36:20 UTC
*** Bug 1287689 has been marked as a duplicate of this bug. ***

Comment 14 Dmitry Tantsur 2016-01-19 11:39:28 UTC
Two suggested workarounds from the duplicate bug 1287689:

1. "I used the upstream ramdisk/kernel and I was able to get the baremetal nodes to install."

2. "I hit this in a virt environment as well when trying to deploy an overcloud for a 2nd time (delete and redeploy). I worked around it by recreating the overcloud nodes image files:

for image in $(ls /var/lib/libvirt/images/ | grep baremetalbrbm); do qemu-img create -f qcow2 /var/lib/libvirt/images/$image 41G; done"

Comment 15 Dmitry Tantsur 2016-01-19 11:42:18 UTC
Ruchika, it's not required, but it's recommended that you update nodes as well.


Could anyone with this problem please get full logs from the failing deploy image? Grab /run/initramfs/rdsosreport.txt and output of journalctl. You can use curl to push this files to any remote location (e.g. FTP).

Comment 19 Dmitry Tantsur 2016-01-19 13:21:42 UTC
My needinfo for logs still stays, please don't remove it. Actually would be also great to grab ironic-conductor logs from /var/log/ironic or journalctl -u openstack-ironic-conductor

Comment 21 Matt Wisch 2016-01-20 20:02:28 UTC
Created attachment 1116713 [details]
rdsosreport from overcloud node

Comment 22 Matt Wisch 2016-01-20 20:10:52 UTC
(In reply to Dmitry Tantsur from comment #15)
> Ruchika, it's not required, but it's recommended that you update nodes as
> well.
> 
> 
> Could anyone with this problem please get full logs from the failing deploy
> image? Grab /run/initramfs/rdsosreport.txt and output of journalctl. You can
> use curl to push this files to any remote location (e.g. FTP).

Dmitry, I uploaded an rdsosreport generated from the deploy image.  I have separate journalctl output, but it is in the rdsosreport as well so I didn't upload it.

Comment 24 Felipe Alfaro Solana 2016-01-22 09:14:52 UTC
(In reply to John Browning from comment #7)
> Verified workaround is to use the 7.1 deploy ramdisk & 7.1 deploy kernel.
> Use 7.2 for discover & overcloud image.

May I ask how is it possible that Q&A didn't detect this?

Comment 25 Jaromir Coufal 2016-01-22 14:50:59 UTC
Hi Felipe. It was tested and we run CI against the versions. The issue seems to be combination of specific hardware, RHEL 7.2 and deploy ramdisk/kernel disk. Since it seems to be hardware specific issue we did not catch it in our testing. Workaround for 7.2 is stated here, we are working on solution for 7.3 and for OSP8 this should be solved with IPA replacing these disk versions.

Comment 26 Nicolas Hicher 2016-01-22 15:25:13 UTC
Hi Jaromir,

It's not only with specific hardware, I've got this issue on virtual environment.

Comment 27 Dai Saito 2016-01-24 10:38:34 UTC
I also the same error has occurred.
Error Message: 'Failed to mount root partition /dev/sda on /mnt/rootfs'

After partprobe command(@210) that is executed in 50-init.sh(initramfs), I tried the code added to sleep a few seconds.
Then, the deployment was successful.

In order to deploy to be successful, probably, it is necessary to mount /dev/sda2.
But the device file of the partition (/dev/sda1, /dev/sda2) does not yet exist,
So I think that it was unable to mount.

device files:
ls /dev/sda*
/dev/sda
/dev/sda1
/dev/sda2

I think in order to mount certainly, it is necessary to wait until the device file of the partition is created.

Image :deploy-ramdisk-ironic.initramfs
Script:/lib/dracut/hooks/pre-mount/50-init.sh

Comment 28 Jaromir Coufal 2016-01-27 10:51:26 UTC
Thanks Nicolas, I will run it with our CI team to find out where is the issue.

Comment 30 Christopher Brown 2016-01-28 10:19:27 UTC
Hello,

This is also affecting our deployment. For Red Hat folks, please see:

https://access.redhat.com/support/cases/#/case/01571628

for more information, sosreports, etc.

It would be really good to understand how CI has missed this as it has essentially cost us around 2 weeks engineering time whilst we chased various potential causes like old uefi, bad firmware, raid config, incorrect undercloud deployment, etc.

Thanks

Comment 32 Iuliia Ievstignieieva 2016-01-28 13:59:53 UTC
Hi Mike,


Could you please provide an update on the progress of the bug? It is affecting CEE customers.


Thank you,


Julia

Team Lead

GSS EMEA

Comment 33 Dave Cain 2016-01-28 14:51:17 UTC
(In reply to Dmitry Tantsur from comment #10)
> I see 3 problems mentioned in this thread, let's concentrate on the initial
> problem: "mount: you must specify the filesystem type" "Failed to mount root
> partition /dev/sda on /mnt/rootfs". Please create separate reports for other
> issues (especially network one which seems to belong to os_net_config
> actually).

I noticed this problem in the OSP8 beta with the deployment ramdisk contained in the deploy-ramdisk-ironic-8.0-20151203.1-beta-2.tar.

Although I am booting to a remote iSCSI LUN hosted on a storage array, I still ran into the problem described in this thread.  A workaround for me was to destroy the LUN and re-create it, so that during provisioning the deployment ramdisk sees nothing but a raw block device.  I haven't tested, but a similar fix for local disk (assuming that's what you're using) may be to wipe the partition table on the nodes to be deployed to, before launching the overcloud deployment.

Although comment #4 and #5 appears to imply this happens on a fresh install, so maybe take my comments "for what it's worth".

Comment 34 Mike Burns 2016-01-28 14:53:10 UTC
This bug is ON_QA and planned for the 7.3 release of OSP director.  For OSP 8, the plan is to use the ironic-python-agent image for deployment.

Comment 37 Christopher Brown 2016-01-28 14:59:35 UTC
(In reply to Mike Burns from comment #34)
> This bug is ON_QA and planned for the 7.3 release of OSP director.  For OSP
> 8, the plan is to use the ironic-python-agent image for deployment.

Thanks, is there an ETA on either please?

Comment 39 Christopher Brown 2016-01-28 23:36:44 UTC
Hello,

I have disabled UEFI boot and switch to Legacy.
I have re-created the RAID arrays on all nodes.
I have changed the boot order to boot from PXE first and this goes through a second PXE boot/install but allows all nodes to boot correctly?

My suspicion here is multiple issues - possibly flashing the controller firmware has helped. Also perhaps patchy UEFI support? Or patchy UEFI firmware?

Comment 42 Alexander Chuzhoy 2016-02-10 16:20:45 UTC
Verified:

The last set of images includes the deploy-ramdisk-ironic.tar used for 7.1GA, where this issue wasn't reported.
Verified the ability to deploy successfully with this set of images.

Verified that sha1 checksum is the same for both  deploy-ramdisk-ironic.tar used in 7.1GA and the latest deploy-ramdisk-ironic.tar provided to QE.

Comment 44 errata-xmlrpc 2016-02-18 16:48:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0264.html


Note You need to log in before you can comment on or make changes to this bug.