Bug 1612033 - Overcloud nodes with pre-defined RAID won't deploy
Summary: Overcloud nodes with pre-defined RAID won't deploy
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dmitry Tantsur
QA Contact: mlammon
URL:
Whiteboard:
Depends On: 1608955
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-03 10:06 UTC by Irina Petrova
Modified: 2023-09-15 00:11 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-18 13:09:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-11520 0 None None None 2021-12-10 17:00:27 UTC

Description Irina Petrova 2018-08-03 10:06:07 UTC
Description of problem:

HP Proliant Gen 9 servers with pre-configured RAID 1 won't deploy with RHEL7.5 images (which is the only option for RHOSP-13 at this moment AFAIK).

We're tracking the issue from kernel POV in BZ 1608955.

However, since Openstack relies on Ironic for the deployment, we'd like to raise awareness and get feedback if the issue can be worked on RHOSP level by any means, potentially working around any kernel constraints and/or in collaboration with the latest RHEL kernel changes.

Initial feedback on the issue:
~~~
When deploying it together with the raid controller (with UEFI) the introspection completes successfully, but when looking at the introspection data I can't really see that the director recognizes any RAID that I have configured on the server (I have configured the RAID in the UEFI setup).

When it comes to the deployment itself, my server fails to finish the whole iPXE process and keeps it on loop (restarts the server and tries to iPXE again); I think that it doesn't work because there is a conflict between the configuration on the server (the raid), and the introspection data (two separate disks).

I have set the boot_mode capability on the server to uefi.
~~~

Follow-up:
~~~
*) Introspection does not work in UEFI mode (not with clean 'stock' init ramdisk, not with mod'ed ramdisk w/ dirvers); with iPXE.
*) Introspection works in Legacy mode. 
   >> However, in Legacy mode, introspection finds two disks when it should see just one (i.e. 2 disks in RAID 1).
~~~

Additional info:

There's been another test in RHOS-10, with RHEL 7.3, and the result is the same (introspection fails in a RAID1 setup).

Hence, here's the list of tested kernels:
*) 7.5 kernel 3.10.0-862
*) 7.3 kernel 3.10.0-514

Tests with PXE (vs iPXE) in UEFI mode were to be done.

Comment 1 Irina Petrova 2018-08-05 16:17:47 UTC
Update: we're past introspection now.

After they have re-installed the Director node, the 'alloc highmem for initrd' failure is gone:

~~~~
1. We used Lenny's kernel + original initramfs = Introspection fails (no highmem error).
2. We used original kernel + original initramfs = Introspection fails (no highmem error).
3. We used original kernel + edited initramfs = Introspection successful (no high highmem  error), but the out put of 'openstack baremetal introspection data save 1a4e30da-b6dc-499d-ba87-0bd8a3819bc0 | jq ".inventory.disks"' produces 2 disks (/dev/sda and /dev/sdb). Is this a normal situation? or does it need to produce only one disk (/dev/sda)? because the raid controller is enable.
~~~~

Now we're trying to have them configure properly the iLO driver, so that it recognizes the SW RAID:

http://specs.openstack.org/openstack/ironic-specs/specs/approved/ironic-generic-raid-interface.html
https://docs.openstack.org/ironic/latest/admin/raid.html#raid
https://docs.openstack.org/ironic/latest/admin/drivers/ilo.html

Any straight-forward must-dos would be extremely helpful as the information is spread all over the place (see the three docs above).

As I understand they must push the ilo driver, if they want to use SW RAID. Is that correct?

From BZ 1494361, we see that SW RAID was *not* an option with RHOSP-10. Dp we have a generic way to push SW RAID in RHOSP-13; or, again they *must* set the ilo driver as per the upstream documentation quoted above?

Comment 2 Dmitry Tantsur 2018-08-06 09:02:53 UTC
Hi! The ironic driver has no effect on how in-band introspection recognizes disks. The documentation you're linking to is about creating HW RAID.

Ironic introspection merely uses the output of lsblk, so I guess the question is why lsblk does not recognize the RAID.

Comment 5 Irina Petrova 2018-08-08 10:15:41 UTC
(In reply to Dmitry Tantsur from comment #2)

Hey! :)

Is there any form of logging we can enable in order to see what exactly is going on there?

By the way, do we support SW RAID to begin with?

Comment 7 Dmitry Tantsur 2018-08-09 08:59:04 UTC
Hi all,

Software RAID is not currently supported by Ironic in any of the OSP releases. Hardware RAID will work if the kernel supports it and lsblk returns the RAID disks as normal disks. However, hardware RAID is not officially supported by OSP and requires a support exception.

I'm not sure what exactly is happening though. Is it just software RAID, hardware RAID or hybrid hardware-assisted RAID? I see both references to "software RAID" and to "RAID controller". Ironic only works with RAID that is anyhow abstracted away by the operating systems. So if the operating system sees and reports two disks, Ironic will report and use two disks.

Could you please clarify the exact RAID type used and your expectations of Ironic?

Comment 8 Irina Petrova 2018-08-09 09:13:29 UTC
Hi Dmitry,


Let's clarify the RAID terminology.

Hardware RAID
The hardware-based array manages the RAID subsystem independently from the host. It **presents a single disk** per RAID array to the host.

Software RAID
Software RAID implements the various RAID levels in the **kernel disk (block device) code**. It offers the cheapest possible solution, as expensive disk controller cards or hot-swap chassis are not required.

In between these two, we have a thing called Firmware RAID, or also colloquially known as 'fake' RAID. 'Fake' because it's a RAID controller that cannot provide a real Hardware RAID but instead just tells the OS to configure a Software RAID.

Firmware RAID
Firmware RAID, also known as ATARAID, is a **type of software RAID** where the RAID sets can be configured using a firmware-based menu.

## source: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-raid


Aviv (TAM) dropped down the information that the customer has this card:
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c04406959

From the link above: the card is a [Key features] Smart Array **RAID engine running in OS driver**. Additionally, judging from the core problem, we can say that their HP RAID controller (card) can do (a variation of) a Software RAID.

So, the customer has a RAID controller that **canNOT provide Hardware RAID**.


Using these definitions, I would say that we do support real HW RAID in OSP, simply because on OS level you see it as a normal disk (just one device!). The controller hides everything else and from the OS you really cannot differentiate between a regular disk and the device provided to you by the controller.

I would argue that *if* it does work, they might configure a SW RAID (with the links I provided above and/or in combination with the ansible driver you pushed on the prio thread internally [1]). Should that work, they would of course need a SE (Support Exception).


If we all agree on this, I think we can just close this Bug.


[1] http://tripleo.org/install/advanced_deployment/ansible_deploy_interface.html

Comment 9 Dmitry Tantsur 2018-08-09 09:39:40 UTC
> In between these two, we have a thing called Firmware RAID, or also colloquially known as 'fake' RAID. 'Fake' because it's a RAID controller that cannot provide a real Hardware RAID but instead just tells the OS to configure a Software RAID.

Right, this is what I meant by "hybrid", forgot the right word.

> I would argue that *if* it does work, they might configure a SW RAID (with the links I provided above and/or in combination with the ansible driver you pushed on the prio thread internally [1]). Should that work, they would of course need a SE (Support Exception).

So, software RAID of any kind (purely software or firmware) will not work automatically, because lsblk will still report two disks. Furthermore, on deployment Ironic wipes the target block device, so any traces of software RAID will be gone.

Indeed, using a heavily customized deployment process with the ansible deploy interface may help here. Please sync with Ramon on whether he may allow a support exception in this case (I'm +1 from dev standpoint).

Comment 12 Dmitry Tantsur 2018-08-13 12:34:03 UTC
Hi! Can someone please provide the current status on two separate issues:

1. Introspection returns 2 disks instead of one.

I understand it's annoying, but is it critical for you? I'm afraid it may be to difficult technically to make introspection consider software/firmware RAID.

2. Deployment fails on the iPXE stage.

Note that this cannot be caused by wrong introspection data, there must be something else. Could you make sure it does boot in the UEFI mode by looking at its screen during boot? Is it possible to screencast the console during the boot?

Comment 13 Irina Petrova 2018-08-13 16:42:16 UTC
Hi Dmitry.

I thought #2 was clear from:

(Irina Petrova from comment #1)
> Update: we're past introspection now.
> 
> After they have re-installed the Director node, the 'alloc highmem for
> initrd' failure is gone:
> 
> ~~~~
> 1. We used Lenny's kernel + original initramfs = Introspection fails (no
> highmem error).
> 2. We used original kernel + original initramfs = Introspection fails (no
> highmem error).
> 3. We used original kernel + edited initramfs = Introspection successful (no
> high highmem  error), but the out put of 'openstack baremetal introspection
> data save 1a4e30da-b6dc-499d-ba87-0bd8a3819bc0 | jq ".inventory.disks"'
> produces 2 disks (/dev/sda and /dev/sdb). Is this a normal situation? or
> does it need to produce only one disk (/dev/sda)? because the raid
> controller is enable.
> ~~~~

That is:

Problem: Deployment fails on the iPXE stage.
Resolution: re-install Undercloud and re-try. Introspection succeeds.
RCA: Unknown.


As for #1, 'Introspection returns 2 disks instead of one.', I haven't heard a word from them since I told them we don't actually support that out-of-the-box.

However, I just noticed a new rhos-prio thread about this.

I'll update you if I get any new info.

Comment 14 Bob Fournier 2018-08-27 20:10:02 UTC
I think this is no longer an issue?  Can we close this or is do we still want to pursue it?

Comment 15 Red Hat Bugzilla 2023-09-15 00:11:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.