Bug 1933256 - _get_partition does not get the right root_partition on nodes with software raid devices
Summary: _get_partition does not get the right root_partition on nodes with software r...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-python-agent
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
medium
low
Target Milestone: ---
: ---
Assignee: Julia Kreger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-26 10:30 UTC by Jaison Raju
Modified: 2022-02-01 21:42 UTC (History)
2 users (show)

Fixed In Version: openstack-ironic-python-agent-7.0.2-0.20210615022843.0756f04.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-01 21:37:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pdb data (26.35 KB, text/plain)
2021-02-26 10:30 UTC, Jaison Raju
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 686580 0 None MERGED Software RAID: Identify the root fs via its UUID from image metadata 2021-03-18 20:14:29 UTC
OpenStack gerrit 686585 0 None MERGED Software RAID: Use UUID to find root fs 2021-03-18 20:14:29 UTC
Red Hat Issue Tracker OSP-2078 0 None None None 2022-02-01 21:42:54 UTC

Description Jaison Raju 2021-02-26 10:30:56 UTC
Created attachment 1759463 [details]
pdb data

Description of problem:
On nodes with existing software raid disks, ipa tends to use /dev/md? disks by default.
For example in this case, from /dev/md125 , /dev/md126, /dev/md127 and another 20-30 disks, ipa selects /dev/md125 as default root_partition.
But the _get_partition function in ironic_python_agent/extensions/image.py does not recognize the correct partition with "img-rootfs", which is /dev/md125p2.
IPA also makes a mistake of then ending up with /dev/md125p1 as the return value of this function. 
In effect, the deployment fails as /dev/md125p1 is mounted in tmp directory and ipa does not find /dev in this mount-point.

Version-Release number of selected component (if applicable):
SUPERMICRO 6049P
RHOSP 16.1.3 GA

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
For detailed investigation, I logged into the IPA environment (via ssh) and manually started the IPA with pdb at some points.

After tmp directory was created I confirmed the mounted partition:
--------------------------------------------------------__
[root@host-192-168-24-38 ~]# df | grep tmp                                                                                                                                                                                                   
devtmpfs       197159460       0 197159460   0% /dev
tmpfs          197436444       0 197436444   0% /dev/shm
tmpfs          197436444   10448 197425996   1% /run
tmpfs          197436444       0 197436444   0% /sys/fs/cgroup
tmpfs           39487288       0  39487288   0% /run/user/0
/dev/md125p1         492     492         0 100% /tmp/tmpt94qs6e2
[root@host-192-168-24-38 ~]# ls -l /dev/md125
md125    md125p1  md125p2


[root@host-192-168-24-38 ~]# tune2fs -l /dev/md125p1
tune2fs 1.45.4 (23-Sep-2019)
tune2fs: Bad magic number in super-block while trying to open /dev/md125p1
/dev/md125p1 contains a iso9660 file system labelled 'config-2'
[root@host-192-168-24-38 ~]# tune2fs -l /dev/md125p2
tune2fs 1.45.4 (23-Sep-2019)
tune2fs: Bad magic number in super-block while trying to open /dev/md125p2
/dev/md125p2 contains a xfs file system labelled 'img-rootfs'
[root@host-192-168-24-38 ~]# blkid /dev/md125p2
/dev/md125p2: LABEL="img-rootfs" UUID="0ec3dea5-f293-4729-b676-5d38a611ce81" TYPE="xfs" PARTUUID="6722008c-02"

[root@host-192-168-24-38 ~]# ls /tmp/tmpt94qs6e2/ -a
.  ..  ec2  openstack
--------------------------------------------------------
PFA detailed pdb data.
Note:
1. "img-rootfs" is on /dev/md125p2
2. UUID of "img-rootfs" is "0ec3dea5-f293-4729-b676-5d38a611ce81"
3. IPA recognizes and passed this UUID correctly to _get_partition with disk.

Comment 1 Jaison Raju 2021-02-26 11:32:38 UTC
I noticed something after adding pdb in _get_partition.
It seems this function finds uuid on the wrong partition:
------------------------------------------------------------------------------------------------------------
2021-02-26 05:48:42.829 2599 DEBUG ironic_lib.utils [-] Execution completed, command line is "mdadm --detail /dev/md125" execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:101                                                    
2021-02-26 05:48:42.830 2599 DEBUG ironic_lib.utils [-] Command stdout is: "/dev/md125:
           Version : 1.2                                                                                                                                                                                                                     
     Creation Time : Fri Feb 26 04:38:33 2021
        Raid Level : raid1
        Array Size : 971910144 (926.89 GiB 995.24 GB)
     Used Dev Size : 971910144 (926.89 GiB 995.24 GB)
      Raid Devices : 2

     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Feb 26 05:48:39 2021
             State : clean, resyncing
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

     Resync Status : 38% complete

              Name : 2
              UUID : 2c6a5e72:fcf61094:02275431:b11b11c5
            Events : 1160

    Number   Major   Minor   RaidDevice State
       0      66      211        0      active sync   /dev/sdat3
       1      66      227        1      active sync   /dev/sdau3
" execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:103
2021-02-26 05:48:42.830 2599 DEBUG ironic_lib.utils [-] Command stderr is: "" execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:104                                                                                               
2021-02-26 05:48:42.830 2599 DEBUG root [-] /dev/md125 is an md device is_md_device /usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py:200                                                                                     
> /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py(70)_get_partition()
-> md_partition = device + 'p1'
(Pdb) locals()                                                                                                                                                                                                                               
{'uuid': '0ec3dea5-f293-4729-b676-5d38a611ce81', 'device': '/dev/md125'}
(Pdb) n
> /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py(71)_get_partition()
-> if (not os.path.exists(md_partition) or
(Pdb) n
> /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py(72)_get_partition()
-> not stat.S_ISBLK(os.stat(md_partition).st_mode)):
(Pdb) locals()
{'uuid': '0ec3dea5-f293-4729-b676-5d38a611ce81', 'device': '/dev/md125', 'md_partition': '/dev/md125p1'}
(Pdb) n
> /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py(78)_get_partition()
-> LOG.debug("Found md device with partition %s", md_partition)
(Pdb) n
2021-02-26 05:49:35.248 2599 DEBUG ironic_python_agent.extensions.image [-] Found md device with partition /dev/md125p1 _get_partition /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py:78                           
> /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py(79)_get_partition()
-> return md_partition
(Pdb) 2021-02-26 05:50:11.463 2599 INFO ironic_python_agent.agent [-] heartbeat successful
2021-02-26 05:50:11.463 2599 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 136.20341590356992

--Return--
> /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py(79)_get_partition()->'/dev/md125p1'
-> return md_partition
(Pdb) locals()
{'uuid': '0ec3dea5-f293-4729-b676-5d38a611ce81', 'device': '/dev/md125', 'md_partition': '/dev/md125p1', '__return__': '/dev/md125p1'}
------------------------------------------------------------------------------------------------------------

[root@host-192-168-24-86 ~]# lsblk -PbioKNAME,UUID,PARTUUID,TYPE | grep md125  
KNAME="md125" UUID="" PARTUUID="" TYPE="raid1"
KNAME="md125" UUID="" PARTUUID="" TYPE="raid1"
KNAME="md125p1" UUID="2021-02-26-10-38-08-00" PARTUUID="dc23973e-01" TYPE="md"
KNAME="md125p1" UUID="2021-02-26-10-38-08-00" PARTUUID="dc23973e-01" TYPE="md"
KNAME="md125p2" UUID="0ec3dea5-f293-4729-b676-5d38a611ce81" PARTUUID="dc23973e-02" TYPE="md"
KNAME="md125p2" UUID="0ec3dea5-f293-4729-b676-5d38a611ce81" PARTUUID="dc23973e-02" TYPE="md"

I won't be having this hardware next week, hence I am trying to get the most data required.
Although I think this should be reproducible pretty easily on any node with software raid.

Comment 2 Steve Baker 2021-03-02 20:47:43 UTC
Could you attach the agent logs for this?

Comment 9 Julia Kreger 2021-06-08 20:47:39 UTC
Given the complexity and the fact software raid is a feature in upstream ironic which was still being worked on when the Train release was cut upstream, and the fixes requires an API change between the Conductor<->Agent, we're going to go the route of letting this get pulled in with 17 as it is already merged instead of attempting to back port to 16.1. This also considers that the risk of backporting code in this critical, and complex path, is a higher risk path as opposed to just letting this come in with the next version release.

Comment 10 Julia Kreger 2021-06-08 20:48:09 UTC
Given the complexity and the fact software raid is a feature in upstream ironic which was still being worked on when the Train release was cut upstream, and the fixes requires an API change between the Conductor<->Agent, we're going to go the route of letting this get pulled in with 17 as it is already merged instead of attempting to back port to 16.1. This also considers that the risk of backporting code in this critical, and complex path, is a higher risk path as opposed to just letting this come in with the next version release.

Comment 11 Julia Kreger 2022-02-01 21:37:40 UTC
Given this issue will be fixed in 17, and the actual software raid usage is not a supported case in osp16.x, I'm closing this out as a next release item. If you have any questions or concerns, please let us know.


Note You need to log in before you can comment on or make changes to this bug.