Bug 1398288

Summary: Specifying root device hints for VMs leads to introspection/instance boot failure
Product: Red Hat OpenStack Reporter: Sergii Nozhka <snozhka>
Component: openstack-ironic-python-agentAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED NOTABUG QA Contact: Raviv Bar-Tal <rbartal>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: augol, chih-hsien.chien, dtantsur, ftaylor, jraju, lmartins, mburns, pbandark, rhel-osp-director-maint, rmahroua, sacpatil, sasha, slinaber, snozhka, srevivo
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-27 10:59:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
ramdisk logs
none
ironic-inspector log none

Description Sergii Nozhka 2016-11-24 11:51:45 UTC
Description of problem:
Adding serial root_device property to ironic node with several disks leads to introspection/instance boot failure.

Version-Release number of selected component (if applicable):
python-ironicclient-1.7.0-1.el7ost.noarch
openstack-ironic-inspector-4.2.0-3.el7ost.noarch
openstack-ironic-conductor-6.2.1-5.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-1.el7ost.noarch
puppet-ironic-9.4.1-1.el7ost.noarch
openstack-ironic-common-6.2.1-5.el7ost.noarch
openstack-ironic-api-6.2.1-5.el7ost.noarch

How reproducible:
Steps to reproduce:
1. Create VM with at least 2 disks and assign serial to both:
    virt-install --ram 8192 --vcpus 4 --os-variant rhel7 --disk path=/var/lib/libvirt/images/root-hints-disk1-1.qcow2,device=disk,bus=virtio,format=qcow2,serial=4000cca77fc4dba1 --disk path=/var/lib/libvirt/images/root-hints-disk1-2.qcow2,device=disk,bus=virtio,format=qcow2,serial=5001c60016ea71ad --import --noautoconsole --vnc --network network:data --network network:management --network network:external --name root-hints-1

2. Add created VM into ironic list
3. Set root_device property of added node to one of desired serial numbers:
    ironic node-update root-hints-1 add properties/root_device='{"serial": "5001c60016ea71ad"}'
4. Try to introspect node.

Actual result: Introspection failed.
Expected result: Introspection passed.


Additional info:

The ironic-inspector log contains the following:

    'root_disk': {
        'rotational': True,
        'vendor': '0x1af4',
        'name': '/dev/vdb',
        'wwn_vendor_extension': None,
        'wwn_with_extension': None,
        'model': '',
        'wwn': None,
        'serial': None,
        'size': 21474836480
    }

Comment 1 Dmitry Tantsur 2016-11-24 11:53:56 UTC
Please specify the exact failure.

Comment 2 Sergii Nozhka 2016-11-24 12:02:22 UTC
(In reply to Dmitry Tantsur from comment #1)
> Please specify the exact failure.

2016-11-24 06:57:29.772 6882 ERROR ironic.drivers.modules.inspector [req-4527dbe4-6bc0-41ec-a5d7-1dd650be4280 - - - - -] Inspection failed for node bfe5f0ee-96d6-46c1-bf0c-2a6f5d7a628c with error: No disks satisfied root device hints

Please refer http://pastebin.test.redhat.com/433352.

Comment 3 Dmitry Tantsur 2016-11-24 12:54:11 UTC
Please attach openstack-ironic-inspector logs and ramdisk logs (see http://tripleo.org/troubleshooting/troubleshooting.html#accessing-logs-from-the-ramdisk)

Comment 4 Sergii Nozhka 2016-11-24 13:03:57 UTC
Created attachment 1223852 [details]
ramdisk logs

Comment 5 Sergii Nozhka 2016-11-24 13:04:26 UTC
Created attachment 1223853 [details]
ironic-inspector log

Comment 8 Lucas Alvares Gomes 2016-11-28 10:29:05 UTC
We did some investigation upstream on this problem. What is happening is that, setting attributes such as "serial" or "wwn" in the libvirt XML doesn't work with some distros, apparently, such attributes doesn't get propagated to the host OS and Ironic can't read it to use with the root device hints (it uses udev to read it, same with lsblk).

When used with a fedora 24 deploy ramdisk or on a real baremetal, the code in Ironic works.

Comment 9 Lucas Alvares Gomes 2016-11-29 11:08:56 UTC
So, after some investigation. Apparently we need the virtio_scsi driver to be loaded (which is by default on centOS7 that I'm testing at the moment) and the disk controller should be "scsi" (instead of virtio or ide). 

Then the attributes from libvirt gets propagated to the OS and the deploy ramdisk can make use of them for the root device hints mechanism.

There's another problem found while testing, apparently, specifying both WWN and Serial will cause libvirt to overwrite the serial number with the WWN, e.g: 

By setting:

 ...
 <serial>123456789</serial>
 <wwn>5001c60016ea71ad</wwn>
 ...

The following appears on lsblk:

[centos@testing ~]$ lsblk -okname,serial,wwn    
KNAME SERIAL           WWN
sda   5001c60016ea71ad 0x5001c60016ea71ad
sda1                   0x5001c60016ea71ad
sda2                   0x5001c60016ea71ad

Comment 10 Dmitry Tantsur 2016-12-13 09:53:33 UTC
*** Bug 1404116 has been marked as a duplicate of this bug. ***

Comment 13 Pratik Pravin Bandarkar 2016-12-21 15:22:10 UTC
I tried reproducing the issue on RHOS10 setup. 

From /var/log/ironic/deploy/278c47dd-dbff-4fa4-a8be-1f215faab11b_5228609d-b22c-419a-afdc-107e93676537_2016-12-21-15:01:44  :

<snip>
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.625 579 DEBUG root [-] Root device hint serial=QM00005 does not match the device /dev/sda value of qm00005 match /usr/lib/python2.7/site-packages/ironic_python_agent/hardware.py:653
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 DEBUG root [-] Root device hint serial=QM00005 does not match the device /dev/sdb value of qm00007 match /usr/lib/python2.7/site-packages/ironic_python_agent/hardware.py:653
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root [-] Unexpected error dispatching get_os_install_device to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x3848190>: Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {u'serial': u'QM00005'}
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root Traceback (most recent call last):
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root   File "/usr/lib/python2.7/site-packages/ironic_python_agent/hardware.py", line 1062, in dispatch_to_managers
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root     return getattr(manager, method)(*args, **kwargs)
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root   File "/usr/lib/python2.7/site-packages/ironic_python_agent/hardware.py", line 688, in get_os_install_device
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root     "deployment using these hints %s" % root_device_hints)
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.626 579 ERROR root DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {u'serial': u'QM00005'}
</snip>


Here we can see:
<snip>
Dec 21 10:01:43 host-192-0-2-15 ironic-python-agent[579]: 2016-12-21 10:01:43.625 579 DEBUG root [-] Root device hint serial=QM00005 does not match the device /dev/sda value of qm00005 match /usr/lib/python2.7/site-packages/ironic_python_agent/hardware.py:653  <======
</snip>

Serial device which I specified was "QM00005". But its expecting "qm00005"

After going through the code, we(sachin,jaison,me) identified that serial value is getting updated with lower case:


 647                 if hint_value != current_value:
 648                     LOG.debug("Root device hint %(hint)s=%(value)s does not "
 649                               "match the device %(device)s value of "
 650                               "%(current)s", {
 651                                   'hint': hint,
 652                                   'value': hint_value, 'device': device,
 653                                   'current': current_value})
 654                     return False
 655                 return True
[...]
 668                     if isinstance(value, six.string_types):
 669                         value = utils.normalize(value)
 670 
__

def normalize(string):
    """Return a normalized string.

    Take a urlencoded value from Ironic and urldecode it.

    :param string: a urlencoded string
    :returns: a normalized version of passed in string
    """
    return parse.unquote(string).lower().strip()
________________________

- As a fix we made below changes in ironic-python-agent.initramfs:
<snip>
 647                 if hint_value.lower() != current_value:
 648                     LOG.debug("Root device hint %(hint)s=%(value)s does not "
 649                               "match the device %(device)s value of "
 650                               "%(current)s", {
 651                                   'hint': hint,
 652                                   'value': hint_value, 'device': device,
 653                                   'current': current_value})
 654                     return False
 655                 return True
[...]
</snip>

- rebuild the initramfs file and uploaded it glance. Currently I am redeploying the overcloud setup. I will update the bz with the result by tomorrow.

Comment 14 Pratik Pravin Bandarkar 2016-12-22 13:31:01 UTC
The deployment was successful:

[root@overcloud-compute-1 heat-admin]# fdisk  -l /dev/sd
sda   sda1  sda2  sdb   sdb1  sdb2  
[root@overcloud-compute-1 heat-admin]# fdisk  -l /dev/sda

Disk /dev/sda: 44.0 GB, 44023414784 bytes, 85983232 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000ed482

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048        4095        1024   83  Linux
/dev/sda2   *        4096    85979879    42987892   83  Linux

[root@overcloud-compute-1 heat-admin]# ceph -s
    cluster 082ccc78-c7a4-11e6-a4c8-525400d086af
     health HEALTH_OK
     monmap e1: 1 mons at {overcloud-controller-0=192.168.125.30:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e15: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v1317: 224 pgs, 6 pools, 1711 kB data, 106 objects
            78528 kB used, 30621 MB / 30697 MB avail
                 224 active+clean

[root@overcloud-compute-1 heat-admin]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    30697M     30621M       78528k          0.25 
POOLS:
    NAME        ID     USED      %USED     MAX AVAIL     OBJECTS 
    rbd         0          0         0        15310M           0 
    metrics     1      1711k      0.01        15310M         102 
    images      2          0         0        15310M           0 
    backups     3          0         0        15310M           0 
    volumes     4        113         0        15310M           4 
    vms         5          0         0        15310M           0 

____________________

With patch mentioned in comment #13, test case was failing. Hence, we can go with below patch:

diff --git a/ironic_python_agent/hardware.py b/ironic_python_agent/hardware.py
index aad4777125bd..54a8489c2471 100644
--- a/ironic_python_agent/hardware.py
+++ b/ironic_python_agent/hardware.py
@@ -643,6 +643,8 @@ class GenericHardwareManager(HardwareManager):
                             'Current value: "%(value)s"; and type: "%(type)s"',
                             {'value': hint_value, 'type': type(hint_value)})
                         return False
+                elif hint == 'serial':
+                    hint_value = utils.normalize(hint_value)
 
                 if hint_value != current_value:
                     LOG.debug("Root device hint %(hint)s=%(value)s does not "

For patch and test case, refer: https://gist.githubusercontent.com/psachin/0410aed410881a09c95701f469e4d325/raw/b30177a9bb484617cb1202890222adc933782c47/ironic_python_agent.diff

Thanks to sachin++ for guidance.

Comment 15 Pratik Pravin Bandarkar 2016-12-22 15:01:59 UTC
added additional test case to handle special characters for "serial":

https://gist.github.com/pbandark/a4c1c3c480718e627aac40b9e2e878ae

Comment 17 Razique Mahroua 2017-02-07 23:06:45 UTC
I haven't test the disk, but here are my findings: 
  - VirtIO disks are not compatible with overcloud deployments when the VM has more than one disk.
  - The serial number, when added along the scsi device type allows me to use the --property root_device='{"serial":"'serial'"}' $node_id feature, thus forcing the installation on the disk
  - When size is added as a property (root_device ='{"size": ">=20"}'), this is not taken into account, and Heat installs the OS randomly (vdb in my case)
  - Using --root-device=vda doesn't work either, Heat ends up using a random disk.

Upon close look of the data being retrieved, I noticed that my primary disk (vda) shows as vdd in the metadata. 

This confirms Jaison Raju's finding regarding how Ironic handles the disk.

I'm going to try the fix suggested in comment 14.

Comment 18 Razique Mahroua 2017-02-07 23:08:23 UTC
tested the fix*
(In reply to Razique Mahroua from comment #17)
> I haven't test the disk, but here are my findings: 
>   - VirtIO disks are not compatible with overcloud deployments when the VM
> has more than one disk.
>   - The serial number, when added along the scsi device type allows me to
> use the --property root_device='{"serial":"'serial'"}' $node_id feature,
> thus forcing the installation on the disk
>   - When size is added as a property (root_device ='{"size": ">=20"}'), this
> is not taken into account, and Heat installs the OS randomly (vdb in my case)
>   - Using --root-device=vda doesn't work either, Heat ends up using a random
> disk.
> 
> Upon close look of the data being retrieved, I noticed that my primary disk
> (vda) shows as vdd in the metadata. 
> 
> This confirms Jaison Raju's finding regarding how Ironic handles the disk.
> 
> I'm going to try the fix suggested in comment 14.

Comment 20 Dmitry Tantsur 2017-04-25 10:04:10 UTC
Pratik, Chen, could you please move your case issues to a new bug? This bug is about VMs specifically, and I'd like to close it, because it seems to be about VM configuration, not bug in Ironic. Please provide your hardware vendor and model, and output of lsblk. Thanks!

Comment 22 Sachin 2017-04-27 10:18:08 UTC
Dmitry,

The bug is certainly in Ironic. See comment#18, Razique tested the fix, I don't understand why the bug was closed? 

> Please provide your hardware vendor and model, and output of lsblk.

See comment#9

Comment 23 Dmitry Tantsur 2017-04-27 10:59:56 UTC
The initial bug was about libvirtd not exposting serial/wwn correctly, and introspection failing due to that. This is NOT your case. Please open a separate bug with the serial number case (in)sensitivity explanation. Please refrain from reopening this bug, unless you want libvirtd behaviour somehow fixed (which is not something our team can do).

In this new bug, please do provide lsblk output for *your* case with the following command line:

 sudo lsblk -Pbdi -oKNAME,MODEL,SIZE,ROTE,TYPE

Please also attach introspection data for the nodes. Ideally, please fetch deployment logs, they should end up in /var/log/ironic/deploy. Thanks!