Bug 1244328 - iscsi initiatorname is identical for all overcloud nodes
Summary: iscsi initiatorname is identical for all overcloud nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: y3
: 7.0 (Kilo)
Assignee: Derek Higgins
QA Contact: Dan Yasny
URL:
Whiteboard:
: 1288423 (view as bug list)
Depends On:
Blocks: 1290377 1299906 1304415 1309819
TreeView+ depends on / blocked
 
Reported: 2015-07-17 19:16 UTC by Marian Krcmarik
Modified: 2019-12-16 04:49 UTC (History)
20 users (show)

Fixed In Version: openstack-tripleo-heat-templates-0.8.6-120.el7ost
Doc Type: Bug Fix
Doc Text:
The iSCSI initiator name was the same for all Compute nodes in an Overcloud, which causes live migration of instances to fail. This fix modifies the iSCSI initiator name during Overcloud deployment. Now live migration succeeds over iSCSI.
Clone Of:
: 1304415 1309819 (view as bug list)
Environment:
Last Closed: 2016-11-29 17:47:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 275890 0 None MERGED Makes the iSCSI initiator name unique for compute nodes 2020-09-01 23:51:25 UTC
Red Hat Knowledge Base (Solution) 2075663 0 None None None 2018-02-08 11:11:08 UTC
Red Hat Product Errata RHBA-2016:0264 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OSP 7 director Bug Fix Advisory 2016-02-18 21:41:29 UTC

Description Marian Krcmarik 2015-07-17 19:16:41 UTC
Description of problem:
blockdev fails to see logged iscsi target - device of a volume attached to an instance after the instance is live migrated to different node and nova-rootwrap fails (see traceback below):

- iscsiadm -m session shows the target to be logged

- It's even possible to display the files with ls:
# ll /dev/disk/by-path/ip-192.0.2.9:3260-iscsi-iqn.2010-10.org.openstack:volume-1a6cc490-1c74-41ea-95b4-2b4a106f534d-lun-0
lrwxrwxrwx. 1 root root 9 Jul 17 13:30 /dev/disk/by-path/ip-192.0.2.9:3260-iscsi-iqn.2010-10.org.openstack:volume-1a6cc490-1c74-41ea-95b4-2b4a106f534d-lun-0 -> ../../sdb
# ll /dev/sdb
brw-rw----. 1 qemu qemu 8, 16 Jul 17 13:30 /dev/sdb

- And even instance seems to be started correctly and nova list returns ACTIVE status for the VM.

- default configuration values of cinder were used (iscsi_helper=tgtadm)

- NFS was used as shared storage for instances.

Version-Release number of selected component (if applicable):
$ rpm -qa | grep openstack
openstack-dashboard-theme-2015.1.0-10.el7ost.noarch
openstack-ceilometer-common-2015.1.0-6.el7ost.noarch
openstack-ceilometer-alarm-2015.1.0-6.el7ost.noarch
openstack-neutron-ml2-2015.1.0-11.el7ost.noarch
openstack-swift-proxy-2.3.0-1.el7ost.noarch
openstack-neutron-2015.1.0-11.el7ost.noarch
openstack-heat-common-2015.1.0-4.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch
openstack-nova-api-2015.1.0-14.el7ost.noarch
openstack-keystone-2015.1.0-4.el7ost.noarch
openstack-swift-object-2.3.0-1.el7ost.noarch
python-django-openstack-auth-1.2.0-3.el7ost.noarch
redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch
openstack-nova-compute-2015.1.0-14.el7ost.noarch
openstack-ceilometer-central-2015.1.0-6.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch
openstack-nova-cert-2015.1.0-14.el7ost.noarch
openstack-nova-scheduler-2015.1.0-14.el7ost.noarch
openstack-glance-2015.1.0-6.el7ost.noarch
openstack-neutron-lbaas-2015.1.0-5.el7ost.noarch
openstack-selinux-0.6.35-3.el7ost.noarch
openstack-swift-2.3.0-1.el7ost.noarch
openstack-nova-common-2015.1.0-14.el7ost.noarch
openstack-ceilometer-collector-2015.1.0-6.el7ost.noarch
openstack-ceilometer-compute-2015.1.0-6.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch
openstack-nova-conductor-2015.1.0-14.el7ost.noarch
openstack-cinder-2015.1.0-3.el7ost.noarch
openstack-neutron-metering-agent-2015.1.0-11.el7ost.noarch
openstack-swift-container-2.3.0-1.el7ost.noarch
python-openstackclient-1.0.3-2.el7ost.noarch
openstack-puppet-modules-2015.1.8-3.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-neutron-common-2015.1.0-11.el7ost.noarch
openstack-heat-engine-2015.1.0-4.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-14.el7ost.noarch
openstack-neutron-openvswitch-2015.1.0-11.el7ost.noarch
openstack-swift-account-2.3.0-1.el7ost.noarch
openstack-dashboard-2015.1.0-10.el7ost.noarch
openstack-ceilometer-notification-2015.1.0-6.el7ost.noarch
openstack-ceilometer-api-2015.1.0-6.el7ost.noarch
openstack-nova-console-2015.1.0-14.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Attach a volume to an instance which uses NFS as shared storage.
2. Live-Migrate the instance to a different node.

Actual results:
nova-rootwrap fails - blockdev reports that there is no such device even though the iscsi target is logged and ls command can list the files.

Expected results:


Additional info:
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-p
ath/ip-192.0.2.9:3260-iscsi-iqn.2010-10.org.openstack:volume-1a6cc490-1c74-41ea-95b4-2b
4a106f534d-lun-0
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-192.0.2.9:3260-iscsi-iqn.2010-10.o
rg.openstack:volume-1a6cc490-1c74-41ea-95b4-2b4a106f534d-lun-0: No such device or addre
ss\n'
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task Traceback (most 
recent call last):
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib
/python2.7/site-packages/nova/openstack/common/periodic_task.py", line 224, in run_peri
odic_tasks
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     task(self, c
ontext)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib
/python2.7/site-packages/nova/compute/manager.py", line 6247, in update_available_resou
rce
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     rt.update_av
ailable_resource(context)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib
/python2.7/site-packages/nova/compute/resource_tracker.py", line 376, in update_availab
le_resource
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     resources = 
self.driver.get_available_resource(self.nodename)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5006, in get_available_resource
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     disk_over_committed = self._get_disk_over_committed_size_total()
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6192, in _get_disk_over_committed_size_total
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     self._get_instance_disk_info(dom.name(), xml))
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6145, in _get_instance_disk_info
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     dk_size = lvm.get_volume_size(path)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/lvm.py", line 172, in get_volume_size
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     run_as_root=True)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/utils.py", line 55, in execute
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     return utils.execute(*args, **kwargs)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/nova/utils.py", line 213, in execute
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     return processutils.execute(*cmd, **kwargs)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 233, in execute
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task     cmd=sanitized_cmd)
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task ProcessExecutionError: Unexpected error while running command.
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-192.0.2.9:3260-iscsi-iqn.2010-10.org.openstack:volume-1a6cc490-1c74-41ea-95b4-2b4a106f534d-lun-0
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task Exit code: 1
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task Stdout: u''
2015-07-17 13:42:28.077 1416 TRACE nova.openstack.common.periodic_task Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-192.0.2.9:3260-iscsi-iqn.2010-10.org.openstack:volume-1a6cc490-1c74-41ea-95b4-2b4a106f534d-lun-0: No such device or address\n

Comment 3 Marian Krcmarik 2015-07-22 17:08:42 UTC
I am changing the component to OSP director. The main reason is that I cannot see that bug on environment which was created by packstack.

More details. It seems blockdev cannot see the device because authentication to ISCSI target after live migration failed - I can see created ACL for the particular target on the controller with using targetcli tool before migrating instance, once the instance with attached volume is migrated, there is no ACL created for the iscsi target. I have no idea what component is responsible for setting the authentication properly but I noticed that ISCSI initiator name of all compute nodes is the same - Not sure this is made purposely and has any impact but on packstack setup ISCSi initiator names of compute nodes are different - initiator name should be unique anyway. That's why I am moving for triage to osp-d. My setup is based on virt-env.

Before migration:
o- iscsi ..........................................................[Targets: 4]
  o- iqn.2010-10.org.openstack:volume-11218701-7f0b-4431-ade7-101c7cf20c6e [TPGs: 1]
  | o- tpg1 ........................................[no-gen-acls, auth per-acl]
  |   o- acls......................................................... [ACLs: 1]
  |   | o- iqn.1994-05.com.redhat:4a52e5aa22c ..... [1-way auth, Mapped LUNs: 1]
  |   |   o- mapped_lun0  [lun0 block/iqn.2010-10.org.openstack:volume-11218701-7f0b-4431-ade7-101c7cf20c6e (rw)]
  |   o- luns ...................................................... [LUNs: 1]
  |   | o- lun0  [block/iqn.2010-10.org.openstack:volume-11218701-7f0b-4431-ade7-101c7cf20c6e (/dev/cinder-volumes/volume-11218701-7f0b-4431-ade7-101c7cf20c6e)]
  |   o- portals .................................................................. [Portals: 1]
  |     o- 0.0.0.0:3260 ........................................................................... [OK]
  o- iqn.2010-10.org.openstack:volume-1a6cc490-1c74-41ea-95b4-2b4a106f534d .................... [TPGs: 1]

After migration:
o- iscsi ........................................................ [Targets: 4]
  o- iqn.2010-10.org.openstack:volume-11218701-7f0b-4431-ade7-101c7cf20c6e .................... [TPGs: 1]
  | o- tpg1 ....................................... [no-gen-acls, auth per-acl]
  |   o- acls ................................................... [ACLs: 0]
  |   o- luns ........................................................ [LUNs: 1]
  |   | o- lun0  [block/iqn.2010-10.org.openstack:volume-11218701-7f0b-4431-ade7-101c7cf20c6e (/dev/cinder-volumes/volume-11218701-7f0b-4431-ade7-101c7cf20c6e)]
  |   o- portals ............................................. [Portals: 1]
  |     o- 0.0.0.0:3260 ............................................ [OK]

The ISCSi initiator name is on both compute nodes: iqn.1994-05.com.redhat:4a52e5aa22c

Comment 4 Mike Burns 2015-08-19 16:58:23 UTC
To summarize, it seems the issue here is that all compute nodes have the same iSCSI initiator name.  

Basil/Jarda, any opinions on how critical this is?

Comment 7 Christian Horn 2015-12-07 08:40:32 UTC
related/DUP?
bz1288423 - iscsi initiatorname is identical for all overcloud nodes

Comment 8 Jaromir Coufal 2015-12-08 10:28:57 UTC
What is an impact for the end-user here?

Comment 9 Marian Krcmarik 2015-12-08 11:26:12 UTC
(In reply to Jaromir Coufal from comment #8)
> What is an impact for the end-user here?

Quite a weird question, Anyway it's been a while I played with this but as far as I remember and I described in comments, After migration proper iscsi ACL is not created and rootwrap reports a fail which causes any other following live migration of the instance to fail and most likely I guess volume was not accessible in the instance after first original live migration.

Comment 13 Perry Myers 2016-01-31 01:51:29 UTC
*** Bug 1288423 has been marked as a duplicate of this bug. ***

Comment 15 Hugh Brock 2016-02-03 14:38:53 UTC
Cloned this against director 8

Comment 16 James Slagle 2016-02-03 15:11:05 UTC
this is what the iscsi-initiator-utils rpm does in %post:

%post
/sbin/ldconfig

%systemd_post iscsi.service iscsi-shutdown.service iscsid.service iscsid.socket                                                                                                                                                                                                

if [ $1 -eq 1 ]; then
        if [ ! -f %{_sysconfdir}/iscsi/initiatorname.iscsi ]; then
                echo "InitiatorName=`/usr/sbin/iscsi-iname`" > %{_sysconfdir}/iscsi/initiatorname.iscsi
        fi  
        # enable socket activation and persistant session startup by default
        /bin/systemctl enable iscsi.service >/dev/null 2>&1 || :
        /bin/systemctl enable iscsid.socket >/dev/null 2>&1 || :
fi


so the name gets set the same for all the nodes since they are deployed from the same image and the name is generated at rpm install time.

i think we just need to add the above logic to our puppet manifests so that it gets regenerated when puppet is run

Comment 17 Rhys Oxenham 2016-02-03 19:09:42 UTC
I wrote this quick patch.

https://review.openstack.org/#/c/275890/

Comment 18 Derek Higgins 2016-02-04 10:31:35 UTC
The patch looks good to me, I'm trying to reproduce the bug, once I do I'll try out your patch. I'd mainly like to see what happens to VM's on an existing cloud if we change the initiator name during their life cycle.

Comment 19 Derek Higgins 2016-02-09 13:08:05 UTC
Not having shared storage setup, I reproduced this using a volume backed VM, 
live migration failed, the host being migrated too could not connect to the
iscsi target. I then changed the Initiator name on both compute nodes
(while a new VM was running), the VM continued to run and live migration
started working.

So the suggested patch should fix the bug as reported, new deployments wont
exhibit the problem and the live migration attempted above would have
worked AIUI.

Following this I live migrated the VM back to the host where it was started
and the initiatorname being reported by targetcli is the original initiatorname
before it was changed. From the looks of it the the change in initiatorname
only took effect on the compute node that hadn't yet been used.

Eric, I'm still digging into this (currently redeploying with 3 compute nodes so
I can try a more complex example), do you know if anything needs to be run for a change in initiatorname to take effect on compute nodes where the origional initiator name had been used?

Comment 20 Eric Harney 2016-02-09 16:13:45 UTC
(In reply to Derek Higgins from comment #19)

I believe Nova/os-brick will read the new initiator name at attach time, but it's possible that you need to restart the iscsi service on the compute node to ensure that it reloads the config and matches what's being used by Nova.  I'm having trouble finding documentation about what the expected behavior is here.

Comment 21 Derek Higgins 2016-02-09 23:18:07 UTC
The patch a attached works for new deployments, live migration works as expected.

For existing deployments it becomes a little more complicated, the following needs to happen on each compute node before live migration is attempted 

# Set the InitiatorName (if /etc/iscsi/.initiator_reset doesn't exist)
/bin/echo InitiatorName=$(/usr/sbin/iscsi-iname) > /etc/iscsi/initiatorname.iscsi

# make sure the new InitiatorName is picked up
systemctl restart iscsid
systemctl restart openstack-nova-compute

Only after doing this have I been able to live migrate volume backed VM's created both before and after the InitiatorName has been changed onto compute nodes that had been previously used

Assuming any upgrade will involve live migration, depending on how it is being orchestrated the 3 lines above may be part of the patch to overcloud_compute.pp or part of an upgrade script.

Comment 23 Dan Yasny 2016-02-11 16:30:47 UTC
[stack@instack ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks            |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| fe8c1d14-d63b-442e-91b8-a6c68f1214ce | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.0.2.7  |
| 44344478-0866-40f4-9a27-a9fa84343119 | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.0.2.11 |
| 69335882-c373-4956-9823-47d95cd1ed4b | overcloud-compute-1     | ACTIVE | -          | Running     | ctlplane=192.0.2.9  |
| 3c691e1a-6a7e-48f1-bb28-fc2b7e953d15 | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.0.2.12 |
| abd350f8-54e2-420f-9e3e-d7f4081ed51c | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.0.2.10 |
| 58483250-538b-4bdb-861b-f7cc98f7d08d | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.0.2.8  |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+


[stack@instack ~]$ for i in `nova list|grep ctlplane|cut -d"=" -f2 |cut -d' ' -f1`; do echo $i; ssh heat-admin@$i cat /etc/iscsi/initiatorname.iscsi; done
192.0.2.7
InitiatorName=iqn.1994-05.com.redhat:9d4e9e8d8fe
192.0.2.11
InitiatorName=iqn.1994-05.com.redhat:8950acdea36
192.0.2.9
InitiatorName=iqn.1994-05.com.redhat:7c3107a5d62
192.0.2.12
InitiatorName=iqn.1994-05.com.redhat:9d4e9e8d8fe
192.0.2.10
InitiatorName=iqn.1994-05.com.redhat:9d4e9e8d8fe
192.0.2.8
InitiatorName=iqn.1994-05.com.redhat:9d4e9e8d8fe


[stack@instack ~]$ rpm -qa |grep tripleo
openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-6.git49b57eb.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-119.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch

Looks like the computes have different initiatornames. All other hosts still have the same initiatorname.

Setting the BZ on virified, will report the non-randomness on the other nodes in a separate BZ

Comment 24 James Slagle 2016-02-11 19:03:40 UTC
this will need reverification as there was an addition to the initial fix. on overcloud update, the iscsid and openstack-nova-compute services need to be restarted on after the initiator name has been set, so that dependency has been added to the compute puppet manifest.

Comment 25 Rhys Oxenham 2016-02-11 21:07:24 UTC
@Dan - it's intended that only the compute nodes have their iSCSI initiator ID's changed, hence why four out of your six nodes have the original and identical ID and the computes are random. This is not a bug.

Comment 27 Dan Yasny 2016-02-12 15:20:35 UTC
(In reply to Rhys Oxenham from comment #25)
> @Dan - it's intended that only the compute nodes have their iSCSI initiator
> ID's changed, hence why four out of your six nodes have the original and
> identical ID and the computes are random. This is not a bug.

I understand that, this is why I opened the other bug on low prio. In case additional disks are to be added via iscsi on those other non-compute nodes (cinder, glance, ceph, swift especially) we don't want to run into problems if we can have an easy fix now. The severity for now is low, but that's still not the way iscsi initiators should be autoconfigured anywhere.

Comment 28 Rhys Oxenham 2016-02-12 15:22:32 UTC
(In reply to Dan Yasny from comment #27)
> (In reply to Rhys Oxenham from comment #25)
> > @Dan - it's intended that only the compute nodes have their iSCSI initiator
> > ID's changed, hence why four out of your six nodes have the original and
> > identical ID and the computes are random. This is not a bug.
> 
> I understand that, this is why I opened the other bug on low prio. In case
> additional disks are to be added via iscsi on those other non-compute nodes
> (cinder, glance, ceph, swift especially) we don't want to run into problems
> if we can have an easy fix now. The severity for now is low, but that's
> still not the way iscsi initiators should be autoconfigured anywhere.

Got it, then I was the one that confused your intentions - apologies.

Comment 29 Dan Yasny 2016-02-12 15:23:24 UTC
(In reply to James Slagle from comment #24)
> this will need reverification as there was an addition to the initial fix.
> on overcloud update, the iscsid and openstack-nova-compute services need to
> be restarted on after the initiator name has been set, so that dependency
> has been added to the compute puppet manifest.

@James, Can you please elaborate on how exactly the additional verification steps should look? 

If I deploy 7.1 and upgrade to 7.3 (last night's puddle) and then check the initiator names, will that suffice?

Comment 32 errata-xmlrpc 2016-02-18 16:46:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0264.html


Note You need to log in before you can comment on or make changes to this bug.