Bug 1484467

Summary: OSP11 -> OSP12 upgrade: unable to backup cinder volume after upgrade
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Jon Bernard <jobernar>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: abishop, afazekas, augol, cschwede, dbecker, dprince, geguileo, jobernar, mariel, mburns, mcornea, morazi, owalsh, pgrist, rhel-osp-director-maint, scohen, tshefi
Target Milestone: rcKeywords: AutomationBlocker, Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.1-0.20170927205938 Doc Type: Bug Fix
Doc Text:
Running Cinder services on bare metal machines and the Iscsid service in a container caused the services to have different iSCSI Qualified Name (IQN) values. Because the IQN is used to authenticate iSCSI connections, Cinder backup operations failed with an authentication error that was caused by an IQN mismatch. With this fix, the Iscsid service now runs on bare metal, and all other services, such as containerized Nova and non-containerized Cinder, are configured to use the correct IQN.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:55:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
controller sosreport none

Description Marius Cornea 2017-08-23 16:43:33 UTC
Created attachment 1317154 [details]
controller sosreport

Description of problem:
OSP11 -> OSP12 upgrade: unable to backup cinder volume after upgrade

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170805163048.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy basic OSP11 with 1 controller + 1 compute
timeout 180m openstack overcloud deploy \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
--log-file overcloud_deployment_40.log

2. Upgrade to OSP12:
timeout 180m openstack overcloud deploy \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/docker-osp12.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \

timeout 180m openstack overcloud deploy \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-converge-docker.yaml \
-e /home/stack/docker-osp12.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \


3. Create volume:

(overcloud) [stack@undercloud-0 ~]$ cinder create 1
cinder b+--------------------------------+---------------------------------------+
| Property                       | Value                                 |
+--------------------------------+---------------------------------------+
| attachments                    | []                                    |
| availability_zone              | nova                                  |
| bootable                       | false                                 |
| consistencygroup_id            | None                                  |
| created_at                     | 2017-08-23T16:31:57.000000            |
| description                    | None                                  |
| encrypted                      | False                                 |
| id                             | 44373257-c772-4e47-a636-9f9be5edcfdf  |
| metadata                       | {}                                    |
| migration_status               | None                                  |
| multiattach                    | False                                 |
| name                           | None                                  |
| os-vol-host-attr:host          | hostgroup@tripleo_iscsi#tripleo_iscsi |
| os-vol-mig-status-attr:migstat | None                                  |
| os-vol-mig-status-attr:name_id | None                                  |
| os-vol-tenant-attr:tenant_id   | 539901bc8bd64a91811e26205bd936a7      |
| replication_status             | None                                  |
| size                           | 1                                     |
| snapshot_id                    | None                                  |
| source_volid                   | None                                  |
| status                         | creating                              |
| updated_at                     | 2017-08-23T16:31:57.000000            |
| user_id                        | 3cc0710d2c6846a1b8375d99fb381211      |
| volume_type                    | None                                  |
+--------------------------------+---------------------------------------+

4. Backup the created volume: 

(overcloud) [stack@undercloud-0 ~]$ cinder backup-list
+--------------------------------------+--------------------------------------+--------+------+------+--------------+-----------+
| ID                                   | Volume ID                            | Status | Name | Size | Object Count | Container |
+--------------------------------------+--------------------------------------+--------+------+------+--------------+-----------+
| db60d3bc-4580-4d40-9801-b70802a6f6c8 | 44373257-c772-4e47-a636-9f9be5edcfdf | error  | -    | 1    | 0            | -         |
+--------------------------------------+--------------------------------------+--------+------+------+--------------+-----------+


Actual results:

(overcloud) [stack@undercloud-0 ~]$ cinder backup-show db60d3bc-4580-4d40-9801-b70802a6f6c8
+-----------------------+--------------------------------------+
| Property              | Value                                |
+-----------------------+--------------------------------------+
| availability_zone     | nova                                 |
| container             | None                                 |
| created_at            | 2017-08-23T16:32:17.000000           |
| data_timestamp        | 2017-08-23T16:32:17.000000           |
| description           | None                                 |
| fail_reason           | Volume device not found at .         |
| has_dependent_backups | False                                |
| id                    | db60d3bc-4580-4d40-9801-b70802a6f6c8 |
| is_incremental        | False                                |
| name                  | None                                 |
| object_count          | 0                                    |
| size                  | 1                                    |
| snapshot_id           | None                                 |
| status                | error                                |
| updated_at            | 2017-08-23T16:32:31.000000           |
| volume_id             | 44373257-c772-4e47-a636-9f9be5edcfdf |
+-----------------------+--------------------------------------+

Expected results:
Volume is backed up succesfully.

Additional info:
Volume backup completes fine before the upgrade.

Attaching controller sosreport.

Comment 1 Marius Cornea 2017-08-23 16:45:18 UTC
Traceback:

2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server [req-3cc9fdf4-eb2b-4c0f-9732-9649e279b68d 3cc0710d2c6846a1b8375d99fb381211 539901bc8bd64a91811e26205bd936a7 - default default] Exception during message handling: VolumeDeviceNo
tFound: Volume device not found at .
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 357, in create_backup
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     self._update_backup_error(backup, six.text_type(err))
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     self.force_reraise()
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 346, in create_backup
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     self._run_backup(context, backup, volume)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 392, in _run_backup
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     backup_device.is_snapshot)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 874, in _attach_device
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     return self._attach_volume(ctxt, backup_device, properties)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 896, in _attach_volume
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     {'volume_id', volume.id})
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     self.force_reraise()
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 885, in _attach_volume
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     return self._connect_device(conn)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/cinder/backup/manager.py", line 926, in _connect_device
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     vol_handle = connector.connect_volume(conn['data'])
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/utils.py", line 150, in trace_logging_wrapper
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 506, in connect_volume
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     self._cleanup_connection(connection_properties, force=True)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     self.force_reraise()
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 500, in connect_volume
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     return self._connect_single_volume(connection_properties)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/utils.py", line 61, in _wrapper
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     return r.call(f, *args, **kwargs)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     raise attempt.get()
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     six.reraise(self.value[0], self.value[1], self.value[2])
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 575, in _connect_single_volume
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server     raise exception.VolumeDeviceNotFound(device='')
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server VolumeDeviceNotFound: Volume device not found at .
2017-08-23 16:32:31.936 694053 ERROR oslo_messaging.rpc.server

Comment 2 Gorka Eguileor 2017-08-24 15:50:41 UTC
That traceback is not the relevant one, the one we want to look at is above that one:

--------

2017-08-23 16:32:30.100 1015911 DEBUG oslo.privsep.daemon [req-3cc9fdf4-eb2b-4c0f-9732-9649e279b68d 3cc0710d2c6846a1b8375d99fb381211 539901bc8bd64a91811e26205bd936a7 - default default] privsep: Exception during request[92649264]: Unexpected error while running command.
Command: iscsiadm -m node -T iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf -p 172.17.3.11:3260 --login
Exit code: 24
Stdout: u'Logging in to [iface: default, target: iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf, portal: 172.17.3.11,3260] (multiple)\n'
Stderr: u'iscsiadm: Could not login to [iface: default, target: iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf, portal: 172.17.3.11,3260].\niscsiadm: initiator reported error (24 - iSCSI login failed due to authorization failure)\niscsiadm: Could not log into all portals\n' loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:449
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 445, in loop
    reply = self._process_cmd(*msg)
  File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 428, in _process_cmd
    ret = func(*f_args, **f_kwargs)
  File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 207, in _wrap
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 194, in execute_root
    return custom_execute(*cmd, shell=False, run_as_root=False, **kwargs)
  File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 143, in custom_execute
    on_completion=on_completion, *cmd, **kwargs)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 400, in execute
    cmd=sanitized_cmd)
ProcessExecutionError: Unexpected error while running command.
Command: iscsiadm -m node -T iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf -p 172.17.3.11:3260 --login
Exit code: 24
Stdout: u'Logging in to [iface: default, target: iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf, portal: 172.17.3.11,3260] (multiple)\n'
Stderr: u'iscsiadm: Could not login to [iface: default, target: iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf, portal: 172.17.3.11,3260].\niscsiadm: initiator reported error (24 - iSCSI login failed due to authorization failure)\niscsiadm: Could not log into all portals\n'

----------

The problem is that we cannot login to the iSCSI because the authorization failed, but it's weird, because it's using the correct CHAP username and password.

But before we even go into that I notice that we are in the middle of the upgrade (we are doing a rolling upgrade), indicated by the backup service being v11.0.0 (OSP12)

2017-08-23 15:47:38.408 694053 INFO cinder.service [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Starting cinder-backup node (version 11.0.0)  

But it's running in compatibility mode because it's using OSP11's RPC and Versioned Objects versions:

2017-08-23 15:47:38.221 694053 INFO cinder.rpc [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected cinder-backup objects version 1.21 as minimum service version.
2017-08-23 15:47:38.226 694053 INFO cinder.rpc [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected cinder-backup RPC version 2.0 as minimum service version.
2017-08-23 15:47:38.230 694053 INFO cinder.rpc [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected cinder-volume objects version 1.21 as minimum service version.
2017-08-23 15:47:38.235 694053 INFO cinder.rpc [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected cinder-volume RPC version 3.10 as minimum service version.

But even before that we seem to have a bigger problem with the DB, because it looks like we didn't actually run the migrations during the upgrade:

2017-08-23 15:47:38.423 694053 ERROR cinder.backup.manager ProgrammingError: (pymysql.err.ProgrammingError) (1146, u"Table 'cinder.backup_metadata' doesn't exist") 


Could this be related to this other Bug? https://bugs.launchpad.net/tripleo/+bug/1706951

Comment 3 Marius Cornea 2017-08-24 16:01:39 UTC
(In reply to Gorka Eguileor from comment #2)
> That traceback is not the relevant one, the one we want to look at is above
> that one:
> 
> --------
> 
> 2017-08-23 16:32:30.100 1015911 DEBUG oslo.privsep.daemon
> [req-3cc9fdf4-eb2b-4c0f-9732-9649e279b68d 3cc0710d2c6846a1b8375d99fb381211
> 539901bc8bd64a91811e26205bd936a7 - default default] privsep: Exception
> during request[92649264]: Unexpected error while running command.
> Command: iscsiadm -m node -T
> iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf -p
> 172.17.3.11:3260 --login
> Exit code: 24
> Stdout: u'Logging in to [iface: default, target:
> iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf,
> portal: 172.17.3.11,3260] (multiple)\n'
> Stderr: u'iscsiadm: Could not login to [iface: default, target:
> iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf,
> portal: 172.17.3.11,3260].\niscsiadm: initiator reported error (24 - iSCSI
> login failed due to authorization failure)\niscsiadm: Could not log into all
> portals\n' loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:449
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 445,
> in loop
>     reply = self._process_cmd(*msg)
>   File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 428,
> in _process_cmd
>     ret = func(*f_args, **f_kwargs)
>   File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line
> 207, in _wrap
>     return func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py",
> line 194, in execute_root
>     return custom_execute(*cmd, shell=False, run_as_root=False, **kwargs)
>   File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py",
> line 143, in custom_execute
>     on_completion=on_completion, *cmd, **kwargs)
>   File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py",
> line 400, in execute
>     cmd=sanitized_cmd)
> ProcessExecutionError: Unexpected error while running command.
> Command: iscsiadm -m node -T
> iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf -p
> 172.17.3.11:3260 --login
> Exit code: 24
> Stdout: u'Logging in to [iface: default, target:
> iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf,
> portal: 172.17.3.11,3260] (multiple)\n'
> Stderr: u'iscsiadm: Could not login to [iface: default, target:
> iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf,
> portal: 172.17.3.11,3260].\niscsiadm: initiator reported error (24 - iSCSI
> login failed due to authorization failure)\niscsiadm: Could not log into all
> portals\n'
> 
> ----------
> 
> The problem is that we cannot login to the iSCSI because the authorization
> failed, but it's weird, because it's using the correct CHAP username and
> password.
> 
> But before we even go into that I notice that we are in the middle of the
> upgrade (we are doing a rolling upgrade), indicated by the backup service
> being v11.0.0 (OSP12)
> 
> 2017-08-23 15:47:38.408 694053 INFO cinder.service
> [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Starting cinder-backup
> node (version 11.0.0)  
> 
> But it's running in compatibility mode because it's using OSP11's RPC and
> Versioned Objects versions:
> 
> 2017-08-23 15:47:38.221 694053 INFO cinder.rpc
> [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected
> cinder-backup objects version 1.21 as minimum service version.
> 2017-08-23 15:47:38.226 694053 INFO cinder.rpc
> [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected
> cinder-backup RPC version 2.0 as minimum service version.
> 2017-08-23 15:47:38.230 694053 INFO cinder.rpc
> [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected
> cinder-volume objects version 1.21 as minimum service version.
> 2017-08-23 15:47:38.235 694053 INFO cinder.rpc
> [req-70a476ca-0aaf-4d72-95da-4cc20a8d03a3 - - - - -] Automatically selected
> cinder-volume RPC version 3.10 as minimum service version.
> 
> But even before that we seem to have a bigger problem with the DB, because
> it looks like we didn't actually run the migrations during the upgrade:
> 
> 2017-08-23 15:47:38.423 694053 ERROR cinder.backup.manager ProgrammingError:
> (pymysql.err.ProgrammingError) (1146, u"Table 'cinder.backup_metadata'
> doesn't exist") 
> 
> 
> Could this be related to this other Bug?
> https://bugs.launchpad.net/tripleo/+bug/1706951

Hm, I think so - I am applying https://review.openstack.org/#/c/493878/ which should address the issue reported in the bug. I think we might need to add the post_upgrade_config tag to cinder-backup service as well, similar to cinder-volume. Wdyt?

Comment 4 Gorka Eguileor 2017-08-24 16:13:40 UTC
Well, I don't know about the Triple-O specifics, but I can tell you that you must never run new Cinder code (api, scheduler, backup, or volume) of the new version before running the DB migration.

Comment 5 Marius Cornea 2017-08-24 19:12:38 UTC
It looks that we're doing the migrations later than when the DB error shows up:


2017-08-23 15:47:38.423 694053 ERROR cinder.backup.manager ProgrammingError: (pymysql.err.ProgrammingError) (1146, u"Table 'cinder.backup_metadata' doesn't exist") [SQL: u'SELECT backups.created_at AS backups_created_at, backups.updated_at


2017-08-23 15:49:01.598 743006 INFO migrate.versioning.api [-] 96 -> 97... 
2017-08-23 15:49:01.618 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.618 743006 INFO migrate.versioning.api [-] 97 -> 98... 
2017-08-23 15:49:01.637 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.638 743006 INFO migrate.versioning.api [-] 98 -> 99... 
2017-08-23 15:49:01.661 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.661 743006 INFO migrate.versioning.api [-] 99 -> 100... 
2017-08-23 15:49:01.665 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding attachment_specs_attachment_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.668 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding cgsnapshots_consistencygroup_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.671 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding group_snapshots_group_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.676 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding group_type_specs_group_type_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.679 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding group_volume_type_mapping_group_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.682 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding group_volume_type_mapping_volume_type_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.685 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding quality_of_service_specs_specs_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.688 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding reservations_allocated_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.691 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding reservations_usage_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.694 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding snapshot_metadata_snapshot_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.697 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding snapshots_cgsnapshot_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.700 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding snapshots_group_snapshot_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.703 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding snapshots_volume_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.706 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding transfers_volume_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.709 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_admin_metadata_volume_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.712 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_attachment_volume_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.714 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_glance_metadata_snapshot_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.717 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_glance_metadata_volume_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.720 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_metadata_volume_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.722 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_type_extra_specs_volume_type_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.725 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volume_types_qos_specs_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.728 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volumes_consistencygroup_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.731 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding volumes_group_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.734 743006 INFO 100_add_foreign_key_indexes [-] Skipped adding workers_service_id_idx because an equivalent index already exists.
2017-08-23 15:49:01.738 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.738 743006 INFO migrate.versioning.api [-] 100 -> 101... 
2017-08-23 15:49:01.756 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.756 743006 INFO migrate.versioning.api [-] 101 -> 102... 
2017-08-23 15:49:01.768 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.768 743006 INFO migrate.versioning.api [-] 102 -> 103... 
2017-08-23 15:49:01.785 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.785 743006 INFO migrate.versioning.api [-] 103 -> 104... 
2017-08-23 15:49:01.797 743006 INFO migrate.versioning.api [-] done
2017-08-23 15:49:01.797 743006 INFO migrate.versioning.api [-] 104 -> 105... 
2017-08-23 15:49:01.813 743006 INFO migrate.versioning.api [-] done

Comment 6 Marius Cornea 2017-08-24 19:23:47 UTC
I can see authorization failed messages in the iscsid container:

[root@controller-0 heat-admin]# docker logs -f icsisd
Error: No such container: icsisd
[root@controller-0 heat-admin]# docker logs -f iscsid
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Copying /var/lib/kolla/config_files/src-iscsid/etc/iscsi/.initiator_reset to /etc/iscsi/.initiator_reset
INFO:__main__:Copying /var/lib/kolla/config_files/src-iscsid/etc/iscsi/initiatorname.iscsi to /etc/iscsi/initiatorname.iscsi
INFO:__main__:Writing out command to execute
Running command: '/usr/sbin/iscsid -f'
iscsid: Could not set session1 priority. READ/WRITE throughout and latency could be affected.
iscsid: conn 0 login rejected: initiator failed authorization with target
iscsid: Connection1:0 to [target: iqn.2010-10.org.openstack:volume-3b24312a-101a-4779-a07e-799ca2a90d6a, portal: 172.17.3.20,3260] through [iface: default] is shutdown.
iscsid: Could not set session2 priority. READ/WRITE throughout and latency could be affected.
iscsid: conn 0 login rejected: initiator failed authorization with target
iscsid: Connection2:0 to [target: iqn.2010-10.org.openstack:volume-3b24312a-101a-4779-a07e-799ca2a90d6a, portal: 172.17.3.20,3260] through [iface: default] is shutdown.
iscsid: Could not set session3 priority. READ/WRITE throughout and latency could be affected.
iscsid: conn 0 login rejected: initiator failed authorization with target
iscsid: Connection3:0 to [target: iqn.2010-10.org.openstack:volume-3b24312a-101a-4779-a07e-799ca2a90d6a, portal: 172.17.3.20,3260] through [iface: default] is shutdown.
iscsid: Could not set session4 priority. READ/WRITE throughout and latency could be affected.
iscsid: conn 0 login rejected: initiator failed authorization with target
iscsid: Connection4:0 to [target: iqn.2010-10.org.openstack:volume-3b24312a-101a-4779-a07e-799ca2a90d6a, portal: 172.17.3.20,3260] through [iface: default] is shutdown.
iscsid: Could not set session5 priority. READ/WRITE throughout and latency could be affected.
iscsid: conn 0 login rejected: initiator failed authorization with target
iscsid: Connection5:0 to [target: iqn.2010-10.org.openstack:volume-3b24312a-101a-4779-a07e-799ca2a90d6a, portal: 172.17.3.20,3260] through [iface: default] is shutdown.
iscsid: Could not set session6 priority. READ/WRITE throughout and latency could be affected.
iscsid: conn 0 login rejected: initiator failed authorization with target
iscsid: Connection6:0 to [target: iqn.2010-10.org.openstack:volume-3b24312a-101a-4779-a07e-799ca2a90d6a, portal: 172.17.3.20,3260] through [iface: default] is shutdown.

Comment 7 Gorka Eguileor 2017-08-25 12:29:28 UTC
Those would be the authorization errors we saw on comment #2:

Stdout: u'Logging in to [iface: default, target: iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf, portal: 172.17.3.11,3260] (multiple)\n'
Stderr: u'iscsiadm: Could not login to [iface: default, target: iqn.2010-10.org.openstack:volume-44373257-c772-4e47-a636-9f9be5edcfdf, portal: 172.17.3.11,3260].\niscsiadm: initiator reported error (24 - iSCSI login failed due to authorization failure)\niscsiadm: Could not log into all portals\n'


But I think it would be easier to debug this issue if we remove the clutter from not running the migrations when they should.

Comment 8 Attila Fazekas 2017-08-28 06:14:02 UTC
The issue is not an upgrade time issue.  The issue also happens in the containerized fresh setups. Cinder services are not containerized and the compute from another machine was able to reach the lvm/iscsi service on the controllers.

The issue does not happens if you do fresh setup without containers, the issue maybe related to address usage or to missing global setting (iptables?, selinux, permissions, ..) when the base controller containerized.

Comment 10 Gorka Eguileor 2017-08-28 15:50:45 UTC
If this was not related to upgrades and happens in fresh installations then it should have been reported as such.  Otherwise we'll be more concerned with the upgrading issues reported in the logs (like the DB migration).

We should never have the same initiator name in 2 nodes, even if in this case the Nova node isn't really using it because the iscsid container is using a different file, this is a :

  ./controller-0/etc/iscsi/initiatorname.iscsi:1:InitiatorName=iqn.1994-05.com.redhat:cddb1aa660
  ./compute-0/etc/iscsi/initiatorname.iscsi:1:InitiatorName=iqn.1994-05.com.redhat:cddb1aa660

As far as I can tell the problem is that we are mixing 2 different approaches when deploying the controller nodes, the non containerized Cinder services and the containerized iSCSI daemon.

This means that Cinder will have a different initiator name than the actual iSCSI  daemon, so the authorization will not work because Cinder-Volume will authorize the Cinder-Backup initiator name and iscsid will use a different name for the connection.

Comment 12 Gorka Eguileor 2017-09-05 13:41:11 UTC
*** Bug 1488304 has been marked as a duplicate of this bug. ***

Comment 13 Paul Grist 2017-09-13 00:35:45 UTC
Did we get a resolution for the patch?  There was concern it could affect Nova, but need to get an update on where this one stands and it will block a number of tests.

Comment 14 Gorka Eguileor 2017-09-14 16:08:18 UTC
That is correct, since there is no fine grained control over how iscsid is deployed on the nodes we have to deploy it the same way in controller and compute nodes and that would break Nova-Compute since it's configured to use the containerized version.

Resolution was to stop deploying containerized iscsid and modify the nova-compute container configuration to make it use the host's iscsid instead of the containerized one.  Dan said he's going to propose a patch to do it.

Comment 15 Dan Prince 2017-09-21 13:58:35 UTC
I've made changes to the existing patchset here [1] so that the iscsi config dir is bind mounted in from the host for Nova, Ironic, and Cinder services.

[1] https://code.engineering.redhat.com/gerrit/117020

Let me know if you need anything else. Once this lands and makes its way into the puddle I think it could be marked as on qa.

Comment 16 Tzach Shefi 2017-10-02 14:56:14 UTC
FYI, hit this on a freshly installed containerized osp12. 
Same problem/error
Cinder lvm, Cinder backup nfs or swift both failed.

Comment 17 Mike Burns 2017-10-03 10:47:28 UTC
Patch set is not landed yet.  Moving to ON_DEV until it does.

Comment 18 Alan Bishop 2017-10-04 14:40:11 UTC
I can confirm this issue is not related to upgrading from OSP-11 -> 12. As Tzach noted in comment #16, the problem can easily be reproduced on a fresh install. The patch that addresses the problem [1] has been updated, and we should be able to close on this soon.

[1] https://code.engineering.redhat.com/gerrit/117020

Comment 21 Tzach Shefi 2017-10-24 07:54:43 UTC
Amit/Jon 

On 
openstack-tripleo-heat-templates-7.0.3-0.20171019091938.el7ost.noarch

As Alan and I mentioned before #16,18, this bug doesn't only happen on upgraded system but on a clean ospd12 systems as well. 

I've tested this on a clean install of OSP12. 

Cinder LVM, Cinder backup Swift. 

1) cinder create 1 - done. 

2) cinder backup-create 49c473f9-4d4b-41f9-ad90-9dc40b2bf4fb --name backup1

3) Backup is created successfully/available. 

$ cinder backup-list
+--------------------------------------+--------------------------------------+-----------+---------+------+--------------+---------------+
| ID                                   | Volume ID                            | Status    | Name    | Size | Object Count | Container     |
+--------------------------------------+--------------------------------------+-----------+---------+------+--------------+---------------+
| e3e6605a-7169-441a-a87e-a67fd58b117b | 49c473f9-4d4b-41f9-ad90-9dc40b2bf4fb | available | backup1 | 1    | 22           | volumebackups |
+--------------------------------------+--------------------------------------+-----------+---------+------+--------------+---------------+

I think it's enough to verify. 
If you agree we can closed this as verified. 
else if we opt to test on an upgraded system, please share upgrade 11->12 guide or job.

Comment 27 Tzach Shefi 2017-11-22 17:38:16 UTC
Marius, 
Noticed your QA contact but its on DFG:Storage Squad:Cinder

I'm willing to help test, already tested on a clean osp12 #21. 
Let me know if this is enough to close as verified. 

Or if you think this must be tested on 11->12 upgraded system
which I can also assist with, to get this off your back.

Comment 28 Marius Cornea 2017-11-22 21:30:27 UTC
(In reply to Tzach Shefi from comment #27)
> Marius, 
> Noticed your QA contact but its on DFG:Storage Squad:Cinder
> 
> I'm willing to help test, already tested on a clean osp12 #21. 
> Let me know if this is enough to close as verified. 
> 
> Or if you think this must be tested on 11->12 upgraded system
> which I can also assist with, to get this off your back.

Thanks Tzach. If this looks good on a clean OSP12 deployment I think we're good. I can't see any cinder backup related failed tests during the post upgrade tempest run so I think we can move this to verified.

Comment 32 errata-xmlrpc 2017-12-13 21:55:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462