Description of problem: Many tempest tests listed below fail due to the following error on FFU RHOS-10 to RHOS-13 deployments (also listed below): 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi [req-a2b0daf0-464a-4d42-be91-92e7b0c23455 58bf6cb5e19c4194899cbb625a559dae bf8847a92964448ebf9d50fd3f7703d1 - default default] Exception encountered during portal discovery: ProcessExecutionError: Unexpected error while running command. Command: iscsiadm -m discoverydb -o show -P 1 Exit code: 21 Stdout: u'SENDTARGETS:\nNo targets found.\niSNS:\nNo targets found.\nSTATIC:\nNo targets found.\nFIRMWARE:\nNo targets found.\n' Stderr: u'' How reproducible: always Steps to Reproduce: 1. Deploy RHOS-10 with at least 3 controller, 2 computes, optionally add 2 novacontrollers too 2. Complete FFU upgrade to RHOS-13 3. Run tempest tests as listed below Actual results: 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi [req-a2b0daf0-464a-4d42-be91-92e7b0c23455 58bf6cb5e19c4194899cbb625a559dae bf8847a92964448ebf9d50fd3f7703d1 - default default] Exception encountered during portal discovery: ProcessExecutionError: Unexpected error while running command. Command: iscsiadm -m discoverydb -o show -P 1 Exit code: 21 Stdout: u'SENDTARGETS:\nNo targets found.\niSNS:\nNo targets found.\nSTATIC:\nNo targets found.\nFIRMWARE:\nNo targets found.\n' Stderr: u'' 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi Traceback (most recent call last): 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 200, in _get_ips_iqns_luns 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi ips_iqns_luns = method(connection_properties) 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 359, in _get_discoverydb_portals 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi '-P', 1])[0] or "" 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 1119, in _run_iscsiadm_bare 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi check_exit_code=check_exit_code) 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi result = self.__execute(*args, **kwargs) 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi return execute_root(*cmd, **kwargs) 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 207, in _wrap 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi return self.channel.remote_call(name, args, kwargs) 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 202, in remote_call 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi raise exc_type(*result[2]) 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi ProcessExecutionError: Unexpected error while running command. 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi Command: iscsiadm -m discoverydb -o show -P 1 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi Exit code: 21 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi Stdout: u'SENDTARGETS:\nNo targets found.\niSNS:\nNo targets found.\nSTATIC:\nNo targets found.\nFIRMWARE:\nNo targets found.\n' 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi Stderr: u'' 2021-09-09 23:37:45.056 7 ERROR os_brick.initiator.connectors.iscsi ...................... ...<output omitted>... 2021-09-09 23:37:45.059 7 ERROR nova.compute.manager [instance: ef8cc46f-cc5b-40c7-a724-b96c3a075b86] VolumeDeviceNotFound: Volume device not found at . 2021-09-09 23:37:45.059 7 ERROR nova.compute.manager [instance: ef8cc46f-cc5b-40c7-a724-b96c3a075b86] 2021-09-09 23:37:45.060 7 INFO nova.compute.manager [req-a2b0daf0-464a-4d42-be91-92e7b0c23455 58bf6cb5e19c4194899cbb625a559dae bf8847a92964448ebf9d50fd3f7703d1 - de Expected results: Tests pass Additional info: Tests failing: DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3 #36 Following tests started failing: .setUpClass (tempest.api.compute.servers.test_create_server.ServersTestBootFromVolume) tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_iscsi_volume[id-e19c0cc6-6720-4ed8-be83-b6603ed5c812] tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,volume] tempest.api.compute.admin.test_live_migration.LiveMigrationRemoteConsolesV26Test.test_iscsi_volume[id-e19c0cc6-6720-4ed8-be83-b6603ed5c812] tempest.api.compute.admin.test_live_migration.LiveMigrationRemoteConsolesV26Test.test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,volume] tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_iscsi_volume[id-e19c0cc6-6720-4ed8-be83-b6603ed5c812] tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,volume] tempest.api.compute.admin.test_volumes_negative.VolumesAdminNegativeTest.test_update_attached_volume_with_nonexistent_volume_in_body[id-7dcac15a-b107-46d3-a5f6-cb863f4e454a,negative] tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_attached_volume[id-d0f3f0d6-d9b6-4a32-8da4-23015dcab23c,volume] tempest.api.compute.servers.test_device_tagging.DeviceTaggingTest.test_device_tagging[id-a2e65a6c-66f1-4442-aaa8-498c31778d96,image,network,slow,volume] tempest.api.compute.servers.test_device_tagging.DeviceTaggingTestV2_42.test_device_tagging[id-a2e65a6c-66f1-4442-aaa8-498c31778d96,image,network,slow,volume] tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_rebuild_server_with_volume_attached[id-b68bd8d6-855d-4212-b59b-2e704044dace,slow,volume] tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_volume_backed_server_confirm[id-e6c28180-7454-4b59-b188-0257af08a63b,volume] tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[id-f56e465b-fe10-48bf-b75d-646cda3a8bc9,negative,volume] tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_attach_volume_shelved_or_offload_server[id-13a940b6-3474-4c3c-b03f-29b89112bfee,slow] tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume[id-52e9045a-e90d-4c0d-9087-79d657faffff] tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_list_get_volume_attachments[id-7fa563fe-f0f7-43eb-9e22-a1ece036b513] tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server[id-aab919e2-d992-4cbb-a4ed-745c2475398c,negative] tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_delete_attached_volume[id-a313b5cd-fbd0-49cc-94de-870e99f763c7,negative] DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2novactl_2comp-ipv4-vxlan-lvm-tempest-composable-phase3 #33 Following tests started failing: .setUpClass (tempest.api.compute.servers.test_create_server.ServersTestBootFromVolume) tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_iscsi_volume[id-e19c0cc6-6720-4ed8-be83-b6603ed5c812] tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,volume] tempest.api.compute.admin.test_live_migration.LiveMigrationRemoteConsolesV26Test.test_iscsi_volume[id-e19c0cc6-6720-4ed8-be83-b6603ed5c812] tempest.api.compute.admin.test_live_migration.LiveMigrationRemoteConsolesV26Test.test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,volume] tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_iscsi_volume[id-e19c0cc6-6720-4ed8-be83-b6603ed5c812] tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,volume] tempest.api.compute.admin.test_volumes_negative.VolumesAdminNegativeTest.test_update_attached_volume_with_nonexistent_volume_in_body[id-7dcac15a-b107-46d3-a5f6-cb863f4e454a,negative] tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_attached_volume[id-d0f3f0d6-d9b6-4a32-8da4-23015dcab23c,volume] tempest.api.compute.servers.test_device_tagging.DeviceTaggingTest.test_device_tagging[id-a2e65a6c-66f1-4442-aaa8-498c31778d96,image,network,slow,volume] tempest.api.compute.servers.test_device_tagging.DeviceTaggingTestV2_42.test_device_tagging[id-a2e65a6c-66f1-4442-aaa8-498c31778d96,image,network,slow,volume] tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_rebuild_server_with_volume_attached[id-b68bd8d6-855d-4212-b59b-2e704044dace,slow,volume] tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_volume_backed_server_confirm[id-e6c28180-7454-4b59-b188-0257af08a63b,volume] tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[id-f56e465b-fe10-48bf-b75d-646cda3a8bc9,negative,volume] tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_attach_volume_shelved_or_offload_server[id-13a940b6-3474-4c3c-b03f-29b89112bfee,slow] tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume[id-52e9045a-e90d-4c0d-9087-79d657faffff] tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_list_get_volume_attachments[id-7fa563fe-f0f7-43eb-9e22-a1ece036b513] tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_attach_attached_volume_to_same_server[id-aab919e2-d992-4cbb-a4ed-745c2475398c,negative] tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest.test_delete_attached_volume[id-a313b5cd-fbd0-49cc-94de-870e99f763c7,negative]
LVM is not exactly the right backend for such scenario. The volumes are not duplicated. I suspect that may be the reason. Can you please reproduce with a different backend?
The fundamental concern we have is that LVM is not a viable backend in a multi-controller environment. I have no idea how/why this job passed before. The LVM driver stores volumes on the active controller, and as soon as there's a failover to another controller then you get the VolumeDeviceNotFound failure any time something tries to access a volume created on another controller.
(In reply to Alan Bishop from comment #5) > The fundamental concern we have is that LVM is not a viable backend in a > multi-controller environment. I have no idea how/why this job passed before. > > The LVM driver stores volumes on the active controller, and as soon as > there's a failover to another controller then you get the > VolumeDeviceNotFound failure any time something tries to access a volume > created on another controller. I see, so what can we do here to resolve this? reduce the controller count to 1 or something else? We have block live migration tests that only run on a LVM backend, so we'd like to retain the storage backend
From what I can tell, this is a nova job and so they should probably take the lead on resolving what to do. I don't know the requirements that influence the scope of what the job wants to test. If the focus is nova's behavior over an FFU then yes, reducing the controller count to 1 may be the right way to go. I'm not QE and so I can't be sure there isn't another factor that expects multiple controllers to be part of the test env. All I can say is if an env requires multiple controllers and a failover takes place (which will happen if you're doing an FFU), then cinder storage must use a shared backend so that all controllers have access to the backend. And that precludes LVM. Maybe Tosky has further insight from QE's perspective.
Also, the job mentions 10 to 13 FFU, but BZ lists 16.2 as the affected version.
In my opinion the relevant issue can be found in compute node 0, in /var/log/messages Sep 9 22:30:53 compute-0 journal: + CMD='/usr/sbin/iscsid -f' Sep 9 22:30:53 compute-0 journal: + echo 'Running command: '\''/usr/sbin/iscsid -f'\''' Sep 9 22:30:53 compute-0 journal: + exec /usr/sbin/iscsid -f Sep 9 22:30:53 compute-0 journal: Running command: '/usr/sbin/iscsid -f' Sep 9 22:30:53 compute-0 journal: iscsid: Can not bind IPC socket Sep 9 22:48:25 compute-0 iscsid: Could not insert module . Kmod error -38 Sep 9 22:48:29 compute-0 iscsid: Could not insert module . Kmod error -38 That means that the iscsid container didn't run successfully, since iscsid could not properly bind. If I remember correctly that Kmod error is related to the iscsi-initiator-utils version. If I'm not mistaken the package must be the same in the iscsid container and the nova container, and the OS in the host must match the one in the containers. We cannot find those errors in any of the 3 controllers, so the upgrade process may be doing something different in the compute nodes than it does in the controller nodes.
> the package must be the same in the iscsid container and the nova container, and the OS in the host must match the one in the containers That looks shaky. And reminds the situation with the similar requirement to pacemaker packages, which we were able to get rid of by moving CLI commands from containers to the hosts. So as a side question, would it be possible to remove iscsi-initiator-utils from containers and run it from host?
The Upgrades DFG should take a look at this. I dug further, and the problem is not a version mismatch. The FFU isn't stopping the iscsid daemon running on the compute host, which prevents errors when OSP-13 tries to run it in a container. I'm not familiar with the details of this FFU job, but the logs show tempest creating a volume when running OSP-10 (volume ID 8a645393-7ad8-4f62-845e-711c3dd42a6a, around 2021-09-09 13:15:59), as seen in [1] [1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/undercloud-0/home/stack/tempest-dir/tempest.log.gz 2021-09-09 13:15:59.668 26970 INFO tempest.lib.common.rest_client [req-8950b639-4e93-415d-90d3-82a883d4e36a ] Request (LiveAutoBlockMigrationV225TestJSON:test_iscsi_volume): 200 POST http://10.0.0.110:8776/v1/ba0fa6bcd1294d0887160d69df3c2cc6/volumes 1.338s 2021-09-09 13:15:59.669 26970 DEBUG tempest.lib.common.rest_client [req-8950b639-4e93-415d-90d3-82a883d4e36a ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: {"volume": {"display_name": "tempest-LiveAutoBlockMigrationV225TestJSON-volume-1464230533", "size": 1}} Response - Headers: {'status': '200', 'content-length': '437', 'content-location': 'http://10.0.0.110:8776/v1/ba0fa6bcd1294d0887160d69df3c2cc6/volumes', 'x-compute-request-id': 'req-8950b639-4e93-415d-90d3-82a883d4e36a', 'connection': 'close', 'date': 'Thu, 09 Sep 2021 17:15:59 GMT', 'content-type': 'application/json', 'x-openstack-request-id': 'req-8950b639-4e93-415d-90d3-82a883d4e36a'} Body: {"volume": {"status": "creating", "display_name": "tempest-LiveAutoBlockMigrationV225TestJSON-volume-1464230533", "attachments": [], "availability_zone": "nova", "bootable": "false", "encrypted": false, "created_at": "2021-09-09T17:15:59.497597", "multiattach": "false", "display_description": null, "volume_type": null, "snapshot_id": null, "source_volid": null, "metadata": {}, "id": "8a645393-7ad8-4f62-845e-711c3dd42a6a", "size": 1}} _log_request_full /home/stack/tempest-dir/tempest/lib/common/rest_client.py:431 2021-09-09 13:15:59.669 26963 INFO tempest.lib.common.rest_client [req-67bc7600-e335-4cdf-b83a-f1222cf81621 ] Request (InstanceUsageAuditLogTestJSON:tearDownClass): 204 DELETE http://10.0.0.110:9696/v2.0/security-groups/1ce48703-9981-4b98-b5b7-896df22bc0fa 0.536s 2021-09-09 13:15:59.670 26963 DEBUG tempest.lib.common.rest_client [req-67bc7600-e335-4cdf-b83a-f1222cf81621 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} This causes the compute node to start the iscsid service, as seen in its /var/log/messages [2]: [2] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/compute-0/var/log/messages.gz Sep 9 13:16:15 compute-0 systemd: Starting Open-iSCSI... Sep 9 13:16:15 compute-0 iscsid: iSCSI logger with pid=54302 started! Sep 9 13:16:15 compute-0 systemd: Started Open-iSCSI. Sep 9 13:16:16 compute-0 kernel: iscsi: registered transport (tcp) Sep 9 13:16:16 compute-0 kernel: scsi host2: iSCSI Initiator over TCP/IP Sep 9 13:16:16 compute-0 iscsid: iSCSI daemon with pid=54303 started! However, later the FFU fails to detect the daemon is running, as seen in [3] [3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/undercloud-0/home/stack/overcloud_upgrade_run.log.gz TASK [FFU check if iscsid service is deployed] ********************************* Thursday 09 September 2021 15:10:52 -0400 (0:00:00.239) 0:01:40.761 **** fatal: [compute-1]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid"], "delta": "0:00:00.018936", "end": "2021-09-09 19:10:52.276075", "msg": "non-zero return code", "rc": 1, "start": "2021-09-09 19:10:52.257139", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring fatal: [compute-0]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid"], "delta": "0:00:00.018140", "end": "2021-09-09 19:10:52.332576", "msg": "non-zero return code", "rc": 1, "start": "2021-09-09 19:10:52.314436", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring TASK [Set fact iscsid_enabled] ************************************************* Thursday 09 September 2021 15:10:52 -0400 (0:00:00.534) 0:01:41.296 **** ok: [compute-1] => {"ansible_facts": {"iscsid_enabled": false}, "changed": false} ok: [compute-0] => {"ansible_facts": {"iscsid_enabled": false}, "changed": false} TASK [FFU check if iscsid.socket service is deployed] ************************** Thursday 09 September 2021 15:10:52 -0400 (0:00:00.221) 0:01:41.518 **** changed: [compute-1] => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid.socket"], "delta": "0:00:00.020690", "end": "2021-09-09 19:10:53.033834", "rc": 0, "start": "2021-09-09 19:10:53.013144", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} changed: [compute-0] => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid.socket"], "delta": "0:00:00.009062", "end": "2021-09-09 19:10:53.070776", "rc": 0, "start": "2021-09-09 19:10:53.061714", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} TASK [Set fact iscsid_socket_enabled] ****************************************** Thursday 09 September 2021 15:10:53 -0400 (0:00:00.500) 0:01:42.019 **** ok: [compute-1] => {"ansible_facts": {"iscsid_socket_enabled": true}, "changed": false} ok: [compute-0] => {"ansible_facts": {"iscsid_socket_enabled": true}, "changed": false} Notice it set "iscsid_socket_enabled": true, but "iscsid_enabled": false. This allows the iscsid daemon to continue running on the host, as seen in [4]: [4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/compute-0/var/log/extra/pstree.txt.gz I assume this FFU test used to pass, and the recent failure is a regression. What I don't know if this is due to a change in OSP-10's RHEL, or something in OSP-13.
OSP 13 was retired on June 27, 2023. No further work is expected to occur on this issue.