Bug 2003750
| Summary: | [FFU] Unexpected error while running command. Command: iscsiadm -m discoverydb -o show -P 1 | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Archit Modi <amodi> |
| Component: | openstack-tripleo | Assignee: | Jason Grosso <jgrosso> |
| Status: | CLOSED EOL | QA Contact: | Joe H. Rahme <jhakimra> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 13.0 (Queens) | CC: | abishop, bdobreli, geguileo, jbadiapa, jgrosso, lbezdick, ltoscano, lyarwood, mburns, senrique |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-07-11 20:56:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Archit Modi
2021-09-13 15:28:00 UTC
LVM is not exactly the right backend for such scenario. The volumes are not duplicated. I suspect that may be the reason. Can you please reproduce with a different backend? The fundamental concern we have is that LVM is not a viable backend in a multi-controller environment. I have no idea how/why this job passed before. The LVM driver stores volumes on the active controller, and as soon as there's a failover to another controller then you get the VolumeDeviceNotFound failure any time something tries to access a volume created on another controller. (In reply to Alan Bishop from comment #5) > The fundamental concern we have is that LVM is not a viable backend in a > multi-controller environment. I have no idea how/why this job passed before. > > The LVM driver stores volumes on the active controller, and as soon as > there's a failover to another controller then you get the > VolumeDeviceNotFound failure any time something tries to access a volume > created on another controller. I see, so what can we do here to resolve this? reduce the controller count to 1 or something else? We have block live migration tests that only run on a LVM backend, so we'd like to retain the storage backend From what I can tell, this is a nova job and so they should probably take the lead on resolving what to do. I don't know the requirements that influence the scope of what the job wants to test. If the focus is nova's behavior over an FFU then yes, reducing the controller count to 1 may be the right way to go. I'm not QE and so I can't be sure there isn't another factor that expects multiple controllers to be part of the test env. All I can say is if an env requires multiple controllers and a failover takes place (which will happen if you're doing an FFU), then cinder storage must use a shared backend so that all controllers have access to the backend. And that precludes LVM. Maybe Tosky has further insight from QE's perspective. Also, the job mentions 10 to 13 FFU, but BZ lists 16.2 as the affected version. In my opinion the relevant issue can be found in compute node 0, in /var/log/messages
Sep 9 22:30:53 compute-0 journal: + CMD='/usr/sbin/iscsid -f'
Sep 9 22:30:53 compute-0 journal: + echo 'Running command: '\''/usr/sbin/iscsid -f'\'''
Sep 9 22:30:53 compute-0 journal: + exec /usr/sbin/iscsid -f
Sep 9 22:30:53 compute-0 journal: Running command: '/usr/sbin/iscsid -f'
Sep 9 22:30:53 compute-0 journal: iscsid: Can not bind IPC socket
Sep 9 22:48:25 compute-0 iscsid: Could not insert module . Kmod error -38
Sep 9 22:48:29 compute-0 iscsid: Could not insert module . Kmod error -38
That means that the iscsid container didn't run successfully, since iscsid could not properly bind.
If I remember correctly that Kmod error is related to the iscsi-initiator-utils version.
If I'm not mistaken the package must be the same in the iscsid container and the nova container, and the OS in the host must match the one in the containers.
We cannot find those errors in any of the 3 controllers, so the upgrade process may be doing something different in the compute nodes than it does in the controller nodes.
> the package must be the same in the iscsid container and the nova container, and the OS in the host must match the one in the containers
That looks shaky. And reminds the situation with the similar requirement to pacemaker packages, which we were able to get rid of by moving CLI commands from containers to the hosts.
So as a side question, would it be possible to remove iscsi-initiator-utils from containers and run it from host?
The Upgrades DFG should take a look at this. I dug further, and the problem is not a version mismatch. The FFU isn't stopping the iscsid daemon running on the compute host, which prevents errors when OSP-13 tries to run it in a container. I'm not familiar with the details of this FFU job, but the logs show tempest creating a volume when running OSP-10 (volume ID 8a645393-7ad8-4f62-845e-711c3dd42a6a, around 2021-09-09 13:15:59), as seen in [1] [1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/undercloud-0/home/stack/tempest-dir/tempest.log.gz 2021-09-09 13:15:59.668 26970 INFO tempest.lib.common.rest_client [req-8950b639-4e93-415d-90d3-82a883d4e36a ] Request (LiveAutoBlockMigrationV225TestJSON:test_iscsi_volume): 200 POST http://10.0.0.110:8776/v1/ba0fa6bcd1294d0887160d69df3c2cc6/volumes 1.338s 2021-09-09 13:15:59.669 26970 DEBUG tempest.lib.common.rest_client [req-8950b639-4e93-415d-90d3-82a883d4e36a ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: {"volume": {"display_name": "tempest-LiveAutoBlockMigrationV225TestJSON-volume-1464230533", "size": 1}} Response - Headers: {'status': '200', 'content-length': '437', 'content-location': 'http://10.0.0.110:8776/v1/ba0fa6bcd1294d0887160d69df3c2cc6/volumes', 'x-compute-request-id': 'req-8950b639-4e93-415d-90d3-82a883d4e36a', 'connection': 'close', 'date': 'Thu, 09 Sep 2021 17:15:59 GMT', 'content-type': 'application/json', 'x-openstack-request-id': 'req-8950b639-4e93-415d-90d3-82a883d4e36a'} Body: {"volume": {"status": "creating", "display_name": "tempest-LiveAutoBlockMigrationV225TestJSON-volume-1464230533", "attachments": [], "availability_zone": "nova", "bootable": "false", "encrypted": false, "created_at": "2021-09-09T17:15:59.497597", "multiattach": "false", "display_description": null, "volume_type": null, "snapshot_id": null, "source_volid": null, "metadata": {}, "id": "8a645393-7ad8-4f62-845e-711c3dd42a6a", "size": 1}} _log_request_full /home/stack/tempest-dir/tempest/lib/common/rest_client.py:431 2021-09-09 13:15:59.669 26963 INFO tempest.lib.common.rest_client [req-67bc7600-e335-4cdf-b83a-f1222cf81621 ] Request (InstanceUsageAuditLogTestJSON:tearDownClass): 204 DELETE http://10.0.0.110:9696/v2.0/security-groups/1ce48703-9981-4b98-b5b7-896df22bc0fa 0.536s 2021-09-09 13:15:59.670 26963 DEBUG tempest.lib.common.rest_client [req-67bc7600-e335-4cdf-b83a-f1222cf81621 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} This causes the compute node to start the iscsid service, as seen in its /var/log/messages [2]: [2] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/compute-0/var/log/messages.gz Sep 9 13:16:15 compute-0 systemd: Starting Open-iSCSI... Sep 9 13:16:15 compute-0 iscsid: iSCSI logger with pid=54302 started! Sep 9 13:16:15 compute-0 systemd: Started Open-iSCSI. Sep 9 13:16:16 compute-0 kernel: iscsi: registered transport (tcp) Sep 9 13:16:16 compute-0 kernel: scsi host2: iSCSI Initiator over TCP/IP Sep 9 13:16:16 compute-0 iscsid: iSCSI daemon with pid=54303 started! However, later the FFU fails to detect the daemon is running, as seen in [3] [3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/undercloud-0/home/stack/overcloud_upgrade_run.log.gz TASK [FFU check if iscsid service is deployed] ********************************* Thursday 09 September 2021 15:10:52 -0400 (0:00:00.239) 0:01:40.761 **** fatal: [compute-1]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid"], "delta": "0:00:00.018936", "end": "2021-09-09 19:10:52.276075", "msg": "non-zero return code", "rc": 1, "start": "2021-09-09 19:10:52.257139", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring fatal: [compute-0]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid"], "delta": "0:00:00.018140", "end": "2021-09-09 19:10:52.332576", "msg": "non-zero return code", "rc": 1, "start": "2021-09-09 19:10:52.314436", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring TASK [Set fact iscsid_enabled] ************************************************* Thursday 09 September 2021 15:10:52 -0400 (0:00:00.534) 0:01:41.296 **** ok: [compute-1] => {"ansible_facts": {"iscsid_enabled": false}, "changed": false} ok: [compute-0] => {"ansible_facts": {"iscsid_enabled": false}, "changed": false} TASK [FFU check if iscsid.socket service is deployed] ************************** Thursday 09 September 2021 15:10:52 -0400 (0:00:00.221) 0:01:41.518 **** changed: [compute-1] => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid.socket"], "delta": "0:00:00.020690", "end": "2021-09-09 19:10:53.033834", "rc": 0, "start": "2021-09-09 19:10:53.013144", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} changed: [compute-0] => {"changed": true, "cmd": ["systemctl", "is-enabled", "--quiet", "iscsid.socket"], "delta": "0:00:00.009062", "end": "2021-09-09 19:10:53.070776", "rc": 0, "start": "2021-09-09 19:10:53.061714", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} TASK [Set fact iscsid_socket_enabled] ****************************************** Thursday 09 September 2021 15:10:53 -0400 (0:00:00.500) 0:01:42.019 **** ok: [compute-1] => {"ansible_facts": {"iscsid_socket_enabled": true}, "changed": false} ok: [compute-0] => {"ansible_facts": {"iscsid_socket_enabled": true}, "changed": false} Notice it set "iscsid_socket_enabled": true, but "iscsid_enabled": false. This allows the iscsid daemon to continue running on the host, as seen in [4]: [4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-ffu-upgrade-10-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-compute-lvm-tempest-phase3/36/compute-0/var/log/extra/pstree.txt.gz I assume this FFU test used to pass, and the recent failure is a regression. What I don't know if this is due to a change in OSP-10's RHEL, or something in OSP-13. OSP 13 was retired on June 27, 2023. No further work is expected to occur on this issue. |