Description of problem: `openstack overcloud support report collect $nodes` command seems to broken in RHOSP16. I have tested this with both 16.1.6 and 16.1.7 latest with the same issue. Below is the information I collected trying to dig in to this issue, the relevant issue seems to be `Workflow failed due to message status. Status:FAILED Message:None` ~~~ (undercloud) [stack@director ~]$ openstack workflow execution input show e4706892-0a63-4dfb-8ff2-96c85feb320a { "container": "overcloud_support", "server_name": "overcloud-controller-0", "concurrency": 5, "timeout": 1800, "queue_name": "tripleo" } (undercloud) [stack@director ~]$ openstack task execution show 59869d17-d16a-4fb1-8246-6c348c6b76c1 +-----------------------+-------------------------------------------------------------------+ | Field | Value | +-----------------------+-------------------------------------------------------------------+ | ID | 59869d17-d16a-4fb1-8246-6c348c6b76c1 | | Name | send_message | | Workflow name | tripleo.support.v1.collect_logs | | Workflow namespace | | | Workflow Execution ID | 58b6f3d2-1289-468c-b353-21182ee7e50c | | State | ERROR | | State info | Workflow failed due to message status. Status:FAILED Message:None | | Created at | 2021-08-08 09:42:36 | | Updated at | 2021-08-08 09:42:38 | +-----------------------+-------------------------------------------------------------------+ Failure caused by error in tasks: send_message send_message [task_ex_id=c05fc811-69b4-442e-a77b-3c90f194c760] -> Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\n [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None} [wf_ex_id=8cb6ae47-620c-45ed-83c4-621d6fc9f474, idx=0]: Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\n [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None} (undercloud) [stack@director ~]$ openstack task execution result show 59869d17-d16a-4fb1-8246-6c348c6b76c1 { "result": "Workflow failed due to message status. Status:FAILED Message:None", "payload": { "status": "FAILED", "message": null, "root_execution_id": "e4706892-0a63-4dfb-8ff2-96c85feb320a", "execution_id": "58b6f3d2-1289-468c-b353-21182ee7e50c", "plan_name": null, "deployment_status": null }, "swift_message": { "type": "tripleo.deployment.v1.fetch_logs", "payload": { "status": "FAILED", "message": null, "root_execution_id": "e4706892-0a63-4dfb-8ff2-96c85feb320a", "execution_id": "58b6f3d2-1289-468c-b353-21182ee7e50c", "plan_name": null, "deployment_status": null } }, "deployment_status_message": { "deployment_status": null, "workflow_status": { "type": "tripleo.deployment.v1.fetch_logs", "payload": { "status": "FAILED", "message": null, "root_execution_id": "e4706892-0a63-4dfb-8ff2-96c85feb320a", "execution_id": "58b6f3d2-1289-468c-b353-21182ee7e50c", "plan_name": null, "deployment_status": null } } }, "container": "None-messages" } (undercloud) [stack@director ~]$ openstack workflow execution output show e4706892-0a63-4dfb-8ff2-96c85feb320a { "result": "Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=c05fc811-69b4-442e-a77b-3c90f194c760] -> Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\\n\\n send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\\n [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None}\n [wf_ex_id=8cb6ae47-620c-45ed-83c4-621d6fc9f474, idx=0]: Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\\n\\n send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\\n [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None}\n", "servers_with_name": [ { "id": "52bdf8fd-32bb-4993-8e6c-6f289403e777", "name": "overcloud-controller-0", "status": "ACTIVE", "tenant_id": "295f9ba6ab854b17bf94b48a9430a2c4", "user_id": "0a24795ea81746deac395ce6e346f88a", "metadata": {}, "hostId": "151907f9e0993aa3494643c93ac32bb5516bc0941b2669d22b902fa0", "image": { "id": "6a26d179-83f2-41aa-8cba-c02a796c012f", "links": [ { "rel": "bookmark", "href": "https://192.168.24.2:13774/images/6a26d179-83f2-41aa-8cba-c02a796c012f" } ] }, "flavor": { "vcpus": 1, "ram": 4096, "disk": 40, "ephemeral": 0, "swap": 0, "original_name": "control", "extra_specs": { "capabilities:profile": "control", "resources:CUSTOM_BAREMETAL": "1", "resources:DISK_GB": "0", "resources:MEMORY_MB": "0", "resources:VCPU": "0" } }, "created": "2021-05-23T04:35:20Z", "updated": "2021-05-23T04:38:54Z", "addresses": { "ctlplane": [ { "version": 4, "addr": "192.168.24.16", "OS-EXT-IPS:type": "fixed", "OS-EXT-IPS-MAC:mac_addr": "0a:4b:07:a9:17:04" } ] }, "accessIPv4": "", "accessIPv6": "", "links": [ { "rel": "self", "href": "https://192.168.24.2:13774/v2.1/servers/52bdf8fd-32bb-4993-8e6c-6f289403e777" }, { "rel": "bookmark", "href": "https://192.168.24.2:13774/servers/52bdf8fd-32bb-4993-8e6c-6f289403e777" } ], "OS-DCF:diskConfig": "MANUAL", "progress": 0, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "True", "key_name": "default", "OS-SRV-USG:launched_at": "2021-05-23T04:38:54.000000", "OS-SRV-USG:terminated_at": null, "OS-EXT-SRV-ATTR:host": "director.localdomain", "OS-EXT-SRV-ATTR:instance_name": "instance-00000004", "OS-EXT-SRV-ATTR:hypervisor_hostname": "9d0e05d3-0c11-40ea-9239-6cbac4193a72", "OS-EXT-SRV-ATTR:reservation_id": "r-luc0pkst", "OS-EXT-SRV-ATTR:launch_index": 0, "OS-EXT-SRV-ATTR:hostname": "overcloud-controller-0", "OS-EXT-SRV-ATTR:kernel_id": "53aefa2c-110e-491c-84af-a935e7d2fcbc", "OS-EXT-SRV-ATTR:ramdisk_id": "17973d34-943b-4d27-91a1-9f9a97cc15ad", "OS-EXT-SRV-ATTR:root_device_name": "/dev/sda", "OS-EXT-STS:task_state": null, "OS-EXT-STS:vm_state": "active", "OS-EXT-STS:power_state": 1, "os-extended-volumes:volumes_attached": [], "locked": false, "locked_reason": null, "description": null, "tags": [], "trusted_image_certificates": null, "host_status": "UP", "security_groups": [ { "name": "default" } ] } ], "type": "tripleo.support.v1.fetch_logs.collect_logs_on_servers", "status": "FAILED", "message": { "result": "Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\n [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\n", "type": "tripleo.deployment.v1.fetch_logs", "status": "FAILED", "message": null } } (undercloud) [stack@director ~]$ swift list overcloud_support ~~~ Things to note are that running the SoS report manually on the overcloud nodes works fine, running a while loop to check if the sos process is started at all by tripleo-client shows that it doesn't even attempt to run. Version-Release number of selected component (if applicable): RHOSP16.1.7 How reproducible: Every time Steps to Reproduce: 1. openstack overcloud support report collect $nodes 2. Wait for the command to time out 3. Actual results: Command times out
This happens because the support workflow is using deploy_on_servers: https://github.com/openstack/tripleo-common/blob/stable/train/workbooks/support.yaml#L23 This ultimately relies on os-collect-config running on the overcloud nodes: https://github.com/openstack/tripleo-common/blob/stable/train/workbooks/deployment.yaml#L24 https://github.com/openstack/tripleo-common/blob/stable/train/setup.cfg#L94 https://github.com/openstack/tripleo-common/blob/stable/train/tripleo_common/actions/deployment.py#L39-L41 So the options I see here: 1. Since we don't use Mistral in Ussuri at all. So we can potentially look at backporting this feature from Ussuri and just use Ansible directly: https://github.com/openstack/python-tripleoclient/blob/stable/ussuri/tripleoclient/v2/overcloud_support.py https://github.com/openstack/tripleo-ansible/blob/stable/ussuri/tripleo_ansible/playbooks/cli-support-collect-logs.yaml 2. We will need to re-write the workbook for this. The easiest option I think would be to use the tripleo.ansible-playbook action, and backport the playbook for this from Ussuri: https://github.com/openstack/tripleo-ansible/blob/stable/ussuri/tripleo_ansible/playbooks/cli-support-collect-logs.yaml Then call it similar to how we call other playbooks. For example: https://github.com/openstack/tripleo-common/blob/stable/train/workbooks/access.yaml#L176-L190 Both have their challenges. Option 1, the Ansible inventory and ssh key are both in /var/lib/mistral which we don't reliably have access to as the stack user. Option 2. We need to invest effort into re-writing the Mistral workbook with no intention to continue using Mistral moving forward. I'll look into it and get back to you with some reviews.
Yea I just reproduced the failure today. It is likely due to os-collect-config not running on the remote systems. Realistically as of OSP16, ansible should likely be leveraged to handle this rather than mistral. The usefulness of this was more for OSP13 where we didn't have such methods in place.
Looks like just enabling os-collect-config doesn't address it because there are errors in running os-refresh-config Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,068] (os-refresh-config) [INFO] Starting phase pre-configure Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 ----------------------- PROFILING ----------------------- Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Target: pre-configure.d Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Script Seconds Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 --------------------------------------- ---------- Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 --------------------- END PROFILING --------------------- Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,092] (os-refresh-config) [INFO] Completed phase pre-configure Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,092] (os-refresh-config) [INFO] Starting phase configure Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Running /usr/libexec/os-refresh-config/configure.d/20-os-apply-config Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021/08/10 02:43:18 PM] [INFO] writing /etc/os-collect-config.conf Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021/08/10 02:43:18 PM] [INFO] writing /var/run/heat-config/heat-config Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021/08/10 02:43:18 PM] [INFO] success Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 20-os-apply-config completed Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Running /usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: docker runtime is deprecated in Stein and will be removed in Train. Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: Traceback (most recent call last): Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 62, in <module> Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: sys.exit(main(sys.argv)) Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 57, in main Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: DOCKER_CMD Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/lib/python3.6/site-packages/paunch/__init__.py", line 111, in cleanup Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: r.delete_missing_configs(config_ids) Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/lib/python3.6/site-packages/paunch/runner.py", line 211, in delete_missing_configs Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: for conf_id in self.current_config_ids(): Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/lib/python3.6/site-packages/paunch/runner.py", line 86, in current_config_ids Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: cmd, log=self.log, quiet=False, warn_only=True) Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/lib/python3.6/site-packages/paunch/runner.py", line 49, in execute Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: stderr=subprocess.PIPE) Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__ Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: restore_signals, start_new_session) Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: raise child_exception_type(errno_num, err_msg, err_filename) Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: FileNotFoundError: [Errno 2] No such file or directory: 'docker': 'docker' Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,705] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']> Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,706] (os-refresh-config) [ERROR] Aborting... Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1.
Hey Lewis, Give this a shot: https://review.opendev.org/c/openstack/python-tripleoclient/+/804182 You need the depends-on as well: https://review.opendev.org/c/openstack/tripleo-ansible/+/804181 As I mentioned, getting the existing inventory is difficult since the directory is owned by mistral. I'm relying on the user to have an inventory file, and if they don't then we'll return the exact command they can use to generate one. Unless Alex has a better idea for getting the inventory, I think this will get us through the Train cycle for now. I'll update the documentation for this as well once we're happy with the solution.
Thanks Brendan, I'll work with you testing this. Currently not working but I'll provide the details offline and test the next patch set.
Figured out the inventory thing. Lewis hit an issue with the key: TASK [Ensure sos is installed] *********************************************************************************************************************************************************************************** fatal: [overcloud-controller-0]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: no such identity: /home/stack/.ssh/id_rsa_tripleo: No such file or directory\r\ntripleo-admin.24.16: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).", "unreachable": true} I remembered that I had copied that key manually. So patchset5 now gets the inventory and the key automatically: https://review.opendev.org/c/openstack/python-tripleoclient/+/804182/5/tripleoclient/v1/overcloud_support.py The playbook we're using relies on the key existing here: https://github.com/openstack/tripleo-ansible/blob/stable/ussuri/tripleo_ansible/playbooks/cli-support-collect-logs.yaml#L73 So I just grab the Key and write it to the filesystem in the expected location to minimise the number of changes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8795