Bug 1991485 - openstack overcloud support report collect failing work
Summary: openstack overcloud support report collect failing work
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: z9
: 16.1 (Train on RHEL 8.2)
Assignee: Brendan Shephard
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-09 09:04 UTC by ldenny
Modified: 2022-12-07 20:25 UTC (History)
5 users (show)

Fixed In Version: tripleo-ansible-0.5.1-1.20220201163747.902c3c8.el8ost tripleo-ansible-0.5.1-1.20220513083452.902c3c8.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-07 20:24:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 804181 0 None None None 2021-08-11 00:48:49 UTC
OpenStack gerrit 804182 0 None None None 2021-08-11 00:48:49 UTC
Red Hat Issue Tracker OSP-6904 0 None None None 2021-11-15 12:59:27 UTC
Red Hat Product Errata RHBA-2022:8795 0 None None None 2022-12-07 20:25:09 UTC

Description ldenny 2021-08-09 09:04:46 UTC
Description of problem:
`openstack overcloud support report collect $nodes` command seems to broken in RHOSP16. I have tested this with both 16.1.6 and 16.1.7 latest with the same issue.

Below is the information I collected trying to dig in to this issue, the relevant issue seems to be `Workflow failed due to message status. Status:FAILED Message:None`

~~~
(undercloud) [stack@director ~]$ openstack workflow execution input show e4706892-0a63-4dfb-8ff2-96c85feb320a
{
    "container": "overcloud_support",
    "server_name": "overcloud-controller-0",
    "concurrency": 5,
    "timeout": 1800,
    "queue_name": "tripleo"
}

(undercloud) [stack@director ~]$ openstack task execution show 59869d17-d16a-4fb1-8246-6c348c6b76c1
+-----------------------+-------------------------------------------------------------------+
| Field                 | Value                                                             |
+-----------------------+-------------------------------------------------------------------+
| ID                    | 59869d17-d16a-4fb1-8246-6c348c6b76c1                              |
| Name                  | send_message                                                      |
| Workflow name         | tripleo.support.v1.collect_logs                                   |
| Workflow namespace    |                                                                   |
| Workflow Execution ID | 58b6f3d2-1289-468c-b353-21182ee7e50c                              |
| State                 | ERROR                                                             |
| State info            | Workflow failed due to message status. Status:FAILED Message:None |
| Created at            | 2021-08-08 09:42:36                                               |
| Updated at            | 2021-08-08 09:42:38                                               |
+-----------------------+-------------------------------------------------------------------+

Failure caused by error in tasks: send_message

  send_message [task_ex_id=c05fc811-69b4-442e-a77b-3c90f194c760] -> Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\n\n  send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\n    [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None}
    [wf_ex_id=8cb6ae47-620c-45ed-83c4-621d6fc9f474, idx=0]: Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\n\n  send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\n    [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None}

(undercloud) [stack@director ~]$ openstack task execution result show 59869d17-d16a-4fb1-8246-6c348c6b76c1
{
    "result": "Workflow failed due to message status. Status:FAILED Message:None",
    "payload": {
        "status": "FAILED",
        "message": null,
        "root_execution_id": "e4706892-0a63-4dfb-8ff2-96c85feb320a",
        "execution_id": "58b6f3d2-1289-468c-b353-21182ee7e50c",
        "plan_name": null,
        "deployment_status": null
    },
    "swift_message": {
        "type": "tripleo.deployment.v1.fetch_logs",
        "payload": {
            "status": "FAILED",
            "message": null,
            "root_execution_id": "e4706892-0a63-4dfb-8ff2-96c85feb320a",
            "execution_id": "58b6f3d2-1289-468c-b353-21182ee7e50c",
            "plan_name": null,
            "deployment_status": null
        }
    },
    "deployment_status_message": {
        "deployment_status": null,
        "workflow_status": {
            "type": "tripleo.deployment.v1.fetch_logs",
            "payload": {
                "status": "FAILED",
                "message": null,
                "root_execution_id": "e4706892-0a63-4dfb-8ff2-96c85feb320a",
                "execution_id": "58b6f3d2-1289-468c-b353-21182ee7e50c",
                "plan_name": null,
                "deployment_status": null
            }
        }
    },
    "container": "None-messages"
}

(undercloud) [stack@director ~]$ openstack workflow execution output show e4706892-0a63-4dfb-8ff2-96c85feb320a  
{
    "result": "Failure caused by error in tasks: send_message\n\n  send_message [task_ex_id=c05fc811-69b4-442e-a77b-3c90f194c760] -> Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\\n\\n  send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\\n    [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None}\n    [wf_ex_id=8cb6ae47-620c-45ed-83c4-621d6fc9f474, idx=0]: Workflow failed due to message status. Status:FAILED Message:{'result': 'Failure caused by error in tasks: send_message\\n\\n  send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\\n    [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\\n', 'type': 'tripleo.deployment.v1.fetch_logs', 'status': 'FAILED', 'message': None}\n",
    "servers_with_name": [
        {
            "id": "52bdf8fd-32bb-4993-8e6c-6f289403e777",
            "name": "overcloud-controller-0",
            "status": "ACTIVE",
            "tenant_id": "295f9ba6ab854b17bf94b48a9430a2c4",
            "user_id": "0a24795ea81746deac395ce6e346f88a",
            "metadata": {},
            "hostId": "151907f9e0993aa3494643c93ac32bb5516bc0941b2669d22b902fa0",
            "image": {
                "id": "6a26d179-83f2-41aa-8cba-c02a796c012f",
                "links": [
                    {
                        "rel": "bookmark",
                        "href": "https://192.168.24.2:13774/images/6a26d179-83f2-41aa-8cba-c02a796c012f"
                    }
                ]
            },
            "flavor": {
                "vcpus": 1,
                "ram": 4096,
                "disk": 40,
                "ephemeral": 0,
                "swap": 0,
                "original_name": "control",
                "extra_specs": {
                    "capabilities:profile": "control",
                    "resources:CUSTOM_BAREMETAL": "1",
                    "resources:DISK_GB": "0",
                    "resources:MEMORY_MB": "0",
                    "resources:VCPU": "0"
                }
            },
            "created": "2021-05-23T04:35:20Z",
            "updated": "2021-05-23T04:38:54Z",
            "addresses": {
                "ctlplane": [
                    {
                        "version": 4,
                        "addr": "192.168.24.16",
                        "OS-EXT-IPS:type": "fixed",
                        "OS-EXT-IPS-MAC:mac_addr": "0a:4b:07:a9:17:04"
                    }
                ]
            },
            "accessIPv4": "",
            "accessIPv6": "",
            "links": [
                {
                    "rel": "self",
                    "href": "https://192.168.24.2:13774/v2.1/servers/52bdf8fd-32bb-4993-8e6c-6f289403e777"
                },
                {
                    "rel": "bookmark",
                    "href": "https://192.168.24.2:13774/servers/52bdf8fd-32bb-4993-8e6c-6f289403e777"
                }
            ],
            "OS-DCF:diskConfig": "MANUAL",
            "progress": 0,
            "OS-EXT-AZ:availability_zone": "nova",
            "config_drive": "True",
            "key_name": "default",
            "OS-SRV-USG:launched_at": "2021-05-23T04:38:54.000000",
            "OS-SRV-USG:terminated_at": null,
            "OS-EXT-SRV-ATTR:host": "director.localdomain",
            "OS-EXT-SRV-ATTR:instance_name": "instance-00000004",
            "OS-EXT-SRV-ATTR:hypervisor_hostname": "9d0e05d3-0c11-40ea-9239-6cbac4193a72",
            "OS-EXT-SRV-ATTR:reservation_id": "r-luc0pkst",
            "OS-EXT-SRV-ATTR:launch_index": 0,
            "OS-EXT-SRV-ATTR:hostname": "overcloud-controller-0",
            "OS-EXT-SRV-ATTR:kernel_id": "53aefa2c-110e-491c-84af-a935e7d2fcbc",
            "OS-EXT-SRV-ATTR:ramdisk_id": "17973d34-943b-4d27-91a1-9f9a97cc15ad",
            "OS-EXT-SRV-ATTR:root_device_name": "/dev/sda",
            "OS-EXT-STS:task_state": null,
            "OS-EXT-STS:vm_state": "active",
            "OS-EXT-STS:power_state": 1,
            "os-extended-volumes:volumes_attached": [],
            "locked": false,
            "locked_reason": null,
            "description": null,
            "tags": [],
            "trusted_image_certificates": null,
            "host_status": "UP",
            "security_groups": [
                {
                    "name": "default"
                }
            ]
        }
    ],
    "type": "tripleo.support.v1.fetch_logs.collect_logs_on_servers",
    "status": "FAILED",
    "message": {
        "result": "Failure caused by error in tasks: send_message\n\n  send_message [task_ex_id=59869d17-d16a-4fb1-8246-6c348c6b76c1] -> Workflow failed due to message status. Status:FAILED Message:None\n    [wf_ex_id=ce9cbdf3-aebf-4d53-974d-a2bd6f958971, idx=0]: Workflow failed due to message status. Status:FAILED Message:None\n",
        "type": "tripleo.deployment.v1.fetch_logs",
        "status": "FAILED",
        "message": null
    }
}

(undercloud) [stack@director ~]$ swift list overcloud_support
~~~

Things to note are that running the SoS report manually on the overcloud nodes works fine, running a while loop to check if the sos process is started at all by tripleo-client shows that it doesn't even attempt to run. 

Version-Release number of selected component (if applicable):
RHOSP16.1.7

How reproducible:
Every time

Steps to Reproduce:
1. openstack overcloud support report collect $nodes
2. Wait for the command to time out
3.

Actual results:
Command times out

Comment 1 Brendan Shephard 2021-08-09 10:45:46 UTC
This happens because the support workflow is using deploy_on_servers:
https://github.com/openstack/tripleo-common/blob/stable/train/workbooks/support.yaml#L23

This ultimately relies on os-collect-config running on the overcloud nodes:
https://github.com/openstack/tripleo-common/blob/stable/train/workbooks/deployment.yaml#L24
https://github.com/openstack/tripleo-common/blob/stable/train/setup.cfg#L94
https://github.com/openstack/tripleo-common/blob/stable/train/tripleo_common/actions/deployment.py#L39-L41

So the options I see here:

1. Since we don't use Mistral in Ussuri at all. So we can potentially look at backporting this feature from Ussuri and just use Ansible directly:
https://github.com/openstack/python-tripleoclient/blob/stable/ussuri/tripleoclient/v2/overcloud_support.py
https://github.com/openstack/tripleo-ansible/blob/stable/ussuri/tripleo_ansible/playbooks/cli-support-collect-logs.yaml

2. We will need to re-write the workbook for this. The easiest option I think would be to use the tripleo.ansible-playbook action, and backport the playbook for this from Ussuri:
https://github.com/openstack/tripleo-ansible/blob/stable/ussuri/tripleo_ansible/playbooks/cli-support-collect-logs.yaml
Then call it similar to how we call other playbooks. For example:
https://github.com/openstack/tripleo-common/blob/stable/train/workbooks/access.yaml#L176-L190


Both have their challenges. Option 1, the Ansible inventory and ssh key are both in /var/lib/mistral which we don't reliably have access to as the stack user.
Option 2. We need to invest effort into re-writing the Mistral workbook with no intention to continue using Mistral moving forward.


I'll look into it and get back to you with some reviews.

Comment 2 Alex Schultz 2021-08-09 22:09:40 UTC
Yea I just reproduced the failure today. It is likely due to os-collect-config not running on the remote systems. Realistically as of OSP16, ansible should likely be leveraged to handle this rather than mistral. The usefulness of this was more for OSP13 where we didn't have such methods in place.

Comment 3 Alex Schultz 2021-08-10 14:45:42 UTC
Looks like just enabling os-collect-config doesn't address it because there are errors in running os-refresh-config

Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,068] (os-refresh-config) [INFO] Starting phase pre-configure
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 ----------------------- PROFILING -----------------------
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Target: pre-configure.d
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Script                                     Seconds
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 ---------------------------------------  ----------
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 --------------------- END PROFILING ---------------------
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,092] (os-refresh-config) [INFO] Completed phase pre-configure
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,092] (os-refresh-config) [INFO] Starting phase configure
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Running /usr/libexec/os-refresh-config/configure.d/20-os-apply-config
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021/08/10 02:43:18 PM] [INFO] writing /etc/os-collect-config.conf
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021/08/10 02:43:18 PM] [INFO] writing /var/run/heat-config/heat-config
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021/08/10 02:43:18 PM] [INFO] success
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 20-os-apply-config completed
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: dib-run-parts Tue Aug 10 14:43:18 UTC 2021 Running /usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: docker runtime is deprecated in Stein and will be removed in Train.
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: Traceback (most recent call last):
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 62, in <module>
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     sys.exit(main(sys.argv))
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 57, in main
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     DOCKER_CMD
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/lib/python3.6/site-packages/paunch/__init__.py", line 111, in cleanup
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     r.delete_missing_configs(config_ids)
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/lib/python3.6/site-packages/paunch/runner.py", line 211, in delete_missing_configs
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     for conf_id in self.current_config_ids():
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/lib/python3.6/site-packages/paunch/runner.py", line 86, in current_config_ids
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     cmd, log=self.log, quiet=False, warn_only=True)
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/lib/python3.6/site-packages/paunch/runner.py", line 49, in execute
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     stderr=subprocess.PIPE)
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     restore_signals, start_new_session)
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:   File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]:     raise child_exception_type(errno_num, err_msg, err_filename)
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: FileNotFoundError: [Errno 2] No such file or directory: 'docker': 'docker'
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,705] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']>
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: [2021-08-10 14:43:18,706] (os-refresh-config) [ERROR] Aborting...
Aug 10 14:43:18 overcloud-controller-1 os-collect-config[141682]: Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1.

Comment 4 Brendan Shephard 2021-08-11 01:09:04 UTC
Hey Lewis,

Give this a shot:
https://review.opendev.org/c/openstack/python-tripleoclient/+/804182

You need the depends-on as well:
https://review.opendev.org/c/openstack/tripleo-ansible/+/804181


As I mentioned, getting the existing inventory is difficult since the directory is owned by mistral. I'm relying on the user to have an inventory file, and if they don't then we'll return the exact command they can use to generate one. 

Unless Alex has a better idea for getting the inventory, I think this will get us through the Train cycle for now. I'll update the documentation for this as well once we're happy with the solution.

Comment 5 ldenny 2021-08-11 02:07:14 UTC
Thanks Brendan, 

I'll work with you testing this.

Currently not working but I'll provide the details offline and test the next patch set.

Comment 6 Brendan Shephard 2021-08-11 02:39:52 UTC
Figured out the inventory thing. 

Lewis hit an issue with the key:
TASK [Ensure sos is installed] ***********************************************************************************************************************************************************************************
fatal: [overcloud-controller-0]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: no such identity: /home/stack/.ssh/id_rsa_tripleo: No such file or directory\r\ntripleo-admin.24.16: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).", "unreachable": true}


I remembered that I had copied that key manually. So patchset5 now gets the inventory and the key automatically:
https://review.opendev.org/c/openstack/python-tripleoclient/+/804182/5/tripleoclient/v1/overcloud_support.py

The playbook we're using relies on the key existing here:
https://github.com/openstack/tripleo-ansible/blob/stable/ussuri/tripleo_ansible/playbooks/cli-support-collect-logs.yaml#L73

So I just grab the Key and write it to the filesystem in the expected location to minimise the number of changes.

Comment 22 errata-xmlrpc 2022-12-07 20:24:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795


Note You need to log in before you can comment on or make changes to this bug.