Bug 2155917 - RHOSP 17.0 EDGE: Overcloud Multi-stack Spine Leaf deployment failed with FileNotFoundError: [Errno 2] No such file or directory: '/root/overcloud-deploy/central/central-passwords.yaml'
Summary: RHOSP 17.0 EDGE: Overcloud Multi-stack Spine Leaf deployment failed with File...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Brendan Shephard
QA Contact: Sree
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-22 21:55 UTC by Sree
Modified: 2023-08-06 22:08 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 22:40:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-21027 0 None None None 2022-12-22 21:58:06 UTC

Description Sree 2022-12-22 21:55:31 UTC
Description of problem:

Edge deployment failed at Overcloud Multi-stack Spine Leaf deployment, export command failed with below error:

FileNotFoundError: [Errno 2] No such file or directory: '/root/overcloud-deploy/central/central-passwords.yaml'

command executed: sudo --preserve-env openstack overcloud export --force-overwrite --stack central --output-file /home/stack/central-export.yaml

Version-Release number of selected component (if applicable):
17.0

How reproducible:
100 % Reproducible

Steps to Reproduce:
1.Deploy RHOSP 17.0 mutlistack deployment with Controller:3,compute:2,freeipa:1
tls-everywhere, extending to 2 dcn nodes.
2. network protocol ipv4, no external storage

Actual results:

hypervisor | FAILED | rc=1 >>
Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tripleoclient/command.py", line 32, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.9/site-packages/osc_lib/command/command.py", line 39, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.9/site-packages/cliff/command.py", line 186, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.9/site-packages/tripleoclient/v1/overcloud_export.py", line 105, in take_action
    data = export.export_overcloud(
  File "/usr/lib/python3.9/site-packages/tripleoclient/export.py", line 254, in export_overcloud
    data = export_passwords(working_dir, stack, excludes)
  File "/usr/lib/python3.9/site-packages/tripleoclient/export.py", line 54, in export_passwords
    with open(passwords_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/root/overcloud-deploy/central/central-passwords.yaml'
[Errno 2] No such file or directory: '/root/overcloud-deploy/central/central-passwords.yaml'non-zero return code

Expected results:

Overcloud multistack deployment successful

Additional info:

Comment 3 Brendan Shephard 2022-12-23 00:19:42 UTC
Hey Sree,

Can we re-run that command without sudo? I don't think there is anything that needs privilege escalation there now in OSP17. So just:
openstack overcloud export --force-overwrite --stack central --output-file /home/stack/central-export.yaml

As the stack user. Does that work?

If so, we'll need to just adjust those DCN jobs to remove the sudo --preserve-env

Comment 6 Brendan Shephard 2023-01-17 21:12:50 UTC
Hey Sree,

This is an entirely different error now. It's unrelated to the initial error reported on this BZ. It would be best to raise a new BZ for this new problem:

So the new error is:
2023-01-17 02:31:49.543837 | 525400d8-78eb-88de-f506-000000000100 |       TASK | Nova: Manage aggregate and availability zone and add hosts to the zone
2023-01-17 02:31:52.353463 | 525400d8-78eb-88de-f506-000000000100 |      FATAL | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud | error={"changed": false, "extra_data": {"data": null, "details": "Compute host dcn1-compute-1.redhat.local could not be found.", "response": "{\"itemNotFound\": {\"code\": 404, \"message\": \"Compute host dcn1-compute-1.redhat.local could not be found.\"}}"}, "msg": "ResourceNotFound: 404: Client Error for url: https://overcloud.redhat.local:13774/v2.1/os-aggregates/5/action, Compute host dcn1-compute-1.redhat.local could not be found."}
2023-01-17 02:31:52.357508 | 525400d8-78eb-88de-f506-000000000100 |     TIMING | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud | 0:21:07.150875 | 2.81s

So we can set the new BZ component to tripleo-heat-templates:
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/nova/nova-az-config.yaml#L71-L87

Comment 7 Brendan Shephard 2023-01-17 22:40:37 UTC
I don't see any dcn1-compute1 in the environment btw:

These are the two nodes it collected logs from:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-edge-deployment-17.0-rhel-virthost-ipv4-3cont-2comp-2leafs-x-2comp-tls_everywhere-routed_provider_nets-ovn-naz/29/site-compute-0/etc/hostname.gz

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-edge-deployment-17.0-rhel-virthost-ipv4-3cont-2comp-2leafs-x-2comp-tls_everywhere-routed_provider_nets-ovn-naz/29/site-compute-1/etc/hostname.gz


But on the Hypervisor, I can see there are other nodes that exist. I'm not sure why there are no logs from those nodes that were collected by that job:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-edge-deployment-17.0-rhel-virthost-ipv4-3cont-2comp-2leafs-x-2comp-tls_everywhere-routed_provider_nets-ovn-naz/29/hypervisor/var/log/extra/virsh-list.txt.gz

Central site seems to deploy fine:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-edge-deployment-17.0-rhel-virthost-ipv4-3cont-2comp-2leafs-x-2comp-tls_everywhere-routed_provider_nets-ovn-naz/29/site-undercloud-0/home/stack/overcloud_install.log.gz
PLAY RECAP *********************************************************************
central-compute0-0         : ok=476  changed=191  unreachable=0    failed=0    skipped=213  rescued=0    ignored=1   
central-compute0-1         : ok=473  changed=191  unreachable=0    failed=0    skipped=213  rescued=0    ignored=1   
central-controller0-0      : ok=656  changed=262  unreachable=0    failed=0    skipped=227  rescued=0    ignored=1   
central-controller0-1      : ok=655  changed=255  unreachable=0    failed=0    skipped=228  rescued=0    ignored=1   
central-controller0-2      : ok=655  changed=255  unreachable=0    failed=0    skipped=228  rescued=0    ignored=1   
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0   
undercloud                 : ok=938  changed=338  unreachable=0    failed=0    skipped=223  rescued=64   ignored=1   


So we really need to collect the logs from those other nodes, since that's when the failure occurs:
2023-01-17 02:31:49,544 p=200645 u=stack n=ansible | 2023-01-17 02:31:49.543837 | 525400d8-78eb-88de-f506-000000000100 |       TASK | Nova: Manage aggregate and availability zone and add hosts to the zone
2023-01-17 02:31:52,356 p=200645 u=stack n=ansible | 2023-01-17 02:31:52.353463 | 525400d8-78eb-88de-f506-000000000100 |      FATAL | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud | error={"changed": false, "extra_data": {"data": null, "details": "Compute host dcn1-compute-1.redhat.local could not be found.", "response": "{\"itemNotFound\": {\"code\": 404, \"message\": \"Compute host dcn1-compute-1.redhat.local could not be found.\"}}"}, "msg": "ResourceNotFound: 404: Client Error for url: https://overcloud.redhat.local:13774/v2.1/os-aggregates/5/action, Compute host dcn1-compute-1.redhat.local could not be found."}
2023-01-17 02:31:52,365 p=200645 u=stack n=ansible | NO MORE HOSTS LEFT *************************************************************
2023-01-17 02:31:52,367 p=200645 u=stack n=ansible | PLAY RECAP *********************************************************************
2023-01-17 02:31:52,368 p=200645 u=stack n=ansible | dcn1-compute-0             : ok=476  changed=191  unreachable=0    failed=0    skipped=213  rescued=0    ignored=1   
2023-01-17 02:31:52,368 p=200645 u=stack n=ansible | dcn1-compute-1             : ok=473  changed=191  unreachable=0    failed=0    skipped=213  rescued=0    ignored=1   
2023-01-17 02:31:52,369 p=200645 u=stack n=ansible | dcn1-network-0             : ok=399  changed=156  unreachable=0    failed=0    skipped=190  rescued=0    ignored=1   
2023-01-17 02:31:52,369 p=200645 u=stack n=ansible | dcn1-network-1             : ok=399  changed=156  unreachable=0    failed=0    skipped=190  rescued=0    ignored=1   
2023-01-17 02:31:52,370 p=200645 u=stack n=ansible | localhost                  : ok=0    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0   
2023-01-17 02:31:52,370 p=200645 u=stack n=ansible | undercloud                 : ok=690  changed=138  unreachable=0    failed=1    skipped=279  rescued=36   ignored=1   


The error means that dcn1-compute-1.redhat.local isn't registered with Nova, so we can't assign it to the aggregate and availability zone. To understand why, we would need logs from the node to see if nova_compute is running and working fine.


Let's collect that info and put it all on a new BZ to avoid creating confusion on this one.

Comment 9 Brendan Shephard 2023-01-20 01:40:08 UTC
Hey, I'm not sure where that file lives:
:jobs/DFG/edge/stages/overcloud_deploy_spine_leaf_multistack.groovy.inc

But something is clearly still trying to use sudo, or maybe Ansible using become:true, or --become since it's trying to access a file in the /root directory


Note You need to log in before you can comment on or make changes to this bug.