Bug 1904220

Summary: openstack overcloud deploy fails during config-download phase
Product: Red Hat OpenStack Reporter: Itai Levy <itailev>
Component: openstack-tripleo-commonAssignee: Rabi Mishra <ramishra>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: ahleihel, cjeanner, drosenfe, hakhande, jhajyahy, jjoyce, jschluet, mburns, ramishra, slinaber, tvignaud
Target Milestone: z6Keywords: Reopened, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-common-11.4.1-1.20210310124600.75bd92a.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-26 13:49:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Itai Levy 2020-12-03 20:38:23 UTC
Description of problem:
Trying to deploy newly installed OSP16.1 cloud, failing during config_download_deploy workflow.

Deploy command:
openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
-n /home/stack/templates/network_data.yaml \
-r /home/stack/templates/roles_data.yaml \
--validation-warnings-fatal \
-e /home/stack/templates/node-info.yaml \
-e /home/stack/containers-prepare-parameter.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \
-e /home/stack/templates/network-environment.yaml \
-e /home/stack/templates/env-ovn.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml

Error:

Waiting for messages on queue 'tripleo' with no timeout.
The action raised an exception [action_ex_id=6ede5ba3-e87b-4ad0-b11f-c855d4cecaff, msg='ERROR: Software config with id None not found
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/heat/common/context.py", line 423, in wrapped
    return func(self, ctx, *args, **kwargs)

  File "/usr/lib/python3.6/site-packages/heat/engine/service.py", line 2187, in show_software_config
    return self.software_config.show_software_config(cnxt, config_id)

  File "/usr/lib/python3.6/site-packages/heat/engine/service_software_config.py", line 42, in show_software_config
    sc = software_config_object.SoftwareConfig.get_by_id(cnxt, config_id)

  File "/usr/lib/python3.6/site-packages/heat/objects/software_config.py", line 62, in get_by_id
    context, cls(), db_api.software_config_get(context, config_id))

  File "/usr/lib/python3.6/site-packages/heat/db/sqlalchemy/api.py", line 1166, in software_config_get
    config_id)

heat.common.exception.NotFound: Software config with id None not found
', action_cls='<class 'mistral.actions.action_factory.GetOvercloudConfig'>', attributes='{}', params='{'container': 'overcloud', 'container_config': 'overcloud-config'}']
Overcloud Endpoint: http://X.X.X.X:5000
Overcloud Horizon Dashboard URL: http://X.X.X.X:80/dashboard
Overcloud rc file: /home/stack/overcloudrc
Overcloud Deployed with error
Overcloud configuration failed.


 openstack task execution show 6219f988-a4c0-47a2-8dfc-fbbb8c606947
+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                                                                                                      |
+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID                    | 6219f988-a4c0-47a2-8dfc-fbbb8c606947                                                                                                                                       |
| Name                  | get_config                                                                                                                                                                 |
| Workflow name         | tripleo.deployment.v1.config_download_deploy                                                                                                                               |
| Workflow namespace    |                                                                                                                                                                            |
| Workflow Execution ID | cbb6b29f-f27c-4aee-9a88-739087f0252c                                                                                                                                       |
| State                 | ERROR                                                                                                                                                                      |
| State info            | The action raised an exception [action_ex_id=da244cc3-4474-4e6f-805b-ff8eef8dd4c7, msg='ERROR: Software config with id None not found                                      |
|                       | Traceback (most recent call last):                                                                                                                                         |
|                       |                                                                                                                                                                            |
|                       |   File "/usr/lib/python3.6/site-packages/heat/common/context.py", line 423, in wrapped                                                                                     |
|                       |     return func(self, ctx, *args, **kwargs)                                                                                                                                |
|                       |                                                                                                                                                                            |
|                       |   File "/usr/lib/python3.6/site-packages/heat/engine/service.py", line 2187, in show_software_config                                                                       |
|                       |     return self.software_config.show_software_config(cnxt, config_id)                                                                                                      |
|                       |                                                                                                                                                                            |
|                       |   File "/usr/lib/python3.6/site-packages/heat/engine/service_software_config.py", line 42, in show_software_config                                                         |
|                       |     sc = software_config_object.SoftwareConfig.get_by_id(cnxt, config_id)                                                                                                  |
|                       |                                                                                                                                                                            |
|                       |   File "/usr/lib/python3.6/site-packages/heat/objects/software_config.py", line 62, in get_by_id                                                                           |
|                       |     context, cls(), db_api.software_config_get(context, config_id))                                                                                                        |
|                       |                                                                                                                                                                            |
|                       |   File "/usr/lib/python3.6/site-packages/heat/db/sqlalchemy/api.py", line 1166, in software_config_get                                                                     |
|                       |     config_id)                                                                                                                                                             |
|                       |                                                                                                                                                                            |
|                       | heat.common.exception.NotFound: Software config with id None not found                                                                                                     |
|                       | ', action_cls='<class 'mistral.actions.action_factory.GetOvercloudConfig'>', attributes='{}', params='{'container': 'overcloud', 'container_config': 'overcloud-config'}'] |
| Created at            | 2020-12-03 15:39:03                                                                                                                                                        |
| Updated at            | 2020-12-03 18:36:49                                                                                                                                                        |
+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


Version-Release number of selected component (if applicable):


RH-OSP 16.01
RHEL 8.2

$ rpm -qa | grep -i tripleo
ansible-tripleo-ipsec-9.2.1-0.20200311073016.0c8693c.el8ost.noarch
tripleo-ansible-0.5.1-1.20200914163925.el8ost.noarch
python3-tripleoclient-12.3.2-1.20200914164928.el8ost.noarch
openstack-tripleo-common-containers-11.4.1-1.20200914165651.el8ost.noarch
puppet-tripleo-11.5.0-1.20200914161840.f716ef5.el8ost.noarch
openstack-tripleo-common-11.4.1-1.20200914165651.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200701163410.432518a.el8ost.noarch
openstack-tripleo-validations-11.3.2-1.20200914170825.el8ost.noarch
ansible-tripleo-ipa-0.2.1-1.20200813093411.3bb3c53.el8ost.noarch
ansible-role-tripleo-modify-image-1.2.1-0.20200804085623.1dffa21.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-1.20200914164928.el8ost.noarch
python3-tripleo-common-11.4.1-1.20200914165651.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-1.20200914170156.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200528043425.7dc0fa1.el8ost.noarch


How reproducible:


Steps to Reproduce:
1.Install Undercloud 16.01
2.prepare configuration files
3.deploy 

Actual results:
Overcloud Deployed with error
Overcloud configuration failed.

Expected results:


Additional info:
seems like tripleo-admin user is created in the overcloud nodes

Comment 1 Itai Levy 2020-12-06 10:04:20 UTC
Please let me know if remote troubleshooting session is required to speed up investigation and resolution.

Comment 4 Itai Levy 2020-12-08 11:23:39 UTC
cat /etc/rhosp-release
Red Hat OpenStack Platform release 16.1.2 GA (Train)

Comment 5 Itai Levy 2020-12-09 16:05:26 UTC
Seems like config-download working directory "overcloud" under /var/lib/mistral/ that should hold the ansible configs is missing. not sure why its not created by the director...

Comment 6 Adriano Petrich 2020-12-09 17:10:30 UTC
Good point! If you create it and run again does it work?
 Could you check to see the file deployment/mistral/mistral-engine-container-puppet.yaml under tht is there and if that directory is in the file (around line 128)?

Comment 7 Itai Levy 2020-12-09 17:26:34 UTC
Just to clarify,  /var/lib/mistral/ is there and it includes "ansible_fact_cache" directory, however "overcloud" directory is missing.
Already tried creating "overcloud" directory under /var/lib/mistral/ and chown to mistral user, and repeat overcloud deploy --stack-only followed by  overcloud deploy --config-download-only, however it didnt help...
I will try deleting the overcloud stack and recreating...


file deployment/mistral/mistral-engine-container-puppet.yaml is there, and includes mistral_engine container volume however mounted as read-only:

      docker_config:
        step_4:
          mistral_engine:
            image: {get_param: ContainerMistralEngineImage}
            net: host
            privileged: false
            restart: always
            healthcheck: {get_attr: [ContainersCommon, healthcheck_rpc_port]}
            volumes:
              list_concat:
                - {get_attr: [ContainersCommon, volumes]}
                -
                  - /run:/run
                  - /var/lib/kolla/config_files/mistral_engine.json:/var/lib/kolla/config_files/config.json:ro
                  - /var/lib/config-data/puppet-generated/mistral:/var/lib/kolla/config_files/src:ro
                  - /var/log/containers/mistral:/var/log/mistral:z
                  - /var/lib/mistral:/var/lib/mistral:ro
                  - /usr/share/ansible/:/usr/share/ansible/:ro
                  - /usr/share/openstack-tripleo-validations:/usr/share/openstack-tripleo-validations:ro

who should create the "overcloud" directory under /var/lib/mistral and place the ansible config files?

Comment 8 Itai Levy 2020-12-09 18:03:18 UTC
Adriano, creating the directory didnt help.
The only directory that is populated with ansible files per node is /var/lib/mistral/ansible_fact_cache

who should create the "overcloud" directory under /var/lib/mistral and place there the ansible config files? is it mistal engine container?

how should we proceed?
is the error I get related to OS::TripleO::SoftwareDeployment resource?

Comment 9 Rabi Mishra 2020-12-10 02:52:20 UTC
This is failing when downloading playbooks either from[1] or [2], as it's looking for a config with id None. So you won't have the config-downloaded playbooks yet. Is your heat stack in a good state? 

The details in this BZ is not enough to troubleshoot. There is surely something messed up. Is this a customer reported issue? I don't see a support case linked. Can you provide the undercloud heat db dump and sosreport to investigate?

[1] https://github.com/openstack/tripleo-common/blob/stable/train/tripleo_common/utils/config.py#L231
[2] https://github.com/openstack/tripleo-common/blob/stable/train/tripleo_common/utils/config.py#L94

Comment 10 Itai Levy 2020-12-10 13:08:57 UTC
Hi Rabi, 

This is not a customer reported issue, we see it in our lab.
As you can see the stack is in created state and there are no stack failures:

(undercloud) [stack@rhosp-director ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| ID                                   | Stack Name | Project                          | Stack Status    | Creation Time        | Updated Time |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| b41bdfb3-a893-46b5-becf-462af4b89098 | overcloud  | 109aa1ef23ec4e8091da354d6c465e24 | CREATE_COMPLETE | 2020-12-10T11:10:10Z | None         |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+

(undercloud) [stack@rhosp-director ~]$ openstack stack failures list overcloud
(undercloud) [stack@rhosp-director ~]$

See attached a file include:
- deployment yaml files 
- sosreport 
- db dumps 

https://drive.google.com/file/d/1hl8fKDedN9_sT1MO2A9nSaHjsVWIz1Yz/view?usp=sharing

Your assistance is appreciated. 

thanks
Itai

Comment 11 Rabi Mishra 2020-12-10 13:50:33 UTC
I don't see any issue with heat. I suspect it's probably some issue with swift ( I remember something similar reported earlier), as I see below in mistral logs. May be you can check if swift is working on undercloud? Try and celanup the overcloud-config container and run deploy again.


020-12-10 11:32:22.947 7 INFO workflow_trace [req-f23ba456-c129-441a-aeb0-f72f9c1ffa45 80ac31972a594c3aa353cf98e599b0e8 109aa1ef23ec4e8091da354d6c465e24 - default default] Workflow 'tripleo.swift.v1.container_exists' [RUNNING -> ERROR, msg=None] (execution_id=9dadcee0-be3b-42d3-acc3-2ca853c8832c)
2020-12-10 11:32:22.994 7 INFO mistral.engine.engine_server [req-f23ba456-c129-441a-aeb0-f72f9c1ffa45 80ac31972a594c3aa353cf98e599b0e8 109aa1ef23ec4e8091da354d6c465e24 - default default] Received RPC request 'on_action_complete'[action_ex_id=9dadcee0-be3b-42d3-acc3-2ca853c8832c, result=Result [data=None, error=Failed subworkflow [execution_id=9dadcee0-be3b-42d3-acc3-2ca853c8832c], cancel=False]]
2020-12-10 11:32:23.010 7 INFO workflow_trace [req-f23ba456-c129-441a-aeb0-f72f9c1ffa45 80ac31972a594c3aa353cf98e599b0e8 109aa1ef23ec4e8091da354d6c465e24 - default default] Task 'verify_container_doesnt_exist' (0b0746cb-c289-4db5-8888-f159b88828f2) [RUNNING -> ERROR, msg=None] (execution_id=79c0525c-6bd9-4c94-acf8-fae8045f300a)

Comment 12 Itai Levy 2020-12-10 14:17:22 UTC
here is an updated link for the file download:
https://drive.google.com/file/d/19mCMW7UJB-FxnhkNwg7ys_S3ZMLFNLl6/view?usp=sharing

Comment 13 Itai Levy 2020-12-10 14:22:37 UTC
Swift containers seems to be up:

 podman ps -a | grep -i swift
60441dea88e4  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-ironic-conductor:16.1           /usr/bin/bootstra...  5 hours ago  Exited (0) 5 hours ago         create_swift_temp_url_key
756a4b93d47b  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-proxy-server:16.1         kolla_start           5 hours ago  Up 5 hours ago                 swift_proxy
24faceb843c5  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-object:16.1               kolla_start           5 hours ago  Up 5 hours ago                 swift_rsync
ec6a7ce45493  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-object:16.1               kolla_start           5 hours ago  Up 5 hours ago                 swift_object_updater
b1bbe51517cb  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-object:16.1               kolla_start           5 hours ago  Up 5 hours ago                 swift_object_server
927fae3111ee  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-proxy-server:16.1         kolla_start           5 hours ago  Up 5 hours ago                 swift_object_expirer
4403b0f33c29  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-container:16.1            kolla_start           5 hours ago  Up 5 hours ago                 swift_container_updater
614329e519e4  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-container:16.1            kolla_start           5 hours ago  Up 5 hours ago                 swift_container_server
d5391ca594ad  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-account:16.1              kolla_start           5 hours ago  Up 5 hours ago                 swift_account_server
d444054a9d33  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-account:16.1              kolla_start           5 hours ago  Up 5 hours ago                 swift_account_reaper
4edcdabb95be  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-account:16.1              chown -R swift: /...  5 hours ago  Exited (0) 5 hours ago         swift_setup_srv
721fc1463a02  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-object:16.1               /bin/bash -c sed ...  5 hours ago  Exited (0) 5 hours ago         swift_rsync_fix
96f40d1f1fbe  rhosp-director.ctlplane.localdomain:8787/rhosp-rhel8/openstack-swift-proxy-server:16.1         /bin/bash -c cp -...  5 hours ago  Exited (0) 5 hours ago         swift_copy_rings


I already tried:
- deleting the undercloud stack and re-deploying
- deleting the undercloud packages as advised in https://access.redhat.com/solutions/2210421m and reinstalling from scrach
- reinstalling 16.0 undercloud + overcloud instead of 16.1 

nothing helped, same error.

I have a feeling that I am missing something basic here...
any idea how to proceed?

Itai

Comment 15 Itai Levy 2020-12-10 15:06:26 UTC
As I used RHEL 8.2 DVD iso for the Undercloud OS installation, initial baremetal nodes introspection was failing and I had to figure out that I need to stop/disable libvirtd service that was occupying port 67 and prevented from ironic_inspector_dnsmasq container from coming up...
Maybe there is another RHEL 8.2 inbox service that messing up with undercloud containers and preventing a proper functionality?

[root@rhosp-director stack]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.2 (Ootpa)

[root@rhosp-director stack]# iptables -L -n | grep swift
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 8080 state NEW /* 100 swift_proxy_server_haproxy ipv4 */
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 13808 state NEW /* 100 swift_proxy_server_haproxy_ssl ipv4 */
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 8080,13808 state NEW /* 122 swift proxy ipv4 */
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 873,6000,6001,6002 state NEW /* 123 swift storage ipv4 */

[root@rhosp-director stack]# netstat -pna | grep LISTEN | grep "8080\|13808\|873\|6000\|6001\|6002"
tcp        0      0 192.168.24.1:8080       0.0.0.0:*               LISTEN      730873/python3      
tcp        0      0 192.168.24.1:6000       0.0.0.0:*               LISTEN      728194/python3      
tcp        0      0 192.168.24.3:8080       0.0.0.0:*               LISTEN      704947/haproxy      
tcp        0      0 192.168.24.2:13808      0.0.0.0:*               LISTEN      704947/haproxy      
tcp        0      0 192.168.24.1:6001       0.0.0.0:*               LISTEN      727568/python3      
tcp        0      0 192.168.24.1:6002       0.0.0.0:*               LISTEN      727285/python3      
tcp        0      0 192.168.24.1:873        0.0.0.0:*               LISTEN      728651/rsync

Comment 16 Rabi Mishra 2020-12-10 16:32:50 UTC
Ah I should have checked your templates earlier. The network templates are wrong.

[ramishra@ramishra-laptop deploy_yamls]$ cat controller.yaml 

....

outputs:
  OS::stack_id:
    description: The OsNetConfigImpl resource.
    value:

[ramishra@ramishra-laptop deploy_yamls]$ cat computesriov.yaml

....

outputs:
  OS::stack_id:
    description: The OsNetConfigImpl resource.
    value:


So they are missing the last line


outputs:
  OS::stack_id:
    description: The OsNetConfigImpl resource.
    value:
      get_resource: OsNetConfigImpl << Missing


So they are going as None.

Comment 17 Alaa Hleihel (NVIDIA Mellanox) 2020-12-10 16:38:39 UTC
It would be nice if the tool could say which key is missing rather than just crashing :)

Comment 18 Rabi Mishra 2020-12-10 16:50:53 UTC
> It would be nice if the tool could say which key is missing rather than just crashing :)

The key is there, but the value is empty, which is kind of valid for a template. Though it's very difficult to add checks these mistakes in custom config templates, we'll put a fix to check for network config id (not being None).

Comment 19 Itai Levy 2020-12-10 17:32:13 UTC
Thanks!

Comment 28 Jad Haj Yahya 2021-04-21 16:06:03 UTC
Removed get_resource: OsNetConfigImpl from compute.yaml, deployed and got below error:

Invalid network config for role Compute. Please check the network config templates used

Comment 34 errata-xmlrpc 2021-05-26 13:49:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2097