Bug 1818866 - [RHOSP16][Ml2-OVN] overcloud deploy command got stuck
Summary: [RHOSP16][Ml2-OVN] overcloud deploy command got stuck
Keywords:
Status: CLOSED DUPLICATE of bug 1792500
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Adriano Petrich
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks: 1823324 1823334 1823352
TreeView+ depends on / blocked
 
Reported: 2020-03-30 14:55 UTC by Pradipta Kumar Sahoo
Modified: 2020-04-24 04:02 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-14 20:34:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovn_migration.sh (10.45 KB, application/x-shellscript)
2020-03-30 14:55 UTC, Pradipta Kumar Sahoo
no flags Details

Description Pradipta Kumar Sahoo 2020-03-30 14:55:21 UTC
Created attachment 1674758 [details]
ovn_migration.sh

Description of problem:
In OSP16 scale environment, after the completion of ml2 OVS_to_OVN migration activity, we noticed qrouter & qdhcp and neutron-ovs-agent containers are not cleaned-up and suspecting it's because of the presence of neutron-ovs-agent containers in all the overcloud nodes.

The mistral deployment log doesn't have any exception on this and all the task completed without any failure.

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.0.1 (Train)
Red Hat Enterprise Linux release 8.1 (Ootpa)

$ grep -B1 -A3 neutron_driver containers-prepare-parameter.yaml
      namespace: registry-proxy.engineering.redhat.com/rh-osbs
      neutron_driver: ovn
      rhel_containers: false
      tag: 20200226.1

$ sudo rpm -qa| grep -E "migration|tripleo"
ansible-role-tripleo-modify-image-1.1.1-0.20200122200932.58d7a5b.el8ost.noarch
puppet-tripleo-11.4.1-0.20200205150840.71ff36d.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200128210949.d668f88.el8ost.noarch
python3-tripleoclient-12.3.2-0.20200130192329.78ac810.el8ost.noarch
openstack-tripleo-common-containers-11.3.3-0.20200206225551.18d4fbc.el8ost.noarch
python3-tripleo-common-11.3.3-0.20200206225551.18d4fbc.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200206225551.18d4fbc.el8ost.noarch
ansible-tripleo-ipsec-9.2.0-0.20191022054642.ffe104c.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200206065551.1a9b92a.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200211065546.d3d6dc3.el8ost.noarch
tripleo-ansible-0.4.2-0.20200207140443.b750574.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200130192329.78ac810.el8ost.noarch
openstack-tripleo-image-elements-10.6.1-0.20191022065313.7338463.el8ost.noarch
python3-networking-ovn-migration-tool-7.1.0-0.20200204065607.57ac389.el8ost.noarch


How reproducible:
100% reproduced in scale lab environment 
1x Undercloud
3x Controller
50x Compute

Steps to Reproduce:
1. Updated "ovn_migration.sh" [PFA], as per official recommondation.
2. Successfully completed below 4 stage:
	 Step 1 -> ovn_migration.sh generate-inventory

		   Generates the inventory file

	 Step 2 -> ovn_migration.sh setup-mtu-t1

		   Sets the DHCP renewal T1 to 30 seconds. After this step you will
		   need to wait at least 24h for the change to be propagated to all
		   VMs. This step is only necessary for VXLAN or GRE based tenant
		   networking.

	 Step 3 -> You need to wait at least 24h based on the default configuration
		   of neutron for the DHCP T1 parameter to be propagated, please
		   refer to documentation. WARNING: this is very important if you
		   are using VXLAN or GRE tenant networks.

	 Step 4 -> ovn_migration.sh reduce-mtu

		   Reduces the MTU of the neutron tenant networks networks. This
		   step is only necessary for VXLAN or GRE based tenant networking.

3. Refernece deployment script with OVN service template:

	$ cat overcloud-deploy-ovn.sh
	#!/bin/bash

	time openstack overcloud deploy \
	--timeout 1200 \
	--templates /usr/share/openstack-tripleo-heat-templates \
	  --environment-file /home/stack/firstboot.yaml \
	--stack overcloud \
	--libvirt-type kvm \
	--ntp-server clock1.rdu2.redhat.com \
	-e /home/stack/virt/config_lvm.yaml \
	-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
	-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \
	-e $HOME/ovn-extras.yaml \
	-e /home/stack/ovn-migration.yaml \
	-e /home/stack/virt/network/network-environment.yaml \
	-e /home/stack/virt/inject-trust-anchor.yaml \
	-e /home/stack/virt/hostnames.yml \
	-e /home/stack/virt/debug.yaml \
	-e /home/stack/virt/nodes_data.yaml \
	-e ~/containers-prepare-parameter.yaml 

	$ cat /home/stack/ovn-migration.yaml
	parameter_defaults:
	  ForceNeutronDriverUpdate: true

4. Sucessfully completed the stage5:
	 Step 5 -> ovn_migration.sh start-migration
		   Starts the migration to OVN.

5. After completion, the tripleo mistral ansible log completed, but overcloud deploy workflow was not syncup properly.
For comaristion between the task can be review with below logs:
	* /var/lib/mistral/overcloud/ansible.log (mistral_ovs_to_ovn_2020-03-27_191006.log)
	* overcloud-deploy-ovn.sh.log

	Ansible Task executed
	Compute: 110
	Controller: 137
	Undercloud: 7
	Mistral migration Time duration: ~3hr 15min
	Start time: 2020-03-27 19:10:06 UTC
	End time: 2020-03-27 22:04:18 UTC


Debugging Steps:

6. Neutron config and ML2 Configuration validation:
	http://paste.openstack.org/show/0r1wrFkc7Lkiwyv7OcMR/
	http://paste.openstack.org/show/4DgziytQIMs7WZIiQVft/
7. OVS-DB external ID mapping:
	http://paste.openstack.org/show/4TOmCmLsWIwDoOz4eGg4/
	http://paste.openstack.org/show/tJKnKyqByvDekJhisFiU/
8. Pacemaker OVN DB status
	$ ansible -i hosts controller-1 --become -m shell -a "pcs status | grep -A3 openstack-ovn-northd"
	controller-1 | CHANGED | rc=0 >>
	 Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]
	   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Master controller-1
	   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Slave controller-2
	   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-0
9. It looks OVN migration task not updated OVN northbound database with neutron tenant network & routers except the ovn-migration network. http://paste.openstack.org/show/qFMAbaVW49ACEm8EOrKk/
	# ovn-nbctl show| grep neutron
	Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
	switch 5faf5947-fa43-4bd4-b45f-67000633323f (neutron-386fbaac-5475-4a04-bf0a-627688ee0443) (aka ovn-migration-net-pre)
	router c76f7562-be4f-4211-aacd-5f896e826a77 (neutron-bc9513b6-a7ab-4972-b226-f2ab7c5677e9) (aka ovn-migration-router-pre)
10. The southbound database is updated with only chasis hostname, but it didn’t updated with the associated VM’s VIF port bind details.
	http://paste.openstack.org/show/791327/

11. Looks like the qrouter & qdhcp namespace containers are not cleanup after the migration.
	podman ps| grep -E 'qrouter|qdhcp'
	http://paste.openstack.org/show/791337/
	http://paste.openstack.org/show/791338/
	http://paste.openstack.org/show/791339/
12. I am suspecting the migration script didn’t cleanup the neutron-ovs-agent which is running overcloud node. Due to this the qrouter & qdhcp namespaces still exist.
	http://paste.openstack.org/show/791340/

13. Hence after migration FIP of 1000+ instances is not accessible.

14. Also, rerun step 4, did n't as the pre-migration VM lost FIP.

Actual results:
neutron-ovs-agent and qrouter&qdhcp namespace containers are not cleaned up after ovn migration


Expected results:
The neutron-ovs-agent container should cleanup after the ovn migration task step5 and FIP should accessible irrespective all the instances spawned on the compute integrated with iptable_hybrid and openvswitch firewall driver.

Comment 2 Jakub Libosvar 2020-03-30 15:17:23 UTC
I'm changing component to TripleO as the migration script doesn't cleanup the resources.
Here are the used resource_registries

# Disabling Neutron services that overlap with OVN
  OS::TripleO::Services::NeutronOvsAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronOvsAgent: OS::Heat::None
  OS::TripleO::Services::NeutronL3Agent: OS::Heat::None
  OS::TripleO::Services::NeutronMetadataAgent: OS::Heat::None
  OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronCorePlugin: OS::Heat::None

Would be good to have some tripleo expert here to have a look why the agents were not cleaned up.

Comment 4 Pradipta Kumar Sahoo 2020-03-31 15:51:51 UTC
It looks like the OVN migration task was failed because of the migration ansible task depends on the exit status of the tripleo overcloud deploy command.

~~~
TASK [tripleo-update : Updating the overcloud stack with OVN services]
task path: /home/stack/ovn_migration/playbooks/roles/tripleo-update/tasks/main.yml:20
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "set -o pipefail && /home/stack/overcloud-deploy-ovn.sh 2>&1 > /home/stack/overcloud-deploy-ovn.sh.log\n"
~~~

Ideally, the overcloud deploy command collect event log from ansible mitral log "/var/lib/mistral/overcloud/ansible.log", but in large scale deployment the tripleo workflow didn't sync some time while execute the ansible task like e.g., "Wait for puppet host configuration to finish" and the deploy command get break if it exceeds the timeout value mentioned in deployment script.

In our case, as we used 53 nodes [3xController + 50xCompute], so we used timeout value 1200 in overcloud deployment script as with the same value we successfully deployed the environment with ml2-ovs previously. So in large scale deployment, we usually refer ansible mistral log (/var/lib/mistral/overcloud/ansible.log) status if tripleo overcloud deploy workflow log break due to timeout.

In our case, migration was successfully completed according to below mistral log, but it was not synced with "overcloud-deploy-ovn.sh.log" and overcloud deploy command was failed after 1200 timeout.
~~~
TASK [Wait for puppet host configuration to finish] ****************************
Friday 27 March 2020  21:54:33 +0000 (0:00:21.629)       2:44:26.968 ********** 
FAILED - RETRYING: Wait for puppet host configuration to finish (1200 retries left).
~~~

Due to this issue, the remaining task ovn_migration.sh was not executed and thus the neutron-ovs-agent, qruter, qdhcp containers are not cleaned up and ovn-db-sync not updated neutron tenant network to northbound database.

If there is any chance we add the condition to migration task, it will pass the deployment.

As discussed with Kuba & Daniel, currently there is no revert plan available in ovn migration activity. But looking into the current situation a revert plan would be necessary for customer scenario as the tenant environment was completely down.
So without a revert plan, we can not proceed with OVN migration activity in a large scale environment.

$ grep failed=0 /var/lib/mistral/overcloud/ansible.log; echo $?
2020-03-27 22:04:18,953 p=278441 u=mistral |  compute-0                  : ok=277  changed=108  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,953 p=278441 u=mistral |  compute-1                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,953 p=278441 u=mistral |  compute-10                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,953 p=278441 u=mistral |  compute-11                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-12                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-13                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-14                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-15                 : ok=273  changed=111  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-16                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-17                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-18                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-19                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-2                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-20                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-21                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,954 p=278441 u=mistral |  compute-22                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-23                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-24                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-25                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-26                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-27                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-28                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-29                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-3                  : ok=273  changed=111  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-30                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-31                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-32                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-33                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,955 p=278441 u=mistral |  compute-34                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-35                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-36                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-37                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-38                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-39                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-4                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-40                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-41                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-42                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-43                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-44                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,956 p=278441 u=mistral |  compute-45                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-46                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-47                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-48                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-49                 : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-5                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-6                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-7                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-8                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  compute-9                  : ok=273  changed=110  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  controller-0               : ok=376  changed=150  unreachable=0    failed=0    skipped=167  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  controller-1               : ok=316  changed=137  unreachable=0    failed=0    skipped=181  rescued=0    ignored=1   
2020-03-27 22:04:18,957 p=278441 u=mistral |  controller-2               : ok=316  changed=137  unreachable=0    failed=0    skipped=181  rescued=0    ignored=1   
2020-03-27 22:04:18,958 p=278441 u=mistral |  undercloud                 : ok=20   changed=7    unreachable=0    failed=0    skipped=21   rescued=0    ignored=0   
0


/var/lib/mistral/overcloud/ansible.log
2020-03-27 21:54:33,671 p=278441 u=mistral |  TASK [Wait for puppet host configuration to finish] ****************************
2020-03-27 21:55:08,546 p=278441 u=mistral |  TASK [Debug output for task: Run puppet host configuration for step 5] *********
2020-03-27 21:55:38,394 p=278441 u=mistral |  TASK [Create puppet caching structures] ****************************************
2020-03-27 21:55:58,108 p=278441 u=mistral |  TASK [Check for facter.conf] ***************************************************
2020-03-27 21:56:18,714 p=278441 u=mistral |  TASK [Remove facter.conf if directory] *****************************************
2020-03-27 21:56:38,316 p=278441 u=mistral |  TASK [Write facter cache config] ***********************************************
2020-03-27 21:56:58,799 p=278441 u=mistral |  TASK [Cleanup facter cache if exists] ******************************************
2020-03-27 21:57:19,202 p=278441 u=mistral |  TASK [Pre-cache facts] *********************************************************
2020-03-27 21:57:40,879 p=278441 u=mistral |  TASK [Facter error output when failed] *****************************************
2020-03-27 21:58:01,054 p=278441 u=mistral |  TASK [Sync cached facts] *******************************************************
2020-03-27 21:58:39,243 p=278441 u=mistral |  TASK [Run container-puppet tasks (generate config) during step 5] **************
2020-03-27 21:58:59,352 p=278441 u=mistral |  TASK [Wait for container-puppet tasks (generate config) to finish] *************
2020-03-27 21:59:18,660 p=278441 u=mistral |  TASK [Debug output for task: Run container-puppet tasks (generate config) during step 5] ***
2020-03-27 21:59:38,940 p=278441 u=mistral |  TASK [Diff container-puppet.py puppet-generated changes for check mode] ********
2020-03-27 21:59:59,777 p=278441 u=mistral |  TASK [Diff container-puppet.py puppet-generated changes for check mode] ********
2020-03-27 22:00:19,857 p=278441 u=mistral |  TASK [Start containers for step 5 using paunch] ********************************
2020-03-27 22:00:40,623 p=278441 u=mistral |  TASK [Wait for containers to start for step 5 using paunch] ********************
2020-03-27 22:01:14,800 p=278441 u=mistral |  TASK [Debug output for task: Start containers for step 5] **********************
2020-03-27 22:01:36,019 p=278441 u=mistral |  TASK [Manage containers for step 5 with tripleo-ansible] ***********************
2020-03-27 22:01:55,608 p=278441 u=mistral |  TASK [Clean container_puppet_tasks for controller-0 step 5] ********************
2020-03-27 22:02:16,058 p=278441 u=mistral |  TASK [Calculate container_puppet_tasks for controller-0 step 5] ****************
2020-03-27 22:02:35,617 p=278441 u=mistral |  TASK [Write container-puppet-tasks json file for controller-0 step 5] **********
2020-03-27 22:02:55,011 p=278441 u=mistral |  TASK [Run container-puppet tasks (bootstrap tasks) for step 5] *****************
2020-03-27 22:03:15,009 p=278441 u=mistral |  TASK [Wait for container-puppet tasks (bootstrap tasks) for step 5 to finish] ***
2020-03-27 22:03:34,447 p=278441 u=mistral |  TASK [Debug output for task: Run container-puppet tasks (bootstrap tasks) for step 5] ***
2020-03-27 22:03:54,014 p=278441 u=mistral |  TASK [Server Post Deployments] *************************************************
2020-03-27 22:03:54,400 p=278441 u=mistral |  TASK [include_tasks] ***********************************************************
2020-03-27 22:04:14,831 p=278441 u=mistral |  TASK [External deployment Post Deploy tasks] ***********************************
2020-03-27 22:04:14,995 p=278441 u=mistral |  TASK [is additonal Cell?] ******************************************************
2020-03-27 22:04:15,156 p=278441 u=mistral |  TASK [discover via nova_compute?] **********************************************
2020-03-27 22:04:15,317 p=278441 u=mistral |  TASK [discover via nova_ironic?] ***********************************************
2020-03-27 22:04:15,793 p=278441 u=mistral |  TASK [Discovering nova hosts] **************************************************
2020-03-27 22:04:18,880 p=278441 u=mistral |  TASK [set_fact] ****************************************************************


BR,
Pradipta

Comment 5 Jakub Libosvar 2020-04-01 11:14:45 UTC
It turned out the "overcloud deploy" command got stuck, I'm updating the summary. It's unclear what happened and why it got stuck. We need somebody with broad tripleo knowledge to help us out.

Comment 8 Pradipta Kumar Sahoo 2020-04-08 09:32:02 UTC
Luke,

Thanks for highlight, Let me increase the workers and processes counter to 12 with re-deployment. We will keep you updated with the latest details.

$ sudo grep --color api_workers /var/lib/config-data/puppet-generated/mistral/etc/mistral/mistral.conf                                                                                                      
api_workers=1

$ sudo grep --color processes /var/lib/config-data/puppet-generated/mistral/etc/httpd/conf.d/10-mistral_wsgi.conf
  WSGIDaemonProcess mistral display-name=mistral_wsgi group=mistral processes=1 threads=1 user=mistral

$ sudo grep --color processes /var/lib/config-data/puppet-generated/zaqar/etc/httpd/conf.d/10-zaqar_wsgi.conf
  WSGIDaemonProcess zaqar-server display-name=zaqar_wsgi group=zaqar processes=12 threads=1 user=zaqar

BR,
Pradipta

Comment 10 Sai Sindhur Malleni 2020-04-09 13:26:35 UTC
This bug is pretty much similar to  https://bugzilla.redhat.com/show_bug.cgi?id=1792500

Comment 18 Luke Short 2020-04-14 20:34:12 UTC

*** This bug has been marked as a duplicate of bug 1792500 ***


Note You need to log in before you can comment on or make changes to this bug.