Bug 1818866
Summary: | [RHOSP16][Ml2-OVN] overcloud deploy command got stuck | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Pradipta Kumar Sahoo <psahoo> | ||||
Component: | openstack-tripleo-common | Assignee: | Adriano Petrich <apetrich> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | David Rosenfeld <drosenfe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 16.0 (Train) | CC: | apevec, dalvarez, jlibosva, lhh, lshort, majopela, mburns, scohen, slinaber, smalleni | ||||
Target Milestone: | --- | Keywords: | Regression, Triaged | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-04-14 20:34:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1823324, 1823334, 1823352 | ||||||
Attachments: |
|
Description
Pradipta Kumar Sahoo
2020-03-30 14:55:21 UTC
I'm changing component to TripleO as the migration script doesn't cleanup the resources. Here are the used resource_registries # Disabling Neutron services that overlap with OVN OS::TripleO::Services::NeutronOvsAgent: OS::Heat::None OS::TripleO::Services::ComputeNeutronOvsAgent: OS::Heat::None OS::TripleO::Services::NeutronL3Agent: OS::Heat::None OS::TripleO::Services::NeutronMetadataAgent: OS::Heat::None OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None OS::TripleO::Services::ComputeNeutronCorePlugin: OS::Heat::None Would be good to have some tripleo expert here to have a look why the agents were not cleaned up. It looks like the OVN migration task was failed because of the migration ansible task depends on the exit status of the tripleo overcloud deploy command. ~~~ TASK [tripleo-update : Updating the overcloud stack with OVN services] task path: /home/stack/ovn_migration/playbooks/roles/tripleo-update/tasks/main.yml:20 fatal: [localhost]: FAILED! => {"changed": true, "cmd": "set -o pipefail && /home/stack/overcloud-deploy-ovn.sh 2>&1 > /home/stack/overcloud-deploy-ovn.sh.log\n" ~~~ Ideally, the overcloud deploy command collect event log from ansible mitral log "/var/lib/mistral/overcloud/ansible.log", but in large scale deployment the tripleo workflow didn't sync some time while execute the ansible task like e.g., "Wait for puppet host configuration to finish" and the deploy command get break if it exceeds the timeout value mentioned in deployment script. In our case, as we used 53 nodes [3xController + 50xCompute], so we used timeout value 1200 in overcloud deployment script as with the same value we successfully deployed the environment with ml2-ovs previously. So in large scale deployment, we usually refer ansible mistral log (/var/lib/mistral/overcloud/ansible.log) status if tripleo overcloud deploy workflow log break due to timeout. In our case, migration was successfully completed according to below mistral log, but it was not synced with "overcloud-deploy-ovn.sh.log" and overcloud deploy command was failed after 1200 timeout. ~~~ TASK [Wait for puppet host configuration to finish] **************************** Friday 27 March 2020 21:54:33 +0000 (0:00:21.629) 2:44:26.968 ********** FAILED - RETRYING: Wait for puppet host configuration to finish (1200 retries left). ~~~ Due to this issue, the remaining task ovn_migration.sh was not executed and thus the neutron-ovs-agent, qruter, qdhcp containers are not cleaned up and ovn-db-sync not updated neutron tenant network to northbound database. If there is any chance we add the condition to migration task, it will pass the deployment. As discussed with Kuba & Daniel, currently there is no revert plan available in ovn migration activity. But looking into the current situation a revert plan would be necessary for customer scenario as the tenant environment was completely down. So without a revert plan, we can not proceed with OVN migration activity in a large scale environment. $ grep failed=0 /var/lib/mistral/overcloud/ansible.log; echo $? 2020-03-27 22:04:18,953 p=278441 u=mistral | compute-0 : ok=277 changed=108 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,953 p=278441 u=mistral | compute-1 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,953 p=278441 u=mistral | compute-10 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,953 p=278441 u=mistral | compute-11 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-12 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-13 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-14 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-15 : ok=273 changed=111 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-16 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-17 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-18 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-19 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-2 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-20 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-21 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,954 p=278441 u=mistral | compute-22 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-23 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-24 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-25 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-26 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-27 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-28 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-29 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-3 : ok=273 changed=111 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-30 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-31 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-32 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-33 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,955 p=278441 u=mistral | compute-34 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-35 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-36 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-37 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-38 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-39 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-4 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-40 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-41 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-42 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-43 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-44 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,956 p=278441 u=mistral | compute-45 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-46 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-47 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-48 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-49 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-5 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-6 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-7 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-8 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | compute-9 : ok=273 changed=110 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | controller-0 : ok=376 changed=150 unreachable=0 failed=0 skipped=167 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | controller-1 : ok=316 changed=137 unreachable=0 failed=0 skipped=181 rescued=0 ignored=1 2020-03-27 22:04:18,957 p=278441 u=mistral | controller-2 : ok=316 changed=137 unreachable=0 failed=0 skipped=181 rescued=0 ignored=1 2020-03-27 22:04:18,958 p=278441 u=mistral | undercloud : ok=20 changed=7 unreachable=0 failed=0 skipped=21 rescued=0 ignored=0 0 /var/lib/mistral/overcloud/ansible.log 2020-03-27 21:54:33,671 p=278441 u=mistral | TASK [Wait for puppet host configuration to finish] **************************** 2020-03-27 21:55:08,546 p=278441 u=mistral | TASK [Debug output for task: Run puppet host configuration for step 5] ********* 2020-03-27 21:55:38,394 p=278441 u=mistral | TASK [Create puppet caching structures] **************************************** 2020-03-27 21:55:58,108 p=278441 u=mistral | TASK [Check for facter.conf] *************************************************** 2020-03-27 21:56:18,714 p=278441 u=mistral | TASK [Remove facter.conf if directory] ***************************************** 2020-03-27 21:56:38,316 p=278441 u=mistral | TASK [Write facter cache config] *********************************************** 2020-03-27 21:56:58,799 p=278441 u=mistral | TASK [Cleanup facter cache if exists] ****************************************** 2020-03-27 21:57:19,202 p=278441 u=mistral | TASK [Pre-cache facts] ********************************************************* 2020-03-27 21:57:40,879 p=278441 u=mistral | TASK [Facter error output when failed] ***************************************** 2020-03-27 21:58:01,054 p=278441 u=mistral | TASK [Sync cached facts] ******************************************************* 2020-03-27 21:58:39,243 p=278441 u=mistral | TASK [Run container-puppet tasks (generate config) during step 5] ************** 2020-03-27 21:58:59,352 p=278441 u=mistral | TASK [Wait for container-puppet tasks (generate config) to finish] ************* 2020-03-27 21:59:18,660 p=278441 u=mistral | TASK [Debug output for task: Run container-puppet tasks (generate config) during step 5] *** 2020-03-27 21:59:38,940 p=278441 u=mistral | TASK [Diff container-puppet.py puppet-generated changes for check mode] ******** 2020-03-27 21:59:59,777 p=278441 u=mistral | TASK [Diff container-puppet.py puppet-generated changes for check mode] ******** 2020-03-27 22:00:19,857 p=278441 u=mistral | TASK [Start containers for step 5 using paunch] ******************************** 2020-03-27 22:00:40,623 p=278441 u=mistral | TASK [Wait for containers to start for step 5 using paunch] ******************** 2020-03-27 22:01:14,800 p=278441 u=mistral | TASK [Debug output for task: Start containers for step 5] ********************** 2020-03-27 22:01:36,019 p=278441 u=mistral | TASK [Manage containers for step 5 with tripleo-ansible] *********************** 2020-03-27 22:01:55,608 p=278441 u=mistral | TASK [Clean container_puppet_tasks for controller-0 step 5] ******************** 2020-03-27 22:02:16,058 p=278441 u=mistral | TASK [Calculate container_puppet_tasks for controller-0 step 5] **************** 2020-03-27 22:02:35,617 p=278441 u=mistral | TASK [Write container-puppet-tasks json file for controller-0 step 5] ********** 2020-03-27 22:02:55,011 p=278441 u=mistral | TASK [Run container-puppet tasks (bootstrap tasks) for step 5] ***************** 2020-03-27 22:03:15,009 p=278441 u=mistral | TASK [Wait for container-puppet tasks (bootstrap tasks) for step 5 to finish] *** 2020-03-27 22:03:34,447 p=278441 u=mistral | TASK [Debug output for task: Run container-puppet tasks (bootstrap tasks) for step 5] *** 2020-03-27 22:03:54,014 p=278441 u=mistral | TASK [Server Post Deployments] ************************************************* 2020-03-27 22:03:54,400 p=278441 u=mistral | TASK [include_tasks] *********************************************************** 2020-03-27 22:04:14,831 p=278441 u=mistral | TASK [External deployment Post Deploy tasks] *********************************** 2020-03-27 22:04:14,995 p=278441 u=mistral | TASK [is additonal Cell?] ****************************************************** 2020-03-27 22:04:15,156 p=278441 u=mistral | TASK [discover via nova_compute?] ********************************************** 2020-03-27 22:04:15,317 p=278441 u=mistral | TASK [discover via nova_ironic?] *********************************************** 2020-03-27 22:04:15,793 p=278441 u=mistral | TASK [Discovering nova hosts] ************************************************** 2020-03-27 22:04:18,880 p=278441 u=mistral | TASK [set_fact] **************************************************************** BR, Pradipta It turned out the "overcloud deploy" command got stuck, I'm updating the summary. It's unclear what happened and why it got stuck. We need somebody with broad tripleo knowledge to help us out. Luke, Thanks for highlight, Let me increase the workers and processes counter to 12 with re-deployment. We will keep you updated with the latest details. $ sudo grep --color api_workers /var/lib/config-data/puppet-generated/mistral/etc/mistral/mistral.conf api_workers=1 $ sudo grep --color processes /var/lib/config-data/puppet-generated/mistral/etc/httpd/conf.d/10-mistral_wsgi.conf WSGIDaemonProcess mistral display-name=mistral_wsgi group=mistral processes=1 threads=1 user=mistral $ sudo grep --color processes /var/lib/config-data/puppet-generated/zaqar/etc/httpd/conf.d/10-zaqar_wsgi.conf WSGIDaemonProcess zaqar-server display-name=zaqar_wsgi group=zaqar processes=12 threads=1 user=zaqar BR, Pradipta This bug is pretty much similar to https://bugzilla.redhat.com/show_bug.cgi?id=1792500 *** This bug has been marked as a duplicate of bug 1792500 *** |