Created attachment 1734752 [details] Mistral and oc-c0-upgrade-run.log logs Description of problem: I'm not sure this is a Nova bug, maybe and FFU maybe something else. On a manual FFU 13 to 16.1 13 -p 2020-11-13.1 RHOS-16.1-RHEL-8-20201124.n.0 While running post-leapp pre-upgrade workarounds and upgrade controller-0: openstack overcloud upgrade run --stack overcloud --limit controller-0 \ | tee oc-c0-upgrade-run.log Command fails to complete Version-Release number of selected component (if applicable): (undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep heat openstack-heat-common-13.0.3-1.20200914171254.48b730a.el8ost.noarch python3-tripleoclient-heat-installer-12.3.2-1.20200914164930.el8ost.noarch python3-heat-agent-json-file-1.10.1-0.20200311091123.96b819c.el8ost.noarch openstack-heat-agents-1.10.1-0.20200311091123.96b819c.el8ost.noarch python3-heat-agent-ansible-1.10.1-0.20200311091123.96b819c.el8ost.noarch openstack-heat-monolith-13.0.3-1.20200914171254.48b730a.el8ost.noarch python3-heat-agent-apply-config-1.10.1-0.20200311091123.96b819c.el8ost.noarch openstack-tripleo-heat-templates-11.3.2-1.20200914170175.el8ost.noarch puppet-heat-15.4.1-1.20200821233740.d02f3a4.el8ost.noarch openstack-heat-api-13.0.3-1.20200914171254.48b730a.el8ost.noarch python3-heatclient-1.18.0-0.20200310192511.eca1637.el8ost.noarch python3-heat-agent-1.10.1-0.20200311091123.96b819c.el8ost.noarch openstack-heat-engine-13.0.3-1.20200914171254.48b730a.el8ost.noarch heat-cfntools-1.4.2-6.el8ost.noarch python3-heat-agent-puppet-1.10.1-0.20200311091123.96b819c.el8ost.noarch python3-heat-agent-hiera-1.10.1-0.20200311091123.96b819c.el8ost.noarch python3-heat-agent-docker-cmd-1.10.1-0.20200311091123.96b819c.el8ost.noarch How reproducible: Happened twice on two separate systems Steps to Reproduce: 1. Deploy osp13 (none ceph) 2. Start FFU upgraded the UC 3. Start upgrading OC issue: # openstack overcloud upgrade run --stack overcloud --limit controller-0 \ | tee oc-c0-upgrade-run.log Command fails to complete see below Actual results: "Running upgrade for neutron ...", "OK", "Running upgrade for networking-l2gw ...", "OK", "Running upgrade for networking-sfc ...", "OK", "Running upgrade for neutron-dynamic-routing ...", "OK", "Running upgrade for vmware-nsx ...", "OK", "", "2ae8ea451f7bd3fcc69075291d0b2130dcc743d3b75999e37a730dd0b2745880", "", "Error during database migration: \"Database schema file with version 118 doesn't exist.\"", "", "1cbab63cbc6a258e7dfa5fb54f180d86b3b7d1fc8c71097042529e83e9951835", "", "0426de1906ee42d51a381f08a0e7d14581a1e09559a5cff75160db7852d52256", "", "2020-11-29 14:42:59.544 12 WARNING oslo_db.sqlalchemy.engines [-] MySQL SQL mode is 'STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION', consider enabling TRADITIONAL or STRICT_ALL_TABLES\u001b[00m", "2020-11-29 14:42:59.548 12 INFO alembic.runtime.migration [-] Context impl MySQLImpl.\u001b[00m", "2020-11-29 14:42:59.548 12 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.\u001b[00m", "", "Nova database contains data, placement database does not. Okay to proceed with migration", "Dumping from NOVA_API to migrate-db.gdtSz3QZ/from-nova.sql", "Loading to PLACEMENT from migrate-db.gdtSz3QZ/from-nova.sql", "", "2020-11-29 14:43:15.877 13 INFO barbican.model.sync [-] Syncing the secret_stores table with barbican.conf\u001b[00m", "", "Cell0 is already setup", "", "b4a05835ce5cae88c79b12507f3bbfcf9e3516c3857cfea3aa0a4857a51658d4", "", "(cellv2) Updating default cell_v2 cell d59ddd21-1c84-4e10-8183-ca6d111afefc", "", "37ff99fff380bf3b68284810580b318a61961e4bed8811e61538d973e588dda0", "", "3322a41a6bc165aafb5f72145bf068ed22664f1b8bfa53cf0efab06c508367c0"]} NO MORE HOSTS LEFT ************************************************************* PLAY RECAP ********************************************************************* controller-0 : ok=293 changed=165 unreachable=0 failed=1 skipped=268 rescued=0 ignored=0 Sunday 29 November 2020 09:43:44 -0500 (0:01:39.015) 0:22:11.052 ******* =============================================================================== Wait for containers to start for step 2 using paunch ------------------ 559.31s Wait for container-puppet tasks (generate config) to finish ----------- 125.84s Pre-fetch all the containers ------------------------------------------ 122.69s Wait for containers to start for step 3 using paunch ------------------- 99.02s Wait for puppet host configuration to finish --------------------------- 59.80s Wait for puppet host configuration to finish --------------------------- 17.62s Wait for puppet host configuration to finish --------------------------- 17.56s Run puppet on the host to apply IPtables rules ------------------------- 15.69s tripleo-network-config : Run NetworkConfig script ---------------------- 14.24s Debug output for task: Start containers for step 2 --------------------- 10.86s tripleo-hieradata : Render hieradata from template ---------------------- 7.67s Wait for containers to start for step 1 using paunch -------------------- 7.15s tripleo-kernel : Set extra sysctl options ------------------------------- 6.25s Render all_nodes data as group_vars for overcloud ----------------------- 5.02s tripleo-bootstrap : Deploy required packages to bootstrap TripleO ------- 4.52s Wait for container-puppet tasks (bootstrap tasks) for step 2 to finish --- 3.86s Wait for container-puppet tasks (bootstrap tasks) for step 1 to finish --- 3.83s tripleo-podman : ensure podman and deps are installed ------------------- 3.00s tripleo_lvmfilter : gather package facts -------------------------------- 2.96s tripleo-bootstrap : Deploy network-scripts required for deprecated network service --- 2.94s Ansible failed, check log at /var/log/containers/mistral/package_update.log. 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log. 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last): 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 32, in run 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun super(Command, self).run(parsed_args) 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun return super(Command, self).run(parsed_args) 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun return_code = self.take_action(parsed_args) or 0 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_upgrade.py", line 271, in take_action 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun priv_key=key) 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/tripleoclient/utils.py", line 1369, in run_update_ansible_action 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun verbosity=verbosity, extra_vars=extra_vars) 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/package_update.py", line 127, in update_ansible 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun raise RuntimeError('Update failed with: {}'.format(payload['message'])) 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log. 2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun ESC[00m 2020-11-29 09:43:45.303 195668 ERROR openstack [-] Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.: RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.ESC[00m 2020-11-29 09:43:45.303 195668 INFO osc_lib.shell [-] END return value: 1ESC[00m Expected results: Command should complete without an issue, as it has for me in the past FFU attempts. Additional info: Some of the logs were rolled so check log.1 as well Add two grep logs from c0: grep-error-c0-var-log-containers.txt grep-error-var-log-paunch.log.txt
For what it's worth, I started a third FFU attempt on Titan50 this time without Barbican deployed, it just passed the step that failed here. now moving on to upgrade controller-1. Unsure if Barbican or maybe it together with Cinder backup causes this bug. If/once my FFU completes, I could verify the Cinder bug. After which if it helps I can try a forth FFU this time with Barbican but without Cinder backup.
Updating per comment above, The third FFU attempt completed without errors. Cinder bz was verified, that urgency has now dropped. Re-confirming Michele's suspicion, Babrican is the prime suspect here. Bumping up urgency of bz to high, as customers using Babrican on OSP13 might hit this during FFU upgrades. I'l let dev add input before but suggest we update the title to reflect the issue better, maybe "FFU from 13 to 16.1z3 fails when Barbican is deployed".
Created kcs for this when the transitional images are skipped. https://access.redhat.com/node/5625391
https://access.redhat.com/solutions/5625391
Nice work, thanks Jeremy. Note that this kind of error may happen for any service where the Stein image was left out. If any other similar bugs are reported, it'd be great if we could update the same KCS with the errors from those.
Given that this appears to have been a misconfiguration I'm closing this as works for me. If there is anything else expected then please reopen the bug.