Bug 1902628 - FFU from 13-16.1, fails on: post-leapp pre-upgrade workarounds and upgrade controller-0:
Summary: FFU from 13-16.1, fails on: post-leapp pre-upgrade workarounds and upgrade co...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Jesse Pretorius
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-30 08:22 UTC by Tzach Shefi
Modified: 2024-03-25 17:17 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-05 18:13:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Mistral and oc-c0-upgrade-run.log logs (10.67 MB, application/gzip)
2020-11-30 08:22 UTC, Tzach Shefi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-14152 0 None None None 2022-03-22 13:24:07 UTC

Description Tzach Shefi 2020-11-30 08:22:01 UTC
Created attachment 1734752 [details]
Mistral and oc-c0-upgrade-run.log logs

Description of problem: I'm not sure this is a Nova bug, maybe and FFU maybe something else. 

On a manual FFU 13 to 16.1
13  -p 2020-11-13.1
RHOS-16.1-RHEL-8-20201124.n.0

While running post-leapp pre-upgrade workarounds and upgrade controller-0:

    openstack overcloud upgrade run --stack overcloud --limit controller-0 \
        | tee oc-c0-upgrade-run.log

Command fails to complete 

Version-Release number of selected component (if applicable):
(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep heat
openstack-heat-common-13.0.3-1.20200914171254.48b730a.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-1.20200914164930.el8ost.noarch
python3-heat-agent-json-file-1.10.1-0.20200311091123.96b819c.el8ost.noarch
openstack-heat-agents-1.10.1-0.20200311091123.96b819c.el8ost.noarch
python3-heat-agent-ansible-1.10.1-0.20200311091123.96b819c.el8ost.noarch
openstack-heat-monolith-13.0.3-1.20200914171254.48b730a.el8ost.noarch
python3-heat-agent-apply-config-1.10.1-0.20200311091123.96b819c.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-1.20200914170175.el8ost.noarch
puppet-heat-15.4.1-1.20200821233740.d02f3a4.el8ost.noarch
openstack-heat-api-13.0.3-1.20200914171254.48b730a.el8ost.noarch
python3-heatclient-1.18.0-0.20200310192511.eca1637.el8ost.noarch
python3-heat-agent-1.10.1-0.20200311091123.96b819c.el8ost.noarch
openstack-heat-engine-13.0.3-1.20200914171254.48b730a.el8ost.noarch
heat-cfntools-1.4.2-6.el8ost.noarch
python3-heat-agent-puppet-1.10.1-0.20200311091123.96b819c.el8ost.noarch
python3-heat-agent-hiera-1.10.1-0.20200311091123.96b819c.el8ost.noarch
python3-heat-agent-docker-cmd-1.10.1-0.20200311091123.96b819c.el8ost.noarch


How reproducible:
Happened twice on two separate systems

Steps to Reproduce:
1. Deploy osp13 (none ceph)
2. Start FFU upgraded the UC
3. Start upgrading OC 
issue:
#  openstack overcloud upgrade run --stack overcloud --limit controller-0 \
        | tee oc-c0-upgrade-run.log

Command fails to complete see below

Actual results:


 "Running upgrade for neutron ...", "OK", "Running upgrade for networking-l2gw ...", "OK", "Running upgrade for networking-sfc ...", "OK", "Running upgrade for neutron-dynamic-routing ...", "OK", "Running upgrade for vmware-nsx ...", "OK", "", "2ae8ea451f7bd3fcc69075291d0b2130dcc743d3b75999e37a730dd0b2745880", "", "Error during database migration: \"Database schema file with version 118 doesn't exist.\"", "", "1cbab63cbc6a258e7dfa5fb54f180d86b3b7d1fc8c71097042529e83e9951835", "", "0426de1906ee42d51a381f08a0e7d14581a1e09559a5cff75160db7852d52256", "", "2020-11-29 14:42:59.544 12 WARNING oslo_db.sqlalchemy.engines [-] MySQL SQL mode is 'STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION', consider enabling TRADITIONAL or STRICT_ALL_TABLES\u001b[00m", "2020-11-29 14:42:59.548 12 INFO alembic.runtime.migration [-] Context impl MySQLImpl.\u001b[00m", "2020-11-29 14:42:59.548 12 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.\u001b[00m", "", "Nova database contains data, placement database does not. Okay to proceed with migration", "Dumping from NOVA_API to migrate-db.gdtSz3QZ/from-nova.sql", "Loading to PLACEMENT from migrate-db.gdtSz3QZ/from-nova.sql", "", "2020-11-29 14:43:15.877 13 INFO barbican.model.sync [-] Syncing the secret_stores table with barbican.conf\u001b[00m", "", "Cell0 is already setup", "", "b4a05835ce5cae88c79b12507f3bbfcf9e3516c3857cfea3aa0a4857a51658d4", "", "(cellv2) Updating default cell_v2 cell d59ddd21-1c84-4e10-8183-ca6d111afefc", "", "37ff99fff380bf3b68284810580b318a61961e4bed8811e61538d973e588dda0", "", "3322a41a6bc165aafb5f72145bf068ed22664f1b8bfa53cf0efab06c508367c0"]}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
controller-0               : ok=293  changed=165  unreachable=0    failed=1    skipped=268  rescued=0    ignored=0   

Sunday 29 November 2020  09:43:44 -0500 (0:01:39.015)       0:22:11.052 ******* 
=============================================================================== 
Wait for containers to start for step 2 using paunch ------------------ 559.31s
Wait for container-puppet tasks (generate config) to finish ----------- 125.84s
Pre-fetch all the containers ------------------------------------------ 122.69s
Wait for containers to start for step 3 using paunch ------------------- 99.02s
Wait for puppet host configuration to finish --------------------------- 59.80s
Wait for puppet host configuration to finish --------------------------- 17.62s
Wait for puppet host configuration to finish --------------------------- 17.56s
Run puppet on the host to apply IPtables rules ------------------------- 15.69s
tripleo-network-config : Run NetworkConfig script ---------------------- 14.24s
Debug output for task: Start containers for step 2 --------------------- 10.86s
tripleo-hieradata : Render hieradata from template ---------------------- 7.67s
Wait for containers to start for step 1 using paunch -------------------- 7.15s
tripleo-kernel : Set extra sysctl options ------------------------------- 6.25s
Render all_nodes data as group_vars for overcloud ----------------------- 5.02s
tripleo-bootstrap : Deploy required packages to bootstrap TripleO ------- 4.52s
Wait for container-puppet tasks (bootstrap tasks) for step 2 to finish --- 3.86s
Wait for container-puppet tasks (bootstrap tasks) for step 1 to finish --- 3.83s
tripleo-podman : ensure podman and deps are installed ------------------- 3.00s
tripleo_lvmfilter : gather package facts -------------------------------- 2.96s
tripleo-bootstrap : Deploy network-scripts required for deprecated network service --- 2.94s

Ansible failed, check log at /var/log/containers/mistral/package_update.log.
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last):
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 32, in run
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     super(Command, self).run(parsed_args)
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     return super(Command, self).run(parsed_args)
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     return_code = self.take_action(parsed_args) or 0
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_upgrade.py", line 271, in take_action
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     priv_key=key)
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/utils.py", line 1369, in run_update_ansible_action
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     verbosity=verbosity, extra_vars=extra_vars)
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/package_update.py", line 127, in update_ansible
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     raise RuntimeError('Update failed with: {}'.format(payload['message']))
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.
2020-11-29 09:43:45.301 195668 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun ESC[00m
2020-11-29 09:43:45.303 195668 ERROR openstack [-] Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.: RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.ESC[00m
2020-11-29 09:43:45.303 195668 INFO osc_lib.shell [-] END return value: 1ESC[00m



Expected results:
Command should complete without an issue, 
as it has for me in the past FFU attempts. 

Additional info:

Some of the logs were rolled so check log.1 as well
Add two grep logs from c0:
grep-error-c0-var-log-containers.txt
grep-error-var-log-paunch.log.txt

Comment 6 Tzach Shefi 2020-12-02 15:12:40 UTC
For what it's worth, I started a third FFU attempt on Titan50
this time without Barbican deployed, it just passed the step that failed here. 
now moving on to upgrade controller-1. 

Unsure if Barbican or maybe it together with Cinder backup causes this bug.
If/once my FFU completes, I could verify the Cinder bug. 
After which if it helps I can try a forth FFU this time with Barbican but without Cinder backup.

Comment 7 Tzach Shefi 2020-12-03 07:50:47 UTC
Updating per comment above,
The third FFU attempt completed without errors.
Cinder bz was verified, that urgency has now dropped. 

Re-confirming Michele's suspicion, Babrican is the prime suspect here. 

Bumping up urgency of bz to high,
as customers using Babrican on OSP13 might hit this during FFU upgrades.


I'l let dev add input before but 
suggest we update the title to reflect the issue better, 
maybe "FFU from 13 to 16.1z3 fails when Barbican is deployed".

Comment 8 Jeremy 2020-12-03 19:45:04 UTC
Created kcs for this when the transitional images are skipped.

https://access.redhat.com/node/5625391

Comment 10 Jesse Pretorius 2020-12-03 21:05:20 UTC
Nice work, thanks Jeremy. Note that this kind of error may happen for any service where the Stein image was left out. If any other similar bugs are reported, it'd be great if we could update the same KCS with the errors from those.

Comment 11 Jesse Pretorius 2021-03-05 18:13:37 UTC
Given that this appears to have been a misconfiguration I'm closing this as works for me. If there is anything else expected then please reopen the bug.


Note You need to log in before you can comment on or make changes to this bug.