1516275 – OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails while running cinder-manage db_sync when an incorrect location of Docker images is provided

Bug 1516275 - OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails while running cinder-manage db_sync when an incorrect location of Docker images is provided

Summary: OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails whil...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-paunch
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z3
Target Release:	12.0 (Pike)
Assignee:	Steve Baker
QA Contact:	Yurii Prokulevych
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1516634
TreeView+	depends on / blocked

Reported:	2017-11-22 11:51 UTC by Marius Cornea
Modified:	2018-11-12 20:22 UTC (History)
CC List:	13 users (show)
Fixed In Version:	python-paunch-1.5.3-1.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-20 12:53:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1733941	None	None	None	2017-11-22 20:29:46 UTC
OpenStack gerrit	522665	None	None	None	2017-11-23 21:45:52 UTC
OpenStack gerrit	525322	None	None	None	2017-12-04 21:30:53 UTC

Description Marius Cornea 2017-11-22 11:51:24 UTC

Description of problem:
OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails while running cinder-manage db_sync when an incorrect location of Docker images is provided. It took approximately 30 minutes on an 1 controller + 1 compute deployment (between 2017-11-22 10:59:56 and 2017-11-22 11:28:52) to fail. In addition the error message doesn't point to the root cause of the issue. This is the failure message:

 Stack overcloud UPDATE_FAILED 

overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployment_Step3.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 439855a2-1831-43e4-95ce-08ec9d707c67
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "Debug: Received report to process from controller-0.localdomain", 
            "Debug: Processing report from controller-0.localdomain with processor Puppet::Reports::Store"
        ], 
        "failed_when_result": true
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/d94ba0f2-9398-4aef-a486-c54cda6b23d4_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=4    changed=1    unreachable=0    failed=1   
    
    (truncated, view all with --long)
  deploy_stderr: |

Heat Stack update failed.
Heat Stack update failed.

If we check the os-collect-config journal on the controllers we can notice the last error is:

"Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout",
"Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout",

But cinder-manage db_sync cannot succeed because there aren't any galera containers running at this point:

[root@controller-0 heat-admin]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Looking for more errors in the os-collect-config we can spot that the docker images weren't not found:

ERROR: 9291 -- Failed to pull image: rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker:inexistent\", \n        \"2017-11-22 11:21:06,897 ERROR: 9291 -- Failed running docker-puppet.py for crond\", \n        \"2017-11-22 11:21:06,897 ERROR: 9291 -- Unable to find image 'rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker:inexistent' locally\", \n        \"Trying to pull repository rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker ... \", \n        \"Pulling repository rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker\", \n        \"/usr/bin/docker-current: Error: image rhosp12/openstack-cron-docker:inexistent not found.\", \n        \"2017-11-22 11:21:06,898 INFO: 9291 -- Finished processing puppet configs\"

Note the 'inexistent' tag was passed on purpose to reproduce this bug.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-10.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11
2. Upgrade to OSP12 by passing inexistent image locations in the Docker Image parameters, e.g:

parameter_defaults:
  DockerAodhApiImage: $url/rhosp12/openstack-aodh-api-docker:inexistent


Actual results:
Upgrade fails while running cinder-manage db_sync after 30 minutes.

Expected results:
Upgrade should fail fast and point to the root cause of the failure(inaccessible images location)

Additional info:

Comment 1 Marios Andreou 2017-11-22 17:10:26 UTC

we discussed this on the upgrades call today... reaching out to Containers and Deployment DFGs to see if they have any thoughts about how we might catch this earlier. The upgrade_tasks and upgrade workflow itself doesn't check images or do anything to the containers currently (mainly stopping/disabling of systemd services).

Comment 2 Marios Andreou 2017-11-22 17:11:15 UTC

please see comment #1 thanks

Comment 4 Steve Baker 2017-11-22 20:18:22 UTC

There is an enhancement to paunch which would make this failure a lot less obscure. Currently detached containers are launched by doing a "docker run" then continuing with the next tasks. If the image can't be pulled (wrong image ref, network issue) then the container will eventually fail to start.

If paunch checked whether the image exists locally, then does a docker pull, then it could fail early with a clear message.

This won't catch the cases where the container isn't starting for some other reason, because paunch is not a service manager. For this we would need specific validator resources in tripleo-heat-templates which (for example) assert that mariadb is running and responding just before the first db_sync thing runs.

Comment 8 Steve Baker 2017-12-04 19:56:42 UTC

Upstream fix has landed, I'd like to know whether this should get downstream via a stable/pike backport or a direct downstream backport

Comment 9 Steve Baker 2017-12-04 21:30:53 UTC

There is no downstream git/gerrit for paunch[1], but there is an upstream stable backport.

[1] http://git.app.eng.bos.redhat.com/git/?q=python-paunch

Comment 11 Lon Hohberger 2018-04-06 10:33:48 UTC

According to our records, this should be resolved by python-paunch-1.5.3-1.el7ost.  This build is available now.

Comment 13 Yurii Prokulevych 2018-07-02 12:33:49 UTC

Verified with python-paunch-1.5.5-1.el7ost.noarch

Upgrade step failed:
overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.CephStorageDeployment_Step1.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 2c45805b-e217-4bb1-a446-7bd738652292
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "See '/usr/bin/docker-current run --help'.", 
            "2018-07-02 12:17Heat Stack update failed.
Heat Stack update failed.
:23,043 INFO: 62691 -- Finished processing puppet configs", 
            "2018-07-02 12:17:23,043 ERROR: 62690 -- ERROR configuring crond"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/fbad6ef5-eae2-4898-a39c-1df245294d85_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=6    changed=2    unreachable=0    failed=1

And in os-collect-config logs:
...
"2018-07-02 12:17:28,334 ERROR: 62966 -- Failed running docker-puppet.py for crond",
"2018-07-02 12:17:28,335 ERROR: 62966 -- Unable to find image '192.168.24.1:8787/rhosp12/openstack-cron:inexistent' locally",
"Trying to pull repository 192.168.24.1:8787/rhosp12/openstack-cron ... ", 
"Pulling repository 192.168.24.1:8787/rhosp12/openstack-cron",
"/usr/bin/docker-current: Error: image rhosp12/openstack-cron:inexistent not found.",
"See '/usr/bin/docker-current run --help'.",
"2018-07-02 12:17:28,335 INFO: 62966 -- Finished processing puppet configs",
"2018-07-02 12:17:28,335 ERROR: 62965 -- ERROR configuring crond"
...

Comment 16 errata-xmlrpc 2018-08-20 12:53:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2521

Note You need to log in before you can comment on or make changes to this bug.