Description of problem: OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails while running cinder-manage db_sync when an incorrect location of Docker images is provided. It took approximately 30 minutes on an 1 controller + 1 compute deployment (between 2017-11-22 10:59:56 and 2017-11-22 11:28:52) to fail. In addition the error message doesn't point to the root cause of the issue. This is the failure message: Stack overcloud UPDATE_FAILED overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployment_Step3.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 439855a2-1831-43e4-95ce-08ec9d707c67 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... "Debug: Received report to process from controller-0.localdomain", "Debug: Processing report from controller-0.localdomain with processor Puppet::Reports::Store" ], "failed_when_result": true } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/d94ba0f2-9398-4aef-a486-c54cda6b23d4_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=4 changed=1 unreachable=0 failed=1 (truncated, view all with --long) deploy_stderr: | Heat Stack update failed. Heat Stack update failed. If we check the os-collect-config journal on the controllers we can notice the last error is: "Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout", "Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout", But cinder-manage db_sync cannot succeed because there aren't any galera containers running at this point: [root@controller-0 heat-admin]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES Looking for more errors in the os-collect-config we can spot that the docker images weren't not found: ERROR: 9291 -- Failed to pull image: rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker:inexistent\", \n \"2017-11-22 11:21:06,897 ERROR: 9291 -- Failed running docker-puppet.py for crond\", \n \"2017-11-22 11:21:06,897 ERROR: 9291 -- Unable to find image 'rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker:inexistent' locally\", \n \"Trying to pull repository rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker ... \", \n \"Pulling repository rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp12/openstack-cron-docker\", \n \"/usr/bin/docker-current: Error: image rhosp12/openstack-cron-docker:inexistent not found.\", \n \"2017-11-22 11:21:06,898 INFO: 9291 -- Finished processing puppet configs\" Note the 'inexistent' tag was passed on purpose to reproduce this bug. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.3-10.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP11 2. Upgrade to OSP12 by passing inexistent image locations in the Docker Image parameters, e.g: parameter_defaults: DockerAodhApiImage: $url/rhosp12/openstack-aodh-api-docker:inexistent Actual results: Upgrade fails while running cinder-manage db_sync after 30 minutes. Expected results: Upgrade should fail fast and point to the root cause of the failure(inaccessible images location) Additional info:
we discussed this on the upgrades call today... reaching out to Containers and Deployment DFGs to see if they have any thoughts about how we might catch this earlier. The upgrade_tasks and upgrade workflow itself doesn't check images or do anything to the containers currently (mainly stopping/disabling of systemd services).
please see comment #1 thanks
There is an enhancement to paunch which would make this failure a lot less obscure. Currently detached containers are launched by doing a "docker run" then continuing with the next tasks. If the image can't be pulled (wrong image ref, network issue) then the container will eventually fail to start. If paunch checked whether the image exists locally, then does a docker pull, then it could fail early with a clear message. This won't catch the cases where the container isn't starting for some other reason, because paunch is not a service manager. For this we would need specific validator resources in tripleo-heat-templates which (for example) assert that mariadb is running and responding just before the first db_sync thing runs.
Upstream fix has landed, I'd like to know whether this should get downstream via a stable/pike backport or a direct downstream backport
There is no downstream git/gerrit for paunch[1], but there is an upstream stable backport. [1] http://git.app.eng.bos.redhat.com/git/?q=python-paunch
According to our records, this should be resolved by python-paunch-1.5.3-1.el7ost. This build is available now.
Verified with python-paunch-1.5.5-1.el7ost.noarch Upgrade step failed: overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.CephStorageDeployment_Step1.1: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 2c45805b-e217-4bb1-a446-7bd738652292 status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... "See '/usr/bin/docker-current run --help'.", "2018-07-02 12:17Heat Stack update failed. Heat Stack update failed. :23,043 INFO: 62691 -- Finished processing puppet configs", "2018-07-02 12:17:23,043 ERROR: 62690 -- ERROR configuring crond" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/fbad6ef5-eae2-4898-a39c-1df245294d85_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=6 changed=2 unreachable=0 failed=1 And in os-collect-config logs: ... "2018-07-02 12:17:28,334 ERROR: 62966 -- Failed running docker-puppet.py for crond", "2018-07-02 12:17:28,335 ERROR: 62966 -- Unable to find image '192.168.24.1:8787/rhosp12/openstack-cron:inexistent' locally", "Trying to pull repository 192.168.24.1:8787/rhosp12/openstack-cron ... ", "Pulling repository 192.168.24.1:8787/rhosp12/openstack-cron", "/usr/bin/docker-current: Error: image rhosp12/openstack-cron:inexistent not found.", "See '/usr/bin/docker-current run --help'.", "2018-07-02 12:17:28,335 INFO: 62966 -- Finished processing puppet configs", "2018-07-02 12:17:28,335 ERROR: 62965 -- ERROR configuring crond" ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2521