Hide Forgot
Description of problem: OSP11 -> OSP12 upgrade: rerunning major-upgrade-composable-steps-docker.yaml for a second time fails with: ERROR: The specified reference "WorkflowTasks_Step1_Execution" (in NetworkerDeployment_Step1) is incorrect. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.0-0.20170913050523.0rc2.el7ost.noarch How reproducible: 1/1 Steps to Reproduce: 1. Deploy OSP11 with standalone nodes including Ceph storage nodes 2. Set DockerCephDaemonImage parameter to non existing location 3. Run major-upgrade-composable-steps-docker.yaml 4. Wait for deployment to fail because of missing image: [root@undercloud-0 stack]# tail /var/log/mistral/ceph-install-workflow.log 2017-09-17 19:59:39,292 p=11305 u=mistral | TASK [ceph-docker-common : pull ceph/rhceph-2-rhel7 image] ********************* 2017-09-17 19:59:39,675 p=11305 u=mistral | fatal: [192.168.24.19]: FAILED! => {"changed": false, "cmd": ["docker", "pull", "192.168.24.1:8787/ceph/rhceph-2-rhel7:latest"], "delta": "0:00:00.031890", "end": "2017-09-17 23:59:40.694261", "failed": true, "rc": 1, "start": "2017-09-17 23:59:40.662371", "stderr": "Error: image ceph/rhceph-2-rhel7:latest not found", "stderr_lines": ["Error: image ceph/rhceph-2-rhel7:latest not found"], "stdout": "Trying to pull repository 192.168.24.1:8787/ceph/rhceph-2-rhel7 ... \nPulling repository 192.168.24.1:8787/ceph/rhceph-2-rhel7", "stdout_lines": ["Trying to pull repository 192.168.24.1:8787/ceph/rhceph-2-rhel7 ... ", "Pulling repository 192.168.24.1:8787/ceph/rhceph-2-rhel7"]} 5. Fix the issue by uploading image to the location specified in DockerCephDaemonImage 6. Rerun the major-upgrade-composable-steps-docker.yaml Actual results: Fails right away with: ERROR: The specified reference "WorkflowTasks_Step1_Execution" (in NetworkerDeployment_Step1) is incorrect. Expected results: Rerunning major-upgrade-composable-steps-docker.yaml is possible. Additional info: Attaching sosreport and deploy script/environment files used.
Created attachment 1327281 [details] stack home
Spent some more time looking here to try and triage it as we discussed on scrum yesterday. AFAICS it is indeed related to ansible-ceph - the workflow tasks are here https://github.com/openstack/tripleo-heat-templates/blob/ab682ed638a63b435037d5b2a34df7770e2c4d5a/common/deploy-steps.j2#L98-L151 Those steps are included after the upgrade_tasks, here https://github.com/openstack/tripleo-heat-templates/blob/ab682ed638a63b435037d5b2a34df7770e2c4d5a/common/major_upgrade_steps.j2.yaml#L179 and here https://github.com/openstack/tripleo-heat-templates/blob/ab682ed638a63b435037d5b2a34df7770e2c4d5a/common/post-upgrade.j2.yaml There may be some issue with the way the workflow tasks are defined or some recent change in the deploy-steps which broke it? From the attached https://bugzilla.redhat.com/attachment.cgi?id=1327281 stack-home and the overcloud_composable_upgrade.log the trace is like 2017-09-17 23:57:38Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps.WorkflowTasks_Step2]: CREATE_COMPLETE state changed 2017-09-17 23:57:38Z [overclouHeat Stack update failed. Heat Stack update failed. d-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS state changed 2017-09-17 23:59:43Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution]: CREATE_FAILED resources.WorkflowTasks_Step2_Execution: ERROR 2017-09-17 23:59:44Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps]: CREATE_FAILED Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 2017-09-17 23:59:45Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps]: CREATE_FAILED resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 2017-09-17 23:59:45Z [overcloud-AllNodesDeploySteps-gr5fyevs3224]: UPDATE_FAILED resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 2017-09-17 23:59:45Z [AllNodesDeploySteps]: UPDATE_FAILED resources.AllNodesDeploySteps: resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 2017-09-17 23:59:46Z [overcloud]: UPDATE_FAILED resources.AllNodesDeploySteps: resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR Stack overcloud UPDATE_FAILED overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution: resource_type: OS::Mistral::ExternalResource physical_resource_id: a9ef9aed-ec71-4abe-b762-888373d49a3e status: CREATE_FAILED status_reason: | resources.WorkflowTasks_Step2_Execution: ERROR I am holding off on marking triaged for now and I think we should reach out to the ceph dfg for help on that, since the workflow tasks in question are ceph-ansible related. I'll try ping Jeff on irc now - DFG:Ceph can we please get some help to triage this ceph-ansible related issue.
the error from ceph-ansible in comment #0 seems a fine error from the initial run due to image url being unset the real blocker seems to be instead that NetworkDeployment_Step1 has a dependency on a resource which doesn't exist
This sounds a lot like https://bugs.launchpad.net/heat/+bug/1701677 The patch for that appears to have merged on master just after Pike branched, so it's not present in OSP12. I've proposed a backport upstream.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462