Bug 1492590

Summary: OSP11 -> OSP12 upgrade: rerunning major-upgrade-composable-steps-docker.yaml for a second time fails with: ERROR: The specified reference "WorkflowTasks_Step1_Execution" (in NetworkerDeployment_Step1) is incorrect.
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Ronnie Rasouli <rrasouli>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: apannu, aschultz, dbecker, gfidente, jomurphy, jschluet, mandreou, mburns, morazi, rhel-osp-director-maint, sbaker, sclewis, shardy, srevivo, therve, zbitter
Target Milestone: rcKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-heat-9.0.1-0.20171004002955.633da7f.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 22:08:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
stack home none

Description Marius Cornea 2017-09-18 09:26:34 UTC
Description of problem:
OSP11 -> OSP12 upgrade: rerunning major-upgrade-composable-steps-docker.yaml for a second time fails with: ERROR: The specified reference "WorkflowTasks_Step1_Execution" (in NetworkerDeployment_Step1) is incorrect.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170913050523.0rc2.el7ost.noarch

How reproducible:
1/1

Steps to Reproduce:
1. Deploy OSP11 with standalone nodes including Ceph storage nodes

2. Set DockerCephDaemonImage parameter to non existing location

3. Run major-upgrade-composable-steps-docker.yaml

4. Wait for deployment to fail because of missing image:

[root@undercloud-0 stack]# tail /var/log/mistral/ceph-install-workflow.log 
2017-09-17 19:59:39,292 p=11305 u=mistral |  TASK [ceph-docker-common : pull ceph/rhceph-2-rhel7 image] *********************
2017-09-17 19:59:39,675 p=11305 u=mistral |  fatal: [192.168.24.19]: FAILED! => {"changed": false, "cmd": ["docker", "pull", "192.168.24.1:8787/ceph/rhceph-2-rhel7:latest"], "delta": "0:00:00.031890", "end": "2017-09-17 23:59:40.694261", "failed": true, "rc": 1, "start": "2017-09-17 23:59:40.662371", "stderr": "Error: image ceph/rhceph-2-rhel7:latest not found", "stderr_lines": ["Error: image ceph/rhceph-2-rhel7:latest not found"], "stdout": "Trying to pull repository 192.168.24.1:8787/ceph/rhceph-2-rhel7 ... \nPulling repository 192.168.24.1:8787/ceph/rhceph-2-rhel7", "stdout_lines": ["Trying to pull repository 192.168.24.1:8787/ceph/rhceph-2-rhel7 ... ", "Pulling repository 192.168.24.1:8787/ceph/rhceph-2-rhel7"]}

5. Fix the issue by uploading image to the location specified in DockerCephDaemonImage

6. Rerun the major-upgrade-composable-steps-docker.yaml

Actual results:

Fails right away with:
ERROR: The specified reference "WorkflowTasks_Step1_Execution" (in NetworkerDeployment_Step1) is incorrect.

Expected results:
Rerunning major-upgrade-composable-steps-docker.yaml is possible.

Additional info:
Attaching sosreport and deploy script/environment files used.

Comment 1 Marius Cornea 2017-09-18 09:33:19 UTC
Created attachment 1327281 [details]
stack home

Comment 3 Marios Andreou 2017-09-19 14:46:34 UTC
Spent some more time looking here to try and triage it as we discussed on scrum yesterday. AFAICS it is indeed related to ansible-ceph - the workflow tasks are here https://github.com/openstack/tripleo-heat-templates/blob/ab682ed638a63b435037d5b2a34df7770e2c4d5a/common/deploy-steps.j2#L98-L151    

Those steps are included after the upgrade_tasks, here https://github.com/openstack/tripleo-heat-templates/blob/ab682ed638a63b435037d5b2a34df7770e2c4d5a/common/major_upgrade_steps.j2.yaml#L179 and here https://github.com/openstack/tripleo-heat-templates/blob/ab682ed638a63b435037d5b2a34df7770e2c4d5a/common/post-upgrade.j2.yaml

There may be some issue with the way the workflow tasks are defined or some recent change in the deploy-steps which broke it? From the attached https://bugzilla.redhat.com/attachment.cgi?id=1327281 stack-home and the 
overcloud_composable_upgrade.log the trace is like

        2017-09-17 23:57:38Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps.WorkflowTasks_Step2]: CREATE_COMPLETE  state changed
        2017-09-17 23:57:38Z [overclouHeat Stack update failed.
        Heat Stack update failed.
        d-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS  state changed
        2017-09-17 23:59:43Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution]: CREATE_FAILED  resources.WorkflowTasks_Step2_Execution: ERROR
        2017-09-17 23:59:44Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps]: CREATE_FAILED  Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
        2017-09-17 23:59:45Z [overcloud-AllNodesDeploySteps-gr5fyevs3224.AllNodesPostUpgradeSteps]: CREATE_FAILED  resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
        2017-09-17 23:59:45Z [overcloud-AllNodesDeploySteps-gr5fyevs3224]: UPDATE_FAILED  resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
        2017-09-17 23:59:45Z [AllNodesDeploySteps]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
        2017-09-17 23:59:46Z [overcloud]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.AllNodesPostUpgradeSteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR

         Stack overcloud UPDATE_FAILED 

        overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution:
          resource_type: OS::Mistral::ExternalResource
          physical_resource_id: a9ef9aed-ec71-4abe-b762-888373d49a3e
          status: CREATE_FAILED
          status_reason: |
            resources.WorkflowTasks_Step2_Execution: ERROR

I am holding off on marking triaged for now and I think we should reach out to the ceph dfg for help on that, since the workflow tasks in question are ceph-ansible related. I'll try ping Jeff on irc now - DFG:Ceph can we please get some help to triage this ceph-ansible related issue.

Comment 4 Giulio Fidente 2017-09-19 14:51:24 UTC
the error from ceph-ansible in comment #0 seems a fine error from the initial run due to image url being unset

the real blocker seems to be instead that NetworkDeployment_Step1 has a dependency on a resource which doesn't exist

Comment 9 Zane Bitter 2017-09-20 14:40:45 UTC
This sounds a lot like https://bugs.launchpad.net/heat/+bug/1701677

The patch for that appears to have merged on master just after Pike branched, so it's not present in OSP12. I've proposed a backport upstream.

Comment 16 errata-xmlrpc 2017-12-13 22:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462