Bug 1288220 - Scaling out an Openstack Infrastructure provider with an additional compute node doesn't work for an updated overcloud
Summary: Scaling out an Openstack Infrastructure provider with an additional compute n...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Tzu-Mainn Chen
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks: 1298196
TreeView+ depends on / blocked
 
Reported: 2015-12-03 21:38 UTC by Marius Cornea
Modified: 2016-09-12 23:51 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1298196 (view as bug list)
Environment:
Last Closed: 2016-04-18 16:16:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
os-collect-config (1.05 MB, text/x-vhdl)
2015-12-07 13:14 UTC, Marius Cornea
no flags Details
os-collect-config log (296.48 KB, text/plain)
2016-02-23 21:55 UTC, Tzu-Mainn Chen
no flags Details
heat update failures (1.32 MB, text/plain)
2016-02-24 14:44 UTC, Ronnie Rasouli
no flags Details

Description Marius Cornea 2015-12-03 21:38:50 UTC
Description of problem:
Scaling out an Openstack Infrastructure provider with an additional compute node for an overcloud updated from 7.1 to 7.2 doesn't work.

Version-Release number of selected component (if applicable):
5.5.0.13

How reproducible:
100%

Steps to Reproduce:
1. Deploy RHEL OSP 7.1 (3 ctrls, 1 compute and 3 ceph nodes in my test)
2. Update both undercloud and overcloud to version 7.2 by following the update procedure
3. Add the undercloud to Cloudforms as Openstack Infra provider
4. Scale out the infra provider with an additional compute node

Actual results:
The stack update seems to get stuck.

Expected results:
The stack update finishes.

Additional info:

From Openstack side:

Initial deploy command:
openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml 

Update command:
/usr/bin/yes '' | openstack overcloud update stack overcloud -i \
         --templates ~/templates/my-overcloud \
         -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
         -e ~/templates/my-overcloud/environments/network-isolation.yaml \
         -e ~/templates/network-environment.yaml \
         -e ~/templates/firstboot-environment.yaml \
         -e ~/templates/ceph.yaml \
         -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \
         -e ~/templates/ctrlport.yaml

Comment 2 Ladislav Smola 2015-12-04 16:00:02 UTC
this is the call in ManageIQ

https://bugzilla.redhat.com/show_bug.cgi?id=1288220

We do call PATCH with master template, we get from api and only one Count parameter, used for scaling.

Comment 3 Marius Cornea 2015-12-07 13:14:16 UTC
Created attachment 1103221 [details]
os-collect-config

Moving this to the Openstack side. I tested the scale out on a non-update 7.2 deployment and it worked. 

When I try to scale out on an updated deployment I can see the new compute node provisioned and yum update triggered for it. In the end the overcloud stack update timeouts after 240minutes. 

Attaching the os-collect-config log from the added compute node.

Comment 5 Marius Cornea 2015-12-07 16:48:17 UTC
From what I can tell from the os-collect-config log the last action on the newly added compute was: 

++ curl -s -w '%{http_code}' -X POST -H Content-Type: -o /tmp/tmp.UOjYLGlR5o --data-binary '{"deploy_stdout": "os-apply-config deployment 0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 completed", "deploy_status_code": "0"}' 'http://192.0.2.1:8000/v
1/signal/arn%3Aopenstack%3Aheat%3A%3A9edce9ad17474a2093b89749ea8ae99c%3Astacks%2Fovercloud-Compute-vwqrtha3heky-1-727wpn3ted3l%2F024b0343-e168-4ab8-be40-464a96187655%2Fresources%2FNovaComputeDeployment?Timestamp=2015-12-07T12%3A44%3A15Z&Si
gnatureMethod=HmacSHA256&AWSAccessKeyId=5e5bf7331dcc4bfe8a2603ff8b8efb2e&SignatureVersion=2&Signature=FfuLQZUy%2FjUABdLZUQGfEHEOiMJOYcUXj9bH3OJEA4w%3D'
+ status=200

while the undercloud heat-api.log shows 

{
    "code": 404,
    "error": {
        "message": "Not found",
        "traceback": "Traceback (most recent call last):\n\n  File \"/usr/lib/python2.7/site-packages/heat/common/context.py\", line 300, in wrapped\n    return func(self, ctx, *args, **kwargs)\n\n  File \"/usr/lib/python2.7/site-packages/heat/engine/service.py\", line 1504, in show_software_deployment\n    cnxt, deployment_id)\n\n  File \"/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py\", line 148, in show_software_deployment\n    cnxt, deployment_id)\n\n  File \"/usr/lib/python2.7/site-packages/heat/objects/software_deployment.py\", line 72, in get_by_id\n    db_api.software_deployment_get(context, deployment_id))\n\n  File \"/usr/lib/python2.7/site-packages/heat/db/api.py\", line 292, in software_deployment_get\n    return IMPL.software_deployment_get(context, deployment_id)\n\n  File \"/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py\", line 820, in software_deployment_get\n    deployment_id)\n\nNotFound: Deployment with id 0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 not found\n",
        "type": "NotFound"
    },
    "explanation": "The resource could not be found.",
    "title": "Not Found"
}

2015-12-07 11:33:14.715 1241 INFO eventlet.wsgi.server [req-e57b9aa6-d5c2-4892-9790-f203e3b5d9f5 fb1641bfac4d45c2a9af85bf39c78cb3 9edce9ad17474a2093b89749ea8ae99c] 192.0.2.1 - - [07/Dec/2015 11:33:14] "GET /v1/9edce9ad17474a2093b89749ea8ae99c/software_deployments/0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 HTTP/1.1" 404 1382 0.165222

Comment 6 Angus Thomas 2015-12-17 16:23:12 UTC
This is believed to be resolved by the fix posted on https://bugzilla.redhat.com/show_bug.cgi?id=1290796

Can it be re-tested, with that fix in place?

Comment 7 Marius Cornea 2015-12-17 17:46:16 UTC
I re-tested with the latest build but I am getting the same results. The stack update hasn't failed yet but I am expecting it to time out after 240 minutes. 

I'm going to leave the environment available for debugging. Thank you.

Comment 9 Udi Kalifon 2015-12-22 09:06:16 UTC
I am hitting the same issue, scale out is impossible after update from 7.1 to 7.2. There is another bug related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1257950.

Comment 10 Marius Cornea 2016-01-03 18:26:17 UTC
Got the same result with the last scale out attempt:

| stack_status          | UPDATE_FAILED                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| stack_status_reason   | Timed out

Comment 11 Jaromir Coufal 2016-01-26 23:29:58 UTC
Non-working scale-out is a regression, requesting a blocker flag.

Comment 12 James Slagle 2016-01-28 15:09:20 UTC
in the os-collect-config log, it's clear the compute node is still running yum update on scale out, which is not intended and very likely the reason for the timeout.

this indicates that cloudforms is not using the same fix as illustrated in https://bugzilla.redhat.com/show_bug.cgi?id=1290796 on scale out atttempts.

Comment 13 Ladislav Smola 2016-01-29 07:53:07 UTC
@James not sure what fix there should be.

For scale out, CloudForms is doing Heat PATCH, changing only ComputeCount parameter and sending master template it downloads from the API, nothing else. I think that should always succeed, shouldn't it?

Comment 14 James Slagle 2016-02-03 10:54:41 UTC
(In reply to Ladislav Smola from comment #13)
> @James not sure what fix there should be.
> 
> For scale out, CloudForms is doing Heat PATCH, changing only ComputeCount
> parameter and sending master template it downloads from the API, nothing
> else. I think that should always succeed, shouldn't it?

have a look at the way the client is passing UpdateIdentifier in clear_parameters on stack update to keep nodes from yum updating on scale out

Comment 15 Tzu-Mainn Chen 2016-02-15 21:00:16 UTC
(In reply to James Slagle from comment #14)
> (In reply to Ladislav Smola from comment #13)
> > @James not sure what fix there should be.
> > 
> > For scale out, CloudForms is doing Heat PATCH, changing only ComputeCount
> > parameter and sending master template it downloads from the API, nothing
> > else. I think that should always succeed, shouldn't it?
> 
> have a look at the way the client is passing UpdateIdentifier in
> clear_parameters on stack update to keep nodes from yum updating on scale out

I can't find a reference to UpdateIdentifier upstream or in any of the pending patches; do you mean DeployIdentifier in _update_parameters?  It seems to match what you're talking about, but I just want to be sure.

Comment 17 Tzu-Mainn Chen 2016-02-23 21:55:00 UTC
Created attachment 1129928 [details]
os-collect-config log

Comment 18 Tzu-Mainn Chen 2016-02-23 21:57:44 UTC
I updated the CloudForms PATCH call to use clear_parameters with DeployIdentifier and UpdateIdentifier.  The resulting stack scale-up successfully bypassed the yum update; however, the update seems to hang.

I've attached a log of os-collect-config; are there other items that might help us understand what's going on?

Comment 20 Ronnie Rasouli 2016-02-24 14:44:14 UTC
Created attachment 1130236 [details]
heat update failures

Comment 21 Tzu-Mainn Chen 2016-03-02 21:06:39 UTC
Some additional info: it turns out that the scale out works if you run it a *second* time.  It looks like this is because there may be Heat hooks leftover after the overcloud update that are not cleared.  The second scale out attempt works because it happens after the user deletes the Heat resource after the timeout, clearing the hooks.  If the hooks are leftover from after the update, then the cleanest solution would be to make sure that the update cleans up the hooks.

Comment 23 Mike Burns 2016-04-07 21:00:12 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.


Note You need to log in before you can comment on or make changes to this bug.