Bug 1298196 - Scaling out an Openstack Infrastructure provider with an additional compute node doesn't work for an updated overcloud
Summary: Scaling out an Openstack Infrastructure provider with an additional compute n...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ga
: 8.0 (Liberty)
Assignee: Angus Thomas
QA Contact: Dan Yasny
URL:
Whiteboard:
Depends On: 1288220
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-13 12:58 UTC by Mike Burns
Modified: 2016-04-20 11:22 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1288220
Environment:
Last Closed: 2016-04-20 11:22:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike Burns 2016-01-13 12:58:29 UTC
Cloning to 8 for tracking there.

+++ This bug was initially created as a clone of Bug #1288220 +++

Description of problem:
Scaling out an Openstack Infrastructure provider with an additional compute node for an overcloud updated from 7.1 to 7.2 doesn't work.

Version-Release number of selected component (if applicable):
5.5.0.13

How reproducible:
100%

Steps to Reproduce:
1. Deploy RHEL OSP 7.1 (3 ctrls, 1 compute and 3 ceph nodes in my test)
2. Update both undercloud and overcloud to version 7.2 by following the update procedure
3. Add the undercloud to Cloudforms as Openstack Infra provider
4. Scale out the infra provider with an additional compute node

Actual results:
The stack update seems to get stuck.

Expected results:
The stack update finishes.

Additional info:

From Openstack side:

Initial deploy command:
openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml 

Update command:
/usr/bin/yes '' | openstack overcloud update stack overcloud -i \
         --templates ~/templates/my-overcloud \
         -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
         -e ~/templates/my-overcloud/environments/network-isolation.yaml \
         -e ~/templates/network-environment.yaml \
         -e ~/templates/firstboot-environment.yaml \
         -e ~/templates/ceph.yaml \
         -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \
         -e ~/templates/ctrlport.yaml

--- Additional comment from Ladislav Smola on 2015-12-04 11:00:02 EST ---

this is the call in ManageIQ

https://bugzilla.redhat.com/show_bug.cgi?id=1288220

We do call PATCH with master template, we get from api and only one Count parameter, used for scaling.

--- Additional comment from Marius Cornea on 2015-12-07 08:14 EST ---

Moving this to the Openstack side. I tested the scale out on a non-update 7.2 deployment and it worked. 

When I try to scale out on an updated deployment I can see the new compute node provisioned and yum update triggered for it. In the end the overcloud stack update timeouts after 240minutes. 

Attaching the os-collect-config log from the added compute node.

--- Additional comment from Marius Cornea on 2015-12-07 11:48:17 EST ---

From what I can tell from the os-collect-config log the last action on the newly added compute was: 

++ curl -s -w '%{http_code}' -X POST -H Content-Type: -o /tmp/tmp.UOjYLGlR5o --data-binary '{"deploy_stdout": "os-apply-config deployment 0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 completed", "deploy_status_code": "0"}' 'http://192.0.2.1:8000/v
1/signal/arn%3Aopenstack%3Aheat%3A%3A9edce9ad17474a2093b89749ea8ae99c%3Astacks%2Fovercloud-Compute-vwqrtha3heky-1-727wpn3ted3l%2F024b0343-e168-4ab8-be40-464a96187655%2Fresources%2FNovaComputeDeployment?Timestamp=2015-12-07T12%3A44%3A15Z&Si
gnatureMethod=HmacSHA256&AWSAccessKeyId=5e5bf7331dcc4bfe8a2603ff8b8efb2e&SignatureVersion=2&Signature=FfuLQZUy%2FjUABdLZUQGfEHEOiMJOYcUXj9bH3OJEA4w%3D'
+ status=200

while the undercloud heat-api.log shows 

{
    "code": 404,
    "error": {
        "message": "Not found",
        "traceback": "Traceback (most recent call last):\n\n  File \"/usr/lib/python2.7/site-packages/heat/common/context.py\", line 300, in wrapped\n    return func(self, ctx, *args, **kwargs)\n\n  File \"/usr/lib/python2.7/site-packages/heat/engine/service.py\", line 1504, in show_software_deployment\n    cnxt, deployment_id)\n\n  File \"/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py\", line 148, in show_software_deployment\n    cnxt, deployment_id)\n\n  File \"/usr/lib/python2.7/site-packages/heat/objects/software_deployment.py\", line 72, in get_by_id\n    db_api.software_deployment_get(context, deployment_id))\n\n  File \"/usr/lib/python2.7/site-packages/heat/db/api.py\", line 292, in software_deployment_get\n    return IMPL.software_deployment_get(context, deployment_id)\n\n  File \"/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py\", line 820, in software_deployment_get\n    deployment_id)\n\nNotFound: Deployment with id 0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 not found\n",
        "type": "NotFound"
    },
    "explanation": "The resource could not be found.",
    "title": "Not Found"
}

2015-12-07 11:33:14.715 1241 INFO eventlet.wsgi.server [req-e57b9aa6-d5c2-4892-9790-f203e3b5d9f5 fb1641bfac4d45c2a9af85bf39c78cb3 9edce9ad17474a2093b89749ea8ae99c] 192.0.2.1 - - [07/Dec/2015 11:33:14] "GET /v1/9edce9ad17474a2093b89749ea8ae99c/software_deployments/0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 HTTP/1.1" 404 1382 0.165222

--- Additional comment from Angus Thomas on 2015-12-17 11:23:12 EST ---

This is believed to be resolved by the fix posted on https://bugzilla.redhat.com/show_bug.cgi?id=1290796

Can it be re-tested, with that fix in place?

--- Additional comment from Marius Cornea on 2015-12-17 12:46:16 EST ---

I re-tested with the latest build but I am getting the same results. The stack update hasn't failed yet but I am expecting it to time out after 240 minutes. 

I'm going to leave the environment available for debugging. Thank you.

--- Additional comment from Udi on 2015-12-22 04:06:16 EST ---

I am hitting the same issue, scale out is impossible after update from 7.1 to 7.2. There is another bug related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1257950.

--- Additional comment from Marius Cornea on 2016-01-03 13:26:17 EST ---

Got the same result with the last scale out attempt:

| stack_status          | UPDATE_FAILED                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| stack_status_reason   | Timed out

Comment 2 Hugh Brock 2016-02-07 14:19:44 UTC
I believe this is fixed by the patch described in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1290796

Mike Burns, can you verify the patch in the referenced bug made it into 8.0, and if so move this to on_qa testonly?

Thanks,
--Hugh

Comment 3 Mike Burns 2016-02-08 10:42:07 UTC
The patch for that particular bug was a downstream only patch that shouldn't be needed on liberty.  Basically, this means that using parameter_defaults everywhere should be all that we need and that's been done

Comment 7 Dan Yasny 2016-04-20 09:59:50 UTC
Retested
1. without SSL on both UC and OC
2. installed 7.3GA, upgraded UC to 8puddle
Workarounds:
- updated the rabbitmq password
- added extra swap to the instack VM
- fixed the (broken by UC upgrade) networking (missing vlan10)
3. populated the overcloud with 5 tenants/vms/volumes/etc
4. made a local kilo THT copy
5. ran the deploy command pointing at the kilo THT dir, and added 1 compute
6. added 5 more instances
7. verified some of the new instances got started on the new compute node

setting BZ to verified, though it would be beneficial to repeat the test with SSL on UC and OC as well as IPv6


Note You need to log in before you can comment on or make changes to this bug.