Bug 1250088 - rhel-osp-director: Failing to scale the overcloud with one additional compute: Nova show for the node reports: {"message": "Build of instance [UUID] aborted: Could not clean up failed build, not rescheduling"
rhel-osp-director: Failing to scale the overcloud with one additional compute...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
unspecified
Unspecified Unspecified
urgent Severity high
: async
: 7.0 (Kilo)
Assigned To: Dmitry Tantsur
Alexander Chuzhoy
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-04 09:38 EDT by Alexander Chuzhoy
Modified: 2015-08-17 15:32 EDT (History)
28 users (show)

See Also:
Fixed In Version: openstack-nova-2015.1.0-17.el7ost
Doc Type: Bug Fix
Doc Text:
Cause: Due to a race condition between Nova and Ironic when retrying certain deployment operations while trying to scale up the overcloud, a previously-deployed host would be selected. Consequence: Deploying the new host(s) fails. Fix: Address race condition by using Ironic's retry mechanism. Result: Deploying new host(s) succeeds.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-17 15:32:49 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
nova and ironic logs from the undercloud (15.94 MB, application/x-gzip)
2015-08-04 15:19 EDT, Alexander Chuzhoy
no flags Details
Patch to limit number of retries in nova (6.74 KB, patch)
2015-08-07 08:41 EDT, Dmitry Tantsur
no flags Details | Diff
Patch to stop nova retrying on conflict (1.16 KB, patch)
2015-08-07 10:49 EDT, Dmitry Tantsur
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 210369 None None None Never
OpenStack gerrit 211097 None None None Never

  None (edit)
Description Alexander Chuzhoy 2015-08-04 09:38:11 EDT
rhel-osp-director: Failing to update the overcloud with one additional compute: Nova show for the node reports: {"message": "Build of instance [UUID] aborted: Could not clean up failed build, not rescheduling"


Environment:
instack-undercloud-2.1.2-22.el7ost.noarch
openstack-ironic-conductor-2015.1.0-9.el7ost.noarch
python-ironicclient-0.5.1-9.el7ost.noarch
openstack-ironic-api-2015.1.0-9.el7ost.noarch
python-ironic-discoverd-1.1.0-5.el7ost.noarch
openstack-ironic-common-2015.1.0-9.el7ost.noarch
openstack-ironic-discoverd-1.1.0-5.el7ost.noarch



Steps to reproduce:
1. Install undercloud.
2. Deploy overcloud with 1 controller, 1 compute, 1 ceph node.
   Command: openstack overcloud deploy --plan overcloud --control-scale 1  --compute-scale 1  --ceph-storage-scale 1 --block-storage-scale 0 --swift-storage-scale 0 -e [path to yaml file] --ntp-server [IP] --timeout 90

3. Attemp to update the overcloud with an additional controller.
   Command: openstack overcloud deploy --plan overcloud --control-scale 1  --compute-scale 2  --ceph-storage-scale 1 --block-storage-scale 0 --swift-storage-scale 0 -e [path to yaml file] --ntp-server [IP] --timeout 90

Result:
The update process gets stuck. Running "nova list" shows "BUILD  | spawning   | NOSTATE  " for the fields: "status| task state|power state" respectively for the node that should be added to the overcloud.
ironic node-list shows: "| [UUID] | None | None | power off | available | False |"
for fields: "| UUID | Name | Instance UUID | Power State | Provision State | Maintenance |"   respectively.


Expected result:
The node being added to the overcloud should boot.
Comment 4 Alexander Chuzhoy 2015-08-04 15:19:17 EDT
Created attachment 1059206 [details]
nova and ironic logs from the undercloud
Comment 5 Alexander Chuzhoy 2015-08-04 15:20:46 EDT
Reproducing after reinstalling from scratch.
Comment 6 Alexander Chuzhoy 2015-08-04 15:21:40 EDT
Note: prior to attempting the overcloud update, I confired the networks on the overcloud and launched 30 instances.
Comment 7 Alexander Chuzhoy 2015-08-05 16:02:49 EDT
reproduced on a freshly installed setup without the step in comment #6
Comment 8 Dmitry Tantsur 2015-08-06 09:45:06 EDT
In logs:

2015-08-04 15:17:06.796 6363 WARNING ironicclient.common.http [-] Request returned failure status.
2015-08-04 15:17:06.796 6363 WARNING ironicclient.common.http [-] Error contacting Ironic server: Node 18da0c75-027f-46b8-bfb1-1a1cb54d4c82 is associated with instance bde93567-f4b7-45bb-8d8a-c74d1666195b.

could you try workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1251117 to see if it fixes a problem for you?
Comment 9 Alexander Chuzhoy 2015-08-06 10:52:37 EDT
The W/A in https://bugzilla.redhat.com/show_bug.cgi?id=1251117 doesn't fix the issue.
Comment 10 Rhys Oxenham 2015-08-06 10:54:19 EDT
I had a look into this system... unfortunately we cannot fix it with the workaround I suggested in that bug. The problem seems to be a disconnect between Nova and Ironic. Even though the ironic node '18da0c75-027f-46b8-bfb1-1a1cb54d4c82' is associated with another instance, Nova has chosen it as it's deployment host and continually tries to use it before giving up.

"path": "/instance_uuid", "value": "3ddbc728-4ab3-4c66-9f60-792b6accb664", "op": "add"}]' http://192.0.2.1:6385/v1/nodes/4a281551-dbc1-44d4-b1a3-4649a48b1f61

Error contacting Ironic server: Node 4a281551-dbc1-44d4-b1a3-4649a48b1f61 is associated with instance 076f7c9b-755c-4df0-ad6a-d074e75e9c63.

It seems as though it thinks that 'active' is an available state!
Comment 11 Alexander Chuzhoy 2015-08-06 11:45:52 EDT
Reproduced on another setup while attempting to scale from 1 to 3 computes.

Note: One node was turned on by the ironic while the second remained off.


The following entries were found inthe log:

2015-08-06 11:41:35.710 6480 WARNING ironicclient.common.http [-] Error contacting Ironic server: Node 4e215f24-3279-4507-8cae-6be133a004ad is associated with instance 42fb97be-06cc-4403-9b5b-05df3d4ce7ab.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
    executor_callback))

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
    executor_callback)

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
    result = func(ctxt, **new_args)

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 405, in update_node
    node_obj.save()

  File "/usr/lib/python2.7/site-packages/ironic/objects/base.py", line 143, in wrapper
    return fn(self, ctxt, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 265, in save
    self.dbapi.update_node(self.uuid, updates)

  File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 338, in update_node
    return self._do_update_node(node_id, values)

  File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 364, in _do_update_node
    instance=ref.instance_uuid)

NodeAssociated: Node 4e215f24-3279-4507-8cae-6be133a004ad is associated with instance 42fb97be-06cc-4403-9b5b-05df3d4ce7ab.
 (HTTP 409). Attempt 24 of 61
Comment 12 Alexander Chuzhoy 2015-08-06 14:40:00 EDT
The issue reproduces on attempt to scale the overcloud with 2 additional ceph nodes.
One additional node was powered on and the other was not.
Comment 13 marios 2015-08-07 05:13:58 EDT
I just tried this and it works for me. original report attempts to scale controller (which afaik is currently unsupported) but subsequent comments lead me to believe it is compute/ceph.

The main difference is the scaling procedure (I don't see any mention int he original report) - not sure what OP is following but I also see it in the official docs at 

 It looks like original reported didn't follow the documented procedure for scaling (whether using templates or not) at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/sect-Scaling_the_Overcloud.html

I did the following:

DEPLOY:
openstack overcloud deploy --plan overcloud --control-scale 1  --compute-scale 1  --ceph-storage-scale 1

(Overcloud Deployed)
SCALE:
openstack management plan set 59777086-d07e-454c-9598-a989465349b4 -S Compute-1=2
openstack overcloud deploy --plan overcloud
New compute node comes up, goes to Active
Overcloud Deployed

TEST:
launched an overcloud instance and it got scheduled to the new compute like:

[stack@instack ~]$ nova service-list
nova+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 1  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2015-08-07T09:04:21.000000 | -               |
| 2  | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2015-08-07T09:04:22.000000 | -               |
| 3  | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2015-08-07T09:04:22.000000 | -               |
| 4  | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2015-08-07T09:04:20.000000 | -               |
| 5  | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2015-08-07T09:04:17.000000 | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
[stack@instack ~]$ nova host-describe overcloud-compute-1.localdomain
+---------------------------------+----------------------------------+-----+-----------+---------+
| HOST                            | PROJECT                          | cpu | memory_mb | disk_gb |
+---------------------------------+----------------------------------+-----+-----------+---------+
| overcloud-compute-1.localdomain | (total)                          | 1   | 3791      | 40      |
| overcloud-compute-1.localdomain | (used_now)                       | 1   | 1024      | 10      |
| overcloud-compute-1.localdomain | (used_max)                       | 1   | 512       | 10      |
| overcloud-compute-1.localdomain | a96a2792791248c7a01de85d9a461429 | 1   | 512       | 10      |
+---------------------------------+----------------------------------+-----+-----------+---------+
[stack@instack ~]$
Comment 14 marios 2015-08-07 05:52:35 EDT
sorry, original reporter didn't try to scale controller, see his command there - was just a typo in the sentence. in any case the point still stands. i am just running through a deploy now and going to try scaling without the plan set part and see if i can reproduce exactly the error seen
Comment 15 marios 2015-08-07 06:34:26 EDT
So, I couldn't make explosions by just doing:

deploy like:
openstack overcloud deploy --plan overcloud --control-scale 1  --compute-scale 1  --ceph-storage-scale 1

scale compute like:
overcloud --control-scale 1  --compute-scale 2  --ceph-storage-scale 1

it comes up and i launched an instance etc (in fact it gets the instance). 

my env isn't latest poodle (one before that) but seem to match what is reported like 

[stack@instack ~]$ rpm -qa | grep "instack-undercloud\|ironic-conductor\|ironicclient\|ironic-api\|ironic-*" 
openstack-ironic-common-2015.1.0-9.el7ost.noarch
openstack-ironic-api-2015.1.0-9.el7ost.noarch
instack-undercloud-2.1.2-22.el7ost.noarch
openstack-ironic-discoverd-1.1.0-5.el7ost.noarch
python-ironic-discoverd-1.1.0-5.el7ost.noarch
python-ironicclient-0.5.1-9.el7ost.noarch
openstack-ironic-conductor-2015.1.0-9.el7ost.noarch


in any case can you please try with the additional scale set step and see if it reproduces for you? Also are you using network isolation (I see the -e but it isn't specified what it is for)? Just so I can reproduce exactly
Comment 17 Dmitry Tantsur 2015-08-07 08:41:42 EDT
Created attachment 1060353 [details]
Patch to limit number of retries in nova

Sasha, could you please test the attached patch?

Use something like that to apply:

 cd /usr/lib/python2.7/site-packages
 patch -p1 < /path/to/native-retries.patch
Comment 18 Dmitry Tantsur 2015-08-07 09:24:35 EDT
Important note: the patch may lead to failures after reboot. I'll work on a better patch, if you confirm it solves a problem for you.
Comment 19 Dmitry Tantsur 2015-08-07 10:49:44 EDT
Created attachment 1060382 [details]
Patch to stop nova retrying on conflict

I've prepared a simplified version of the patch. Please revert your nova installation to original RPM and retry with this patch.
Comment 20 Omri Hochman 2015-08-07 11:00:55 EDT
> Created attachment 1060353 [details]
> Patch to limit number of retries in nova
> 
> Sasha, could you please test the attached patch?
> 
> Use something like that to apply:
> 
>  cd /usr/lib/python2.7/site-packages
>  patch -p1 < /path/to/native-retries.patch

verified the patch in (In reply to Dmitry Tantsur from comment #17) works .
Comment 21 Omri Hochman 2015-08-07 13:03:04 EDT
(In reply to Dmitry Tantsur from comment #19)
> Created attachment 1060382 [details]
> Patch to stop nova retrying on conflict
> 
> I've prepared a simplified version of the patch. Please revert your nova
> installation to original RPM and retry with this patch.

Verified the patch (In reply to Dmitry Tantsur from comment #19) works.
Comment 22 Alexander Chuzhoy 2015-08-07 17:40:31 EDT
reply to Dmitry Tantsur - comment 17.

I tested it on VM environment and the reported issue didn't reproduce.
Comment 23 Dmitry Tantsur 2015-08-10 07:26:14 EDT
Proper patch proposed to upstream master as https://review.openstack.org/#/c/210369
Comment 25 Mark McLoughlin 2015-08-12 08:18:48 EDT
(In reply to Dmitry Tantsur from comment #23)
> Proper patch proposed to upstream master as
> https://review.openstack.org/#/c/210369

This is now merged upstream
Comment 27 Omri Hochman 2015-08-14 15:21:42 EDT
Verified with openstack-nova-2015.1.0-17.el7ost.
Comment 29 errata-xmlrpc 2015-08-17 15:32:49 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1638

Note You need to log in before you can comment on or make changes to this bug.