Bug 1593909

Summary: Overcloud Nodes listed as "ERROR" after Upgrade to OSP13
Product: Red Hat OpenStack Reporter: Darin Sorrentino <dsorrent>
Component: python-tripleoclientAssignee: Jiri Stransky <jstransk>
Status: CLOSED DUPLICATE QA Contact: Gurenko Alex <agurenko>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: hbrock, jpichon, jslagle, jstransk, mburns
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-25 15:16:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport from the server showing 3 overcloud nodes in ERROR state none

Description Darin Sorrentino 2018-06-21 20:00:16 UTC
Created attachment 1453600 [details]
sosreport from the server showing 3 overcloud nodes in ERROR state

Description of problem:
Both Chris J (cjanisze) and myself hit this issue.  At the completion of the upgrade to OSP13 on the Director node, all/some of the Overcloud nodes show in an ERROR State:

(undercloud) [stack@ds-hf-ca-undercloud ~]$ openstack server list
+--------------------------------------+------------------------+--------+-----------------------+--------------------------------+---------+
| ID                                   | Name                   | Status | Networks              | Image                          | Flavor  |
+--------------------------------------+------------------------+--------+-----------------------+--------------------------------+---------+
| 3cd682e6-b2c0-4505-af7a-a01786a5cfe4 | overcloud-controller-2 | ACTIVE | ctlplane=172.16.0.105 | overcloud-full_20180619T142126 | control |
| afb6d2a8-0937-488b-85dd-157ac38ad6bf | overcloud-controller-0 | ACTIVE | ctlplane=172.16.0.101 | overcloud-full_20180619T142126 | control |
| 1f57af8d-bdc5-41b9-a58c-b561a7cfe927 | overcloud-compute-0    | ERROR  | ctlplane=172.16.0.112 | overcloud-full_20180619T142126 | compute |
| 2b6f3e6c-83d0-4fe1-856e-a001be10287e | overcloud-compute-1    | ERROR  | ctlplane=172.16.0.103 | overcloud-full_20180619T142126 | compute |
| d3b7b0be-3a55-4a0e-a1fd-15c401b392bb | overcloud-controller-1 | ERROR  | ctlplane=172.16.0.108 | overcloud-full_20180619T142126 | control |
+--------------------------------------+------------------------+--------+-----------------------+--------------------------------+---------+
(undercloud) [stack@ds-hf-ca-undercloud ~]$ 


In my environment (above) 3 nodes are in error state while 2 remain active. Chris had all of his nodes in ERROR state.

The Overcloud appears to be functional so we are going to just use nova to reset the state to active.  I am attaching the an sosreport from my environment before I force the state change to active. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jiri Stransky 2018-06-25 15:16:15 UTC
Thanks for the report Darin, we've hit this recently in other environments too, it's a race condition between nova-compute and ironic-conductor starting up. If nova-compute comes up before ironic-conductor is able to reply on requests, the instances backed by ironic go to ERROR. Workaround is `openstack server set --state active <server-id>`.

Being tracked as bug 1590297 so i'll mark this one as duplicate.

*** This bug has been marked as a duplicate of bug 1590297 ***