Bug 1303246

Summary: openstack commands fail on overcloud after failed overcloud scaling deploy
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED WONTFIX QA Contact: Shai Revivo <srevivo>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: dblack, ebarrera, hbrock, jcoufal, jdennis, jmelvin, jraju, jslagle, jthomas, mburns, nkinder, rhel-osp-director-maint, srevivo
Target Milestone: ---Keywords: Unconfirmed
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-05 19:34:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeremy 2016-01-29 22:27:09 UTC
Description of problem:The customer attempted to scale the overcloud by adding 1 additional compute node.( --compute-scale=$whatever ) After that he can not enter any openstack commands after sourcing the overcloudrc. (his stack is named va instead of overcloud , so varc )

[heat-admin@va-controller-0 ~]$ source varc 
[heat-admin@va-controller-0 ~]$ nova list
ERROR (Unauthorized): Invalid user / password (Disable debug mode to suppress these details.) (HTTP 401) (Request-ID: req-4a62b261-ef5a-4e51-afb7-d24c02db20c5)

Attempted to re-run deploy using orignal compute scale in hopes to restore the stack. --compute-scale=$original_number, this get stuck on 

+---------------+--------------------------------------+----------------------+--------------------+----------------------+
| resource_name | physical_resource_id                 | resource_type        | resource_status    | updated_time         |
+---------------+--------------------------------------+----------------------+--------------------+----------------------+
| 10            | 8e52ae96-84ac-4421-b027-46e6a5a422ab | OS::TripleO::Compute | UPDATE_COMPLETE    | 2016-01-29T20:00:57Z |
| 11            | fcab3113-222b-4edd-bd99-0b7a3f1bf2b3 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:01:36Z |
| 5             | 1d09fd98-32c3-46b8-b9af-d767e43c636b | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:02:45Z |
| 2             | 574d17ff-3ba8-4d03-9420-8bb4693bf993 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:04:12Z |
| 8             | 463351f3-31c4-482d-8871-447f3fb8bfb1 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:05:18Z |
| 13            | 51c5f1d3-4b5e-4fff-b596-642f720280f4 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:06:27Z |
| 1             | 7ab7184c-0727-4ab6-a742-0aba188e1d5f | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:07:30Z |
| 7             | 77ac74f4-9a19-4204-9594-5f723a828786 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:08:27Z |
| 15            | 230ad103-ab0a-4e52-934e-325ddd43ab32 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:09:07Z |
| 6             | 0628019e-3e05-4628-9497-284f640334e5 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:09:46Z |
| 12            | d81beffe-2190-43e0-8a6f-ef6526aefee7 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:10:34Z |
| 14            | 712d9855-8ec8-464c-a050-5d0c2e4be2ef | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:11:11Z |
| 3             | dbc54ef4-6c2e-405e-8fa7-b379e2d53796 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:11:57Z |
| 0             | 911d9151-25be-4b87-a3f2-a6461c7abff2 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:12:57Z |
| 9             | 497cf259-f81c-454f-95cc-cff9be6a2996 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:13:43Z |
| 4             | d4ad705f-0f58-4cd0-85e0-e9bc28b802e4 | OS::TripleO::Compute | UPDATE_IN_PROGRESS | 2016-01-29T20:14:27Z |
+---------------+--------------------------------------+----------------------+--------------------+----------------------+
I assume the stack got rid of the failed node that was attempted to be scaled in. now the original nodes are stuck in progress. SO...

At this point i'm investigating keystone:

SEe this in /var/log/keystone

2016-01-29 17:12:54.241 7631 DEBUG keystone.middleware.core [-] Auth token not in the request header. Will not build auth context. process_request /usr/lib/python2.7/site-packages/keystone/middleware/core.py:229
2016-01-29 17:12:54.241 7631 INFO keystone.common.wsgi [-] POST /tokens?
2016-01-29 17:12:54.291 7631 DEBUG oslo_messaging._drivers.amqp [-] UNIQUE_ID is f5d985d30d5648bfa3402de251d89a14. _add_unique_id /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py:258


Version-Release number of selected component (if applicable):
keystone-2015.1.1-1.el7ost.noarch

Actual results:


Expected results:


Additional info:

Comment 2 Hugh Brock 2016-01-30 14:00:55 UTC
Without knowing for sure, I would guess we're going to need a lot more logs to get an idea of what's going on here. 

What was the result of the token replacement above?

Comment 3 Jon Thomas 2016-01-30 14:05:20 UTC
Token replacement did not work. My guess is the db has a new admin password and the rc files didn't get written with the new password.

Comment 5 Jon Thomas 2016-01-30 20:18:01 UTC
another test

Tried updating password with no luck

  /etc/keystone/keystone.conf 
  admin_token = 2c7d0ddb1aed4579172cbba4105e28e6a7f69587 
  #admin_token = 3f038ac3099362ac23b53cfc4b38ecae39e78825 

tried both tokens

login as root to your controller
export OS_SERVICE_TOKEN=2c7d0ddb1aed4579172cbba4105e28e6a7f69587
export OS_SERVICE_ENDPOINT=http://10.133.165.20:35357/v2.0/
keystone user-password-update --pass a96d995c317ffe65d7d29157774c1d67f90924e0 admin

didn't work, also didn't get the following warning. I tested this on my machine(s) and it works. 
 
WARNING: Bypassing authentication using a token & endpoint (authentication credentials are being ignored).

Comment 6 Jon Thomas 2016-01-31 01:04:54 UTC
We tried:

1) changing the tokens to what was listed in tripleo-overcloud-passwords
2) using token only auth to change the passwords
3) Rerunning the deploy command with only the original computes
4) using heat stack-update-cancel
5) redefining the new node and rerunning the scale up deploy command

all failed.

Current state is:

1) still cannot use the admin account to access the controllers. Other user accounts evidently work fine.
2) the stack is in UPDATE_FAILED state with compute and controller resources marked as failed.

There is newer code. However, I'd rather get engineering feedback before attempting to upgrade since this is a running system integrated with openshift and cloudforms.

Comment 7 Eduard Barrera 2016-01-31 13:44:31 UTC
I noticed that admin_tenant_name=service in nova.conf and in my nova.conf is services.

If I change it to "service" instead of "services" I get very similar errors but not the same:

# nova list
ERROR (Unauthorized): Unauthorized (HTTP 401) (Request-ID: req-87938873-b5d0-41e9-ab2c-12f506e4c040)



2016-01-31 08:30:37.865 5757 DEBUG keystoneclient.auth.identity.v2 [-] Making authentication request to http://192.168.101.196:35357/v2.0/tokens get_auth_ref /usr/lib/python2.7/site-packages/keystoneclient/auth/identity/v2.py:76
2016-01-31 08:30:37.920 5757 DEBUG keystoneclient.session [-] Request returned failure status: 401 request /usr/lib/python2.7/site-packages/keystoneclient/session.py:396
2016-01-31 08:30:37.921 5757 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find project: service (Disable debug mode to suppress these details.)", "code": 401, "title": "Unauthorized"}}
2016-01-31 08:30:37.921 5757 DEBUG keystoneclient.auth.identity.v2 [-] Making authentication request to http://192.168.101.196:35357/v2.0/tokens get_auth_ref /usr/lib/python2.7/site-packages/keystoneclient/auth/identity/v2.py:76
2016-01-31 08:30:37.972 5757 DEBUG keystoneclient.session [-] Request returned failure status: 401 request /usr/lib/python2.7/site-packages/keystoneclient/session.py:396
2016-01-31 08:30:37.973 5757 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find project: service (Disable debug mode to suppress these details.)", "code": 401, "title": "Unauthorized"}}  <=================
2016-01-31 08:30:37.973 5757 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-01-31 08:30:37.973 5757 INFO nova.osapi_compute.wsgi.server [-] 192.168.101.196 "GET /v2/fb2c259140e4405aad4a3867afc6f95f/servers/detail HTTP/1.1" status: 401 len: 290 time: 0.1090090

let me compare with customer logs:


/site-packages/keystoneclient/auth/identity/v2.py:76
2016-01-29 12:41:28.504 35696 DEBUG keystoneclient.session [-] Request returned failure status: 401 request /usr/lib/python2.7/site-packages/keystoneclient/session.py:396
2016-01-29 12:41:28.504 35696 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Invalid user / password (Disable debug mode to suppress these details.)", "code": 401, "title": "Unauthorized"}}
2016-01-29 12:41:28.505 35696 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-01-29 12:41:28.505 35696 INFO nova.osapi_compute.wsgi.server [-] 10.133.162.9 "GET /v2/ef9184f8a90544c587036d90e9f7c362/servers/detail?limit=21&project_id=ef9184f8a90544c587036d90e9f7c362 HTTP/1.1" status: 401 len: 288 time: 0.1541209
2016-01-29 12:41:28.587 35649 WARNING keystonemiddleware.auth_token [-] Unable to find authentication token in headers
2016-01-29 12:41:28.588 35649 INFO nova.osapi_compute.wsgi.server [-] 10.133.162.9 "GET /v2/ef9184f8a90544c587036d90e9f7c362 HTTP/1.1" status: 401 len: 288 time: 0.0022380

Comment 10 Jon Thomas 2016-01-31 17:39:03 UTC
No dice on the services change. 

BTW, my controllers use "service" instead of "services" and it works fine.
# rpm -qa | grep keystone
python-keystoneclient-1.3.0-2.el7ost.noarch
openstack-keystone-2015.1.0-4.el7ost.noarch
python-keystonemiddleware-1.5.1-1.el7ost.noarch
python-keystone-2015.1.0-4.el7ost.noarch

Comment 13 Mike Burns 2016-04-07 21:07:13 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 15 Jaromir Coufal 2016-10-05 19:34:05 UTC
Seems outdated, please re-open if the concern is still valid.