Bug 1386719

Summary: OSP9 to OSP10 upgrade pingtest fails.
Product: Red Hat OpenStack Reporter: Marios Andreou <mandreou>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: dbecker, jcoufal, jjoyce, jschluet, mburns, morazi, rhel-osp-director-maint
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.0.0-1.3.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:22:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1337794    
Attachments:
Description Flags
pingtest output
none
pingtest_output after controllers upgraded
none
relevant journal messages from controller0
none
pingtest after fixing swift, see comment #5
none
sanity check credentials are fixed with https://review.openstack.org/#/c/392593/
none
quite a bit of heat-engine.log scroll to end for domain admin auth failure none

Description Marios Andreou 2016-10-19 13:55:14 UTC
Description of problem:


After successful OSP9 to OSP10 upgrade on a 3 control 1 compute dev environment, the post upgrade pingtest fails with 

        ResourceInError: resources.volume1: Went to status error due to "Unknown"                         | CREATE_FAILED


Trying to trace that volume event on the controller - a possibly related error (but also looks like the volume is created from the messages below):

        Oct 19 11:41:26 overcloud-controller-0 cinder-volume: 2016-10-19 11:41:26.638 27562 ERROR cinder.service [-] Manager for service cinder-volume hostgroup@tripleo_iscsi is reporting problems, not sending heartbeat. Service will appear "down".
        Oct 19 11:41:26 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:26.983 26969 INFO cinder.api.v3.volumes [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Create volume of 1 GB
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.116 26969 INFO cinder.volume.api [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Availability Zones retrieved successfully.
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.938 26969 INFO cinder.volume.api [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Volume created successfully.
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.939 26969 INFO cinder.api.openstack.wsgi [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] http://10.0.0.4:8776/v3/859254da36fd430e9cdbd5c0b4209eb4/volumes returned with HTTP 202
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.939 26969 INFO eventlet.wsgi.server [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] 172.16.2.6 "POST /v3/859254da36fd430e9cdbd5c0b4209eb4/volumes HTTP/1.1" status: 202  len: 1103 time: 0.9616349

        Oct 19 11:50:07 overcloud-controller-0.localdomain cinder-volume[27398]: 2016-10-19 11:50:07.098 27562 ERROR cinder.service [-] Manager for service cinder-volume hostgroup@tripleo_iscsi is reporting problems, not sending heartbeat. Service will appear "down".

I'll attach the output of the pingtest rather than paste here but creating the bugzilla for now and will update once we have more info 

To be clear, I upgraded the environment (including the aodh post-upgrade migration) and then rebooted all nodes (because 7.3) before running the pingtest.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP9
2. Upgrade to OSP10 like at https://gitlab.cee.redhat.com/sathlang/ospd-9-to-10-upgrade/blob/master/README.md
3. Reboot nodes because 7.3
4. pingtest
 
Actual results:
 fail as above

Expected results:
 not fail :/

Additional info:

Comment 1 Marios Andreou 2016-10-19 13:56:19 UTC
Created attachment 1212155 [details]
pingtest output

Comment 2 Marios Andreou 2016-11-02 13:58:22 UTC
Created attachment 1216574 [details]
pingtest_output after controllers upgraded

Comment 3 Marios Andreou 2016-11-02 13:59:55 UTC
Update, the description is slightly inaccurate because I filed the BZ at a later time to when it actually occurred. The description says that after converge I rebooted and then ran the pingtest. That is accurate, however, the pingtest issue first starts appearing after the controllers are upgraded successfully. So after UPDATE_COMPLETE, run pingtest and it fails as in the attachment.

Comment 4 Marios Andreou 2016-11-02 14:02:25 UTC
Created attachment 1216577 [details]
relevant journal messages from controller0

Comment 5 Marios Andreou 2016-11-02 14:26:58 UTC
some more poking today. I discovered that swift services were down after the controllers are upgraded. We have a change in the hiera data that we use to determine which swift services to bringup. 

I opened a review at https://review.openstack.org/#/c/392680/ but it doesn't fully fix the problem (gets further, but still fails, attaching new log for this run)

Comment 6 Marios Andreou 2016-11-02 14:28:09 UTC
Created attachment 1216586 [details]
pingtest after fixing swift, see comment #5

Comment 7 Marios Andreou 2016-11-03 11:21:13 UTC
After some more poking today I suspect this may be related to overcloud password issues... once the swift services are back up and as you can see in the attachment from comment #6 the overcloud heat stack create fails for authorization... looking at controller0 logs I see from heat-engine log:


2016-11-03 10:34:44.513 7598 ERROR heat.engine.clients.keystoneclient [req-b4b24448-26c0-4618-9def-1edcc23eeb76 a3d2c3c619db4433a2da763bf966d7a3 f692f5e0499545028b7a0235d7480139 - - -] Domain admin client authentication failed

and from keystone.log:

2016-11-03 10:34:44.510 11829 WARNING keystone.auth.plugins.core [req-dd2b7ce4-56d1-48e7-ad3d-99b86f2dda5a - - - - -] Could not find domain: Default
2016-11-03 10:34:44.511 11829 WARNING keystone.common.wsgi [req-dd2b7ce4-56d1-48e7-ad3d-99b86f2dda5a - - - - -] Authorization failed. The request you have made requires authentication. from 192.0.2.14


I am going to reset the environment and include the fix from BZ 1388930 (which is about the overcloud password changing) as well as the fix for the swift services and see if it reproduces then.

Comment 8 Marios Andreou 2016-11-03 17:36:55 UTC
Today I included the fixup for the overcloudrc issue (BZ 1388930) but have the same result. After controller upgrade (and with swift services now running) the pingtest fails exactly as attached from comment #6.


I'll also attach some more logs, but seems to be an issue with heat<-->keystone and the admin domain

Comment 9 Marios Andreou 2016-11-03 17:53:49 UTC
Created attachment 1217111 [details]
sanity check credentials are fixed with https://review.openstack.org/#/c/392593/

Comment 10 Marios Andreou 2016-11-03 17:55:26 UTC
Created attachment 1217112 [details]
quite a bit of heat-engine.log scroll to end for domain admin auth failure

Comment 11 Marios Andreou 2016-11-04 09:00:29 UTC
thanks to shardy got a possible lead on the heat domain auth failure described in previous comments... may be related to BZ 1388474

Comment 12 Marios Andreou 2016-11-04 15:00:25 UTC
SO going to use this bug to fixup the swift services not being started like in comment #5 and the gerrit review linked above. It has merged both master and newton at https://review.openstack.org/#/c/393760/ so moving to POST

I did *not* get a chance to continue debugging the heat domain issue from comment #7 but we should file a new BZ for that

Comment 13 Marios Andreou 2016-11-04 17:39:59 UTC
just pointing at the newton review rather than master

Comment 15 Omri Hochman 2016-11-21 22:27:29 UTC
Verified with openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch 

as part of QE verification we've added to our automation ping test to the overcloud workload and verified the it's reachable in between each of the upgrade steps .

Comment 18 errata-xmlrpc 2016-12-14 16:22:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html