1386719 – OSP9 to OSP10 upgrade pingtest fails.

Bug 1386719 - OSP9 to OSP10 upgrade pingtest fails.

Summary: OSP9 to OSP10 upgrade pingtest fails.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Marios Andreou
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1337794
TreeView+	depends on / blocked

Reported:	2016-10-19 13:55 UTC by Marios Andreou
Modified:	2016-12-29 16:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.0.0-1.3.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 16:22:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pingtest output (39.10 KB, text/plain) 2016-10-19 13:56 UTC, Marios Andreou	no flags	Details
pingtest_output after controllers upgraded (1.25 KB, text/plain) 2016-11-02 13:58 UTC, Marios Andreou	no flags	Details
relevant journal messages from controller0 (6.91 KB, text/plain) 2016-11-02 14:02 UTC, Marios Andreou	no flags	Details
pingtest after fixing swift, see comment #5 (8.41 KB, text/plain) 2016-11-02 14:28 UTC, Marios Andreou	no flags	Details
sanity check credentials are fixed with https://review.openstack.org/#/c/392593/ (12.12 KB, text/plain) 2016-11-03 17:53 UTC, Marios Andreou	no flags	Details
quite a bit of heat-engine.log scroll to end for domain admin auth failure (53.23 KB, text/plain) 2016-11-03 17:55 UTC, Marios Andreou	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1638821	None	None	None	2016-11-04 15:00:25 UTC
OpenStack gerrit	393760	None	MERGED	Fixup the start of swift services	2020-02-20 20:04:15 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Marios Andreou 2016-10-19 13:55:14 UTC

Description of problem:


After successful OSP9 to OSP10 upgrade on a 3 control 1 compute dev environment, the post upgrade pingtest fails with 

        ResourceInError: resources.volume1: Went to status error due to "Unknown"                         | CREATE_FAILED


Trying to trace that volume event on the controller - a possibly related error (but also looks like the volume is created from the messages below):

        Oct 19 11:41:26 overcloud-controller-0 cinder-volume: 2016-10-19 11:41:26.638 27562 ERROR cinder.service [-] Manager for service cinder-volume hostgroup@tripleo_iscsi is reporting problems, not sending heartbeat. Service will appear "down".
        Oct 19 11:41:26 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:26.983 26969 INFO cinder.api.v3.volumes [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Create volume of 1 GB
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.116 26969 INFO cinder.volume.api [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Availability Zones retrieved successfully.
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.938 26969 INFO cinder.volume.api [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Volume created successfully.
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.939 26969 INFO cinder.api.openstack.wsgi [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] http://10.0.0.4:8776/v3/859254da36fd430e9cdbd5c0b4209eb4/volumes returned with HTTP 202
        Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.939 26969 INFO eventlet.wsgi.server [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] 172.16.2.6 "POST /v3/859254da36fd430e9cdbd5c0b4209eb4/volumes HTTP/1.1" status: 202  len: 1103 time: 0.9616349

        Oct 19 11:50:07 overcloud-controller-0.localdomain cinder-volume[27398]: 2016-10-19 11:50:07.098 27562 ERROR cinder.service [-] Manager for service cinder-volume hostgroup@tripleo_iscsi is reporting problems, not sending heartbeat. Service will appear "down".

I'll attach the output of the pingtest rather than paste here but creating the bugzilla for now and will update once we have more info 

To be clear, I upgraded the environment (including the aodh post-upgrade migration) and then rebooted all nodes (because 7.3) before running the pingtest.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP9
2. Upgrade to OSP10 like at https://gitlab.cee.redhat.com/sathlang/ospd-9-to-10-upgrade/blob/master/README.md
3. Reboot nodes because 7.3
4. pingtest
 
Actual results:
 fail as above

Expected results:
 not fail :/

Additional info:

Comment 1 Marios Andreou 2016-10-19 13:56:19 UTC

Created attachment 1212155 [details]
pingtest output

Comment 2 Marios Andreou 2016-11-02 13:58:22 UTC

Created attachment 1216574 [details]
pingtest_output after controllers upgraded

Comment 3 Marios Andreou 2016-11-02 13:59:55 UTC

Update, the description is slightly inaccurate because I filed the BZ at a later time to when it actually occurred. The description says that after converge I rebooted and then ran the pingtest. That is accurate, however, the pingtest issue first starts appearing after the controllers are upgraded successfully. So after UPDATE_COMPLETE, run pingtest and it fails as in the attachment.

Comment 4 Marios Andreou 2016-11-02 14:02:25 UTC

Created attachment 1216577 [details]
relevant journal messages from controller0

Comment 5 Marios Andreou 2016-11-02 14:26:58 UTC

some more poking today. I discovered that swift services were down after the controllers are upgraded. We have a change in the hiera data that we use to determine which swift services to bringup. 

I opened a review at https://review.openstack.org/#/c/392680/ but it doesn't fully fix the problem (gets further, but still fails, attaching new log for this run)

Comment 6 Marios Andreou 2016-11-02 14:28:09 UTC

Created attachment 1216586 [details]
pingtest after fixing swift, see comment #5

Comment 7 Marios Andreou 2016-11-03 11:21:13 UTC

After some more poking today I suspect this may be related to overcloud password issues... once the swift services are back up and as you can see in the attachment from comment #6 the overcloud heat stack create fails for authorization... looking at controller0 logs I see from heat-engine log:


2016-11-03 10:34:44.513 7598 ERROR heat.engine.clients.keystoneclient [req-b4b24448-26c0-4618-9def-1edcc23eeb76 a3d2c3c619db4433a2da763bf966d7a3 f692f5e0499545028b7a0235d7480139 - - -] Domain admin client authentication failed

and from keystone.log:

2016-11-03 10:34:44.510 11829 WARNING keystone.auth.plugins.core [req-dd2b7ce4-56d1-48e7-ad3d-99b86f2dda5a - - - - -] Could not find domain: Default
2016-11-03 10:34:44.511 11829 WARNING keystone.common.wsgi [req-dd2b7ce4-56d1-48e7-ad3d-99b86f2dda5a - - - - -] Authorization failed. The request you have made requires authentication. from 192.0.2.14


I am going to reset the environment and include the fix from BZ 1388930 (which is about the overcloud password changing) as well as the fix for the swift services and see if it reproduces then.

Comment 8 Marios Andreou 2016-11-03 17:36:55 UTC

Today I included the fixup for the overcloudrc issue (BZ 1388930) but have the same result. After controller upgrade (and with swift services now running) the pingtest fails exactly as attached from comment #6.


I'll also attach some more logs, but seems to be an issue with heat<-->keystone and the admin domain

Comment 9 Marios Andreou 2016-11-03 17:53:49 UTC

Created attachment 1217111 [details]
sanity check credentials are fixed with https://review.openstack.org/#/c/392593/

Comment 10 Marios Andreou 2016-11-03 17:55:26 UTC

Created attachment 1217112 [details]
quite a bit of heat-engine.log scroll to end for domain admin auth failure

Comment 11 Marios Andreou 2016-11-04 09:00:29 UTC

thanks to shardy got a possible lead on the heat domain auth failure described in previous comments... may be related to BZ 1388474

Comment 12 Marios Andreou 2016-11-04 15:00:25 UTC

SO going to use this bug to fixup the swift services not being started like in comment #5 and the gerrit review linked above. It has merged both master and newton at https://review.openstack.org/#/c/393760/ so moving to POST

I did *not* get a chance to continue debugging the heat domain issue from comment #7 but we should file a new BZ for that

Comment 13 Marios Andreou 2016-11-04 17:39:59 UTC

just pointing at the newton review rather than master

Comment 15 Omri Hochman 2016-11-21 22:27:29 UTC

Verified with openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch 

as part of QE verification we've added to our automation ping test to the overcloud workload and verified the it's reachable in between each of the upgrade steps .

Comment 18 errata-xmlrpc 2016-12-14 16:22:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.