Description of problem: After successful OSP9 to OSP10 upgrade on a 3 control 1 compute dev environment, the post upgrade pingtest fails with ResourceInError: resources.volume1: Went to status error due to "Unknown" | CREATE_FAILED Trying to trace that volume event on the controller - a possibly related error (but also looks like the volume is created from the messages below): Oct 19 11:41:26 overcloud-controller-0 cinder-volume: 2016-10-19 11:41:26.638 27562 ERROR cinder.service [-] Manager for service cinder-volume hostgroup@tripleo_iscsi is reporting problems, not sending heartbeat. Service will appear "down". Oct 19 11:41:26 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:26.983 26969 INFO cinder.api.v3.volumes [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Create volume of 1 GB Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.116 26969 INFO cinder.volume.api [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Availability Zones retrieved successfully. Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.938 26969 INFO cinder.volume.api [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] Volume created successfully. Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.939 26969 INFO cinder.api.openstack.wsgi [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] http://10.0.0.4:8776/v3/859254da36fd430e9cdbd5c0b4209eb4/volumes returned with HTTP 202 Oct 19 11:41:27 overcloud-controller-0.localdomain cinder-api[2000]: 2016-10-19 11:41:27.939 26969 INFO eventlet.wsgi.server [req-0282589c-8db5-488f-9f03-eb0baecdd7ba 7e2dd3ae2fb945818a7c3a26d1936ac0 859254da36fd430e9cdbd5c0b4209eb4 - default default] 172.16.2.6 "POST /v3/859254da36fd430e9cdbd5c0b4209eb4/volumes HTTP/1.1" status: 202 len: 1103 time: 0.9616349 Oct 19 11:50:07 overcloud-controller-0.localdomain cinder-volume[27398]: 2016-10-19 11:50:07.098 27562 ERROR cinder.service [-] Manager for service cinder-volume hostgroup@tripleo_iscsi is reporting problems, not sending heartbeat. Service will appear "down". I'll attach the output of the pingtest rather than paste here but creating the bugzilla for now and will update once we have more info To be clear, I upgraded the environment (including the aodh post-upgrade migration) and then rebooted all nodes (because 7.3) before running the pingtest. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy OSP9 2. Upgrade to OSP10 like at https://gitlab.cee.redhat.com/sathlang/ospd-9-to-10-upgrade/blob/master/README.md 3. Reboot nodes because 7.3 4. pingtest Actual results: fail as above Expected results: not fail :/ Additional info:
Created attachment 1212155 [details] pingtest output
Created attachment 1216574 [details] pingtest_output after controllers upgraded
Update, the description is slightly inaccurate because I filed the BZ at a later time to when it actually occurred. The description says that after converge I rebooted and then ran the pingtest. That is accurate, however, the pingtest issue first starts appearing after the controllers are upgraded successfully. So after UPDATE_COMPLETE, run pingtest and it fails as in the attachment.
Created attachment 1216577 [details] relevant journal messages from controller0
some more poking today. I discovered that swift services were down after the controllers are upgraded. We have a change in the hiera data that we use to determine which swift services to bringup. I opened a review at https://review.openstack.org/#/c/392680/ but it doesn't fully fix the problem (gets further, but still fails, attaching new log for this run)
Created attachment 1216586 [details] pingtest after fixing swift, see comment #5
After some more poking today I suspect this may be related to overcloud password issues... once the swift services are back up and as you can see in the attachment from comment #6 the overcloud heat stack create fails for authorization... looking at controller0 logs I see from heat-engine log: 2016-11-03 10:34:44.513 7598 ERROR heat.engine.clients.keystoneclient [req-b4b24448-26c0-4618-9def-1edcc23eeb76 a3d2c3c619db4433a2da763bf966d7a3 f692f5e0499545028b7a0235d7480139 - - -] Domain admin client authentication failed and from keystone.log: 2016-11-03 10:34:44.510 11829 WARNING keystone.auth.plugins.core [req-dd2b7ce4-56d1-48e7-ad3d-99b86f2dda5a - - - - -] Could not find domain: Default 2016-11-03 10:34:44.511 11829 WARNING keystone.common.wsgi [req-dd2b7ce4-56d1-48e7-ad3d-99b86f2dda5a - - - - -] Authorization failed. The request you have made requires authentication. from 192.0.2.14 I am going to reset the environment and include the fix from BZ 1388930 (which is about the overcloud password changing) as well as the fix for the swift services and see if it reproduces then.
Today I included the fixup for the overcloudrc issue (BZ 1388930) but have the same result. After controller upgrade (and with swift services now running) the pingtest fails exactly as attached from comment #6. I'll also attach some more logs, but seems to be an issue with heat<-->keystone and the admin domain
Created attachment 1217111 [details] sanity check credentials are fixed with https://review.openstack.org/#/c/392593/
Created attachment 1217112 [details] quite a bit of heat-engine.log scroll to end for domain admin auth failure
thanks to shardy got a possible lead on the heat domain auth failure described in previous comments... may be related to BZ 1388474
SO going to use this bug to fixup the swift services not being started like in comment #5 and the gerrit review linked above. It has merged both master and newton at https://review.openstack.org/#/c/393760/ so moving to POST I did *not* get a chance to continue debugging the heat domain issue from comment #7 but we should file a new BZ for that
just pointing at the newton review rather than master
Verified with openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch as part of QE verification we've added to our automation ping test to the overcloud workload and verified the it's reachable in between each of the upgrade steps .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html