Bug 1473800

Summary:	osp 9 -> 10 upgrade failure, HTTP/500 from CFN signal, Expected exception during message handling from heat-engine
Product:	Red Hat OpenStack	Reporter:	Matt Flusche <mflusche>
Component:	instack-undercloud	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED NOTABUG	QA Contact:	Arik Chernetsky <achernet>
Severity:	high	Docs Contact:
Priority:	high
Version:	10.0 (Newton)	CC:	aschultz, augol, jack.ottofaro, mburns, mflusche, ramishra, rhel-osp-director-maint, sathlang, sbaker, shardy, srevivo, therve, zbitter
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-07 13:33:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Flusche 2017-07-21 17:18:14 UTC

Description of problem:

- Hung Overcloud upgrade.
- Overcloud nodes are receiving a HTTP/500 response for CFN software deployment signals.

During upgrade deployment step 1
openstack overcloud deploy --templates \
....
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-ceilometer-wsgi-mitaka-newton.yaml

Results, hung/aborted deployment:

2017-07-19 02:29:20Z [ocd00-Controller-thij7w32l6ay-2-dxtk5lywaziu.NodeTLSData]: UPDATE_IN_PROGRESS  state changed
2017-07-19 06:25:38Z [Compute]: UPDATE_FAILED  UPDATE aborted
2017-07-19 06:25:38Z [Controller]: UPDATE_FAILED  UPDATE aborted
2017-07-19 06:25:38Z [ocd00]: UPDATE_FAILED  Timed out
2017-07-19 06:25:39Z [ocd00-Compute-3gchunxgkwg6.2]: UPDATE_FAILED  UPDATE aborted
2017-07-19 06:25:39Z [ocd00-Controller-thij7w32l6ay.0]: UPDATE_FAILED  UPDATE aborted
2017-07-19 06:25:39Z [ocd00-Compute-3gchunxgkwg6.1]: UPDATE_FAILED  UPDATE aborted
2017-07-19 06:25:39Z [ocd00-Controller-thij7w32l6ay.2]: UPDATE_FAILED  UPDATE aborted

The overcloud node receives a HTTP/500 from CFN for this software deployment (note: times here are MDT):

Jul 18 20:30:16 ocd00-controller-2.localdomain os-collect-config[4228]: [2017-07-18 20:30:16,187] (heat-config) [DEBUG] [2017-07-18 20:30:12,188] (heat-config-notify) [DEBUG] Signaling to http://10.224.0.5:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A502217386c5c4e60b03a6564e47d0e68%3Astacks%2Focd00-Controller-thij7w32l6ay-2-dxtk5lywaziu-NodeTLSData-n53u45bz2vai%2F39d029be-3eab-463e-9739-91e3641360a4%2Fresources%2FControllerTLSDeployment?Timestamp=2017-07-18T18%3A49%3A41Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=dacfd44e4c1d4fffa043796efa2bf9ef&SignatureVersion=2&Signature=%2F%2F1cdJwvl1ZJ%2F58QUhhoN%2FWcZOF39GDVlmRozPaEgjs%3D via POST
Jul 18 20:30:16 ocd00-controller-2.localdomain os-collect-config[4228]: [2017-07-18 20:30:16,154] (heat-config-notify) [DEBUG] Response <Response [500]>

heat-cfn logs (logs are in EDT time, +2 hours from above):

2017-07-18 22:30:12.208 5415 DEBUG heat.api.middleware.version_negotiation [-] Processing request: POST /v1/signal/arn:openstack:heat::502217386c5c4e60b03a6564e47d0e68:stacks/ocd00-Controller-thij7w32l6ay-2-dxtk5lywaziu-NodeTLSData-n53u45bz2vai/39d029be-3eab-463e-9739-91e3641360a4/resources/ControllerTLSDeployment Accept: */* process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50


2017-07-18 22:30:16.152 5415 INFO eventlet.wsgi.server [req-d453aa03-da92-4cc4-a2a4-116de9769dc0 54fb794cd96740a29ed104cd6402481a 3b9334eb45164e239369aef25d99094a - 1b3d3dc4faa34fecb87bc6c5ff3e4041 1b3d3dc4faa34fecb87bc6c5ff3e4041] 10.224.0.128 - - [18/Jul/2017 22:30:16] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3A502217386c5c4e60b03a6564e47d0e68%3Astacks%2Focd00-Controller-thij7w32l6ay-2-dxtk5lywaziu-NodeTLSData-n53u45bz2vai%2F39d029be-3eab-463e-9739-91e3641360a4%2Fresources%2FControllerTLSDeployment?Timestamp=2017-07-18T18%3A49%3A41Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=dacfd44e4c1d4fffa043796efa2bf9ef&SignatureVersion=2&Signature=%2F%2F1cdJwvl1ZJ%2F58QUhhoN%2FWcZOF39GDVlmRozPaEgjs%3D HTTP/1.1" 500 1906 3.944028


heat-engine logs:

2017-07-18 22:30:16.076 6981 DEBUG oslo_messaging.rpc.server [req-d453aa03-da92-4cc4-a2a4-116de9769dc0 - 502217386c5c4e60b03a6564e47d0e68 - - -] Expected exception during message handling () _process_incoming /usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:136
2017-07-18 22:30:16.106 6981 DEBUG oslo_messaging._drivers.amqpdriver [req-d453aa03-da92-4cc4-a2a4-116de9769dc0 - 502217386c5c4e60b03a6564e47d0e68 - - -] sending reply msg_id: 4884d832db7b45c881b834ef050130db reply queue: reply_3d304d49842c425498dea212d071207b time elapsed: 2.527338199s _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:73


I'll provide a link to the full logs.

Version-Release number of selected component (if applicable):
$ grep heat installed-rpms 
heat-cfntools-1.3.0-2.el7ost.noarch                         Wed Jun 14 09:31:32 2017
openstack-heat-api-7.0.2-4.el7ost.noarch                    Tue Jul 18 16:57:35 2017
openstack-heat-api-cfn-7.0.2-4.el7ost.noarch                Tue Jul 18 16:57:34 2017
openstack-heat-api-cloudwatch-7.0.2-4.el7ost.noarch         Tue Jul 18 16:57:35 2017
openstack-heat-common-7.0.2-4.el7ost.noarch                 Tue Jul 18 16:56:24 2017
openstack-heat-engine-7.0.2-4.el7ost.noarch                 Tue Jul 18 16:57:35 2017
openstack-heat-templates-0-0.11.1e6015dgit.el7ost.noarch    Tue Jul 18 16:54:03 2017
openstack-tripleo-heat-templates-5.2.0-15.el7ost.noarch     Tue Jul 18 16:52:19 2017
openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch Tue Jul 18 16:59:30 2017
puppet-heat-9.5.0-1.el7ost.noarch                           Tue Jul 18 16:53:50 2017
python-heat-agent-0-0.11.1e6015dgit.el7ost.noarch           Tue Jul 18 16:55:00 2017
python-heat-tests-7.0.2-4.el7ost.noarch                     Tue Jul 18 16:57:34 2017
python-heatclient-1.5.0-1.el7ost.noarch                     Tue Jul 18 16:54:08 2017


How reproducible:
seems to be specific to this environment.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Zane Bitter 2017-07-21 22:16:14 UTC

I see a couple of errors like this in the engine logs:

2017-07-18 22:29:17.350 6980 INFO heat.engine.stack [req-b5f96e47-b167-49ad-886c-b4d5db918a02 65dd7dce219249f48a4fc92dc6173d52 502217386c5c4e60b03a6564e47d0e68 - - -] Stack UPDATE IN_PROGRESS (ocd00-Controller-thij7w32l6ay-2-dxtk5lywaziu-NodeTLSCAData-les7ykumlh56): Stack UPDATE started
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource [req-88e076f0-7ba5-457f-a5bb-ac647aa83bf4 - 502217386c5c4e60b03a6564e47d0e68 - - -] Resource type OS::Nova::Server unavailable
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource Traceback (most recent call last):
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 184, in _validate_service_availability
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     (svc_available, reason) = cls.is_service_available(context)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 704, in is_service_available
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     service_name=cls.default_client_name)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/clients/client_plugin.py", line 175, in does_endpoint_exist
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     endpoint_type=endpoint_type)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/clients/client_plugin.py", line 99, in url_for
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     keystone_session = self.context.keystone_session
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 149, in keystone_session
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     if self.auth_needs_refresh():
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 162, in auth_needs_refresh
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     auth_ref = self.auth_plugin.get_auth_ref(self._keystone_session)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/generic/base.py", line 181, in get_auth_ref
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     return self._plugin.get_auth_ref(session, **kwargs)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/v3/base.py", line 167, in get_auth_ref
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     authenticated=False, log=False, **rkwargs)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 681, in post
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     return self.request(url, 'POST', **kwargs)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     return wrapped(*args, **kwargs)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 570, in request
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource     raise exceptions.from_response(resp, method, url)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource Forbidden: You are not authorized to perform the requested action. (HTTP 403) (Request-ID: req-624f399c-a171-4567-a648-7ef39126a227)
2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource 

That correspond to the first two 500 errors in the heat-api-cfn log, one in each of the two heat-engine processes that are running.

Here's my guess: The keystone token is timing out, and Heat is unable to refresh the token. (This will be fixed in OSP12 by bug 1438015.) It fails once per engine with the traceback and then silently from then on.

If this is even remotely close to the mark then a workaround will be to extend the time-to-live of keystone tokens beyond the maximum time you expect it to take to create/update the overcloud.

Comment 3 Matt Flusche 2017-07-24 15:57:14 UTC

Hi Zane,

Thanks for looking at issue and finding this token expiration.

I'm trying to understand in this instance how the token timeout would be exceeded.

The keystone token ttl is 4 hours.

/etc/keystone/keystone.conf 
expiration = 14400

Heat engine was restarted about an hour before the issue:

2017-07-18 21:29:06.278 6116 INFO oslo_service.service [-] Starting 2 workers
2017-07-18 21:29:06.282 6116 DEBUG oslo_service.service [-] Started child 6980 _start_child /usr/lib/python2.7/site-packages/oslo_service/service.py:500
2017-07-18 21:29:06.288 6116 DEBUG oslo_service.service [-] Started child 6981 _start_child /usr/lib/python2.7/site-packages/oslo_service/service.py:500

The time between the heat stack update and the failure was about 4 minutes.

2017-07-18 22:25:36.361 6980 DEBUG heat.engine.scheduler [req-b5f96e47-b167-49ad-886c-b4d5db918a02 65dd7dce219249f48a4fc92dc6173d52 502217386c5c4e60b03a6564e47d0e68 - - -] Task Stack "ocd00" [99777d35-4fcb-4abd-94cc-89aa0631bb8d] Update starting start /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:184

2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource [req-88e076f0-7ba5-457f-a5bb-ac647aa83bf4 - 502217386c5c4e60b03a6564e47d0e68 - - -] Resource type OS::Nova::Server unavailable

Is this exception when heat is refreshing its keystone token?

Comment 4 Thomas Hervé 2017-07-24 20:29:01 UTC

Can you check the roles of the user? That reminds of the issue we had when _member_ was removed but still present in the trust.

Comment 5 Matt Flusche 2017-07-24 20:45:57 UTC

Thomas,

This would be for the user executing the stack update?

Matt

Comment 6 Zane Bitter 2017-07-24 20:54:22 UTC

OK, so much for that theory.

It's crazy that we're calling _validate_service_availability() in the first place. I raised a bug upstream to stop doing that.

Apparently Heat is trying to fetch the Keystone catalog to do that check and being denied - we're not sure why because there are no Keystone logs available. But the credentials it's using to handle the signal (which IIRC should be the trust stored in the database when creating the stack) are not being accepted.

It's possible this could be due to a misconfiguration of Heat, but the config looks fairly normal to me. Rabi, any ideas?

Comment 7 Thomas Hervé 2017-07-25 07:17:09 UTC

Yes, admin in this case.

Comment 8 Rabi Mishra 2017-07-25 08:24:28 UTC

From the logs it seems some are successfully authenticated and and some are failing randomly. 

Some signals failing with 500 (error from heat engine as Zane mentioned above) or simply 403 as below.

 2017-07-18 13:03:35.776 1310 INFO eventlet.wsgi.server [-] 10.224.0.106 - - [18/Jul/2017 13:03:35] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3A502217386c5c4e60b03a6564e47d0e68%3Astacks%2Focd00-ComputeNodesPostDeployment-wfnqgujqigck-ComputePuppetDeployment-n6umawhcz52z%2F850dbe92-831f-4b0c-bfa6-ca2407e584ed%2Fresources%2F2?Timestamp=2017-07-18T16%3A44%3A39Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=392a87c8d74442b5a9b916ca3f1596cc&SignatureVersion=2&Signature=MxFXyeYD6win%2BSNInRqYuuke75oGF2OoNYkBmEOl4kE%3D HTTP/1.1" 403 306 0.028369

There are number of successful signals in between too with 200.

Does not look like a heat issue to me. Most probably keystone issue. We may need the keystone logs at the time of the failures. 

sosreport has keystone logs from 2017-07-20.

Comment 9 Matt Flusche 2017-07-25 14:34:30 UTC

Here are the full keystone logs:

http://collab-shell.usersys.redhat.com/01819909/keystone-logs-director.tar.gz/

And the associated keystone auth failure:


2017-07-18 22:29:16.793 3193 DEBUG keystone.middleware.auth [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] There is either no auth token in the request or the certificate issuer is not trusted. No auth context will be set. fill_context /usr/lib/python2.7/site-packages/keystone/middleware/auth.py:210
2017-07-18 22:29:16.795 3193 INFO keystone.common.wsgi [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] POST http://10.224.0.5:35357/v3/auth/tokens
2017-07-18 22:29:17.054 3193 DEBUG oslo_messaging._drivers.amqpdriver [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] CAST unique_id: aa2a2f36002046538d1e79e3915e9c58 NOTIFY exchange 'keystone' topic 'notifications.info' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:432
2017-07-18 22:29:17.409 3193 WARNING keystone.common.wsgi [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] You are not authorized to perform the requested action.
2017-07-18 22:29:25.323 3193 DEBUG keystone.middleware.auth [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] Authenticating user token process_request /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:346

Comment 10 Rabi Mishra 2017-07-26 04:36:45 UTC

From the logs it seems there are some 403 for ec2 credentials (used for signaling) where the credentials could not be found in db around 2017-07-18 13:03:35.


2017-07-18 13:03:35.770 1535 WARNING keystone.common.wsgi [req-c8acb534-e9c2-465b-a0d1-2106209298f6 - - - - -] Could not find credential: 9ac0a18332453169749a4b9a95463d0ea75cc60dd477e295e331762b3cbe6520

Not sure why that started happening though. This seems to happen for only a few ec2 credentials.

But then looks all services (heat, keystone, mariadb) were shutdown and brought up around 2017-07-18 17:11. Also I can see some migration scripts running for keystone (Is there an undercloud keystone upgrade in between?)


/versioning/script/base.py:30
2017-07-18 17:12:08.227 16170 DEBUG migrate.versioning.repository [-] Repository /usr/lib/python2.7/site-packages/keystone/common/sql/migrate_repo loaded successfully __init__ /usr/lib/python2.7/site-packages/migrate/versioning/repository.py:82
2017-07-18 17:12:08.227 16170 DEBUG migrate.versioning.repository [-] Config: OrderedDict([('db_settings', OrderedDict([('__name__', 'db_settings'), ('repository_id', 'keystone'), ('version_table', 'migrate_version'), ('required_dbs', '[]'), ('use_timestamp_numbering', 'False')]))]) __init__ /usr/lib/python2.7/site-packages/migrate/versioning/repository.py:83
2017-07-18 17:12:08.232 16170 INFO migrate.versioning.api [-] 96 -> 97... 
2017-07-18 17:12:08.245 16170 INFO migrate.versioning.api [-] done
2017-07-18 17:12:08.245 16170 INFO migrate.versioning.api [-] 97 -> 98... 
2017-07-18 17:12:08.251 16170 INFO migrate.versioning.api [-] done
2017-07-18 17:12:08.251 16170 INFO migrate.versioning.api [-] 98 -> 99... 
2017-07-18 17:12:08.256 16170 INFO migrate.versioning.api [-] done
2017

Is it allowed to do an undercloud upgrade when an overcloud upgrade is going on?

I guess that undercloud keystone upgrade did not go well and there are plenty of errors in the keystone logs after that and it seems trust based auth is not working.

Comment 11 Thomas Hervé 2017-07-26 06:56:58 UTC

I haven't got the response regarding the roles of the admin user. This looks like the know issue of roles being deleted during upgrade, and breaking the trust created by Heat, but we need to confirm with the roles.

Comment 12 Jack Ottofaro 2017-07-26 11:52:21 UTC

Is this the role information needed? This was provided by the customer:

[stack@ospd00 log]$ openstack role assignment list --project admin --user admin --names
+-------+-------+---------+
| Role  | User  | Project |
+-------+-------+---------+
| admin | admin | admin   |
+-------+-------+---------+

Comment 13 Thomas Hervé 2017-07-26 12:06:59 UTC

Right so I believe _member_ is still necessary. See https://bugzilla.redhat.com/show_bug.cgi?id=1325475#c25 for the workaround.

Comment 14 Matt Flusche 2017-07-26 13:58:08 UTC

Thanks Thomas, Rabi and Zane for the continued support.

The undercloud was upgraded prior to this overcloud upgrade and as far as I know it was successful.  The keystone logs look pretty normal to me for an upgrade.  

Is there a way to test the trust functionality with a simpler heat stack?

I've asked the customer to add _member_ role to admin and re-run the heat deployment.

Comment 15 Matt Flusche 2017-07-26 22:26:43 UTC

Adding _member_ role to the admin user resolved this issue. Thank you Thomas!

For the OSP 9 (upgrading from 8) this step is listed in the docs:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Updating_Director_Packages

However for OSP 10 (upgrading from 9) it is not:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#undercloud_upgrade

I'll open a doc BZ to get this included.

Thanks again,

Matt

Comment 16 Matt Flusche 2017-07-26 22:41:59 UTC

Doc BZ: 1475572

Comment 17 Zane Bitter 2017-07-27 14:14:05 UTC

Supposedly this should have been fixed already by OSP 10, so reassigning to instack-undercloud to investigate why it's still happening.

Comment 18 Sofer Athlan-Guyot 2017-07-28 12:57:31 UTC

Hi Matt,

just wondering if that env was upgraded from osp8 to 9 to 10 ?

Comment 19 Jack Ottofaro 2017-07-28 13:17:08 UTC

Sofer/Matt,

The customer told me that this installation begin with an OSP 9 installation. I asked about this because I knew the customer had previously done these installation steps and was wondering why they had not hit this issue before. Previously, when they did not hit this issue, they had done an OSP 8 to 9 upgrade and then did the OSP 9 to 10.

Comment 20 Matt Flusche 2017-07-28 13:38:22 UTC

Jack,

I believe that was for the overcloud environment.  Can you verify if the undercloud was 8->9->10 upgrade?

Comment 21 Jack Ottofaro 2017-07-31 19:15:25 UTC

Customer confirmed that the undercloud was an 8->9->10 upgrade. The overcloud was an OSP 9 deploy.

Comment 22 Sofer Athlan-Guyot 2017-08-03 09:22:35 UTC

So we can assume that the member addition was overlook during osp8-osp9 undercloud upgrade.  I think we should close 1475572 as not a bug and this one as well by the same occasion.

Matt what do you think ?

Comment 23 Matt Flusche 2017-08-07 13:33:21 UTC

Yes, closing.  Thank you.