Bug 1473800
Summary: | osp 9 -> 10 upgrade failure, HTTP/500 from CFN signal, Expected exception during message handling from heat-engine | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Matt Flusche <mflusche> |
Component: | instack-undercloud | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED NOTABUG | QA Contact: | Arik Chernetsky <achernet> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | aschultz, augol, jack.ottofaro, mburns, mflusche, ramishra, rhel-osp-director-maint, sathlang, sbaker, shardy, srevivo, therve, zbitter |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-07 13:33:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Matt Flusche
2017-07-21 17:18:14 UTC
I see a couple of errors like this in the engine logs: 2017-07-18 22:29:17.350 6980 INFO heat.engine.stack [req-b5f96e47-b167-49ad-886c-b4d5db918a02 65dd7dce219249f48a4fc92dc6173d52 502217386c5c4e60b03a6564e47d0e68 - - -] Stack UPDATE IN_PROGRESS (ocd00-Controller-thij7w32l6ay-2-dxtk5lywaziu-NodeTLSCAData-les7ykumlh56): Stack UPDATE started 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource [req-88e076f0-7ba5-457f-a5bb-ac647aa83bf4 - 502217386c5c4e60b03a6564e47d0e68 - - -] Resource type OS::Nova::Server unavailable 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource Traceback (most recent call last): 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 184, in _validate_service_availability 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource (svc_available, reason) = cls.is_service_available(context) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 704, in is_service_available 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource service_name=cls.default_client_name) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/client_plugin.py", line 175, in does_endpoint_exist 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource endpoint_type=endpoint_type) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/client_plugin.py", line 99, in url_for 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource keystone_session = self.context.keystone_session 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 149, in keystone_session 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource if self.auth_needs_refresh(): 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 162, in auth_needs_refresh 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource auth_ref = self.auth_plugin.get_auth_ref(self._keystone_session) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/generic/base.py", line 181, in get_auth_ref 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource return self._plugin.get_auth_ref(session, **kwargs) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/v3/base.py", line 167, in get_auth_ref 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource authenticated=False, log=False, **rkwargs) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 681, in post 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource return self.request(url, 'POST', **kwargs) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource return wrapped(*args, **kwargs) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 570, in request 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource raise exceptions.from_response(resp, method, url) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource Forbidden: You are not authorized to perform the requested action. (HTTP 403) (Request-ID: req-624f399c-a171-4567-a648-7ef39126a227) 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource That correspond to the first two 500 errors in the heat-api-cfn log, one in each of the two heat-engine processes that are running. Here's my guess: The keystone token is timing out, and Heat is unable to refresh the token. (This will be fixed in OSP12 by bug 1438015.) It fails once per engine with the traceback and then silently from then on. If this is even remotely close to the mark then a workaround will be to extend the time-to-live of keystone tokens beyond the maximum time you expect it to take to create/update the overcloud. Hi Zane, Thanks for looking at issue and finding this token expiration. I'm trying to understand in this instance how the token timeout would be exceeded. The keystone token ttl is 4 hours. /etc/keystone/keystone.conf expiration = 14400 Heat engine was restarted about an hour before the issue: 2017-07-18 21:29:06.278 6116 INFO oslo_service.service [-] Starting 2 workers 2017-07-18 21:29:06.282 6116 DEBUG oslo_service.service [-] Started child 6980 _start_child /usr/lib/python2.7/site-packages/oslo_service/service.py:500 2017-07-18 21:29:06.288 6116 DEBUG oslo_service.service [-] Started child 6981 _start_child /usr/lib/python2.7/site-packages/oslo_service/service.py:500 The time between the heat stack update and the failure was about 4 minutes. 2017-07-18 22:25:36.361 6980 DEBUG heat.engine.scheduler [req-b5f96e47-b167-49ad-886c-b4d5db918a02 65dd7dce219249f48a4fc92dc6173d52 502217386c5c4e60b03a6564e47d0e68 - - -] Task Stack "ocd00" [99777d35-4fcb-4abd-94cc-89aa0631bb8d] Update starting start /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:184 2017-07-18 22:29:17.417 6980 ERROR heat.engine.resource [req-88e076f0-7ba5-457f-a5bb-ac647aa83bf4 - 502217386c5c4e60b03a6564e47d0e68 - - -] Resource type OS::Nova::Server unavailable Is this exception when heat is refreshing its keystone token? Can you check the roles of the user? That reminds of the issue we had when _member_ was removed but still present in the trust. Thomas, This would be for the user executing the stack update? Matt OK, so much for that theory. It's crazy that we're calling _validate_service_availability() in the first place. I raised a bug upstream to stop doing that. Apparently Heat is trying to fetch the Keystone catalog to do that check and being denied - we're not sure why because there are no Keystone logs available. But the credentials it's using to handle the signal (which IIRC should be the trust stored in the database when creating the stack) are not being accepted. It's possible this could be due to a misconfiguration of Heat, but the config looks fairly normal to me. Rabi, any ideas? Yes, admin in this case. From the logs it seems some are successfully authenticated and and some are failing randomly. Some signals failing with 500 (error from heat engine as Zane mentioned above) or simply 403 as below. 2017-07-18 13:03:35.776 1310 INFO eventlet.wsgi.server [-] 10.224.0.106 - - [18/Jul/2017 13:03:35] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3A502217386c5c4e60b03a6564e47d0e68%3Astacks%2Focd00-ComputeNodesPostDeployment-wfnqgujqigck-ComputePuppetDeployment-n6umawhcz52z%2F850dbe92-831f-4b0c-bfa6-ca2407e584ed%2Fresources%2F2?Timestamp=2017-07-18T16%3A44%3A39Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=392a87c8d74442b5a9b916ca3f1596cc&SignatureVersion=2&Signature=MxFXyeYD6win%2BSNInRqYuuke75oGF2OoNYkBmEOl4kE%3D HTTP/1.1" 403 306 0.028369 There are number of successful signals in between too with 200. Does not look like a heat issue to me. Most probably keystone issue. We may need the keystone logs at the time of the failures. sosreport has keystone logs from 2017-07-20. Here are the full keystone logs: http://collab-shell.usersys.redhat.com/01819909/keystone-logs-director.tar.gz/ And the associated keystone auth failure: 2017-07-18 22:29:16.793 3193 DEBUG keystone.middleware.auth [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] There is either no auth token in the request or the certificate issuer is not trusted. No auth context will be set. fill_context /usr/lib/python2.7/site-packages/keystone/middleware/auth.py:210 2017-07-18 22:29:16.795 3193 INFO keystone.common.wsgi [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] POST http://10.224.0.5:35357/v3/auth/tokens 2017-07-18 22:29:17.054 3193 DEBUG oslo_messaging._drivers.amqpdriver [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] CAST unique_id: aa2a2f36002046538d1e79e3915e9c58 NOTIFY exchange 'keystone' topic 'notifications.info' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:432 2017-07-18 22:29:17.409 3193 WARNING keystone.common.wsgi [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] You are not authorized to perform the requested action. 2017-07-18 22:29:25.323 3193 DEBUG keystone.middleware.auth [req-624f399c-a171-4567-a648-7ef39126a227 - - - - -] Authenticating user token process_request /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:346 From the logs it seems there are some 403 for ec2 credentials (used for signaling) where the credentials could not be found in db around 2017-07-18 13:03:35. 2017-07-18 13:03:35.770 1535 WARNING keystone.common.wsgi [req-c8acb534-e9c2-465b-a0d1-2106209298f6 - - - - -] Could not find credential: 9ac0a18332453169749a4b9a95463d0ea75cc60dd477e295e331762b3cbe6520 Not sure why that started happening though. This seems to happen for only a few ec2 credentials. But then looks all services (heat, keystone, mariadb) were shutdown and brought up around 2017-07-18 17:11. Also I can see some migration scripts running for keystone (Is there an undercloud keystone upgrade in between?) /versioning/script/base.py:30 2017-07-18 17:12:08.227 16170 DEBUG migrate.versioning.repository [-] Repository /usr/lib/python2.7/site-packages/keystone/common/sql/migrate_repo loaded successfully __init__ /usr/lib/python2.7/site-packages/migrate/versioning/repository.py:82 2017-07-18 17:12:08.227 16170 DEBUG migrate.versioning.repository [-] Config: OrderedDict([('db_settings', OrderedDict([('__name__', 'db_settings'), ('repository_id', 'keystone'), ('version_table', 'migrate_version'), ('required_dbs', '[]'), ('use_timestamp_numbering', 'False')]))]) __init__ /usr/lib/python2.7/site-packages/migrate/versioning/repository.py:83 2017-07-18 17:12:08.232 16170 INFO migrate.versioning.api [-] 96 -> 97... 2017-07-18 17:12:08.245 16170 INFO migrate.versioning.api [-] done 2017-07-18 17:12:08.245 16170 INFO migrate.versioning.api [-] 97 -> 98... 2017-07-18 17:12:08.251 16170 INFO migrate.versioning.api [-] done 2017-07-18 17:12:08.251 16170 INFO migrate.versioning.api [-] 98 -> 99... 2017-07-18 17:12:08.256 16170 INFO migrate.versioning.api [-] done 2017 Is it allowed to do an undercloud upgrade when an overcloud upgrade is going on? I guess that undercloud keystone upgrade did not go well and there are plenty of errors in the keystone logs after that and it seems trust based auth is not working. I haven't got the response regarding the roles of the admin user. This looks like the know issue of roles being deleted during upgrade, and breaking the trust created by Heat, but we need to confirm with the roles. Is this the role information needed? This was provided by the customer: [stack@ospd00 log]$ openstack role assignment list --project admin --user admin --names +-------+-------+---------+ | Role | User | Project | +-------+-------+---------+ | admin | admin | admin | +-------+-------+---------+ Right so I believe _member_ is still necessary. See https://bugzilla.redhat.com/show_bug.cgi?id=1325475#c25 for the workaround. Thanks Thomas, Rabi and Zane for the continued support. The undercloud was upgraded prior to this overcloud upgrade and as far as I know it was successful. The keystone logs look pretty normal to me for an upgrade. Is there a way to test the trust functionality with a simpler heat stack? I've asked the customer to add _member_ role to admin and re-run the heat deployment. Adding _member_ role to the admin user resolved this issue. Thank you Thomas! For the OSP 9 (upgrading from 8) this step is listed in the docs: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Updating_Director_Packages However for OSP 10 (upgrading from 9) it is not: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#undercloud_upgrade I'll open a doc BZ to get this included. Thanks again, Matt Doc BZ: 1475572 Supposedly this should have been fixed already by OSP 10, so reassigning to instack-undercloud to investigate why it's still happening. Hi Matt, just wondering if that env was upgraded from osp8 to 9 to 10 ? Sofer/Matt, The customer told me that this installation begin with an OSP 9 installation. I asked about this because I knew the customer had previously done these installation steps and was wondering why they had not hit this issue before. Previously, when they did not hit this issue, they had done an OSP 8 to 9 upgrade and then did the OSP 9 to 10. Jack, I believe that was for the overcloud environment. Can you verify if the undercloud was 8->9->10 upgrade? Customer confirmed that the undercloud was an 8->9->10 upgrade. The overcloud was an OSP 9 deploy. So we can assume that the member addition was overlook during osp8-osp9 undercloud upgrade. I think we should close 1475572 as not a bug and this one as well by the same occasion. Matt what do you think ? Yes, closing. Thank you. |