Red Hat Bugzilla – Bug 1467849
Possible race condition - deployments fail because keystone is not up and running on time
Last modified: 2018-03-07 06:44:46 EST
Description of problem: Randomly deployment fails in overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.0 with this error (the error is visible in "openstack stack failures list"): Error: ceilometer-upgrade --skip-metering-database returned 1 instead of one of [0] If you re-run the deployment command it will resume from where it failed and finish successfully. It appears that Keystone is not running when it's needed by ceilometer. ceilo-upgrade tries to authenticate with keystone to create resource types in gnocchi. But keystone throws a 503: 10.35.191.20 - - [02/Jul/2017:14:48:06 +0000] "GET /v1/resource_type/instance_disk HTTP/1.1" 503 170 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5" 2017-07-02 14:48:11.800 116449 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 503 2017-07-02 14:48:11.800 116449 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>503 Service Unavailable</h1> No server is available to handle this request. and ceilometer-upgrade fails with: 2017-07-02 14:48:11.803 123807 CRITICAL ceilometer [-] ClientException: {"message": "The server is currently unavailable. Please try again at a later time.<br /><br />\n\n\n", "code": "503 Service Unavailable", "title": "Service Unavailable"} (HTTP 503) Version-Release number of selected component (if applicable): OSP11 GA 2017-06-30.3 How reproducible: randomly Steps to Reproduce: 1. Deploy the default plan. I deployed a topology of 3 controllers + 1 compute + 3 ceph with this command: openstack overcloud deploy --templates --ntp-server clock.redhat.com -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e counters.yaml 2. If you hit this error you re-run the command to resume it from where it failed, and it passes the 2nd time. Expected results: Deployment should pass the first time. You shouldn't hit errors due to keystone not being up and running on time.
Created attachment 1295342 [details] puppet logs
Created attachment 1295833 [details] logs: haproxy messages and httpd
There is 89 errors like this one: Server aodh/overcloud-controller-0.internalapi.cwdg720xd01.org is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. 23:58:40: haproxy startup 23:58:40: first message bout Layer4 being not responding a lot of connections are refused with the error above 23:59:58: first connection accepted 00:00:27: haproxy adds the localhost 00:00:49: haproxy starts to add the other hosts everything is back to normal 00:23:23: controller0 is down 00:23:37: controller2 is down 00:23:27: controller0 is up 00:26:00: controller2 is up 00:26:04: controller1 is down 00:26:50: controller2 is down 00:28:47: controller0 is down 00:28:51: controller0 is up everything is back to normal in gnocchi_wsgi_error.log-20170709: [Sat Jul 08 00:20:27.679270 2017] [:error] [pid 50912] Option "metric_processing_delay" from group "storage" is deprecated. Use option "metric_processing_delay" from group "metricd". [Sat Jul 08 00:20:27.772207 2017] [:error] [pid 50912] [remote 192.168.140.30:120] mod_wsgi (pid=50912): Target WSGI script '/var/www/cgi-bin/gnocchi/app' cannot be loaded as Python module. [Sat Jul 08 00:20:27.772226 2017] [:error] [pid 50912] [remote 192.168.140.30:120] mod_wsgi (pid=50912): Exception occurred processing WSGI script '/var/www/cgi-bin/gnocchi/app'. [Sat Jul 08 00:20:27.772248 2017] [:error] [pid 50912] [remote 192.168.140.30:120] Traceback (most recent call last): [Sat Jul 08 00:20:27.772262 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "/var/www/cgi-bin/gnocchi/app", line 22, in <module> [Sat Jul 08 00:20:27.772280 2017] [:error] [pid 50912] [remote 192.168.140.30:120] application = app.build_wsgi_app() [Sat Jul 08 00:20:27.772286 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 136, in build_wsgi_app [Sat Jul 08 00:20:27.772296 2017] [:error] [pid 50912] [remote 192.168.140.30:120] return load_app(service.prepare_service()) [Sat Jul 08 00:20:27.772301 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 91, in load_app [Sat Jul 08 00:20:27.772308 2017] [:error] [pid 50912] [remote 192.168.140.30:120] storage = gnocchi_storage.get_driver(conf) [Sat Jul 08 00:20:27.772312 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 144, in get_driver [Sat Jul 08 00:20:27.772634 2017] [:error] [pid 50912] [remote 192.168.140.30:120] conf.incoming) [Sat Jul 08 00:20:27.772648 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "/usr/lib/python2.7/site-packages/gnocchi/storage/incoming/ceph.py", line 32, in __init__ [Sat Jul 08 00:20:27.772894 2017] [:error] [pid 50912] [remote 192.168.140.30:120] self.rados, self.ioctx = ceph.create_rados_connection(conf) [Sat Jul 08 00:20:27.772906 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "/usr/lib/python2.7/site-packages/gnocchi/storage/common/ceph.py", line 59, in create_rados_connection [Sat Jul 08 00:20:27.773113 2017] [:error] [pid 50912] [remote 192.168.140.30:120] ioctx = conn.open_ioctx(conf.ceph_pool) [Sat Jul 08 00:20:27.773122 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188) [Sat Jul 08 00:20:27.773207 2017] [:error] [pid 50912] [remote 192.168.140.30:120] File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325) [Sat Jul 08 00:20:27.773261 2017] [:error] [pid 50912] [remote 192.168.140.30:120] ObjectNotFound: error opening pool 'metrics' [Sat Jul 08 00:20:38.619911 2017] [:error] [pid 50915] Option "metric_processing_delay" from group "storage" is deprecated. Use option "metric_processing_delay" from group "metricd". [Sat Jul 08 00:20:38.701378 2017] [:error] [pid 50915] [remote 192.168.140.30:112] mod_wsgi (pid=50915): Target WSGI script '/var/www/cgi-bin/gnocchi/app' cannot be loaded as Python module. [Sat Jul 08 00:20:38.701405 2017] [:error] [pid 50915] [remote 192.168.140.30:112] mod_wsgi (pid=50915): Exception occurred processing WSGI script '/var/www/cgi-bin/gnocchi/app'. [Sat Jul 08 00:20:38.701428 2017] [:error] [pid 50915] [remote 192.168.140.30:112] Traceback (most recent call last): [Sat Jul 08 00:20:38.701442 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "/var/www/cgi-bin/gnocchi/app", line 22, in <module> [Sat Jul 08 00:20:38.701463 2017] [:error] [pid 50915] [remote 192.168.140.30:112] application = app.build_wsgi_app() [Sat Jul 08 00:20:38.701470 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 136, in build_wsgi_app [Sat Jul 08 00:20:38.701481 2017] [:error] [pid 50915] [remote 192.168.140.30:112] return load_app(service.prepare_service()) [Sat Jul 08 00:20:38.701492 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 91, in load_app [Sat Jul 08 00:20:38.701501 2017] [:error] [pid 50915] [remote 192.168.140.30:112] storage = gnocchi_storage.get_driver(conf) [Sat Jul 08 00:20:38.701506 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 144, in get_driver [Sat Jul 08 00:20:38.701615 2017] [:error] [pid 50915] [remote 192.168.140.30:112] conf.incoming) [Sat Jul 08 00:20:38.701626 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "/usr/lib/python2.7/site-packages/gnocchi/storage/incoming/ceph.py", line 32, in __init__ [Sat Jul 08 00:20:38.701695 2017] [:error] [pid 50915] [remote 192.168.140.30:112] self.rados, self.ioctx = ceph.create_rados_connection(conf) [Sat Jul 08 00:20:38.701703 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "/usr/lib/python2.7/site-packages/gnocchi/storage/common/ceph.py", line 59, in create_rados_connection [Sat Jul 08 00:20:38.701739 2017] [:error] [pid 50915] [remote 192.168.140.30:112] ioctx = conn.open_ioctx(conf.ceph_pool) [Sat Jul 08 00:20:38.701747 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188) [Sat Jul 08 00:20:38.701814 2017] [:error] [pid 50915] [remote 192.168.140.30:112] File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325) [Sat Jul 08 00:20:38.701870 2017] [:error] [pid 50915] [remote 192.168.140.30:112] ObjectNotFound: error opening pool 'metrics'
Created attachment 1295843 [details] overcloud deploy output
The deployment finish with a CREATE_FAILED at 2017-07-08 00:29:01Z.
Goneri: This is not the same race condition Udi saw. In your case, you dont have a metric pool created in ceph. gnocchi wont create metric pool in ceph, we just use it. In case of external ceph, you need to create this. See related doc bug: https://bugzilla.redhat.com/show_bug.cgi?id=1412295
The gnocchi pool may be misconfiguration but I disagree with your conclusion: - the deployment was working fine until the last puddle - if I restart the deployment, It will recover and finish fine - if I run the ceilometer-upgrade --skip-metering-database manually, it successes.
ok then lets keep this bug relevant to the race condition. You still have issues with ceph pools, dont ignore that.
The reason your deployment fails is because of 503 failing from keystone is: 192.168.140.30 - - [08/Jul/2017:00:25:57 +0000] "GET /v1/resource_type/swift_account HTTP/1.1" 503 170 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5" so ceilometer-upgrade tries to make a request to gnocchi to create resource types. But when authenticating keystone throws a server unavailable and gets a 503. I dont see ceilometer-upgrade.log here from your logs. But i would expect a service unavailable traceback in the log. in keystone logs i see ceilo-upgrade making the request: 192.168.140.30 - - [08/Jul/2017:00:24:50 +0000] "GET / HTTP/1.1" 300 602 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5" 192.168.140.30 - - [08/Jul/2017:00:24:55 +0000] "POST /v3/auth/tokens HTTP/1.1" 201 9439 "-" "gnocchi/3.1.2 keystonemiddleware.auth_token/4.14.0 keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5" i do see a bunch of these in haproxy logs but i dont know if they are related: Jul 08 00:26:50 overcloud-controller-0.cwdg720xd01.org haproxy[35773]: Server horizon/overcloud-controller-2.internalapi.cwdg720xd01.org is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
I reproduced the problem with the last puddle (2017-06-30.3). sosreport: http://file.rdu.redhat.com/~gleboude/BZ1467849/
Thx Goneri. The issue turned out to be due to gnocchi statsd being configured at the same step and causing a notify on httpd. This was bringing keystone down while ceilo-upgrade was trying at the same time. The following patch should ensure we try ceilo-upgrade before gnocchi stuff kicks in so we dont end up in this race condition. https://review.openstack.org/#/c/482712/
I just tried https://review.openstack.org/#/c/482712/1/manifests/profile/base/ceilometer/collector.pp on Pradeep request and I've the same problem.
I've keep hitting the problem repeatedly in donwstream HA automation - around 2/3 od the jobs fail on the problem. I have not been hitting it constantly in automation but very often which makes our testing very slow and unreliable.
fixing the DFG, this has nothing to do with security
I was having this issue, just tried https://review.openstack.org/#/c/482712/3/manifests/profile/base/ceilometer/collector.pp and it works. In order to test/apply the patch I modified the overcloud-full.qcow2 image using virt-edit /usr/share/openstack-puppet/modules/tripleo/manifests/profile/base/ceilometer/collector.pp
Is there some indication in the log that the fix is there? If it fails randomly we can't just run it and verify based on that deployment...
(In reply to Sasha Smolyak from comment #20) > Is there some indication in the log that the fix is there? If it fails > randomly we can't just run it and verify based on that deployment... In my case it was always failing, using templates from the HCI Reference Arquitecture at https://github.com/RHsyseng/hci and now it's always working.
Tested various deployment topologies and didn't hit this issue (it used to happen to me 99% of the times). Verified in puppet-tripleo-6.5.0-5.el7ost.noarch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2406