Bug 1467849 - Possible race condition - deployments fail because keystone is not up and running on time [NEEDINFO]
Possible race condition - deployments fail because keystone is not up and run...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo (Show other bugs)
11.0 (Ocata)
Unspecified Unspecified
urgent Severity urgent
: async
: 11.0 (Ocata)
Assigned To: RHOS Maint
Sasha Smolyak
: AutomationBlocker, Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-05 06:35 EDT by Udi
Modified: 2017-08-01 20:50 EDT (History)
15 users (show)

See Also:
Fixed In Version: puppet-tripleo-6.5.0-3.1.el7ost
Doc Type: Known Issue
Doc Text:
Cause: During an OSP11 overcloud deployment, There is a race condition where, on occasion, ceilometer-upgrade runs at the same time when Apache is restarted due to other services being configured in the same step. Consequence: This can cause the ceilometer upgrade to fail since it cannot authenticate with keystone, because Apache is still not in the active state and aborts the deployment as a failure. Workaround: When this happens, restart the overcloud deploy from where it failed and the deploy should get past this race condition and proceed with the deployment normally. Result: As a result, the deploy should be successful instead of failing with an error.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-01 20:50:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ssmolyak: needinfo? (pkilambi)


Attachments (Terms of Use)
puppet logs (26.52 KB, application/octet-stream)
2017-07-07 12:33 EDT, Gonéri Le Bouder
no flags Details
logs: haproxy messages and httpd (1.16 MB, application/x-gzip)
2017-07-10 09:32 EDT, Gonéri Le Bouder
no flags Details
overcloud deploy output (273.33 KB, application/x-gzip)
2017-07-10 10:05 EDT, Gonéri Le Bouder
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 482712 None None None 2017-07-11 17:37 EDT

  None (edit)
Description Udi 2017-07-05 06:35:01 EDT
Description of problem:

Randomly deployment fails in overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.0 with this error (the error is visible in "openstack stack failures list"):

Error: ceilometer-upgrade --skip-metering-database returned 1 instead of one of [0]

If you re-run the deployment command it will resume from where it failed and finish successfully. It appears that Keystone is not running when it's needed by ceilometer. ceilo-upgrade tries to authenticate with keystone to create resource types in gnocchi. But keystone throws a 503:

10.35.191.20 - - [02/Jul/2017:14:48:06 +0000] "GET
/v1/resource_type/instance_disk HTTP/1.1" 503 170 "-"
"ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1
CPython/2.7.5"

2017-07-02 14:48:11.800 116449 ERROR keystonemiddleware.auth_token [-]
Bad response code while validating token: 503
2017-07-02 14:48:11.800 116449 WARNING keystonemiddleware.auth_token
[-] Identity response: <html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.

and ceilometer-upgrade fails with:

2017-07-02 14:48:11.803 123807 CRITICAL ceilometer [-]
ClientException: {"message": "The server is currently unavailable.
Please try again at a later time.<br /><br />\n\n\n", "code": "503
Service Unavailable", "title": "Service Unavailable"} (HTTP 503)


Version-Release number of selected component (if applicable):
OSP11 GA 2017-06-30.3


How reproducible:
randomly


Steps to Reproduce:
1. Deploy the default plan. I deployed a topology of 3 controllers + 1 compute + 3 ceph with this command: openstack overcloud deploy --templates --ntp-server clock.redhat.com -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e counters.yaml
2. If you hit this error you re-run the command to resume it from where it failed, and it passes the 2nd time.


Expected results:
Deployment should pass the first time. You shouldn't hit errors due to keystone not being up and running on time.
Comment 1 Gonéri Le Bouder 2017-07-07 12:33 EDT
Created attachment 1295342 [details]
puppet logs
Comment 2 Gonéri Le Bouder 2017-07-10 09:32 EDT
Created attachment 1295833 [details]
logs: haproxy messages and httpd
Comment 3 Gonéri Le Bouder 2017-07-10 10:00:42 EDT
There is 89 errors like this one:
Server aodh/overcloud-controller-0.internalapi.cwdg720xd01.org is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

23:58:40: haproxy startup
23:58:40: first message bout Layer4 being not responding
a lot of connections are refused with the error above
23:59:58: first connection accepted
00:00:27: haproxy adds the localhost
00:00:49: haproxy starts to add the other hosts
everything is back to normal
00:23:23: controller0 is down
00:23:37: controller2 is down
00:23:27: controller0 is up
00:26:00: controller2 is up
00:26:04: controller1 is down
00:26:50: controller2 is down
00:28:47: controller0 is down
00:28:51: controller0 is up
everything is back to normal



in gnocchi_wsgi_error.log-20170709:
[Sat Jul 08 00:20:27.679270 2017] [:error] [pid 50912] Option "metric_processing_delay" from group "storage" is deprecated. Use option "metric_processing_delay" from group "metricd".
[Sat Jul 08 00:20:27.772207 2017] [:error] [pid 50912] [remote 192.168.140.30:120] mod_wsgi (pid=50912): Target WSGI script '/var/www/cgi-bin/gnocchi/app' cannot be loaded as Python module.
[Sat Jul 08 00:20:27.772226 2017] [:error] [pid 50912] [remote 192.168.140.30:120] mod_wsgi (pid=50912): Exception occurred processing WSGI script '/var/www/cgi-bin/gnocchi/app'.
[Sat Jul 08 00:20:27.772248 2017] [:error] [pid 50912] [remote 192.168.140.30:120] Traceback (most recent call last):
[Sat Jul 08 00:20:27.772262 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/var/www/cgi-bin/gnocchi/app", line 22, in <module>
[Sat Jul 08 00:20:27.772280 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     application = app.build_wsgi_app()
[Sat Jul 08 00:20:27.772286 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 136, in build_wsgi_app
[Sat Jul 08 00:20:27.772296 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     return load_app(service.prepare_service())
[Sat Jul 08 00:20:27.772301 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 91, in load_app
[Sat Jul 08 00:20:27.772308 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     storage = gnocchi_storage.get_driver(conf)
[Sat Jul 08 00:20:27.772312 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 144, in get_driver
[Sat Jul 08 00:20:27.772634 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     conf.incoming)
[Sat Jul 08 00:20:27.772648 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/incoming/ceph.py", line 32, in __init__
[Sat Jul 08 00:20:27.772894 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     self.rados, self.ioctx = ceph.create_rados_connection(conf)
[Sat Jul 08 00:20:27.772906 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/common/ceph.py", line 59, in create_rados_connection
[Sat Jul 08 00:20:27.773113 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     ioctx = conn.open_ioctx(conf.ceph_pool)
[Sat Jul 08 00:20:27.773122 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188)
[Sat Jul 08 00:20:27.773207 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
[Sat Jul 08 00:20:27.773261 2017] [:error] [pid 50912] [remote 192.168.140.30:120] ObjectNotFound: error opening pool 'metrics'
[Sat Jul 08 00:20:38.619911 2017] [:error] [pid 50915] Option "metric_processing_delay" from group "storage" is deprecated. Use option "metric_processing_delay" from group "metricd".
[Sat Jul 08 00:20:38.701378 2017] [:error] [pid 50915] [remote 192.168.140.30:112] mod_wsgi (pid=50915): Target WSGI script '/var/www/cgi-bin/gnocchi/app' cannot be loaded as Python module.
[Sat Jul 08 00:20:38.701405 2017] [:error] [pid 50915] [remote 192.168.140.30:112] mod_wsgi (pid=50915): Exception occurred processing WSGI script '/var/www/cgi-bin/gnocchi/app'.
[Sat Jul 08 00:20:38.701428 2017] [:error] [pid 50915] [remote 192.168.140.30:112] Traceback (most recent call last):
[Sat Jul 08 00:20:38.701442 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/var/www/cgi-bin/gnocchi/app", line 22, in <module>
[Sat Jul 08 00:20:38.701463 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     application = app.build_wsgi_app()
[Sat Jul 08 00:20:38.701470 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 136, in build_wsgi_app
[Sat Jul 08 00:20:38.701481 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     return load_app(service.prepare_service())
[Sat Jul 08 00:20:38.701492 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 91, in load_app
[Sat Jul 08 00:20:38.701501 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     storage = gnocchi_storage.get_driver(conf)
[Sat Jul 08 00:20:38.701506 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 144, in get_driver
[Sat Jul 08 00:20:38.701615 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     conf.incoming)
[Sat Jul 08 00:20:38.701626 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/incoming/ceph.py", line 32, in __init__
[Sat Jul 08 00:20:38.701695 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     self.rados, self.ioctx = ceph.create_rados_connection(conf)
[Sat Jul 08 00:20:38.701703 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/common/ceph.py", line 59, in create_rados_connection
[Sat Jul 08 00:20:38.701739 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     ioctx = conn.open_ioctx(conf.ceph_pool)
[Sat Jul 08 00:20:38.701747 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188)
[Sat Jul 08 00:20:38.701814 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
[Sat Jul 08 00:20:38.701870 2017] [:error] [pid 50915] [remote 192.168.140.30:112] ObjectNotFound: error opening pool 'metrics'
Comment 4 Gonéri Le Bouder 2017-07-10 10:05 EDT
Created attachment 1295843 [details]
overcloud deploy output
Comment 5 Gonéri Le Bouder 2017-07-10 10:12:32 EDT
The deployment finish with a CREATE_FAILED at 2017-07-08 00:29:01Z.
Comment 6 Pradeep Kilambi 2017-07-10 11:56:42 EDT
Goneri:

This is not the same race condition Udi saw. In your case, you dont have a metric pool created in ceph. gnocchi wont create metric pool in ceph, we just use it. In case of external ceph, you need to create this. 

See related doc bug: https://bugzilla.redhat.com/show_bug.cgi?id=1412295
Comment 7 Gonéri Le Bouder 2017-07-10 14:06:45 EDT
The gnocchi pool may be misconfiguration but I disagree with your conclusion:

- the deployment was working fine until the last puddle
- if I restart the deployment, It will recover and finish fine
- if I run the ceilometer-upgrade --skip-metering-database manually, it successes.
Comment 8 Pradeep Kilambi 2017-07-10 14:20:12 EDT
ok then lets keep this bug relevant to the race condition. You still have issues with ceph pools, dont ignore that.
Comment 9 Pradeep Kilambi 2017-07-10 15:07:02 EDT
The reason your deployment fails is because of 503 failing from keystone is:

192.168.140.30 - - [08/Jul/2017:00:25:57 +0000] "GET /v1/resource_type/swift_account HTTP/1.1" 503 170 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"

so ceilometer-upgrade tries to make a request to gnocchi to create resource types. But when authenticating keystone throws a server unavailable and gets a 503.

I dont see ceilometer-upgrade.log here from your logs. But i would expect a service unavailable traceback in the log.

in keystone logs i see ceilo-upgrade making the request:

192.168.140.30 - - [08/Jul/2017:00:24:50 +0000] "GET / HTTP/1.1" 300 602 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"
192.168.140.30 - - [08/Jul/2017:00:24:55 +0000] "POST /v3/auth/tokens HTTP/1.1" 201 9439 "-" "gnocchi/3.1.2 keystonemiddleware.auth_token/4.14.0 keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"


i do see a bunch of these in haproxy logs but i dont know if they are related:

Jul 08 00:26:50 overcloud-controller-0.cwdg720xd01.org haproxy[35773]: Server horizon/overcloud-controller-2.internalapi.cwdg720xd01.org is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Comment 10 Gonéri Le Bouder 2017-07-11 17:03:53 EDT
I reproduced the problem with the last puddle (2017-06-30.3). sosreport: http://file.rdu.redhat.com/~gleboude/BZ1467849/
Comment 11 Pradeep Kilambi 2017-07-11 17:37:17 EDT
Thx Goneri. The issue turned out to be due to gnocchi statsd being configured at the same step and causing a notify on httpd. This was bringing keystone down while ceilo-upgrade was trying at the same time.

The following patch should ensure we try ceilo-upgrade before gnocchi stuff kicks in so we dont end up in this race condition.

https://review.openstack.org/#/c/482712/
Comment 12 Gonéri Le Bouder 2017-07-11 18:56:41 EDT
I just tried  https://review.openstack.org/#/c/482712/1/manifests/profile/base/ceilometer/collector.pp on Pradeep request and I've the same problem.
Comment 13 Marian Krcmarik 2017-07-17 08:11:03 EDT
I've keep hitting the problem repeatedly in donwstream HA automation - around 2/3 od the jobs fail on the problem.
I have not been hitting it constantly in automation but very often which makes our testing very slow and unreliable.
Comment 15 Pradeep Kilambi 2017-07-17 11:20:25 EDT
fixing the DFG, this has nothing to do with security
Comment 18 J.Luis 2017-07-26 04:35:44 EDT
I was having this issue, just tried https://review.openstack.org/#/c/482712/3/manifests/profile/base/ceilometer/collector.pp and it works.

In order to test/apply the patch I modified the overcloud-full.qcow2 image using virt-edit /usr/share/openstack-puppet/modules/tripleo/manifests/profile/base/ceilometer/collector.pp
Comment 20 Sasha Smolyak 2017-07-30 05:52:35 EDT
Is there some indication in the log that the fix is there? If it fails randomly we can't just run it and verify based on that deployment...
Comment 21 J.Luis 2017-07-30 06:34:41 EDT
(In reply to Sasha Smolyak from comment #20)
> Is there some indication in the log that the fix is there? If it fails
> randomly we can't just run it and verify based on that deployment...

In my case it was always failing, using templates from the HCI Reference Arquitecture at  https://github.com/RHsyseng/hci and now it's always working.
Comment 22 Udi 2017-07-31 06:20:17 EDT
Tested various deployment topologies and didn't hit this issue (it used to happen to me 99% of the times).

Verified in puppet-tripleo-6.5.0-5.el7ost.noarch.
Comment 24 errata-xmlrpc 2017-08-01 20:50:11 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2406

Note You need to log in before you can comment on or make changes to this bug.