Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1467849

Summary:

Possible race condition - deployments fail because keystone is not up and running on time

Product:

Red Hat OpenStack

Reporter:

Udi Kalifon <ukalifon>

Component:

puppet-tripleo

Assignee:

RHOS Maint <rhos-maint>

Status:

CLOSED ERRATA

QA Contact:

Sasha Smolyak <ssmolyak>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

11.0 (Ocata)

CC:

dbecker, dnavale, emacchi, goneri, jjoyce, jschluet, jslagle, juanluis.alarcon, mburns, mkrcmari, morazi, ohochman, pkilambi, rhel-osp-director-maint, samccann, slinaber, tvignaud

Target Milestone:

async

Keywords:

AutomationBlocker, Triaged, ZStream

Target Release:

11.0 (Ocata)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

puppet-tripleo-6.5.0-3.1.el7ost

Doc Type:

Known Issue

Doc Text:

Previously, during the Red Hat OpenStack Platform 11 deployment, there is a race condition where, on occasion the ceilometer-upgrade runs at the same time as Apache is restarted due to other services being configured in the same step. This resulted in the ceilometer-upgrade to fail as it could not authenticate with the Identity service, as Apache was still not in the active state and aborting the deployment as a failure. With this update, as a workaround, when this happens you need to restart the overcloud deploy from where it fails and the deploy should get past this race condition and proceed with the deployment as normal. As a result, the deployment should be successful instead of failing with an error.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-02 00:50:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
puppet logs	none
logs: haproxy messages and httpd	none
overcloud deploy output	none

Description Udi Kalifon 2017-07-05 10:35:01 UTC

Description of problem:

Randomly deployment fails in overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.0 with this error (the error is visible in "openstack stack failures list"):

Error: ceilometer-upgrade --skip-metering-database returned 1 instead of one of [0]

If you re-run the deployment command it will resume from where it failed and finish successfully. It appears that Keystone is not running when it's needed by ceilometer. ceilo-upgrade tries to authenticate with keystone to create resource types in gnocchi. But keystone throws a 503:

10.35.191.20 - - [02/Jul/2017:14:48:06 +0000] "GET
/v1/resource_type/instance_disk HTTP/1.1" 503 170 "-"
"ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1
CPython/2.7.5"

2017-07-02 14:48:11.800 116449 ERROR keystonemiddleware.auth_token [-]
Bad response code while validating token: 503
2017-07-02 14:48:11.800 116449 WARNING keystonemiddleware.auth_token
[-] Identity response: <html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.

and ceilometer-upgrade fails with:

2017-07-02 14:48:11.803 123807 CRITICAL ceilometer [-]
ClientException: {"message": "The server is currently unavailable.
Please try again at a later time.<br /><br />\n\n\n", "code": "503
Service Unavailable", "title": "Service Unavailable"} (HTTP 503)


Version-Release number of selected component (if applicable):
OSP11 GA 2017-06-30.3


How reproducible:
randomly


Steps to Reproduce:
1. Deploy the default plan. I deployed a topology of 3 controllers + 1 compute + 3 ceph with this command: openstack overcloud deploy --templates --ntp-server clock.redhat.com -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e counters.yaml
2. If you hit this error you re-run the command to resume it from where it failed, and it passes the 2nd time.


Expected results:
Deployment should pass the first time. You shouldn't hit errors due to keystone not being up and running on time.

Comment 1 Gonéri Le Bouder 2017-07-07 16:33:30 UTC

Created attachment 1295342 [details]
puppet logs

Comment 2 Gonéri Le Bouder 2017-07-10 13:32:08 UTC

Created attachment 1295833 [details]
logs: haproxy messages and httpd

Comment 3 Gonéri Le Bouder 2017-07-10 14:00:42 UTC

There is 89 errors like this one:
Server aodh/overcloud-controller-0.internalapi.cwdg720xd01.org is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

23:58:40: haproxy startup
23:58:40: first message bout Layer4 being not responding
a lot of connections are refused with the error above
23:59:58: first connection accepted
00:00:27: haproxy adds the localhost
00:00:49: haproxy starts to add the other hosts
everything is back to normal
00:23:23: controller0 is down
00:23:37: controller2 is down
00:23:27: controller0 is up
00:26:00: controller2 is up
00:26:04: controller1 is down
00:26:50: controller2 is down
00:28:47: controller0 is down
00:28:51: controller0 is up
everything is back to normal



in gnocchi_wsgi_error.log-20170709:
[Sat Jul 08 00:20:27.679270 2017] [:error] [pid 50912] Option "metric_processing_delay" from group "storage" is deprecated. Use option "metric_processing_delay" from group "metricd".
[Sat Jul 08 00:20:27.772207 2017] [:error] [pid 50912] [remote 192.168.140.30:120] mod_wsgi (pid=50912): Target WSGI script '/var/www/cgi-bin/gnocchi/app' cannot be loaded as Python module.
[Sat Jul 08 00:20:27.772226 2017] [:error] [pid 50912] [remote 192.168.140.30:120] mod_wsgi (pid=50912): Exception occurred processing WSGI script '/var/www/cgi-bin/gnocchi/app'.
[Sat Jul 08 00:20:27.772248 2017] [:error] [pid 50912] [remote 192.168.140.30:120] Traceback (most recent call last):
[Sat Jul 08 00:20:27.772262 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/var/www/cgi-bin/gnocchi/app", line 22, in <module>
[Sat Jul 08 00:20:27.772280 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     application = app.build_wsgi_app()
[Sat Jul 08 00:20:27.772286 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 136, in build_wsgi_app
[Sat Jul 08 00:20:27.772296 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     return load_app(service.prepare_service())
[Sat Jul 08 00:20:27.772301 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 91, in load_app
[Sat Jul 08 00:20:27.772308 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     storage = gnocchi_storage.get_driver(conf)
[Sat Jul 08 00:20:27.772312 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 144, in get_driver
[Sat Jul 08 00:20:27.772634 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     conf.incoming)
[Sat Jul 08 00:20:27.772648 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/incoming/ceph.py", line 32, in __init__
[Sat Jul 08 00:20:27.772894 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     self.rados, self.ioctx = ceph.create_rados_connection(conf)
[Sat Jul 08 00:20:27.772906 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/common/ceph.py", line 59, in create_rados_connection
[Sat Jul 08 00:20:27.773113 2017] [:error] [pid 50912] [remote 192.168.140.30:120]     ioctx = conn.open_ioctx(conf.ceph_pool)
[Sat Jul 08 00:20:27.773122 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188)
[Sat Jul 08 00:20:27.773207 2017] [:error] [pid 50912] [remote 192.168.140.30:120]   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
[Sat Jul 08 00:20:27.773261 2017] [:error] [pid 50912] [remote 192.168.140.30:120] ObjectNotFound: error opening pool 'metrics'
[Sat Jul 08 00:20:38.619911 2017] [:error] [pid 50915] Option "metric_processing_delay" from group "storage" is deprecated. Use option "metric_processing_delay" from group "metricd".
[Sat Jul 08 00:20:38.701378 2017] [:error] [pid 50915] [remote 192.168.140.30:112] mod_wsgi (pid=50915): Target WSGI script '/var/www/cgi-bin/gnocchi/app' cannot be loaded as Python module.
[Sat Jul 08 00:20:38.701405 2017] [:error] [pid 50915] [remote 192.168.140.30:112] mod_wsgi (pid=50915): Exception occurred processing WSGI script '/var/www/cgi-bin/gnocchi/app'.
[Sat Jul 08 00:20:38.701428 2017] [:error] [pid 50915] [remote 192.168.140.30:112] Traceback (most recent call last):
[Sat Jul 08 00:20:38.701442 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/var/www/cgi-bin/gnocchi/app", line 22, in <module>
[Sat Jul 08 00:20:38.701463 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     application = app.build_wsgi_app()
[Sat Jul 08 00:20:38.701470 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 136, in build_wsgi_app
[Sat Jul 08 00:20:38.701481 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     return load_app(service.prepare_service())
[Sat Jul 08 00:20:38.701492 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 91, in load_app
[Sat Jul 08 00:20:38.701501 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     storage = gnocchi_storage.get_driver(conf)
[Sat Jul 08 00:20:38.701506 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 144, in get_driver
[Sat Jul 08 00:20:38.701615 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     conf.incoming)
[Sat Jul 08 00:20:38.701626 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/incoming/ceph.py", line 32, in __init__
[Sat Jul 08 00:20:38.701695 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     self.rados, self.ioctx = ceph.create_rados_connection(conf)
[Sat Jul 08 00:20:38.701703 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "/usr/lib/python2.7/site-packages/gnocchi/storage/common/ceph.py", line 59, in create_rados_connection
[Sat Jul 08 00:20:38.701739 2017] [:error] [pid 50915] [remote 192.168.140.30:112]     ioctx = conn.open_ioctx(conf.ceph_pool)
[Sat Jul 08 00:20:38.701747 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188)
[Sat Jul 08 00:20:38.701814 2017] [:error] [pid 50915] [remote 192.168.140.30:112]   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
[Sat Jul 08 00:20:38.701870 2017] [:error] [pid 50915] [remote 192.168.140.30:112] ObjectNotFound: error opening pool 'metrics'

Comment 4 Gonéri Le Bouder 2017-07-10 14:05:05 UTC

Created attachment 1295843 [details]
overcloud deploy output

Comment 5 Gonéri Le Bouder 2017-07-10 14:12:32 UTC

The deployment finish with a CREATE_FAILED at 2017-07-08 00:29:01Z.

Comment 6 Pradeep Kilambi 2017-07-10 15:56:42 UTC

Goneri:

This is not the same race condition Udi saw. In your case, you dont have a metric pool created in ceph. gnocchi wont create metric pool in ceph, we just use it. In case of external ceph, you need to create this. 

See related doc bug: https://bugzilla.redhat.com/show_bug.cgi?id=1412295

Comment 7 Gonéri Le Bouder 2017-07-10 18:06:45 UTC

The gnocchi pool may be misconfiguration but I disagree with your conclusion:

- the deployment was working fine until the last puddle
- if I restart the deployment, It will recover and finish fine
- if I run the ceilometer-upgrade --skip-metering-database manually, it successes.

Comment 8 Pradeep Kilambi 2017-07-10 18:20:12 UTC

ok then lets keep this bug relevant to the race condition. You still have issues with ceph pools, dont ignore that.

Comment 9 Pradeep Kilambi 2017-07-10 19:07:02 UTC

The reason your deployment fails is because of 503 failing from keystone is:

192.168.140.30 - - [08/Jul/2017:00:25:57 +0000] "GET /v1/resource_type/swift_account HTTP/1.1" 503 170 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"

so ceilometer-upgrade tries to make a request to gnocchi to create resource types. But when authenticating keystone throws a server unavailable and gets a 503.

I dont see ceilometer-upgrade.log here from your logs. But i would expect a service unavailable traceback in the log.

in keystone logs i see ceilo-upgrade making the request:

192.168.140.30 - - [08/Jul/2017:00:24:50 +0000] "GET / HTTP/1.1" 300 602 "-" "ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"
192.168.140.30 - - [08/Jul/2017:00:24:55 +0000] "POST /v3/auth/tokens HTTP/1.1" 201 9439 "-" "gnocchi/3.1.2 keystonemiddleware.auth_token/4.14.0 keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"


i do see a bunch of these in haproxy logs but i dont know if they are related:

Jul 08 00:26:50 overcloud-controller-0.cwdg720xd01.org haproxy[35773]: Server horizon/overcloud-controller-2.internalapi.cwdg720xd01.org is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Comment 10 Gonéri Le Bouder 2017-07-11 21:03:53 UTC

I reproduced the problem with the last puddle (2017-06-30.3). sosreport: http://file.rdu.redhat.com/~gleboude/BZ1467849/

Comment 11 Pradeep Kilambi 2017-07-11 21:37:17 UTC

Thx Goneri. The issue turned out to be due to gnocchi statsd being configured at the same step and causing a notify on httpd. This was bringing keystone down while ceilo-upgrade was trying at the same time.

The following patch should ensure we try ceilo-upgrade before gnocchi stuff kicks in so we dont end up in this race condition.

https://review.openstack.org/#/c/482712/

Comment 12 Gonéri Le Bouder 2017-07-11 22:56:41 UTC

I just tried  https://review.openstack.org/#/c/482712/1/manifests/profile/base/ceilometer/collector.pp on Pradeep request and I've the same problem.

Comment 13 Marian Krcmarik 2017-07-17 12:11:03 UTC

I've keep hitting the problem repeatedly in donwstream HA automation - around 2/3 od the jobs fail on the problem.
I have not been hitting it constantly in automation but very often which makes our testing very slow and unreliable.

Comment 15 Pradeep Kilambi 2017-07-17 15:20:25 UTC

fixing the DFG, this has nothing to do with security

Comment 18 J.Luis 2017-07-26 08:35:44 UTC

I was having this issue, just tried https://review.openstack.org/#/c/482712/3/manifests/profile/base/ceilometer/collector.pp and it works.

In order to test/apply the patch I modified the overcloud-full.qcow2 image using virt-edit /usr/share/openstack-puppet/modules/tripleo/manifests/profile/base/ceilometer/collector.pp

Comment 20 Sasha Smolyak 2017-07-30 09:52:35 UTC

Is there some indication in the log that the fix is there? If it fails randomly we can't just run it and verify based on that deployment...

Comment 21 J.Luis 2017-07-30 10:34:41 UTC

(In reply to Sasha Smolyak from comment #20)
> Is there some indication in the log that the fix is there? If it fails
> randomly we can't just run it and verify based on that deployment...

In my case it was always failing, using templates from the HCI Reference Arquitecture at  https://github.com/RHsyseng/hci and now it's always working.

Comment 22 Udi Kalifon 2017-07-31 10:20:17 UTC

Tested various deployment topologies and didn't hit this issue (it used to happen to me 99% of the times).

Verified in puppet-tripleo-6.5.0-5.el7ost.noarch.

Comment 24 errata-xmlrpc 2017-08-02 00:50:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2406

Comment 25 Red Hat Bugzilla 2023-09-14 04:00:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days