Bug 1236057

Summary:

Overcloud Ceilometer stops working when turning off one controller in HA setup

Product:

Red Hat OpenStack

Reporter:

Marius Cornea <mcornea>

Component:

openstack-puppet-modules

Assignee:

Jiri Stransky <jstransk>

Status:

CLOSED ERRATA

QA Contact:

Omri Hochman <ohochman>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.0 (Kilo)

CC:

aortega, ddomingo, eglynn, gchamoul, gfidente, hbrock, jschluet, lbezdick, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, rrosa, yeylon

Target Milestone:

Target Release:

7.0 (Kilo)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-puppet-modules-2015.1.8-8.el7ost

Doc Type:

Bug Fix

Doc Text:

Previously, the HAProxy configuration of the Telemetry service used incorrect checks, which caused the Telemetry service to fail in an HA deployment. Specifically, the HAProxy configuration did not have availability checks, and incorrectly used SSL checks instead of TCP. This release fixes the checks, ensuring that the Telemetry service is correctly balanced and can launch in an HA deployment.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-08-05 13:28:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1228433

Attachments:

Description	Flags
ceilometer logs	none
ceilometer logs	none

Description Marius Cornea 2015-06-26 12:44:14 UTC

Created attachment 1043499 [details]
ceilometer logs

Description of problem:
Overcloud Ceilometer stops working when turning off one controller in HA setup. I'm using a virt env with 3 x controllers, 2 x computes, 3 x ceph nodes. When turnning off first controller ceilometer stops working and ceilometer related resources stop working. 

Version-Release number of selected component (if applicable):
openstack-tripleo-image-elements-0.9.6-4.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-18.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch
openstack-puppet-modules-2015.1.7-5.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controllers
2. Turn off one of the controllers
3. Check if overcloud ceilometer is working

Actual results:
Ceilometer stops working and cluster resources related to ceilometer fail

Expected results:
Ceilometer will still work after bringing one of the controllers down.

Additional info:
Attaching ceilometer logs and pcs status.

Comment 3 Giulio Fidente 2015-06-26 13:00:07 UTC

I haven't collected yet enough infos on the root cause, but I have seen this happening on a live env

Comment 4 Mike Burns 2015-07-07 01:04:14 UTC

Dev can't seem to reproduce this issue.  Can you reproduce consistently?  Is there an environment we can look at where this reproduces?

Comment 5 Giulio Fidente 2015-07-08 15:21:44 UTC

The change at https://code.engineering.redhat.com/gerrit/#/c/52379/ which was merged to fix BZ 1236374 removes the constraint which was restarting Ceilometer when Redis VIP relocated. This bug is most probably affected by that change and we need to see if it is reproduced using openstack-tripleo-heat-templates-0.8.6-26.el7ost

Comment 6 Marius Cornea 2015-07-13 21:48:14 UTC

I'm still able to reproduce it. This is what I get after turning off one of the first controller:

[stack@instack ~]$ ceilometer meter-list
('Connection aborted.', BadStatusLine("''",))

I'm seeing the ceilometer backends in haproxy don't have checks:

listen ceilometer
  bind 172.16.20.10:8777 
  bind 172.16.23.10:8777 
  balance roundrobin
  option tcplog
  option ssl-hello-chk
  server overcloud-controller-0 172.16.20.16:8777 
  server overcloud-controller-1 172.16.20.14:8777 
  server overcloud-controller-2 172.16.20.15:8777

Comment 7 Marius Cornea 2015-07-13 21:49:40 UTC

Created attachment 1051530 [details]
ceilometer logs

Attaching ceilometer logs from one of the 2 remaining running controller nodes.

Comment 8 Giulio Fidente 2015-07-14 15:51:55 UTC

Marius, the initial query will fail until ceilometer recognizes the first mongodb node is down, logging something like:

  Unable to reconnect to the primary mongodb: [Errno 110] Connection timed out. Trying again in 10 seconds.

at which points it will attempt a connection to the second mongodb instance and if you the same request it will work as expected.

Comment 9 Marius Cornea 2015-07-14 16:20:33 UTC

The haproxy misses the check option for ceilometer backends. We removed the node that is down from haproxy config and timeouts didn't occur anymore. 

Upstream patch:
https://review.openstack.org/#/c/199507/2

Comment 11 Giulio Fidente 2015-07-14 17:01:53 UTC

Marius, ack, some requests were indeed failing on the "check" keyword missing

Comment 14 Marius Cornea 2015-07-21 08:13:56 UTC

With current puddle ceilometer api doesn't work right after deployment. Removing the ssl-hello-chk option makes it work again.    

[stack@instack ~]$ . overcloudrc 
[stack@instack ~]$ ceilometer meter-list
('Connection aborted.', BadStatusLine("''",))

[root@overcloud-controller-0 heat-admin]# grep -A8 ceilometer /etc/haproxy/haproxy.cfg 
listen ceilometer
  bind 172.16.20.11:8777 
  bind 172.16.23.10:8777 
  balance roundrobin
  option tcplog
  option ssl-hello-chk
  server overcloud-controller-0 172.16.20.14:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.16.20.15:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.16.20.13:8777 check fall 5 inter 2000 rise 2

[root@overcloud-controller-0 ~]# tail -5 /var/log/ceilometer/api.log 
2015-07-21 04:02:24.415 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:24] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x90HAPROXYSSLCHK')
2015-07-21 04:02:25.624 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:25] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x91HAPROXYSSLCHK')
2015-07-21 04:02:26.419 17058 ERROR werkzeug [-] 172.16.20.13 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK')
2015-07-21 04:02:26.420 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK')
2015-07-21 04:02:27.626 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:27] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x93HAPROXYSSLCHK')

Comment 15 Hugh Brock 2015-07-21 09:00:19 UTC

Blarrgh -- this is caused by the haproxy.cfg change that we considered backing out of core. This would be a good reason to actually back it out, given we're respinning anyway.

Comment 16 Jiri Stransky 2015-07-21 09:30:41 UTC

I think James Slagle's patch should fix this:

https://review.openstack.org/203298/

Comment 17 Jiri Stransky 2015-07-21 11:25:56 UTC

Tested upstream:

listen ceilometer
  bind 192.0.2.17:8777 
  bind 192.0.2.18:8777 
  server overcloud-controller-0 192.0.2.5:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.4:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.3:8777 check fall 5 inter 2000 rise 2

listen glance_api
  bind 192.0.2.17:9292 
  bind 192.0.2.18:9292 
  option httpchk GET /
  server overcloud-controller-0 192.0.2.5:9292 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.4:9292 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.3:9292 check fall 5 inter 2000 rise 2

We're waiting for CI results before merge.

Comment 18 Jiri Stransky 2015-07-21 12:25:34 UTC

Actually i should have pasted the glance_registry bit:

listen glance_registry
  bind 192.0.2.18:9191 
  server overcloud-controller-0 192.0.2.5:9191 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.4:9191 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.3:9191 check fall 5 inter 2000 rise 2

Patch merged upstream.

Comment 20 Omri Hochman 2015-07-29 22:31:41 UTC

Note : In order to see this test pass, it require to configure fence-agent on all of the HA controllers. 

Verified : 
openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch
openstack-puppet-modules-2015.1.8-8.el7ost.noarch
puppet-3.6.2-2.el7.noarch
instack-undercloud-2.1.2-22.el7ost.noarch
instack-0.0.7-1.el7ost.noarch

Comment 22 errata-xmlrpc 2015-08-05 13:28:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548