Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1236057 - Overcloud Ceilometer stops working when turning off one controller in HA setup
Overcloud Ceilometer stops working when turning off one controller in HA setup
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-puppet-modules (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
high Severity high
: ga
: 7.0 (Kilo)
Assigned To: Jiri Stransky
Omri Hochman
:
Depends On:
Blocks: 1228433
  Show dependency treegraph
 
Reported: 2015-06-26 08:44 EDT by Marius Cornea
Modified: 2015-08-05 09:28 EDT (History)
16 users (show)

See Also:
Fixed In Version: openstack-puppet-modules-2015.1.8-8.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, the HAProxy configuration of the Telemetry service used incorrect checks, which caused the Telemetry service to fail in an HA deployment. Specifically, the HAProxy configuration did not have availability checks, and incorrectly used SSL checks instead of TCP. This release fixes the checks, ensuring that the Telemetry service is correctly balanced and can launch in an HA deployment.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-05 09:28:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ceilometer logs (6.70 KB, application/x-gzip)
2015-06-26 08:44 EDT, Marius Cornea
no flags Details
ceilometer logs (4.90 KB, application/x-gzip)
2015-07-13 17:49 EDT, Marius Cornea
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 199507 None None None Never
OpenStack gerrit 203298 None None None Never
Red Hat Product Errata RHEA-2015:1548 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory 2015-08-05 13:07:06 EDT

  None (edit)
Description Marius Cornea 2015-06-26 08:44:14 EDT
Created attachment 1043499 [details]
ceilometer logs

Description of problem:
Overcloud Ceilometer stops working when turning off one controller in HA setup. I'm using a virt env with 3 x controllers, 2 x computes, 3 x ceph nodes. When turnning off first controller ceilometer stops working and ceilometer related resources stop working. 

Version-Release number of selected component (if applicable):
openstack-tripleo-image-elements-0.9.6-4.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-18.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch
openstack-puppet-modules-2015.1.7-5.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controllers
2. Turn off one of the controllers
3. Check if overcloud ceilometer is working

Actual results:
Ceilometer stops working and cluster resources related to ceilometer fail

Expected results:
Ceilometer will still work after bringing one of the controllers down.

Additional info:
Attaching ceilometer logs and pcs status.
Comment 3 Giulio Fidente 2015-06-26 09:00:07 EDT
I haven't collected yet enough infos on the root cause, but I have seen this happening on a live env
Comment 4 Mike Burns 2015-07-06 21:04:14 EDT
Dev can't seem to reproduce this issue.  Can you reproduce consistently?  Is there an environment we can look at where this reproduces?
Comment 5 Giulio Fidente 2015-07-08 11:21:44 EDT
The change at https://code.engineering.redhat.com/gerrit/#/c/52379/ which was merged to fix BZ 1236374 removes the constraint which was restarting Ceilometer when Redis VIP relocated. This bug is most probably affected by that change and we need to see if it is reproduced using openstack-tripleo-heat-templates-0.8.6-26.el7ost
Comment 6 Marius Cornea 2015-07-13 17:48:14 EDT
I'm still able to reproduce it. This is what I get after turning off one of the first controller:

[stack@instack ~]$ ceilometer meter-list
('Connection aborted.', BadStatusLine("''",))

I'm seeing the ceilometer backends in haproxy don't have checks:

listen ceilometer
  bind 172.16.20.10:8777 
  bind 172.16.23.10:8777 
  balance roundrobin
  option tcplog
  option ssl-hello-chk
  server overcloud-controller-0 172.16.20.16:8777 
  server overcloud-controller-1 172.16.20.14:8777 
  server overcloud-controller-2 172.16.20.15:8777
Comment 7 Marius Cornea 2015-07-13 17:49:40 EDT
Created attachment 1051530 [details]
ceilometer logs

Attaching ceilometer logs from one of the 2 remaining running controller nodes.
Comment 8 Giulio Fidente 2015-07-14 11:51:55 EDT
Marius, the initial query will fail until ceilometer recognizes the first mongodb node is down, logging something like:

  Unable to reconnect to the primary mongodb: [Errno 110] Connection timed out. Trying again in 10 seconds.

at which points it will attempt a connection to the second mongodb instance and if you the same request it will work as expected.
Comment 9 Marius Cornea 2015-07-14 12:20:33 EDT
The haproxy misses the check option for ceilometer backends. We removed the node that is down from haproxy config and timeouts didn't occur anymore. 

Upstream patch:
https://review.openstack.org/#/c/199507/2
Comment 11 Giulio Fidente 2015-07-14 13:01:53 EDT
Marius, ack, some requests were indeed failing on the "check" keyword missing
Comment 14 Marius Cornea 2015-07-21 04:13:56 EDT
With current puddle ceilometer api doesn't work right after deployment. Removing the ssl-hello-chk option makes it work again.    

[stack@instack ~]$ . overcloudrc 
[stack@instack ~]$ ceilometer meter-list
('Connection aborted.', BadStatusLine("''",))

[root@overcloud-controller-0 heat-admin]# grep -A8 ceilometer /etc/haproxy/haproxy.cfg 
listen ceilometer
  bind 172.16.20.11:8777 
  bind 172.16.23.10:8777 
  balance roundrobin
  option tcplog
  option ssl-hello-chk
  server overcloud-controller-0 172.16.20.14:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.16.20.15:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.16.20.13:8777 check fall 5 inter 2000 rise 2

[root@overcloud-controller-0 ~]# tail -5 /var/log/ceilometer/api.log 
2015-07-21 04:02:24.415 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:24] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x90HAPROXYSSLCHK')
2015-07-21 04:02:25.624 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:25] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x91HAPROXYSSLCHK')
2015-07-21 04:02:26.419 17058 ERROR werkzeug [-] 172.16.20.13 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK')
2015-07-21 04:02:26.420 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK')
2015-07-21 04:02:27.626 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:27] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x93HAPROXYSSLCHK')
Comment 15 Hugh Brock 2015-07-21 05:00:19 EDT
Blarrgh -- this is caused by the haproxy.cfg change that we considered backing out of core. This would be a good reason to actually back it out, given we're respinning anyway.
Comment 16 Jiri Stransky 2015-07-21 05:30:41 EDT
I think James Slagle's patch should fix this:

https://review.openstack.org/203298/
Comment 17 Jiri Stransky 2015-07-21 07:25:56 EDT
Tested upstream:

listen ceilometer
  bind 192.0.2.17:8777 
  bind 192.0.2.18:8777 
  server overcloud-controller-0 192.0.2.5:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.4:8777 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.3:8777 check fall 5 inter 2000 rise 2

listen glance_api
  bind 192.0.2.17:9292 
  bind 192.0.2.18:9292 
  option httpchk GET /
  server overcloud-controller-0 192.0.2.5:9292 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.4:9292 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.3:9292 check fall 5 inter 2000 rise 2

We're waiting for CI results before merge.
Comment 18 Jiri Stransky 2015-07-21 08:25:34 EDT
Actually i should have pasted the glance_registry bit:

listen glance_registry
  bind 192.0.2.18:9191 
  server overcloud-controller-0 192.0.2.5:9191 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.4:9191 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.3:9191 check fall 5 inter 2000 rise 2

Patch merged upstream.
Comment 20 Omri Hochman 2015-07-29 18:31:41 EDT
Note : In order to see this test pass, it require to configure fence-agent on all of the HA controllers. 

Verified : 
openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch
openstack-puppet-modules-2015.1.8-8.el7ost.noarch
puppet-3.6.2-2.el7.noarch
instack-undercloud-2.1.2-22.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
Comment 22 errata-xmlrpc 2015-08-05 09:28:11 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548

Note You need to log in before you can comment on or make changes to this bug.