Created attachment 1043499 [details] ceilometer logs Description of problem: Overcloud Ceilometer stops working when turning off one controller in HA setup. I'm using a virt env with 3 x controllers, 2 x computes, 3 x ceph nodes. When turnning off first controller ceilometer stops working and ceilometer related resources stop working. Version-Release number of selected component (if applicable): openstack-tripleo-image-elements-0.9.6-4.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-18.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch openstack-puppet-modules-2015.1.7-5.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 controllers 2. Turn off one of the controllers 3. Check if overcloud ceilometer is working Actual results: Ceilometer stops working and cluster resources related to ceilometer fail Expected results: Ceilometer will still work after bringing one of the controllers down. Additional info: Attaching ceilometer logs and pcs status.
I haven't collected yet enough infos on the root cause, but I have seen this happening on a live env
Dev can't seem to reproduce this issue. Can you reproduce consistently? Is there an environment we can look at where this reproduces?
The change at https://code.engineering.redhat.com/gerrit/#/c/52379/ which was merged to fix BZ 1236374 removes the constraint which was restarting Ceilometer when Redis VIP relocated. This bug is most probably affected by that change and we need to see if it is reproduced using openstack-tripleo-heat-templates-0.8.6-26.el7ost
I'm still able to reproduce it. This is what I get after turning off one of the first controller: [stack@instack ~]$ ceilometer meter-list ('Connection aborted.', BadStatusLine("''",)) I'm seeing the ceilometer backends in haproxy don't have checks: listen ceilometer bind 172.16.20.10:8777 bind 172.16.23.10:8777 balance roundrobin option tcplog option ssl-hello-chk server overcloud-controller-0 172.16.20.16:8777 server overcloud-controller-1 172.16.20.14:8777 server overcloud-controller-2 172.16.20.15:8777
Created attachment 1051530 [details] ceilometer logs Attaching ceilometer logs from one of the 2 remaining running controller nodes.
Marius, the initial query will fail until ceilometer recognizes the first mongodb node is down, logging something like: Unable to reconnect to the primary mongodb: [Errno 110] Connection timed out. Trying again in 10 seconds. at which points it will attempt a connection to the second mongodb instance and if you the same request it will work as expected.
The haproxy misses the check option for ceilometer backends. We removed the node that is down from haproxy config and timeouts didn't occur anymore. Upstream patch: https://review.openstack.org/#/c/199507/2
Marius, ack, some requests were indeed failing on the "check" keyword missing
With current puddle ceilometer api doesn't work right after deployment. Removing the ssl-hello-chk option makes it work again. [stack@instack ~]$ . overcloudrc [stack@instack ~]$ ceilometer meter-list ('Connection aborted.', BadStatusLine("''",)) [root@overcloud-controller-0 heat-admin]# grep -A8 ceilometer /etc/haproxy/haproxy.cfg listen ceilometer bind 172.16.20.11:8777 bind 172.16.23.10:8777 balance roundrobin option tcplog option ssl-hello-chk server overcloud-controller-0 172.16.20.14:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-1 172.16.20.15:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-2 172.16.20.13:8777 check fall 5 inter 2000 rise 2 [root@overcloud-controller-0 ~]# tail -5 /var/log/ceilometer/api.log 2015-07-21 04:02:24.415 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:24] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x90HAPROXYSSLCHK') 2015-07-21 04:02:25.624 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:25] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x91HAPROXYSSLCHK') 2015-07-21 04:02:26.419 17058 ERROR werkzeug [-] 172.16.20.13 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK') 2015-07-21 04:02:26.420 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK') 2015-07-21 04:02:27.626 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:27] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x93HAPROXYSSLCHK')
Blarrgh -- this is caused by the haproxy.cfg change that we considered backing out of core. This would be a good reason to actually back it out, given we're respinning anyway.
I think James Slagle's patch should fix this: https://review.openstack.org/203298/
Tested upstream: listen ceilometer bind 192.0.2.17:8777 bind 192.0.2.18:8777 server overcloud-controller-0 192.0.2.5:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-1 192.0.2.4:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-2 192.0.2.3:8777 check fall 5 inter 2000 rise 2 listen glance_api bind 192.0.2.17:9292 bind 192.0.2.18:9292 option httpchk GET / server overcloud-controller-0 192.0.2.5:9292 check fall 5 inter 2000 rise 2 server overcloud-controller-1 192.0.2.4:9292 check fall 5 inter 2000 rise 2 server overcloud-controller-2 192.0.2.3:9292 check fall 5 inter 2000 rise 2 We're waiting for CI results before merge.
Actually i should have pasted the glance_registry bit: listen glance_registry bind 192.0.2.18:9191 server overcloud-controller-0 192.0.2.5:9191 check fall 5 inter 2000 rise 2 server overcloud-controller-1 192.0.2.4:9191 check fall 5 inter 2000 rise 2 server overcloud-controller-2 192.0.2.3:9191 check fall 5 inter 2000 rise 2 Patch merged upstream.
Note : In order to see this test pass, it require to configure fence-agent on all of the HA controllers. Verified : openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch openstack-puppet-modules-2015.1.8-8.el7ost.noarch puppet-3.6.2-2.el7.noarch instack-undercloud-2.1.2-22.el7ost.noarch instack-0.0.7-1.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1548