Bug 1468859 - OpenStack control plane down due to gnocchi issues with OSP 10 ceilometer and zenoss
Summary: OpenStack control plane down due to gnocchi issues with OSP 10 ceilometer and...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-gnocchi
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Julien Danjou
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-09 01:35 UTC by Andreas Karis
Modified: 2020-09-10 10:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-02 09:18:03 UTC
Target Upstream Version:
Embargoed:
tshefi: automate_bug-


Attachments (Terms of Use)

Description Andreas Karis 2017-07-09 01:35:59 UTC
Description of problem:
OpenStack control plane down due to gnocchi issues with OSP 10 ceilometer and zenoss

Customer is not using Ceph, but is running into https://access.redhat.com/solutions/3066751
~~~
[root@overcloud-controller-0 ~]# cat gnocchi.txt
[root@overcloud-controller-0 ~]# netstat -planex | grep httpd | grep ACC
unix  2      [ ACC ]     STREAM     LISTENING     1705910442 4351/httpd           /var/run/wsgi.1030974.5.1.sock
unix  2      [ ACC ]     STREAM     LISTENING     1705910444 4352/httpd           /var/run/wsgi.1030974.5.2.sock
unix  249    [ ACC ]     STREAM     LISTENING     1705910446 4353/httpd           /var/run/wsgi.1030974.5.3.sock
unix  2      [ ACC ]     STREAM     LISTENING     1705910449 4363/httpd           /var/run/wsgi.1030974.5.4.sock
[root@overcloud-controller-0 ~]#  ps aux | grep http | wc -l
273
[root@overcloud-controller-0 ~]#  rpm -qa | grep gnocchi
python-gnocchiclient-2.6.0-1.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-3.0.2-1.el7ost.noarch
openstack-gnocchi-metricd-3.0.2-1.el7ost.noarch
puppet-gnocchi-9.4.1-1.el7ost.noarch
openstack-gnocchi-common-3.0.2-1.el7ost.noarch
openstack-gnocchi-statsd-3.0.2-1.el7ost.noarch
python-gnocchi-3.0.2-1.el7ost.noarch
openstack-gnocchi-api-3.0.2-1.el7ost.noarch
openstack-gnocchi-carbonara-3.0.2-1.el7ost.noarch
[root@overcloud-controller-0 ~]#  lsof -nn /var/run/wsgi.1030974.5.3.sock
COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF       NODE NAME
httpd      4353 gnocchi    6u  unix 0xffff8855b9fdc800      0t0 1840093777 /var/run/wsgi.1030974.5.3.sock
httpd      4353 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4354 gnocchi    6u  unix 0xffff8812aebb0000      0t0 1815065148 /var/run/wsgi.1030974.5.3.sock
httpd      4354 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4355 gnocchi    6u  unix 0xffff881424898000      0t0 1814911701 /var/run/wsgi.1030974.5.3.sock
httpd      4355 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4356 gnocchi    6u  unix 0xffff88434674a400      0t0 1815379719 /var/run/wsgi.1030974.5.3.sock
httpd      4356 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4357 gnocchi    6u  unix 0xffff884f01c58c00      0t0 1814351285 /var/run/wsgi.1030974.5.3.sock
httpd      4357 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4358 gnocchi    6u  unix 0xffff881298e18400      0t0 1814677648 /var/run/wsgi.1030974.5.3.sock
httpd      4358 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4359 gnocchi    6u  unix 0xffff885fb75c5800      0t0 1814222114 /var/run/wsgi.1030974.5.3.sock
httpd      4359 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4360 gnocchi    6u  unix 0xffff881f43f57400      0t0 1815546612 /var/run/wsgi.1030974.5.3.sock
httpd      4360 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4361 gnocchi    6u  unix 0xffff88139ebb5c00      0t0 1814467211 /var/run/wsgi.1030974.5.3.sock
httpd      4361 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4362 gnocchi    6u  unix 0xffff883d94a7b000      0t0 1815260447 /var/run/wsgi.1030974.5.3.sock
httpd      4362 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd   1030974    root   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
~~~

Comment 3 Julien Danjou 2017-07-10 11:26:14 UTC
Hi Andreas,

So for now this does make any sense.

1. You say "customer run into https://access.redhat.com/solutions/3066751", except that this is as you noticed a Ceph issue where the pool is not created. So I don't understand how the customer run into this. Do you mean that the symptom where httpd is stuck is the same?

2. The ceilometer.conf you pasted does not reference gnocchi at all, but zenoss. So I also don't understand how Gnocchi is in this picture. I don't think we support zenoss anyway.

3. The ceilometer.conf in the sosreport do indeed points go Gnocchi

4. The log in sosreports do indeed show other issues. collector.log from sosreport-20170616-193207/overcloud-controller-1.localdomain shows that RabbitMQ was unreachable until 8 june and the service got shut down on 12 june. The logs are from 16 june so why collector was shut down for 4 days, I don't know.

5. The sosreport log does not include gnocchi log :(

So why I'd love to discover the root cause of the problem you encountered, I wonder if it's not too late now that the system did change a lot.

Andreas, would you be able to build a full procedure to reproduce the problem?

Comment 4 Andreas Karis 2017-07-10 16:05:57 UTC
Hi,

The customer was not using Ceph, but we could observe the same symptoms that happen when the metrics pool is missing:
~~~
[root@overcloud-controller-0 ~]# cat gnocchi.txt
[root@overcloud-controller-0 ~]# netstat -planex | grep httpd | grep ACC
unix  2      [ ACC ]     STREAM     LISTENING     1705910442 4351/httpd           /var/run/wsgi.1030974.5.1.sock
unix  2      [ ACC ]     STREAM     LISTENING     1705910444 4352/httpd           /var/run/wsgi.1030974.5.2.sock
unix  249    [ ACC ]     STREAM     LISTENING     1705910446 4353/httpd           /var/run/wsgi.1030974.5.3.sock
unix  2      [ ACC ]     STREAM     LISTENING     1705910449 4363/httpd           /var/run/wsgi.1030974.5.4.sock
[root@overcloud-controller-0 ~]#  ps aux | grep http | wc -l
273
[root@overcloud-controller-0 ~]#  rpm -qa | grep gnocchi
python-gnocchiclient-2.6.0-1.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-3.0.2-1.el7ost.noarch
openstack-gnocchi-metricd-3.0.2-1.el7ost.noarch
puppet-gnocchi-9.4.1-1.el7ost.noarch
openstack-gnocchi-common-3.0.2-1.el7ost.noarch
openstack-gnocchi-statsd-3.0.2-1.el7ost.noarch
python-gnocchi-3.0.2-1.el7ost.noarch
openstack-gnocchi-api-3.0.2-1.el7ost.noarch
openstack-gnocchi-carbonara-3.0.2-1.el7ost.noarch
[root@overcloud-controller-0 ~]#  lsof -nn /var/run/wsgi.1030974.5.3.sock
COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF       NODE NAME
httpd      4353 gnocchi    6u  unix 0xffff8855b9fdc800      0t0 1840093777 /var/run/wsgi.1030974.5.3.sock
httpd      4353 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4354 gnocchi    6u  unix 0xffff8812aebb0000      0t0 1815065148 /var/run/wsgi.1030974.5.3.sock
httpd      4354 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4355 gnocchi    6u  unix 0xffff881424898000      0t0 1814911701 /var/run/wsgi.1030974.5.3.sock
httpd      4355 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4356 gnocchi    6u  unix 0xffff88434674a400      0t0 1815379719 /var/run/wsgi.1030974.5.3.sock
httpd      4356 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4357 gnocchi    6u  unix 0xffff884f01c58c00      0t0 1814351285 /var/run/wsgi.1030974.5.3.sock
httpd      4357 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4358 gnocchi    6u  unix 0xffff881298e18400      0t0 1814677648 /var/run/wsgi.1030974.5.3.sock
httpd      4358 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4359 gnocchi    6u  unix 0xffff885fb75c5800      0t0 1814222114 /var/run/wsgi.1030974.5.3.sock
httpd      4359 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4360 gnocchi    6u  unix 0xffff881f43f57400      0t0 1815546612 /var/run/wsgi.1030974.5.3.sock
httpd      4360 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4361 gnocchi    6u  unix 0xffff88139ebb5c00      0t0 1814467211 /var/run/wsgi.1030974.5.3.sock
httpd      4361 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd      4362 gnocchi    6u  unix 0xffff883d94a7b000      0t0 1815260447 /var/run/wsgi.1030974.5.3.sock
httpd      4362 gnocchi   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
httpd   1030974    root   27u  unix 0xffff882fa8a4f400      0t0 1705910446 /var/run/wsgi.1030974.5.3.sock
~~~

Just to clarify: I think the customer switched to zenoss, then back to gnocchi, then hit this issue. 

Let's do the following: we upgraded their environment since to latest OSP 10z3. We can close this BZ as NOT A BUG and in case the customer runs again into the same issues, I will either reopen this one here or open a new one. This one can still serve as a reminder that their may be issues with ceilometer + gnocchi + swift that have similar symptoms as the Ceph case (see above output). I agree that it's late now and that the data is largely insufficient, so I didn't really expect you to find anything significant. I'll forward your points to the customer though, as next steps in case this happens again. We reenabled ceilometer / gnocchi by the way, and we are going to watch this environment closely.

Thanks!!!

Comment 5 Julien Danjou 2017-07-10 16:29:52 UTC
Great Andreas, thanks for the follow-up. I'm closing this but feel free to re-open if you have any interesting new data coming ing.

Comment 15 Julien Danjou 2017-07-19 09:26:20 UTC
Hi Andreas,

I'm not sure updating a CLOSED bug with a new set of problem is the best practice nor way to solve your issue, but well. I'm reopening for the sake of it, even if it might not be the same issue.

Reading the oldest Gnocchi's log on controller-0 from 14 July, Swift was not functionning:
2017-07-14 03:39:05.929 581527 ERROR swiftclient     raise ClientException.from_response(resp, 'Object GET failed', body)

I've no idea why Swift is down since several days. The swift.log file only have data from 17th July.

So yeah if Swift is not functioning, no metric can be processed and all is queued for days until the problem is solved.

Isn't there any monitoring on Swift?

Comment 16 Andreas Karis 2017-07-19 14:47:34 UTC
Julien, this is the same environment for which I opened the bug initially. And as this is an issue with swift / gnocchi and anything in the telemetry chain, I added the data here, because I wanted to add additional data to this ticket (as discussed).

I unfortunately arrived too late to verify if the env showed the same symptoms as in comment 1.

It's pure speculation, but I thought that gnocchi might have crashed it, as we had the same issues before. I wouldn't see a reason why it crashed, otherwise. 

We're likely disabling telemetry or switching back to mongo and see if the environment remains stable from there on. Keeping this one open for a while, in case we find something else related to this.

Comment 17 Julien Danjou 2017-07-19 15:46:13 UTC
If using Swift make it crash, Gnocchi or not, you have a Swift problem. Do you have Swift logs?

Disabling Telemetry is just going to be putting the problem under the carpet if the problem is on Swift.

Comment 18 Julien Danjou 2017-10-02 09:18:03 UTC
No update in a while so I'm gonna assume this is not a problem anymore. Closing.


Note You need to log in before you can comment on or make changes to this bug.