Description of problem: OpenStack control plane down due to gnocchi issues with OSP 10 ceilometer and zenoss Customer is not using Ceph, but is running into https://access.redhat.com/solutions/3066751 ~~~ [root@overcloud-controller-0 ~]# cat gnocchi.txt [root@overcloud-controller-0 ~]# netstat -planex | grep httpd | grep ACC unix 2 [ ACC ] STREAM LISTENING 1705910442 4351/httpd /var/run/wsgi.1030974.5.1.sock unix 2 [ ACC ] STREAM LISTENING 1705910444 4352/httpd /var/run/wsgi.1030974.5.2.sock unix 249 [ ACC ] STREAM LISTENING 1705910446 4353/httpd /var/run/wsgi.1030974.5.3.sock unix 2 [ ACC ] STREAM LISTENING 1705910449 4363/httpd /var/run/wsgi.1030974.5.4.sock [root@overcloud-controller-0 ~]# ps aux | grep http | wc -l 273 [root@overcloud-controller-0 ~]# rpm -qa | grep gnocchi python-gnocchiclient-2.6.0-1.el7ost.noarch openstack-gnocchi-indexer-sqlalchemy-3.0.2-1.el7ost.noarch openstack-gnocchi-metricd-3.0.2-1.el7ost.noarch puppet-gnocchi-9.4.1-1.el7ost.noarch openstack-gnocchi-common-3.0.2-1.el7ost.noarch openstack-gnocchi-statsd-3.0.2-1.el7ost.noarch python-gnocchi-3.0.2-1.el7ost.noarch openstack-gnocchi-api-3.0.2-1.el7ost.noarch openstack-gnocchi-carbonara-3.0.2-1.el7ost.noarch [root@overcloud-controller-0 ~]# lsof -nn /var/run/wsgi.1030974.5.3.sock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME httpd 4353 gnocchi 6u unix 0xffff8855b9fdc800 0t0 1840093777 /var/run/wsgi.1030974.5.3.sock httpd 4353 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4354 gnocchi 6u unix 0xffff8812aebb0000 0t0 1815065148 /var/run/wsgi.1030974.5.3.sock httpd 4354 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4355 gnocchi 6u unix 0xffff881424898000 0t0 1814911701 /var/run/wsgi.1030974.5.3.sock httpd 4355 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4356 gnocchi 6u unix 0xffff88434674a400 0t0 1815379719 /var/run/wsgi.1030974.5.3.sock httpd 4356 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4357 gnocchi 6u unix 0xffff884f01c58c00 0t0 1814351285 /var/run/wsgi.1030974.5.3.sock httpd 4357 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4358 gnocchi 6u unix 0xffff881298e18400 0t0 1814677648 /var/run/wsgi.1030974.5.3.sock httpd 4358 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4359 gnocchi 6u unix 0xffff885fb75c5800 0t0 1814222114 /var/run/wsgi.1030974.5.3.sock httpd 4359 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4360 gnocchi 6u unix 0xffff881f43f57400 0t0 1815546612 /var/run/wsgi.1030974.5.3.sock httpd 4360 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4361 gnocchi 6u unix 0xffff88139ebb5c00 0t0 1814467211 /var/run/wsgi.1030974.5.3.sock httpd 4361 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4362 gnocchi 6u unix 0xffff883d94a7b000 0t0 1815260447 /var/run/wsgi.1030974.5.3.sock httpd 4362 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 1030974 root 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock ~~~
Hi Andreas, So for now this does make any sense. 1. You say "customer run into https://access.redhat.com/solutions/3066751", except that this is as you noticed a Ceph issue where the pool is not created. So I don't understand how the customer run into this. Do you mean that the symptom where httpd is stuck is the same? 2. The ceilometer.conf you pasted does not reference gnocchi at all, but zenoss. So I also don't understand how Gnocchi is in this picture. I don't think we support zenoss anyway. 3. The ceilometer.conf in the sosreport do indeed points go Gnocchi 4. The log in sosreports do indeed show other issues. collector.log from sosreport-20170616-193207/overcloud-controller-1.localdomain shows that RabbitMQ was unreachable until 8 june and the service got shut down on 12 june. The logs are from 16 june so why collector was shut down for 4 days, I don't know. 5. The sosreport log does not include gnocchi log :( So why I'd love to discover the root cause of the problem you encountered, I wonder if it's not too late now that the system did change a lot. Andreas, would you be able to build a full procedure to reproduce the problem?
Hi, The customer was not using Ceph, but we could observe the same symptoms that happen when the metrics pool is missing: ~~~ [root@overcloud-controller-0 ~]# cat gnocchi.txt [root@overcloud-controller-0 ~]# netstat -planex | grep httpd | grep ACC unix 2 [ ACC ] STREAM LISTENING 1705910442 4351/httpd /var/run/wsgi.1030974.5.1.sock unix 2 [ ACC ] STREAM LISTENING 1705910444 4352/httpd /var/run/wsgi.1030974.5.2.sock unix 249 [ ACC ] STREAM LISTENING 1705910446 4353/httpd /var/run/wsgi.1030974.5.3.sock unix 2 [ ACC ] STREAM LISTENING 1705910449 4363/httpd /var/run/wsgi.1030974.5.4.sock [root@overcloud-controller-0 ~]# ps aux | grep http | wc -l 273 [root@overcloud-controller-0 ~]# rpm -qa | grep gnocchi python-gnocchiclient-2.6.0-1.el7ost.noarch openstack-gnocchi-indexer-sqlalchemy-3.0.2-1.el7ost.noarch openstack-gnocchi-metricd-3.0.2-1.el7ost.noarch puppet-gnocchi-9.4.1-1.el7ost.noarch openstack-gnocchi-common-3.0.2-1.el7ost.noarch openstack-gnocchi-statsd-3.0.2-1.el7ost.noarch python-gnocchi-3.0.2-1.el7ost.noarch openstack-gnocchi-api-3.0.2-1.el7ost.noarch openstack-gnocchi-carbonara-3.0.2-1.el7ost.noarch [root@overcloud-controller-0 ~]# lsof -nn /var/run/wsgi.1030974.5.3.sock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME httpd 4353 gnocchi 6u unix 0xffff8855b9fdc800 0t0 1840093777 /var/run/wsgi.1030974.5.3.sock httpd 4353 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4354 gnocchi 6u unix 0xffff8812aebb0000 0t0 1815065148 /var/run/wsgi.1030974.5.3.sock httpd 4354 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4355 gnocchi 6u unix 0xffff881424898000 0t0 1814911701 /var/run/wsgi.1030974.5.3.sock httpd 4355 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4356 gnocchi 6u unix 0xffff88434674a400 0t0 1815379719 /var/run/wsgi.1030974.5.3.sock httpd 4356 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4357 gnocchi 6u unix 0xffff884f01c58c00 0t0 1814351285 /var/run/wsgi.1030974.5.3.sock httpd 4357 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4358 gnocchi 6u unix 0xffff881298e18400 0t0 1814677648 /var/run/wsgi.1030974.5.3.sock httpd 4358 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4359 gnocchi 6u unix 0xffff885fb75c5800 0t0 1814222114 /var/run/wsgi.1030974.5.3.sock httpd 4359 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4360 gnocchi 6u unix 0xffff881f43f57400 0t0 1815546612 /var/run/wsgi.1030974.5.3.sock httpd 4360 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4361 gnocchi 6u unix 0xffff88139ebb5c00 0t0 1814467211 /var/run/wsgi.1030974.5.3.sock httpd 4361 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 4362 gnocchi 6u unix 0xffff883d94a7b000 0t0 1815260447 /var/run/wsgi.1030974.5.3.sock httpd 4362 gnocchi 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock httpd 1030974 root 27u unix 0xffff882fa8a4f400 0t0 1705910446 /var/run/wsgi.1030974.5.3.sock ~~~ Just to clarify: I think the customer switched to zenoss, then back to gnocchi, then hit this issue. Let's do the following: we upgraded their environment since to latest OSP 10z3. We can close this BZ as NOT A BUG and in case the customer runs again into the same issues, I will either reopen this one here or open a new one. This one can still serve as a reminder that their may be issues with ceilometer + gnocchi + swift that have similar symptoms as the Ceph case (see above output). I agree that it's late now and that the data is largely insufficient, so I didn't really expect you to find anything significant. I'll forward your points to the customer though, as next steps in case this happens again. We reenabled ceilometer / gnocchi by the way, and we are going to watch this environment closely. Thanks!!!
Great Andreas, thanks for the follow-up. I'm closing this but feel free to re-open if you have any interesting new data coming ing.
Hi Andreas, I'm not sure updating a CLOSED bug with a new set of problem is the best practice nor way to solve your issue, but well. I'm reopening for the sake of it, even if it might not be the same issue. Reading the oldest Gnocchi's log on controller-0 from 14 July, Swift was not functionning: 2017-07-14 03:39:05.929 581527 ERROR swiftclient raise ClientException.from_response(resp, 'Object GET failed', body) I've no idea why Swift is down since several days. The swift.log file only have data from 17th July. So yeah if Swift is not functioning, no metric can be processed and all is queued for days until the problem is solved. Isn't there any monitoring on Swift?
Julien, this is the same environment for which I opened the bug initially. And as this is an issue with swift / gnocchi and anything in the telemetry chain, I added the data here, because I wanted to add additional data to this ticket (as discussed). I unfortunately arrived too late to verify if the env showed the same symptoms as in comment 1. It's pure speculation, but I thought that gnocchi might have crashed it, as we had the same issues before. I wouldn't see a reason why it crashed, otherwise. We're likely disabling telemetry or switching back to mongo and see if the environment remains stable from there on. Keeping this one open for a while, in case we find something else related to this.
If using Swift make it crash, Gnocchi or not, you have a Swift problem. Do you have Swift logs? Disabling Telemetry is just going to be putting the problem under the carpet if the problem is on Swift.
No update in a while so I'm gonna assume this is not a problem anymore. Closing.