Description of problem: Hello, Please make sure that the 'metrics' pool is created on major upgrade from OSP 9 to 10 on integrated Ceph environments. I recently had a case where a customer upgraded 2 environments to OSP 10. What happened is that: The absence of the "metrics" pool in RHOSP after an upgrade of OSP 9 to OSP 10 caused gnocchi to go crazy and create too many connections to redis. This exhausted the number of sockets on the system due to too many sockets in TIME_WAIT. This led rabbitmq to fail, because it couldn't open new ports. It also made httpd hang. Should I create a new bugzilla, or can we assign this ticket to tripleo or documentation (or should we keep it on gnocchi) so that we create a 'metrics' pool if this is internal ceph or the customer knows that they have to create a 'metrics' pool if this is external ceph. The exact number of PGs depends on the environment, but in this specific case, we ran this command against ceph to create the metrics pool: ~~~ ceph osd pool create metrics 64 64 ~~~ Then, the error messages in gnocchi went away (tail -f /var/log/gnocchi/metricd.log on the controllers) As a reminder, we saw this error message: ++++++++++++++++++++++++ From /var/log/gnocchi/metricd.log ~~~ 2017-09-13 15:17:45.479 603330 INFO gnocchi.storage.ceph [-] Ceph storage backend use 'cradox' python library 2017-09-13 15:17:45.487 603330 ERROR cotyledon [-] Unhandled exception 2017-09-13 15:17:45.487 603330 ERROR cotyledon Traceback (most recent call last): 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 52, in _exit_on_exception 2017-09-13 15:17:45.487 603330 ERROR cotyledon yield 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 130, in _run 2017-09-13 15:17:45.487 603330 ERROR cotyledon self.run() 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 92, in run 2017-09-13 15:17:45.487 603330 ERROR cotyledon self._configure() 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 87, in _configure 2017-09-13 15:17:45.487 603330 ERROR cotyledon self.store = storage.get_driver(self.conf) 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 159, in get_driver 2017-09-13 15:17:45.487 603330 ERROR cotyledon return get_driver_class(conf)(conf.storage) 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "/usr/lib/python2.7/site-packages/gnocchi/storage/ceph.py", line 99, in __init__ 2017-09-13 15:17:45.487 603330 ERROR cotyledon self.ioctx = self.rados.open_ioctx(self.pool) 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188) 2017-09-13 15:17:45.487 603330 ERROR cotyledon File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325) 2017-09-13 15:17:45.487 603330 ERROR cotyledon ObjectNotFound: error opening pool 'metrics' 2017-09-13 15:17:45.487 603330 ERROR cotyledon ~~~ ++++++++++++++++++++++++
Hi Andreas - we discussed this during the upgrades bug triage call today. mcornea identified https://bugzilla.redhat.com/show_bug.cgi?id=1412295 (and also https://bugzilla.redhat.com/show_bug.cgi?id=1461951) as a duplicate for this. It looks like the fix was to document the requirement for the operator. I am adding the TC for Telemetry needinfo pradk please can someone check this? Both to confirm the duplicate to BZ 1412295 and then did we land something to docs already or is that still pending? thanks
(In reply to marios from comment #1) > Hi Andreas - we discussed this during the upgrades bug triage call today. > mcornea identified https://bugzilla.redhat.com/show_bug.cgi?id=1412295 (and > also https://bugzilla.redhat.com/show_bug.cgi?id=1461951) as a duplicate for > this. It looks like the fix was to document the requirement for the > operator. > > I am adding the TC for Telemetry needinfo pradk please can someone check > this? Both to confirm the duplicate to BZ 1412295 and then did we land > something to docs already or is that still pending? > > thanks Yes this indeed a duplicate of 1412295. I dont see the other bug fixed yet, I put a need info. Hope it gets wrapped up soon. We can close this as dup i think
Hi, Feel free to close this one as duplicate if the docs bug is being tracked elsewhere! Thanks! - Andreas
*** This bug has been marked as a duplicate of bug 1412295 ***