Bug 1492142 - Make sure that the 'metrics' pool is created on major upgrade from OSP 9 to 10
Summary: Make sure that the 'metrics' pool is created on major upgrade from OSP 9 to 10
Keywords:
Status: CLOSED DUPLICATE of bug 1412295
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Marios Andreou
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-15 15:08 UTC by Andreas Karis
Modified: 2020-12-14 10:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-25 22:00:03 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Andreas Karis 2017-09-15 15:08:33 UTC
Description of problem:
Hello,

Please make sure that the 'metrics' pool is created on major upgrade from OSP 9 to 10 on integrated Ceph environments.

I recently had a case where a customer upgraded 2 environments to OSP 10. 

What happened is that:

The absence of the "metrics" pool in RHOSP after an upgrade of OSP 9 to OSP 10 caused gnocchi to go crazy and create too many connections to redis. This exhausted the number of sockets on the system due to too many sockets in TIME_WAIT. This led rabbitmq to fail, because it couldn't open new ports. It also made httpd hang.


Should I create a new bugzilla, or can we assign this ticket to tripleo or documentation (or should we keep it on gnocchi) so that we create a 'metrics' pool if this is internal ceph or the customer knows that they have to create a 'metrics' pool if this is external ceph.

The exact number of PGs depends on the environment, but in this specific case, we ran this command against ceph to create the metrics pool:
~~~
ceph osd pool create metrics 64 64
~~~

Then, the error messages in gnocchi went away (tail -f /var/log/gnocchi/metricd.log on the controllers)

As a reminder, we saw this error message:
++++++++++++++++++++++++
From /var/log/gnocchi/metricd.log
~~~
2017-09-13 15:17:45.479 603330 INFO gnocchi.storage.ceph [-] Ceph storage backend use 'cradox' python library
2017-09-13 15:17:45.487 603330 ERROR cotyledon [-] Unhandled exception
2017-09-13 15:17:45.487 603330 ERROR cotyledon Traceback (most recent call last):
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 52, in _exit_on_exception
2017-09-13 15:17:45.487 603330 ERROR cotyledon     yield
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 130, in _run
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self.run()
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 92, in run
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self._configure()
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 87, in _configure
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self.store = storage.get_driver(self.conf)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 159, in get_driver
2017-09-13 15:17:45.487 603330 ERROR cotyledon     return get_driver_class(conf)(conf.storage)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/storage/ceph.py", line 99, in __init__
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self.ioctx = self.rados.open_ioctx(self.pool)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
2017-09-13 15:17:45.487 603330 ERROR cotyledon ObjectNotFound: error opening pool 'metrics'
2017-09-13 15:17:45.487 603330 ERROR cotyledon
~~~
++++++++++++++++++++++++

Comment 1 Marios Andreou 2017-09-25 16:24:33 UTC
Hi Andreas - we discussed this during the upgrades bug triage call today. mcornea identified https://bugzilla.redhat.com/show_bug.cgi?id=1412295 (and also https://bugzilla.redhat.com/show_bug.cgi?id=1461951) as a duplicate for this. It looks like the fix was to document the requirement for the operator. 

I am adding the TC for Telemetry needinfo pradk please can someone check this? Both to confirm the duplicate to BZ 1412295 and then did we land something to docs already or is that still pending?

thanks

Comment 2 Pradeep Kilambi 2017-09-25 21:32:26 UTC
(In reply to marios from comment #1)
> Hi Andreas - we discussed this during the upgrades bug triage call today.
> mcornea identified https://bugzilla.redhat.com/show_bug.cgi?id=1412295 (and
> also https://bugzilla.redhat.com/show_bug.cgi?id=1461951) as a duplicate for
> this. It looks like the fix was to document the requirement for the
> operator. 
> 
> I am adding the TC for Telemetry needinfo pradk please can someone check
> this? Both to confirm the duplicate to BZ 1412295 and then did we land
> something to docs already or is that still pending?
> 
> thanks

Yes this indeed a duplicate of 1412295. I dont see the other bug fixed yet, I put a need info. Hope it gets wrapped up soon. We can close this as dup i think

Comment 3 Andreas Karis 2017-09-25 21:43:25 UTC
Hi, 

Feel free to close this one as duplicate if the docs bug is being tracked elsewhere!

Thanks!

- Andreas

Comment 4 Pradeep Kilambi 2017-09-25 22:00:03 UTC

*** This bug has been marked as a duplicate of bug 1412295 ***


Note You need to log in before you can comment on or make changes to this bug.