Bug 1492142

Summary: Make sure that the 'metrics' pool is created on major upgrade from OSP 9 to 10
Product: Red Hat OpenStack Reporter: Andreas Karis <akaris>
Component: openstack-tripleoAssignee: Marios Andreou <mandreou>
Status: CLOSED DUPLICATE QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: aschultz, mandreou, mburns, pkilambi, rhel-osp-director-maint, ssmolyak
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-25 22:00:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2017-09-15 15:08:33 UTC
Description of problem:
Hello,

Please make sure that the 'metrics' pool is created on major upgrade from OSP 9 to 10 on integrated Ceph environments.

I recently had a case where a customer upgraded 2 environments to OSP 10. 

What happened is that:

The absence of the "metrics" pool in RHOSP after an upgrade of OSP 9 to OSP 10 caused gnocchi to go crazy and create too many connections to redis. This exhausted the number of sockets on the system due to too many sockets in TIME_WAIT. This led rabbitmq to fail, because it couldn't open new ports. It also made httpd hang.


Should I create a new bugzilla, or can we assign this ticket to tripleo or documentation (or should we keep it on gnocchi) so that we create a 'metrics' pool if this is internal ceph or the customer knows that they have to create a 'metrics' pool if this is external ceph.

The exact number of PGs depends on the environment, but in this specific case, we ran this command against ceph to create the metrics pool:
~~~
ceph osd pool create metrics 64 64
~~~

Then, the error messages in gnocchi went away (tail -f /var/log/gnocchi/metricd.log on the controllers)

As a reminder, we saw this error message:
++++++++++++++++++++++++
From /var/log/gnocchi/metricd.log
~~~
2017-09-13 15:17:45.479 603330 INFO gnocchi.storage.ceph [-] Ceph storage backend use 'cradox' python library
2017-09-13 15:17:45.487 603330 ERROR cotyledon [-] Unhandled exception
2017-09-13 15:17:45.487 603330 ERROR cotyledon Traceback (most recent call last):
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 52, in _exit_on_exception
2017-09-13 15:17:45.487 603330 ERROR cotyledon     yield
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 130, in _run
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self.run()
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 92, in run
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self._configure()
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 87, in _configure
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self.store = storage.get_driver(self.conf)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/storage/__init__.py", line 159, in get_driver
2017-09-13 15:17:45.487 603330 ERROR cotyledon     return get_driver_class(conf)(conf.storage)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/gnocchi/storage/ceph.py", line 99, in __init__
2017-09-13 15:17:45.487 603330 ERROR cotyledon     self.ioctx = self.rados.open_ioctx(self.pool)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "cradox.pyx", line 413, in cradox.requires.wrapper.validate_func (cradox.c:4188)
2017-09-13 15:17:45.487 603330 ERROR cotyledon   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
2017-09-13 15:17:45.487 603330 ERROR cotyledon ObjectNotFound: error opening pool 'metrics'
2017-09-13 15:17:45.487 603330 ERROR cotyledon
~~~
++++++++++++++++++++++++

Comment 1 Marios Andreou 2017-09-25 16:24:33 UTC
Hi Andreas - we discussed this during the upgrades bug triage call today. mcornea identified https://bugzilla.redhat.com/show_bug.cgi?id=1412295 (and also https://bugzilla.redhat.com/show_bug.cgi?id=1461951) as a duplicate for this. It looks like the fix was to document the requirement for the operator. 

I am adding the TC for Telemetry needinfo pradk please can someone check this? Both to confirm the duplicate to BZ 1412295 and then did we land something to docs already or is that still pending?

thanks

Comment 2 Pradeep Kilambi 2017-09-25 21:32:26 UTC
(In reply to marios from comment #1)
> Hi Andreas - we discussed this during the upgrades bug triage call today.
> mcornea identified https://bugzilla.redhat.com/show_bug.cgi?id=1412295 (and
> also https://bugzilla.redhat.com/show_bug.cgi?id=1461951) as a duplicate for
> this. It looks like the fix was to document the requirement for the
> operator. 
> 
> I am adding the TC for Telemetry needinfo pradk please can someone check
> this? Both to confirm the duplicate to BZ 1412295 and then did we land
> something to docs already or is that still pending?
> 
> thanks

Yes this indeed a duplicate of 1412295. I dont see the other bug fixed yet, I put a need info. Hope it gets wrapped up soon. We can close this as dup i think

Comment 3 Andreas Karis 2017-09-25 21:43:25 UTC
Hi, 

Feel free to close this one as duplicate if the docs bug is being tracked elsewhere!

Thanks!

- Andreas

Comment 4 Pradeep Kilambi 2017-09-25 22:00:03 UTC

*** This bug has been marked as a duplicate of bug 1412295 ***