Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1443057

Summary: Metricd can lose coordination group and lose capacity
Product: Red Hat OpenStack Reporter: Alex Krzos <akrzos>
Component: openstack-gnocchiAssignee: Mehdi ABAAKOUK <mabaakou>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: apevec, jdanjou, jschluet, lhh, mabaakou, nshetty, pkilambi, vaggarwa
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: openstack-gnocchi-3.0.8-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1454842 (view as bug list) Environment:
Last Closed: 2017-07-12 14:07:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1454842    

Description Alex Krzos 2017-04-18 12:01:38 UTC
Description of problem:
Gnocchi Metricd can "lose"(perhaps it attempts to  its coordination group and when it does so, two controllers out of three will have Gnocchi schedulers sharing the same block.

Version-Release number of selected component (if applicable):
Ocata - RHOSP11 (Build 2017-04-06.4)
openstack-gnocchi-api-3.1.2-3.el7ost.noarch
rubygem-sensu-redis-2.1.0-2.el7ost.noarch
python-redis-2.10.3-3.el7ost.noarch
puppet-redis-1.2.4-1.a2d6395git.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.2-3.el7ost.noarch
python-gnocchiclient-3.1.0-1.el7ost.noarch
openstack-gnocchi-common-3.1.2-3.el7ost.noarch
openstack-gnocchi-metricd-3.1.2-3.el7ost.noarch
puppet-gnocchi-10.3.0-2.el7ost.noarch
python-gnocchi-3.1.2-3.el7ost.noarch
python-tooz-1.48.0-1.el7ost.noarch
openstack-gnocchi-statsd-3.1.2-3.el7ost.noarch
redis-3.2.8-1.el7ost.x86_64


How reproducible:
Unsure on how reproducible but this occurred on a cloud after ~9 hours of running.

Steps to Reproduce:
1.
2.
3.

Actual results:
You loose ~1/3 of your capacity to process metrics/measures when this occurs until you restart each metricd process on each controller.  I suspect you can restart just the one with the error log message.

Expected results:


Additional info:
Logs:
Controller-0

2017-04-18 09:53:53.431 339952 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
2017-04-18 09:56:01.594 339952 INFO gnocchi.cli [-] 13534 measurements bundles across 13534 metrics wait to be processed.
2017-04-18 09:57:54.646 339952 INFO gnocchi.cli [-] 3996 measurements bundles across 3996 metrics wait to be processed.
2017-04-18 09:59:23.000 336896 INFO gnocchi.cli [-] New set of agents detected. Now working on block: 1, with up to 2048 metrics
2017-04-18 09:59:53.432 339952 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.


Controller-1

2017-04-18 09:53:53.687 296602 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
2017-04-18 09:56:01.609 296602 INFO gnocchi.cli [-] 13520 measurements bundles across 13520 metrics wait to be processed.
2017-04-18 09:57:55.023 296602 INFO gnocchi.cli [-] 3826 measurements bundles across 3826 metrics wait to be processed.
2017-04-18 09:59:24.040 293559 WARNING gnocchi.cli [-] Error getting block to work on, defaulting to first
2017-04-18 09:59:53.689 296602 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.


Controller-2

2017-04-18 09:53:53.461 297473 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.
2017-04-18 09:56:01.607 297473 INFO gnocchi.cli [-] 13531 measurements bundles across 13531 metrics wait to be processed.
2017-04-18 09:57:54.667 297473 INFO gnocchi.cli [-] 3994 measurements bundles across 3994 metrics wait to be processed.
2017-04-18 09:59:22.911 294213 INFO gnocchi.cli [-] New set of agents detected. Now working on block: 0, with up to 2048 metrics
2017-04-18 09:59:53.463 297473 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.


* Note how at 9:59 Controller-1 can not get a block to work on and defaults to block 0 which controller-2 is also working on.


I am unsure why the metricd daemons picked new blocks to work on to begin with.

Comment 2 Alex Krzos 2017-04-18 12:05:51 UTC
(In reply to Alex Krzos from comment #0)
> Description of problem:
> Gnocchi Metricd can "lose"(perhaps it attempts to  its coordination group


*Meant*

Gnocchi Metricd can "lose"(perhaps it attempts to "re-coordinate" regularly?)  its coordination group and when it does so, two controllers out of three will have Gnocchi schedulers sharing the same block.

Comment 3 Mehdi ABAAKOUK 2017-04-19 05:31:49 UTC
Fortunately, the error message "Error getting block to work on, defaulting to first" shallow the root cause. Also if such case we didn't retry.

I have proposed https://review.openstack.org/457702 to unshallow the root error and to retry later to get correct blocks in such case.

Comment 4 Julien Danjou 2017-05-23 17:06:10 UTC
This has been merged and released as part or Gnocchi 3.0.8.

Comment 8 Sasha Smolyak 2017-07-12 13:18:07 UTC
tooz is fixed. Get exception from redis when it's killed, as expected. Then it reconnects. Verifying it for now

Comment 10 errata-xmlrpc 2017-07-12 14:07:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1748