Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2219596

Summary:	Ceilometer agent must try to keep reconnecting to metrics storage when it fails
Product:	Red Hat OpenStack	Reporter:	Juan Larriba <jlarriba>
Component:	openstack-ceilometer	Assignee:	Jaromír Wysoglad <jwysogla>
Status:	CLOSED MIGRATED	QA Contact:	Leonid Natapov <lnatapov>
Severity:	high	Docs Contact:	mgeary <mgeary>
Priority:	high
Version:	18.0 (Zed)	CC:	apevec, mrunge
Target Milestone:	beta	Keywords:	Triaged
Target Release:	18.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-ceilometer-20.0.1-18.0.20230810154809.e04777a	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-01-05 10:12:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Juan Larriba 2023-07-04 13:41:58 UTC

Currently, Ceilometer only tries to connect to the external metrics/events storage services (like gnocchi) once, when it starts. If the service is not there in that moment, the Ceilometer service just sits there in error and does nothing.

In RHOSP18, Kuberntes pattern will start all the containers at the same time. 

Sometimes the ceilometer agent will start before than the external service, sometimes it will not. If the first situation happens, the service does nothing.

There must be a number of reconnection attempts every 10 seconds or so to give time to external storage systems to spawn.

Comment 2 Leif Madsen 2023-07-05 14:51:52 UTC

It might be worth implementing a scaling reconnect timeout pattern to avoid any sort of overwhelming of services? Something like try, wait 2s, 4s, 8s, 15s, 30s, 60s then every 60s indefinitely. Or maybe just make it a single configurable value like suggested and set a default to 10s by default as suggested. Not sure if there is a strong cost to the reconnect attempts here though, so it might not be worth the logic.

Comment 3 Jaromír Wysoglad 2023-07-10 07:41:07 UTC

Currently the ceilometer TCP publisher is meant to try to connect when ceilometer is first started. If that fails, or if it disconnects sometime later, then it should try to reconnect before it tries to send each new metric. So every time there is a new metric to send, it should try to connect to sg-core again.

This mechanism currently seems to not work, because there is a bug in the code. It looks to me like in Juan's situation it failed to connect when ceilometer first started. Then it failed again when trying to reconnect before sending the first metric and after that it threw an exception and it didn't try to reconnect again.

I can fix the bug and make it to try to reconnect with each metric it tries to send. Do we actually need the mechanism with the reconnect timeout on top of that Leif?

Comment 4 Juan Larriba 2023-07-10 08:23:44 UTC

I dont think the reconnection mechanism with scaling timeout pattern is needed if the mechanism is reconnecting before sending each metric.

However, this inevitably spawns the question: is trying to reconnect to sg-core when it is already connected an expensive operation? Take in account that there can be a number of metrics being sent to sg-core. Maybe it would be worth to include a cheap check of whether ceilometer is already connected instead of just blindy try to establish an already established socket?

Comment 5 Jaromír Wysoglad 2023-07-10 08:50:44 UTC

I think you didn't understand my explanation of how the tcp publisher works.

1. When ceilometer is started, it always performs a single attempt at connecting to sg-core.

2. When there is data to send, it tries to send the data to sg-core (no reconnection attempt here).

3. Only if the transmission of data in the previous step failed, ceilometer tries to reconnect. If the reconnection succeeds, the data are sent again. If the reconnection fails, the data are discarded. From here, it goes back to step 2.


So if everything works, it'll keep sending the data (without reconnecting). If there is some failure and the connection breaks it tries to reconnect before sending each metric until it succeeds, then it'll keep sending without reconnecting again.

The 3rd step is what doesn't work at the moment.

Comment 6 Jaromír Wysoglad 2023-08-23 07:42:00 UTC

Just an update. The bz is waiting for a patch to be merged to Zed, which is blocked by upstream CI at the moment. However the stable/2023.1 backport is merged, which *should* be enough to fix Juan's issue.

Comment 7 Matthias Runge 2024-01-05 10:12:42 UTC

closing this in favor of https://issues.redhat.com/browse/OSP-26303