Bug 2219596 - Ceilometer agent must try to keep reconnecting to metrics storage when it fails
Summary: Ceilometer agent must try to keep reconnecting to metrics storage when it fails
Keywords:
Status: ON_DEV
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer
Version: 18.0 (Zed)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 18.0
Assignee: Jaromír Wysoglad
QA Contact: Leonid Natapov
mgeary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-04 13:41 UTC by Juan Larriba
Modified: 2023-08-17 08:42 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 880066 0 None MERGED Make TCP publisher log warning instead of failing. 2023-07-12 14:26:10 UTC
OpenStack gerrit 888166 0 None NEW Make TCP publisher log warning instead of failing. 2023-07-12 14:27:34 UTC
OpenStack gerrit 888167 0 None NEW Make TCP publisher log warning instead of failing. 2023-07-12 14:27:34 UTC
Red Hat Issue Tracker OSP-26303 0 None None None 2023-07-04 14:22:00 UTC

Description Juan Larriba 2023-07-04 13:41:58 UTC
Currently, Ceilometer only tries to connect to the external metrics/events storage services (like gnocchi) once, when it starts. If the service is not there in that moment, the Ceilometer service just sits there in error and does nothing.

In RHOSP18, Kuberntes pattern will start all the containers at the same time. 

Sometimes the ceilometer agent will start before than the external service, sometimes it will not. If the first situation happens, the service does nothing.

There must be a number of reconnection attempts every 10 seconds or so to give time to external storage systems to spawn.

Comment 2 Leif Madsen 2023-07-05 14:51:52 UTC
It might be worth implementing a scaling reconnect timeout pattern to avoid any sort of overwhelming of services? Something like try, wait 2s, 4s, 8s, 15s, 30s, 60s then every 60s indefinitely. Or maybe just make it a single configurable value like suggested and set a default to 10s by default as suggested. Not sure if there is a strong cost to the reconnect attempts here though, so it might not be worth the logic.

Comment 3 Jaromír Wysoglad 2023-07-10 07:41:07 UTC
Currently the ceilometer TCP publisher is meant to try to connect when ceilometer is first started. If that fails, or if it disconnects sometime later, then it should try to reconnect before it tries to send each new metric. So every time there is a new metric to send, it should try to connect to sg-core again.

This mechanism currently seems to not work, because there is a bug in the code. It looks to me like in Juan's situation it failed to connect when ceilometer first started. Then it failed again when trying to reconnect before sending the first metric and after that it threw an exception and it didn't try to reconnect again.

I can fix the bug and make it to try to reconnect with each metric it tries to send. Do we actually need the mechanism with the reconnect timeout on top of that Leif?

Comment 4 Juan Larriba 2023-07-10 08:23:44 UTC
I dont think the reconnection mechanism with scaling timeout pattern is needed if the mechanism is reconnecting before sending each metric.

However, this inevitably spawns the question: is trying to reconnect to sg-core when it is already connected an expensive operation? Take in account that there can be a number of metrics being sent to sg-core. Maybe it would be worth to include a cheap check of whether ceilometer is already connected instead of just blindy try to establish an already established socket?

Comment 5 Jaromír Wysoglad 2023-07-10 08:50:44 UTC
I think you didn't understand my explanation of how the tcp publisher works.

1. When ceilometer is started, it always performs a single attempt at connecting to sg-core.

2. When there is data to send, it tries to send the data to sg-core (no reconnection attempt here).

3. Only if the transmission of data in the previous step failed, ceilometer tries to reconnect. If the reconnection succeeds, the data are sent again. If the reconnection fails, the data are discarded. From here, it goes back to step 2.


So if everything works, it'll keep sending the data (without reconnecting). If there is some failure and the connection breaks it tries to reconnect before sending each metric until it succeeds, then it'll keep sending without reconnecting again.

The 3rd step is what doesn't work at the moment.


Note You need to log in before you can comment on or make changes to this bug.