Bug 2219596
| Summary: | Ceilometer agent must try to keep reconnecting to metrics storage when it fails | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Juan Larriba <jlarriba> |
| Component: | openstack-ceilometer | Assignee: | Jaromír Wysoglad <jwysogla> |
| Status: | ON_DEV --- | QA Contact: | Leonid Natapov <lnatapov> |
| Severity: | high | Docs Contact: | mgeary <mgeary> |
| Priority: | high | ||
| Version: | 18.0 (Zed) | CC: | apevec, mrunge |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | 18.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Juan Larriba
2023-07-04 13:41:58 UTC
It might be worth implementing a scaling reconnect timeout pattern to avoid any sort of overwhelming of services? Something like try, wait 2s, 4s, 8s, 15s, 30s, 60s then every 60s indefinitely. Or maybe just make it a single configurable value like suggested and set a default to 10s by default as suggested. Not sure if there is a strong cost to the reconnect attempts here though, so it might not be worth the logic. Currently the ceilometer TCP publisher is meant to try to connect when ceilometer is first started. If that fails, or if it disconnects sometime later, then it should try to reconnect before it tries to send each new metric. So every time there is a new metric to send, it should try to connect to sg-core again. This mechanism currently seems to not work, because there is a bug in the code. It looks to me like in Juan's situation it failed to connect when ceilometer first started. Then it failed again when trying to reconnect before sending the first metric and after that it threw an exception and it didn't try to reconnect again. I can fix the bug and make it to try to reconnect with each metric it tries to send. Do we actually need the mechanism with the reconnect timeout on top of that Leif? I dont think the reconnection mechanism with scaling timeout pattern is needed if the mechanism is reconnecting before sending each metric. However, this inevitably spawns the question: is trying to reconnect to sg-core when it is already connected an expensive operation? Take in account that there can be a number of metrics being sent to sg-core. Maybe it would be worth to include a cheap check of whether ceilometer is already connected instead of just blindy try to establish an already established socket? I think you didn't understand my explanation of how the tcp publisher works. 1. When ceilometer is started, it always performs a single attempt at connecting to sg-core. 2. When there is data to send, it tries to send the data to sg-core (no reconnection attempt here). 3. Only if the transmission of data in the previous step failed, ceilometer tries to reconnect. If the reconnection succeeds, the data are sent again. If the reconnection fails, the data are discarded. From here, it goes back to step 2. So if everything works, it'll keep sending the data (without reconnecting). If there is some failure and the connection breaks it tries to reconnect before sending each metric until it succeeds, then it'll keep sending without reconnecting again. The 3rd step is what doesn't work at the moment. |