Bug 2016460
| Summary: | [STF 1.3] sg-core fails handling some messages due to some invalid escape char | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Service Telemetry Framework | Reporter: | ghurel | ||||
| Component: | sg-core-container | Assignee: | Leif Madsen <lmadsen> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||
| Severity: | high | Docs Contact: | Joanne O'Flynn <joflynn> | ||||
| Priority: | high | ||||||
| Version: | 1.3 | CC: | csibbitt, lmadsen, mmagr | ||||
| Target Milestone: | z4 | Keywords: | Triaged, ZStream | ||||
| Target Release: | 1.3 (STF) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | sg-core-container-4.0.4-1 | Doc Type: | Bug Fix | ||||
| Doc Text: |
In some cases, Ceilometer metrics were not handled properly by sg-core. This resulted in some Ceilometer metrics not being stored in Prometheus.
In this release, the processing of metrics has been enhanced to be more robust.
While the sg-core has been enhanced to support larger messages from Ceilometer, an additional change is required to support passing the larger messages through the sg-bridge ring buffer. The changes required to fully support this functionality is being tracked in RHBZ#2053683.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 2051615 (view as bug list) | Environment: | |||||
| Last Closed: | 2022-02-21 16:30:17 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2051615, 2053683 | ||||||
| Attachments: |
|
||||||
|
Description
ghurel
2021-10-21 15:10:02 UTC
I was asked to verify this and found that it was not fixed in the listed package.
$ oc describe po default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q | grep -i image
Image: image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core:4.0.4-1
Image ID: image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core@sha256:f1587eb3ef058462e39cff35f5dc8b81e741b087de00ce2f04ff3d6ee2672355
$ oc logs default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q -c sg-core | strings | grep unexpected
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs
2022-02-10 17:58:41 [DEBUG] failed handling message [handler: ceilometer-metrics[socket], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-pltbs
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-opt-qd.in|..., bigger context ...|default-interconnect-675dd97bc4-pltbs
2022-02-10 18:01:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs
Moving this back to ON_QA as it seems some information was missing where a dependency of increasing the ringBufferSize on the sg-bridge side of things was already required. I'm taking this and going to attempt to verify this again. A dependency in Service Telemetry Operator has already been filed to result in the exposure and increased default value which is tracked in the linked dependency. It should be possible to verify the sg-core side as part of the upcoming release though. Verified this by increasing the ringBufferSize on the bridge container in the Ceilometer metrics Smart Gateway deployment. Did this by scaling down the Service Telemetry Operator to 0 pods so it wouldn't revert changes to the SmartGateway manifest for default-cops04-ceil-meter. The manifest changes implemented look like the following:
apiVersion: smartgateway.infra.watch/v2
kind: SmartGateway
metadata:
...
spec:
applications:
- config: |
host: 0.0.0.0
port: 8081
withTimeStamp: true
name: prometheus
bridge:
amqpUrl: amqp://default-interconnect.service-telemetry.svc.cluster.local:5673/anycast/ceilometer/cops04-metering.sample
ringBufferSize: 16384
handleErrors: true
logLevel: debug
...
Prior to making this change I deployed with the pre-release artifacts, reproduced the issue, made the changes above to enable the new ring buffer size, and then monitored the output for several minutes (the corrupted message would typically show up within 30-60 seconds).
Full solution requires changes as linked as a dependency of this issue, but the code as implemented does resolve the issue (once the bridge container is updated to reflect the larger buffer capacity).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Service Telemetry Framework 1.3 (sg-core-container) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0587 |