Created attachment 1835704 [details] Full debug logs of the ceil-meter SG Description of problem: STF 1.3 configured to monitor multiple OSP 16 clouds with out-of-the-box configuration (i.e. by following the official documentation [1]). The container sg-core of the ceil-meter Smart Gateway fails on regularly on incoming messages with the following errors: > $ oc logs -f default-tst-ceil-meter-smartgateway-5698bb44dc-4z4vs > [...] > 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readEscapedChar: invalid escape char after \, error found in #10 byte of ...|ephemeral\|..., bigger context ...|us\": 1, \"ram\": 1024, \"disk\": 40, \"ephemeral\|..., handler: ceilometer-metrics[socket]] > 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|"vcpus\": |..., bigger context ...|": \"11\", \"name\": \"std.cpu1ram1\", \"vcpus\": |..., handler: ceilometer-metrics[socket]] > [...] Full log output is attached, with "dumpMessages" enabled in the SG configuration for increased verbosity. Actual results: Not exhaustive, but what has been observed so far: - some metrics (e.g. cpu_ceilometer) are missing for some overcloud compute nodes in Prometheus/Grafana, resulting in some dashboards (e.g. Virtual Machine dashboard) to work partially (incomplete lists of projects and VMs). Expected results: All the metrics/events of all the overcloud compute nodes can be seen in Prometheus/Grafana.
I was asked to verify this and found that it was not fixed in the listed package. $ oc describe po default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q | grep -i image Image: image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core:4.0.4-1 Image ID: image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core@sha256:f1587eb3ef058462e39cff35f5dc8b81e741b087de00ce2f04ff3d6ee2672355 $ oc logs default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q -c sg-core | strings | grep unexpected 2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...| 2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs 2022-02-10 17:58:41 [DEBUG] failed handling message [handler: ceilometer-metrics[socket], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-pltbs 2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-opt-qd.in|..., bigger context ...|default-interconnect-675dd97bc4-pltbs 2022-02-10 18:01:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs
Moving this back to ON_QA as it seems some information was missing where a dependency of increasing the ringBufferSize on the sg-bridge side of things was already required. I'm taking this and going to attempt to verify this again. A dependency in Service Telemetry Operator has already been filed to result in the exposure and increased default value which is tracked in the linked dependency. It should be possible to verify the sg-core side as part of the upcoming release though.
Verified this by increasing the ringBufferSize on the bridge container in the Ceilometer metrics Smart Gateway deployment. Did this by scaling down the Service Telemetry Operator to 0 pods so it wouldn't revert changes to the SmartGateway manifest for default-cops04-ceil-meter. The manifest changes implemented look like the following: apiVersion: smartgateway.infra.watch/v2 kind: SmartGateway metadata: ... spec: applications: - config: | host: 0.0.0.0 port: 8081 withTimeStamp: true name: prometheus bridge: amqpUrl: amqp://default-interconnect.service-telemetry.svc.cluster.local:5673/anycast/ceilometer/cops04-metering.sample ringBufferSize: 16384 handleErrors: true logLevel: debug ... Prior to making this change I deployed with the pre-release artifacts, reproduced the issue, made the changes above to enable the new ring buffer size, and then monitored the output for several minutes (the corrupted message would typically show up within 30-60 seconds). Full solution requires changes as linked as a dependency of this issue, but the code as implemented does resolve the issue (once the bridge container is updated to reflect the larger buffer capacity).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Service Telemetry Framework 1.3 (sg-core-container) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0587