Bug 2016460 - [STF 1.3] sg-core fails handling some messages due to some invalid escape char
Summary: [STF 1.3] sg-core fails handling some messages due to some invalid escape char
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Service Telemetry Framework
Classification: Red Hat
Component: sg-core-container
Version: 1.3
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: z4
: 1.3 (STF)
Assignee: Leif Madsen
QA Contact: Leonid Natapov
Joanne O'Flynn
URL:
Whiteboard:
Depends On:
Blocks: 2051615 2053683
TreeView+ depends on / blocked
 
Reported: 2021-10-21 15:10 UTC by ghurel
Modified: 2022-02-21 16:30 UTC (History)
3 users (show)

Fixed In Version: sg-core-container-4.0.4-1
Doc Type: Bug Fix
Doc Text:
In some cases, Ceilometer metrics were not handled properly by sg-core. This resulted in some Ceilometer metrics not being stored in Prometheus. In this release, the processing of metrics has been enhanced to be more robust. While the sg-core has been enhanced to support larger messages from Ceilometer, an additional change is required to support passing the larger messages through the sg-bridge ring buffer. The changes required to fully support this functionality is being tracked in RHBZ#2053683.
Clone Of:
: 2051615 (view as bug list)
Environment:
Last Closed: 2022-02-21 16:30:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Full debug logs of the ceil-meter SG (11.83 MB, text/plain)
2021-10-21 15:10 UTC, ghurel
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github infrawatch sg-core pull 74 0 None open Increase reading buffer size on large messages 2022-01-16 15:36:21 UTC
Github infrawatch sg-core pull 84 0 None Merged Increase reading buffer size on large messages (#74) 2022-02-07 15:54:15 UTC
Red Hat Issue Tracker STF-659 0 None None None 2021-11-17 16:30:30 UTC
Red Hat Product Errata RHSA-2022:0587 0 None None None 2022-02-21 16:30:35 UTC

Description ghurel 2021-10-21 15:10:02 UTC
Created attachment 1835704 [details]
Full debug logs of the ceil-meter SG

Description of problem:
STF 1.3 configured to monitor multiple OSP 16 clouds with out-of-the-box configuration (i.e. by following the official documentation [1]).

The container sg-core of the ceil-meter Smart Gateway fails on regularly on incoming messages with the following errors:

> $ oc logs -f default-tst-ceil-meter-smartgateway-5698bb44dc-4z4vs
> [...]
> 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readEscapedChar: invalid escape char after \, error found in #10 byte of ...|ephemeral\|..., bigger context ...|us\": 1, \"ram\": 1024, \"disk\": 40, \"ephemeral\|..., handler: ceilometer-metrics[socket]]
> 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|"vcpus\": |..., bigger context ...|": \"11\", \"name\": \"std.cpu1ram1\", \"vcpus\": |..., handler: ceilometer-metrics[socket]]
> [...]

Full log output is attached, with "dumpMessages" enabled in the SG configuration for increased verbosity.


Actual results:
Not exhaustive, but what has been observed so far:
- some metrics (e.g. cpu_ceilometer) are missing for some overcloud compute nodes in Prometheus/Grafana, resulting in some dashboards (e.g. Virtual Machine dashboard) to work partially (incomplete lists of projects and VMs).

Expected results:
All the metrics/events of all the overcloud compute nodes can be seen in Prometheus/Grafana.

Comment 8 Chris Sibbitt 2022-02-10 18:54:18 UTC
I was asked to verify this and found that it was not fixed in the listed package.

$ oc describe po default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q | grep -i image
    Image:         image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core:4.0.4-1
    Image ID:      image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core@sha256:f1587eb3ef058462e39cff35f5dc8b81e741b087de00ce2f04ff3d6ee2672355


$ oc logs default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q -c sg-core | strings | grep unexpected
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs
2022-02-10 17:58:41 [DEBUG] failed handling message [handler: ceilometer-metrics[socket], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-pltbs
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-opt-qd.in|..., bigger context ...|default-interconnect-675dd97bc4-pltbs
2022-02-10 18:01:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs

Comment 10 Leif Madsen 2022-02-11 18:36:51 UTC
Moving this back to ON_QA as it seems some information was missing where a dependency of increasing the ringBufferSize on the sg-bridge side of things was already required. I'm taking this and going to attempt to verify this again. A dependency in Service Telemetry Operator has already been filed to result in the exposure and increased default value which is tracked in the linked dependency. It should be possible to verify the sg-core side as part of the upcoming release though.

Comment 11 Leif Madsen 2022-02-11 20:54:15 UTC
Verified this by increasing the ringBufferSize on the bridge container in the Ceilometer metrics Smart Gateway deployment. Did this by scaling down the Service Telemetry Operator to 0 pods so it wouldn't revert changes to the SmartGateway manifest for default-cops04-ceil-meter. The manifest changes implemented look like the following:

apiVersion: smartgateway.infra.watch/v2
kind: SmartGateway
metadata:
...
spec:
  applications:
  - config: |
      host: 0.0.0.0
      port: 8081
      withTimeStamp: true
    name: prometheus
  bridge:
    amqpUrl: amqp://default-interconnect.service-telemetry.svc.cluster.local:5673/anycast/ceilometer/cops04-metering.sample
    ringBufferSize: 16384
  handleErrors: true
  logLevel: debug
...

Prior to making this change I deployed with the pre-release artifacts, reproduced the issue, made the changes above to enable the new ring buffer size, and then monitored the output for several minutes (the corrupted message would typically show up within 30-60 seconds).

Full solution requires changes as linked as a dependency of this issue, but the code as implemented does resolve the issue (once the bridge container is updated to reflect the larger buffer capacity).

Comment 16 errata-xmlrpc 2022-02-21 16:30:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Service Telemetry Framework 1.3 (sg-core-container) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0587


Note You need to log in before you can comment on or make changes to this bug.