Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2016460

Summary: [STF 1.3] sg-core fails handling some messages due to some invalid escape char
Product: Service Telemetry Framework Reporter: ghurel
Component: sg-core-containerAssignee: Leif Madsen <lmadsen>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact: Joanne O'Flynn <joflynn>
Priority: high    
Version: 1.3CC: csibbitt, lmadsen, mmagr
Target Milestone: z4Keywords: Triaged, ZStream
Target Release: 1.3 (STF)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: sg-core-container-4.0.4-1 Doc Type: Bug Fix
Doc Text:
In some cases, Ceilometer metrics were not handled properly by sg-core. This resulted in some Ceilometer metrics not being stored in Prometheus. In this release, the processing of metrics has been enhanced to be more robust. While the sg-core has been enhanced to support larger messages from Ceilometer, an additional change is required to support passing the larger messages through the sg-bridge ring buffer. The changes required to fully support this functionality is being tracked in RHBZ#2053683.
Story Points: ---
Clone Of:
: 2051615 (view as bug list) Environment:
Last Closed: 2022-02-21 16:30:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2051615, 2053683    
Attachments:
Description Flags
Full debug logs of the ceil-meter SG none

Description ghurel 2021-10-21 15:10:02 UTC
Created attachment 1835704 [details]
Full debug logs of the ceil-meter SG

Description of problem:
STF 1.3 configured to monitor multiple OSP 16 clouds with out-of-the-box configuration (i.e. by following the official documentation [1]).

The container sg-core of the ceil-meter Smart Gateway fails on regularly on incoming messages with the following errors:

> $ oc logs -f default-tst-ceil-meter-smartgateway-5698bb44dc-4z4vs
> [...]
> 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readEscapedChar: invalid escape char after \, error found in #10 byte of ...|ephemeral\|..., bigger context ...|us\": 1, \"ram\": 1024, \"disk\": 40, \"ephemeral\|..., handler: ceilometer-metrics[socket]]
> 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|"vcpus\": |..., bigger context ...|": \"11\", \"name\": \"std.cpu1ram1\", \"vcpus\": |..., handler: ceilometer-metrics[socket]]
> [...]

Full log output is attached, with "dumpMessages" enabled in the SG configuration for increased verbosity.


Actual results:
Not exhaustive, but what has been observed so far:
- some metrics (e.g. cpu_ceilometer) are missing for some overcloud compute nodes in Prometheus/Grafana, resulting in some dashboards (e.g. Virtual Machine dashboard) to work partially (incomplete lists of projects and VMs).

Expected results:
All the metrics/events of all the overcloud compute nodes can be seen in Prometheus/Grafana.

Comment 8 Chris Sibbitt 2022-02-10 18:54:18 UTC
I was asked to verify this and found that it was not fixed in the listed package.

$ oc describe po default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q | grep -i image
    Image:         image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core:4.0.4-1
    Image ID:      image-registry.openshift-image-registry.svc:5000/service-telemetry/stf-sg-core@sha256:f1587eb3ef058462e39cff35f5dc8b81e741b087de00ce2f04ff3d6ee2672355


$ oc logs default-cops04-ceil-meter-smartgateway-59b8dd6b64-m892q -c sg-core | strings | grep unexpected
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs
2022-02-10 17:58:41 [DEBUG] failed handling message [handler: ceilometer-metrics[socket], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-pltbs
2022-02-10 17:58:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|-opt-qd.in|..., bigger context ...|default-interconnect-675dd97bc4-pltbs
2022-02-10 18:01:41 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|x-opt-qd.i|..., bigger context ...|/default-interconnect-675dd97bc4-pltbs

Comment 10 Leif Madsen 2022-02-11 18:36:51 UTC
Moving this back to ON_QA as it seems some information was missing where a dependency of increasing the ringBufferSize on the sg-bridge side of things was already required. I'm taking this and going to attempt to verify this again. A dependency in Service Telemetry Operator has already been filed to result in the exposure and increased default value which is tracked in the linked dependency. It should be possible to verify the sg-core side as part of the upcoming release though.

Comment 11 Leif Madsen 2022-02-11 20:54:15 UTC
Verified this by increasing the ringBufferSize on the bridge container in the Ceilometer metrics Smart Gateway deployment. Did this by scaling down the Service Telemetry Operator to 0 pods so it wouldn't revert changes to the SmartGateway manifest for default-cops04-ceil-meter. The manifest changes implemented look like the following:

apiVersion: smartgateway.infra.watch/v2
kind: SmartGateway
metadata:
...
spec:
  applications:
  - config: |
      host: 0.0.0.0
      port: 8081
      withTimeStamp: true
    name: prometheus
  bridge:
    amqpUrl: amqp://default-interconnect.service-telemetry.svc.cluster.local:5673/anycast/ceilometer/cops04-metering.sample
    ringBufferSize: 16384
  handleErrors: true
  logLevel: debug
...

Prior to making this change I deployed with the pre-release artifacts, reproduced the issue, made the changes above to enable the new ring buffer size, and then monitored the output for several minutes (the corrupted message would typically show up within 30-60 seconds).

Full solution requires changes as linked as a dependency of this issue, but the code as implemented does resolve the issue (once the bridge container is updated to reflect the larger buffer capacity).

Comment 16 errata-xmlrpc 2022-02-21 16:30:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Service Telemetry Framework 1.3 (sg-core-container) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0587