Bug 2051615

Summary: [STF 1.4] sg-core fails handling some messages due to some invalid escape char
Product: Service Telemetry Framework Reporter: Leif Madsen <lmadsen>
Component: sg-core-containerAssignee: Martin Magr <mmagr>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact: Joanne O'Flynn <joflynn>
Priority: high    
Version: 1.4CC: ghurel, joflynn, lmadsen, lnatapov, mmagr
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 1.4 (STF)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: sg-core-container-4.1.1-1 Doc Type: Bug Fix
Doc Text:
In some cases, Ceilometer metrics were not handled properly by sg-core. This resulted in some Ceilometer metrics not being stored in Prometheus. In this release, the processing of metrics has been enhanced to be more robust. While the sg-core has been enhanced to support larger messages from Ceilometer, an additional change is required to support passing the larger messages through the sg-bridge ring buffer. The changes required to fully support this functionality is being tracked in RHBZ#2053681.
Story Points: ---
Clone Of: 2016460 Environment:
Last Closed: 2022-02-21 13:50:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2016460    
Bug Blocks: 2053681    

Description Leif Madsen 2022-02-07 15:51:05 UTC
+++ This bug was initially created as a clone of Bug #2016460 +++

Description of problem:
STF 1.3 configured to monitor multiple OSP 16 clouds with out-of-the-box configuration (i.e. by following the official documentation [1]).

The container sg-core of the ceil-meter Smart Gateway fails on regularly on incoming messages with the following errors:

> $ oc logs -f default-tst-ceil-meter-smartgateway-5698bb44dc-4z4vs
> [...]
> 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readEscapedChar: invalid escape char after \, error found in #10 byte of ...|ephemeral\|..., bigger context ...|us\": 1, \"ram\": 1024, \"disk\": 40, \"ephemeral\|..., handler: ceilometer-metrics[socket]]
> 2021-10-21 08:45:20 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|"vcpus\": |..., bigger context ...|": \"11\", \"name\": \"std.cpu1ram1\", \"vcpus\": |..., handler: ceilometer-metrics[socket]]
> [...]

Full log output is attached, with "dumpMessages" enabled in the SG configuration for increased verbosity.


Actual results:
Not exhaustive, but what has been observed so far:
- some metrics (e.g. cpu_ceilometer) are missing for some overcloud compute nodes in Prometheus/Grafana, resulting in some dashboards (e.g. Virtual Machine dashboard) to work partially (incomplete lists of projects and VMs).

Expected results:
All the metrics/events of all the overcloud compute nodes can be seen in Prometheus/Grafana.

Comment 8 Leif Madsen 2022-02-11 18:41:26 UTC
Verified this is working. Depends on changes tracked in RHBZ#2053681.

Comment 12 errata-xmlrpc 2022-02-21 13:50:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Service Telemetry Framework 1.4 (sg-core-container) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0585