Bug 2080895 - Metrics with label values that change during runtime should not expose the old values
Summary: Metrics with label values that change during runtime should not expose the ol...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10-rc3
Hardware: All
OS: All
medium
low
Target Milestone: ---
: 4.10.z
Assignee: Martin Kennelly
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On: 2080894
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-02 09:47 UTC by Martin Kennelly
Modified: 2022-05-18 11:51 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2080894
Environment:
Last Closed: 2022-05-18 11:51:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1070 0 None open Bug 2080895: Ensure stale information is not present in metrics with labels when they change 2022-05-02 09:48:40 UTC
Red Hat Product Errata RHBA-2022:2178 0 None None None 2022-05-18 11:51:13 UTC

Description Martin Kennelly 2022-05-02 09:47:29 UTC
+++ This bug was initially created as a clone of Bug #2080894 +++

Description of problem:
We are using labelled (vector) metrics to export
info and if the label values change, we still
export the old labels values which confuses users
because they have no way of determining which labelled
metric is the latest.

We must reset each vector metric prior to using it to
ensure we are only exporting the current state and not
old states which should have been scraped previously due
to our usage of labels that can be dynamically updated
during runtime.

This is occurring in the OVN metrics exported by
ovnkube-node.

Follow on work is needed to fix OVS metrics that are
used by ovs-exporter.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Get OVN Northbound database leader (use cluster/status command from ovn-nbctl)
2. Login to the ovnkube-node container on that node.
3. Curl the metrics endpoint and grep for ovn_db_cluster_server_role. Note the server_role label is leader.
4. On the same pod, repeatedly kill the nbdb container process only until the ovn_db_cluster_server_role metric changes it label 'server_role' from leader to 'cluster follower'. Do not kill the whole pod. Exec into the container continuously and send a sigterm to the container nbdb process.
5. The metric ovn_db_cluster_server_role has two values - it still exposes 'server_role' as leader but also as 'cluster follower'.

Note: it maybe very awkward to get the northbound database to switch from leader to 'cluster follower'. If you can freeze the process instead of killing it, you may have a better chance of it changing.

Actual results:
There are two values for ovn_db_cluster_server_role

Expected results:
There should only be one value or entry for ovn_db_cluster_server_role

Additional info:
Contact Nadia Pinaeva for further information if you cannot reproduce. I maybe on PTO.

Comment 1 Martin Kennelly 2022-05-02 09:53:51 UTC
Correction: I mentioned 'cluster follower' above - it meant just 'follower'.

Comment 4 zhaozhanqi 2022-05-06 08:46:16 UTC
@weliang Remember you have some test on Metrics, Could you help take a look this bug?

Comment 5 Weibin Liang 2022-05-10 16:29:20 UTC
Tested and verified in 4.10.0-0.nightly-2022-05-10-060208

sh-4.4# curl 127.0.0.1:29105/metrics | grep ovn_db_cluster_server_role
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
10# HELP ovn_db_cluster_server_role A metric with a constant '1' value labeled by database name, cluster uuid, server uuid and server role
0 # TYPE ovn_db_cluster_server_role gauge
3ovn_db_cluster_server_role{cluster_id="82e5d6af-4625-4e09-8c0c-8937feb1e7e7",db_name="OVN_Southbound",server_id="b334bba1-8517-4286-a204-74d50d4e6ec8",server_role="leader"} 1
6ovn_db_cluster_server_role{cluster_id="b1ab83ff-455f-4b5f-9b43-16dbe2dfe412",db_name="OVN_Northbound",server_id="06f6521b-9b6a-49c3-892d-866aa5b87736",server_role="leader"} 1
022    0 36022    0     0   703k      0 --:--:-- --:--:-- --:--:--  717k
sh-4.4# curl 127.0.0.1:29105/metrics | grep ovn_db_cluster_server_role
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP ovn_db_cluster_server_role A metric with a constant '1' value labeled by database name, cluster uuid, server uuid and server role
# TYPE ovn_db_cluster_server_role gauge
ovn_db_cluster_server_role{cluster_id="82e5d6af-4625-4e09-8c0c-8937feb1e7e7",db_name="OVN_Southbound",server_id="b334bba1-8517-4286-a204-74d50d4e6ec8",server_role="leader"} 1
ovn_db_cluster_server_role{cluster_id="b1ab83ff-455f-4b5f-9b43-16dbe2dfe412",db_name="OVN_Northbound",server_id="06f6521b-9b6a-49c3-892d-866aa5b87736",server_role="follower"} 1
100 36021    0 36021    0     0   495k      0 --:--:-- --:--:-- --:--:--  495k
sh-4.4# curl 127.0.0.1:29105/metrics | grep ovn_db_cluster_server_role
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP ovn_db_cluster_server_role A metric with a constant '1' value labeled by database name, cluster uuid, server uuid and server role
# TYPE ovn_db_cluster_server_role gauge
ovn_db_cluster_server_role{cluster_id="82e5d6af-4625-4e09-8c0c-8937feb1e7e7",db_name="OVN_Southbound",server_id="b334bba1-8517-4286-a204-74d50d4e6ec8",server_role="leader"} 1
ovn_db_cluster_server_role{cluster_id="b1ab83ff-455f-4b5f-9b43-16dbe2dfe412",db_name="OVN_Northbound",server_id="06f6521b-9b6a-49c3-892d-866aa5b87736",server_role="follower"} 1
100 35968    0 35968    0     0   468k      0 --:--:-- --:--:-- --:--:--  474k
sh-4.4#

Comment 8 errata-xmlrpc 2022-05-18 11:51:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.14 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2178


Note You need to log in before you can comment on or make changes to this bug.