Bug 2010841 - container_runtime_crio_operations_latency_microseconds does not have quantile anymore
Summary: container_runtime_crio_operations_latency_microseconds does not have quantile...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.z
Assignee: Sascha Grunert
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On: 1997926
Blocks: 2010831
TreeView+ depends on / blocked
 
Reported: 2021-10-05 14:01 UTC by Sascha Grunert
Modified: 2022-08-24 01:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1997926
Environment:
Last Closed: 2022-08-24 01:18:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5264 0 None Merged [release-1.19] BZ#2010841 Fix missing quantile in `latency_microseconds_total` metrics 2022-02-17 08:20:21 UTC

Description Sascha Grunert 2021-10-05 14:01:02 UTC
+++ This bug was initially created as a clone of Bug #1997926 +++

Description of problem:
-----------------------

Starting with 4.6 (4.7 & 4.8 too), there are no more quantile numbers anymore with container_runtime_crio_operations_latency_microseconds metric.

There are only container_runtime_crio_operations_latency_microseconds_sum and container_runtime_crio_operations_latency_microseconds_count.

There are some customers monitor these quantile metrics on 4.5. With just _sum & _count, it's not quite possible to calculate quantile info at all.


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

OCP 4.6 (cri-o 1.19), 4.7 (cri-o 1.20), 4.8 (cri-o 1.21)


How reproducible:
-----------------

Always


On 4.5:
-------

The quantile info are still there...

[quicklab@upi-0 ~]$ oc version
Client Version: 4.5.41
Server Version: 4.5.41
Kubernetes Version: v1.18.3+d8ef5ad

[quicklab@upi-0 ~]$ NODEIP=$(oc get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
[quicklab@upi-0 ~]$ oc get --raw /metrics --server http://${NODEIP}:9537 | grep "container_runtime_crio_operations_latency"
# HELP container_runtime_crio_operations_latency_microseconds Latency in microseconds of CRI-O operations. Broken down by operation type.
# TYPE container_runtime_crio_operations_latency_microseconds summary
container_runtime_crio_operations_latency_microseconds{operation_type="Attach",quantile="0.5"} NaN
container_runtime_crio_operations_latency_microseconds{operation_type="Attach",quantile="0.9"} NaN
container_runtime_crio_operations_latency_microseconds{operation_type="Attach",quantile="0.99"} NaN
container_runtime_crio_operations_latency_microseconds_sum{operation_type="Attach"} 19510
container_runtime_crio_operations_latency_microseconds_count{operation_type="Attach"} 17
container_runtime_crio_operations_latency_microseconds{operation_type="ContainerStatus",quantile="0.5"} 32
container_runtime_crio_operations_latency_microseconds{operation_type="ContainerStatus",quantile="0.9"} 54
container_runtime_crio_operations_latency_microseconds{operation_type="ContainerStatus",quantile="0.99"} 92
.
.
.


On 4.8 (same for 4.6, 4.7):
---------------------------

There are no quantile info but just _total_sum & _total_count metrics...

[quicklab@upi-0 ~]$ oc version
Client Version: 4.8.5
Server Version: 4.8.5
Kubernetes Version: v1.21.1+9807387

[quicklab@upi-0 ~]$ NODEIP=$(oc get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
[quicklab@upi-0 ~]$ oc get --raw /metrics --server http://${NODEIP}:9537 | grep "container_runtime_crio_operations_latency" | grep -c quantile
0


Expected results:
-----------------

Quantile numbers should be provided like 4.5.

--- Additional comment from Peter Hunt on 2021-08-26 14:27:44 UTC ---

Sascha can you PTAL

--- Additional comment from Sascha Grunert on 2021-08-27 09:32:43 UTC ---

Confirmed, fix is incoming in https://github.com/cri-o/cri-o/pull/5258. I'll backport the fix if merged.

--- Additional comment from Sascha Grunert on 2021-08-30 07:49:00 UTC ---

Upstream PR merged, cherry-picking now.

Comment 1 Tom Sweeney 2022-01-06 16:24:14 UTC
Not completed this sprint.

Comment 2 Sascha Grunert 2022-02-17 08:20:05 UTC
Upstream PR got merged.

Comment 4 Weinan Liu 2022-02-24 09:02:39 UTC
4.6.0-0.nightly-2022-02-24-010707 	Failed
Last build 4.6.0-0.nightly-2022-02-18-104035 does not have the fix 
Waiting for the next build to test

Comment 5 Weinan Liu 2022-02-25 06:49:51 UTC
Name 	Phase 	Started 	Failures 	Upgrades
4.6.0-0.nightly-2022-02-24-183332 	Failed 	11 hours ago 			
4.6.0-0.nightly-2022-02-24-142816 	Failed 	15 hours ago 			
4.6.0-0.nightly-2022-02-24-010707 	Failed 	28 hours ago

Sitll no Accpeted build

Comment 6 Weinan Liu 2022-02-28 01:14:59 UTC
Name 	Phase 	Started 	Failures 	Upgrades
4.6.0-0.nightly-2022-02-24-183332 	Failed 	3 days ago 			
4.6.0-0.nightly-2022-02-24-142816 	Failed 	3 days ago 			
4.6.0-0.nightly-2022-02-24-010707 	Failed 	4 days ago 			
Sitll no Accpeted build

Comment 7 Weinan Liu 2022-03-02 06:08:45 UTC
Fixed.
sh-4.4# NODEIP=$(oc get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
sh-4.4# oc get --raw /metrics --server http://${NODEIP}:9537 | grep "container_runtime_crio_operations_latency" |grep -c quantile
63
sh-4.4#
version   4.6.0-0.nightly-2022-03-02-005523   True        False         15m     Cluster version is 4.6.0-0.nightly-2022-03-02-005523

Comment 11 Scott Dodson 2022-08-24 01:18:07 UTC
This was fixed in 4.6.56 however the errata automation failed to close it at the time.


Note You need to log in before you can comment on or make changes to this bug.