1866814 – etcd_object_counts displays incorrect count of Machine objects

Bug 1866814 - etcd_object_counts displays incorrect count of Machine objects

Summary: etcd_object_counts displays incorrect count of Machine objects

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Abu Kashem
QA Contact:	pmali
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-06 13:07 UTC by Michael McCune
Modified:	2020-10-27 16:25 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:25:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 357	0	None	closed	Bug 1866814: UPSTREAM: 94773: count of etcd object should be limited to the specified resource	2021-01-21 06:51:04 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:25:42 UTC

Description Michael McCune 2020-08-06 13:07:58 UTC

Description of problem:

While examining the metrics tab inside the administrator console interface, i noticed that the count of machine resources from the query `etcd_object_counts{resource="machines.machine.openshift.io"}` returned a different value than the query `mapi_machine_items`. The latter metric is exposed directly from the machine-api-operator.

The etcd_object_counts consistently showed 3 more machine resources than the mapi_machine_items. This persisted for the life of the cluster. The etcd_object_counts was also consistently incorrect about the number of machine resources.

Version-Release number of selected component (if applicable):

I have seen this behavior on 4.6 and 4.5


How reproducible:

Very.

Steps to Reproduce:
1. Launch a new cluster
2. Navigate to the Monitoring->Metrics page of the web console
3. Create a query for `etcd_object_counts{resource="machines.machine.openshift.io"}`, observe the count
4. Create a query for `mapi_machine_items`, observe the count
5. verify number of machines by running this command from a terminal, `oc get machines -n openshift-machine-api`

Actual results:

The count on a new cluster from the etcd_object_counts shows 9, while the mapi_machine_items shows 6 (the correct value).

Expected results:

The etcd_object_counts should display the proper number of machines.

Additional info:

Comment 1 Michael McCune 2020-08-06 13:31:12 UTC

changing component to openshift-apiserver, it was suggested to me in slack that this metric comes from the apiserver.

Comment 2 Michal Fojtik 2020-09-05 14:18:11 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Michael McCune 2020-09-08 13:44:43 UTC

i believe this bug is still valid for 4.6

Comment 4 Michal Fojtik 2020-09-08 14:10:35 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 5 Stefan Schimanski 2020-09-08 15:08:52 UTC

> changing component to openshift-apiserver, it was suggested to me in slack that this metric comes from the apiserver.

It doesn't :)

This is a CRD. kube-apiserver provides the metric. It can well be that an API server counts wrong because it counts requests, but before Prometheus can pull the metrics from the /metrics endpoint, the apiserver terminates.

In general these metrics are not 100% correct all the time. Prometheus has a pull system which has no guarantee of correctness.

Comment 6 Michael McCune 2020-09-08 17:45:31 UTC

thanks for the reply Stefan. is there a better component that this should be assigned to?

part of the problem is that i consistently observe the behavior in this bug and it will affect our ability to properly count these items in telemetry. i want to make sure that if we are consistently seeing incorrect values in this metric that we should prescribe a different method for counting these objects.

i was advised to use etcd_object_counts instead of implementing individual component metrics for each CRD-based resource we want to count, but if these values are consistently wrong then i think we have a deeper problem with regards to our telemetry.

Comment 7 Abu Kashem 2020-09-10 23:00:11 UTC

Hi mimccune,
So I checked "etcd_object_counts" with a different CR and observed that it returns correct result.

> avg(etcd_object_counts{resource="catalogsources.operators.coreos.com"})
The above query returned 4

> $ oc get catalogsources --all-namespaces

NAMESPACE               NAME                  DISPLAY               TYPE   PUBLISHER   AGE
openshift-marketplace   certified-operators   Certified Operators   grpc   Red Hat     14d
openshift-marketplace   community-operators   Community Operators   grpc   Red Hat     14d
openshift-marketplace   redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     14d
openshift-marketplace   redhat-operators      Red Hat Operators     grpc   Red Hat     14d


"etcd_object_counts" is returning the correct count for catalogsources. 

This may be how the machine CRD is defined or could be an issue with the apiserver/etcd. We need to spend more time to understand what's going on.

Comment 8 Michael McCune 2020-09-11 12:55:30 UTC

(In reply to Abu Kashem from comment #7)
> This may be how the machine CRD is defined or could be an issue with the
> apiserver/etcd. We need to spend more time to understand what's going on.

thanks for the update Abu!

given those results it does sound like this might be localized to the machine resource. is there anything i can do from the machine-api side to help investigate this?

Comment 9 Abu Kashem 2020-09-11 17:07:59 UTC

> avg(etcd_object_counts{resource="machinesets.machine.openshift.io"})

If I do a query for "machinesets", I see "4", which is correct.

Somehow the count of "machinesets" is being added to the count of "machines" 

I checked the apiserver log to see the underlying etcd keys.

> I0911 16:28:04.752416      18 store.go:1378] Monitoring machines.machine.openshift.io count at <storage-prefix>//machine.openshift.io/machines
> I0911 16:28:04.840828      18 store.go:1378] Monitoring machinesets.machine.openshift.io count at <storage-prefix>//machine.openshift.io/machinesets

My guess is - when kube-apiserver queries for the count, it specifies the key as a prefix. So the count for "machines" results in a query for a prefix of "machine.openshift.io/machines" and it includes the count for "machine.openshift.io/machinesets" as well, not the other way around.

This will require an upstream fix, I am going to start working on it. In the meantime, you can verify this by adding "machinesets" CR and see if the count for "machines" goes up and down accordingly.

 
Short term (while we wait for the upstream fix ) you can use the following query
> avg(etcd_object_counts{resource="machines.machine.openshift.io"}) - avg(etcd_object_counts{resource="machinesets.machine.openshift.io"})

I will update you once the upstream fix is made and it is available to OpenShift.

Comment 10 Michael McCune 2020-09-11 17:15:36 UTC

@Abu, thank you so much for the detailed explanation. it makes perfect sense to me as the counts i were seeing were always off by 3, which means it counted the initial machines (6) and the initial machinesets (3).

(In reply to Abu Kashem from comment #9)
> Short term (while we wait for the upstream fix ) you can use the following
> query
> > avg(etcd_object_counts{resource="machines.machine.openshift.io"}) - avg(etcd_object_counts{resource="machinesets.machine.openshift.io"})
> 

awesome, this will unblock us in the short term.

> I will update you once the upstream fix is made and it is available to
> OpenShift.

perfect, thanks again =)

Comment 11 Abu Kashem 2020-09-15 15:00:14 UTC

upstream fix available, setting release to 4.6

Comment 12 Abu Kashem 2020-09-15 15:06:30 UTC

Upstream fix: https://github.com/kubernetes/kubernetes/pull/94773
We carried it to OpenShift 4.6 - https://github.com/openshift/kubernetes/pull/357

Hi mimccune,
once https://github.com/openshift/kubernetes/pull/357 merges into master and it is available into 4.6 ci or nightly build you can give it a try.

> avg(etcd_object_counts{resource="machines.machine.openshift.io"})
The above query should return correct count with the fix.

Comment 13 Michael McCune 2020-09-15 16:40:33 UTC

(In reply to Abu Kashem from comment #12)
> Upstream fix: https://github.com/kubernetes/kubernetes/pull/94773
> We carried it to OpenShift 4.6 -
> https://github.com/openshift/kubernetes/pull/357
> 
> Hi mimccune,
> once https://github.com/openshift/kubernetes/pull/357 merges into master and
> it is available into 4.6 ci or nightly build you can give it a try.
> 
> > avg(etcd_object_counts{resource="machines.machine.openshift.io"})
> The above query should return correct count with the fix.


awesome, thanks Abu!

Comment 20 errata-xmlrpc 2020-10-27 16:25:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.