Bug 1798450 - kube-aggregator: unavailableGauge is wrong
Summary: kube-aggregator: unavailableGauge is wrong
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1798461
TreeView+ depends on / blocked
 
Reported: 2020-02-05 11:52 UTC by Lukasz Szaszkiewicz
Modified: 2020-05-04 11:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
Fixed aggregator_unavailable_apiservice metric to have correct value.
Clone Of:
: 1798461 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:33:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 358 0 None closed alerts when an aggregation API is down or it reports errors 2020-08-07 05:35:48 UTC
Github openshift origin pull 24496 0 None closed Bug 1798450: makes unavailableGauge metric to always reflect the current state of a service 2020-08-07 05:35:47 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:34:25 UTC

Description Lukasz Szaszkiewicz 2020-02-05 11:52:56 UTC
unavailableGauge metric is not always set and might report incorrect values especially in an HA setup where one instance could observe and mark a service as unavailable whereas some other instance might observe it as available. That would prevent the first instance from reflecting that state since it wouldn't observe any changes

Comment 4 Ke Wang 2020-02-21 15:46:56 UTC
Verified with OCP build:
$ oc version
Client Version: v4.4.0
Server Version: 4.4.0-0.nightly-2020-02-20-203407
Kubernetes Version: v1.17.1

Verification steps,
1. Make some apiservice fail, e.g. remove openshift-apiserver by:
 $ oc patch openshiftapiserver cluster --type=json -p '[{"op": "replace", "path": "/spec/managementState", "value": "Removed"}]'
 
2. Open the prometheus UI from OCP web console, enter keyword ‘aggregator_unavailable_apiservice_count’  and Click on'Exuecte', navigate to Console tab, some unavailable apiservices will be displayed with name and count.

Another way, 
We can try to reboot one master node in terminal console with below CLI
$ master=$(oc get node | grep master | awk '{print $1}' | head -1)
$ oc debug no/$master -- chroot /host shutdown -r now

During node restarting, repeat above step Click on'Exuecte' in prometheus UI, the result will be changed. 

We can see the feature works well with PR merged OCP build.

Comment 6 errata-xmlrpc 2020-05-04 11:33:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.