Bug 1798450

Summary: kube-aggregator: unavailableGauge is wrong
Product: OpenShift Container Platform Reporter: Lukasz Szaszkiewicz <lszaszki>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, mfojtik, sttts, xxia
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
Fixed aggregator_unavailable_apiservice metric to have correct value.
Story Points: ---
Clone Of:
: 1798461 (view as bug list) Environment:
Last Closed: 2020-05-04 11:33:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1798461    

Description Lukasz Szaszkiewicz 2020-02-05 11:52:56 UTC
unavailableGauge metric is not always set and might report incorrect values especially in an HA setup where one instance could observe and mark a service as unavailable whereas some other instance might observe it as available. That would prevent the first instance from reflecting that state since it wouldn't observe any changes

Comment 4 Ke Wang 2020-02-21 15:46:56 UTC
Verified with OCP build:
$ oc version
Client Version: v4.4.0
Server Version: 4.4.0-0.nightly-2020-02-20-203407
Kubernetes Version: v1.17.1

Verification steps,
1. Make some apiservice fail, e.g. remove openshift-apiserver by:
 $ oc patch openshiftapiserver cluster --type=json -p '[{"op": "replace", "path": "/spec/managementState", "value": "Removed"}]'
 
2. Open the prometheus UI from OCP web console, enter keyword ‘aggregator_unavailable_apiservice_count’  and Click on'Exuecte', navigate to Console tab, some unavailable apiservices will be displayed with name and count.

Another way, 
We can try to reboot one master node in terminal console with below CLI
$ master=$(oc get node | grep master | awk '{print $1}' | head -1)
$ oc debug no/$master -- chroot /host shutdown -r now

During node restarting, repeat above step Click on'Exuecte' in prometheus UI, the result will be changed. 

We can see the feature works well with PR merged OCP build.

Comment 6 errata-xmlrpc 2020-05-04 11:33:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581