Bug 1885320

Summary: HPA fires KubeHpaReplicasMismatch alert after OCS installation
Product: OpenShift Container Platform Reporter: Filip Balák <fbalak>
Component: NodeAssignee: Joel Smith <joelsmith>
Node sub component: Autoscaler (HPA, VPA) QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: high CC: alegrand, anpicker, aos-bugs, erooth, jokerman, kakkoyun, lcosic, mloibl, nagrawal, nberry, omitrani, pkrupa, rphillips, surbania
Version: 4.6Flags: joelsmith: needinfo-
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-15 12:12:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1885313    

Description Filip Balák 2020-10-05 15:27:25 UTC
Description of problem:
There is a kubernetes issue https://github.com/kubernetes/kubernetes/issues/79365 that affects HPAs in OpenShift. It seems that because of this issue is triggered alert KubeHpaReplicasMismatch after OCS is installed.

Version-Release number of selected component (if applicable):
OCP: 4.6.0-0.nightly-2020-10-03-051134
OCS: ocs-operator.v4.6.0-108.ci

How reproducible:
100%

Steps to Reproduce:
1. Install OCP.
2. Install OCS.
3. Navigate to Monitoring -> Alerting in OCP UI.

Actual results:
There is alert KubeHpaReplicasMismatch:
HPA openshift-storage/noobaa-endpoint has not matched the desired number of replicas for longer than 15 minutes.

Expected results:
There should be no KubeHpaReplicasMismatch alert.

Additional info:
OCS BZ 1885313

Comment 4 Sergiusz Urbaniak 2020-10-06 06:31:08 UTC
reassigning to Node component who is handling HPA.

Comment 5 Ohad 2020-10-08 12:21:26 UTC
Hi,

I investigated the original issue with @Filip
I would like to add the folowing information: 

1. The KubeHpaReplicasMismatch event on the HPA is persistent and continues 
2. From my obervation the cause for the event is fact that the HPA decided that the desiredReplicas should be 0 (we can see it in the status section of the HPA resource) while the actual replica count is 1
3. The desried replica should never be 0 because the spec specify minReplicas of 1 (and a maxReplicas of 2)
4. The current CPU metric observed by the HPA controller is "Unknown/80%", this can be seen when using the describe command on the HPA resource
5. We are aware there is a known issue regarding pods with initContainers (https://bugzilla.redhat.com/show_bug.cgi?id=1867477) but our pods' spec does not define any initContainer or any sidecar containers
6. This was testetd on top of OCP 4.6 we had a different issue when we tested on top of OCP 4.5 

I hope this information is helpful

Comment 6 Ryan Phillips 2020-10-08 14:36:06 UTC
Can you oc describe the hpa resource? Does the spec for openshift-storage/noobaa-endpoint contain a zero for the Replicas count? Setting it to zero will disable the autoscaling for that resource.

Can you add the logs from the HPA as well?

Comment 7 Ohad 2020-10-08 14:58:33 UTC
> Can you oc describe the hpa resource?
> Can you add the logs from the HPA as well?

@Filip Can you please provide it? maybe from the OCS bug?

@Ryan
The noobaa-endpoint deployment is created with a replica of 1 and we do not change it at all, we leave this responsibility to the HPA

Comment 8 Ryan Phillips 2020-10-08 17:28:05 UTC
Are you ok if we target a fix for this bug in 4.6.z?

Comment 9 Ryan Phillips 2020-10-08 17:57:46 UTC
One more thing to keep in mind is after 1) Install OCP. Monitoring is usually still installing after an OCP cluster is 'up'. Does the ocs-operator make sure that all it's dependent components are running?

Comment 10 Ohad 2020-10-08 18:43:35 UTC
> Does the ocs-operator make sure that all it's dependent components are running?
This is a very good point, and I am not sure what the answer is.
I will have to check and come back with an answer.

But even if this is correct, why does it not resolve itself later in the lifetime of the cluster?