Bug 1885320

Summary:	HPA fires KubeHpaReplicasMismatch alert after OCS installation
Product:	OpenShift Container Platform	Reporter:	Filip Balák <fbalak>
Component:	Node	Assignee:	Joel Smith <joelsmith>
Node sub component:	Autoscaler (HPA, VPA)	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	high	CC:	alegrand, anpicker, aos-bugs, erooth, jokerman, kakkoyun, lcosic, mloibl, nagrawal, nberry, omitrani, pkrupa, rphillips, surbania
Version:	4.6	Flags:	joelsmith: needinfo-
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-15 12:12:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1885313

Description Filip Balák 2020-10-05 15:27:25 UTC

Description of problem:
There is a kubernetes issue https://github.com/kubernetes/kubernetes/issues/79365 that affects HPAs in OpenShift. It seems that because of this issue is triggered alert KubeHpaReplicasMismatch after OCS is installed.

Version-Release number of selected component (if applicable):
OCP: 4.6.0-0.nightly-2020-10-03-051134
OCS: ocs-operator.v4.6.0-108.ci

How reproducible:
100%

Steps to Reproduce:
1. Install OCP.
2. Install OCS.
3. Navigate to Monitoring -> Alerting in OCP UI.

Actual results:
There is alert KubeHpaReplicasMismatch:
HPA openshift-storage/noobaa-endpoint has not matched the desired number of replicas for longer than 15 minutes.

Expected results:
There should be no KubeHpaReplicasMismatch alert.

Additional info:
OCS BZ 1885313

Comment 4 Sergiusz Urbaniak 2020-10-06 06:31:08 UTC

reassigning to Node component who is handling HPA.

Comment 5 Ohad 2020-10-08 12:21:26 UTC

Hi,

I investigated the original issue with @Filip
I would like to add the folowing information: 

1. The KubeHpaReplicasMismatch event on the HPA is persistent and continues 
2. From my obervation the cause for the event is fact that the HPA decided that the desiredReplicas should be 0 (we can see it in the status section of the HPA resource) while the actual replica count is 1
3. The desried replica should never be 0 because the spec specify minReplicas of 1 (and a maxReplicas of 2)
4. The current CPU metric observed by the HPA controller is "Unknown/80%", this can be seen when using the describe command on the HPA resource
5. We are aware there is a known issue regarding pods with initContainers (https://bugzilla.redhat.com/show_bug.cgi?id=1867477) but our pods' spec does not define any initContainer or any sidecar containers
6. This was testetd on top of OCP 4.6 we had a different issue when we tested on top of OCP 4.5 

I hope this information is helpful

Comment 6 Ryan Phillips 2020-10-08 14:36:06 UTC

Can you oc describe the hpa resource? Does the spec for openshift-storage/noobaa-endpoint contain a zero for the Replicas count? Setting it to zero will disable the autoscaling for that resource.

Can you add the logs from the HPA as well?

Comment 7 Ohad 2020-10-08 14:58:33 UTC

> Can you oc describe the hpa resource?
> Can you add the logs from the HPA as well?

@Filip Can you please provide it? maybe from the OCS bug?

@Ryan
The noobaa-endpoint deployment is created with a replica of 1 and we do not change it at all, we leave this responsibility to the HPA

Comment 8 Ryan Phillips 2020-10-08 17:28:05 UTC

Are you ok if we target a fix for this bug in 4.6.z?

Comment 9 Ryan Phillips 2020-10-08 17:57:46 UTC

One more thing to keep in mind is after 1) Install OCP. Monitoring is usually still installing after an OCP cluster is 'up'. Does the ocs-operator make sure that all it's dependent components are running?

Comment 10 Ohad 2020-10-08 18:43:35 UTC

> Does the ocs-operator make sure that all it's dependent components are running?
This is a very good point, and I am not sure what the answer is.
I will have to check and come back with an answer.

But even if this is correct, why does it not resolve itself later in the lifetime of the cluster?