Bug 1885320 - HPA fires KubeHpaReplicasMismatch alert after OCS installation
Summary: HPA fires KubeHpaReplicasMismatch alert after OCS installation
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Joel Smith
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1885313
TreeView+ depends on / blocked
 
Reported: 2020-10-05 15:27 UTC by Filip Balák
Modified: 2021-12-05 11:15 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-15 12:12:39 UTC
Target Upstream Version:
Embargoed:
joelsmith: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 79365 0 None open Warning FailedGetResourceMetric horizontal-pod-autoscaler missing request for cpu 2020-11-23 10:22:34 UTC
Red Hat Bugzilla 1885313 0 unspecified CLOSED noobaa-endpoint HPA fires KubeHpaReplicasMismatch alert after installation 2021-02-22 00:41:40 UTC

Description Filip Balák 2020-10-05 15:27:25 UTC
Description of problem:
There is a kubernetes issue https://github.com/kubernetes/kubernetes/issues/79365 that affects HPAs in OpenShift. It seems that because of this issue is triggered alert KubeHpaReplicasMismatch after OCS is installed.

Version-Release number of selected component (if applicable):
OCP: 4.6.0-0.nightly-2020-10-03-051134
OCS: ocs-operator.v4.6.0-108.ci

How reproducible:
100%

Steps to Reproduce:
1. Install OCP.
2. Install OCS.
3. Navigate to Monitoring -> Alerting in OCP UI.

Actual results:
There is alert KubeHpaReplicasMismatch:
HPA openshift-storage/noobaa-endpoint has not matched the desired number of replicas for longer than 15 minutes.

Expected results:
There should be no KubeHpaReplicasMismatch alert.

Additional info:
OCS BZ 1885313

Comment 4 Sergiusz Urbaniak 2020-10-06 06:31:08 UTC
reassigning to Node component who is handling HPA.

Comment 5 Ohad 2020-10-08 12:21:26 UTC
Hi,

I investigated the original issue with @Filip
I would like to add the folowing information: 

1. The KubeHpaReplicasMismatch event on the HPA is persistent and continues 
2. From my obervation the cause for the event is fact that the HPA decided that the desiredReplicas should be 0 (we can see it in the status section of the HPA resource) while the actual replica count is 1
3. The desried replica should never be 0 because the spec specify minReplicas of 1 (and a maxReplicas of 2)
4. The current CPU metric observed by the HPA controller is "Unknown/80%", this can be seen when using the describe command on the HPA resource
5. We are aware there is a known issue regarding pods with initContainers (https://bugzilla.redhat.com/show_bug.cgi?id=1867477) but our pods' spec does not define any initContainer or any sidecar containers
6. This was testetd on top of OCP 4.6 we had a different issue when we tested on top of OCP 4.5 

I hope this information is helpful

Comment 6 Ryan Phillips 2020-10-08 14:36:06 UTC
Can you oc describe the hpa resource? Does the spec for openshift-storage/noobaa-endpoint contain a zero for the Replicas count? Setting it to zero will disable the autoscaling for that resource.

Can you add the logs from the HPA as well?

Comment 7 Ohad 2020-10-08 14:58:33 UTC
> Can you oc describe the hpa resource?
> Can you add the logs from the HPA as well?

@Filip Can you please provide it? maybe from the OCS bug?

@Ryan
The noobaa-endpoint deployment is created with a replica of 1 and we do not change it at all, we leave this responsibility to the HPA

Comment 8 Ryan Phillips 2020-10-08 17:28:05 UTC
Are you ok if we target a fix for this bug in 4.6.z?

Comment 9 Ryan Phillips 2020-10-08 17:57:46 UTC
One more thing to keep in mind is after 1) Install OCP. Monitoring is usually still installing after an OCP cluster is 'up'. Does the ocs-operator make sure that all it's dependent components are running?

Comment 10 Ohad 2020-10-08 18:43:35 UTC
> Does the ocs-operator make sure that all it's dependent components are running?
This is a very good point, and I am not sure what the answer is.
I will have to check and come back with an answer.

But even if this is correct, why does it not resolve itself later in the lifetime of the cluster?


Note You need to log in before you can comment on or make changes to this bug.