Bug 1885313 - noobaa-endpoint HPA fires KubeHpaReplicasMismatch alert after installation
Summary: noobaa-endpoint HPA fires KubeHpaReplicasMismatch alert after installation
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.6.0
Assignee: Nimrod Becker
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On: 1885320
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-05 15:09 UTC by Filip Balák
Modified: 2020-10-20 10:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-20 09:38:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 79365 0 None open Warning FailedGetResourceMetric horizontal-pod-autoscaler missing request for cpu 2020-11-23 10:22:32 UTC
Red Hat Bugzilla 1836299 0 unspecified CLOSED NooBaa Operator deploys with HPA that fires maxreplicas alerts by default 2024-03-25 15:56:18 UTC

Internal Links: 1836299 1885320

Description Filip Balák 2020-10-05 15:09:31 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After installation is fired alert:
  KubeHpaReplicasMismatch:
  HPA openshift-storage/noobaa-endpoint has not matched the desired number of replicas for longer than 15 minutes.

Version of all relevant components (if applicable):
ocs-operator.v4.6.0-108.ci

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

Steps to Reproduce:
1. Install OCS
2. Navigate to Monitoring -> Alerting

Actual results:
There is alert KubeHpaReplicasMismatch.

Expected results:
There should not be any alerts related to noobaa-endpoint HPA.

Additional info:
$ oc get deployment noobaa-endpoint -n openshift-storage -o yaml
kind: Deployment
apiVersion: apps/v1
(...)
spec:
  replicas: 1
(...)
status:
  observedGeneration: 1
  replicas: 1
  updatedReplicas: 1
  readyReplicas: 1
  availableReplicas: 1

$ oc get HorizontalPodAutoscaler noobaa-endpoint -n openshift-storage -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
(...)
spec:
  maxReplicas: 2
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: noobaa-endpoint
  targetCPUUtilizationPercentage: 80
status:
  currentReplicas: 1
  desiredReplicas: 0

$ oc get hpa -A
NAMESPACE           NAME              REFERENCE                    TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
openshift-storage   noobaa-endpoint   Deployment/noobaa-endpoint   <unknown>/80%   1         2         1          6h44m

Comment 3 Elad 2020-10-07 09:51:43 UTC
Filip, have we got this alert with previous OCS versions? If not, it's a regression IMO

Comment 7 Ohad 2020-10-15 12:22:37 UTC
We found the cause to the issue: 
ocs-ci deploy cluster with the following resources configuration on the storagecluster CR (under spec.resources):

 resources:
      mds: {}
      mgr: {}
      mon: {}
      noobaa-core: {}
      noobaa-db: {}
      noobaa-endpoint: {}
      rgw: {}

This will cause all pods related to these deployments/statefullsets to configure their pod's templates with a resources value of {}.
HPA, as of spec, cannot work without specific values set on the resources section of the pods it is observing which is causing the issue described here.
As is, this is not a bug in the OCS product, or at least not in the official/default/supported deployment for the product.

I am not sure if this configuration (configured via the ocs-ci deployment scripts) is indented or is a bug.
@Filip (or any other QE representative) can you please check and update the reasoning behind this setup, and what can we do in order to mitigate the problem?

Comment 8 Ohad 2020-10-15 13:52:14 UTC
After discussing the issue with Elad and Petr, it seems this is done in order to run OCS on clusters with low resources.

@Filip, can we please verify that deploying the cluster with specific values (maybe via UI) solves the issue
This way we could verify that this is not a bug in the product so we could close the BZ

Comment 9 Filip Balák 2020-10-20 09:38:55 UTC
I confirm that cluster with supported configuration installed manually doesn't trigger this alert. -> NOTABUG

Although, I don't see resources defined in storagecluster CR as provided in comment 7.
Excerpt from ocs-storagecluster instance of StorageCluster CR:
(...)
spec:
  encryption: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephFilesystems: {}
    cephObjectStoreUsers: {}
    cephObjectStores: {}
    snapshotClasses: {}
    storageClasses: {}
  storageDeviceSets:
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Ti
          storageClassName: gp2
          volumeMode: Block
        status: {}
      name: ocs-deviceset-gp2
      placement: {}
      portable: true
      replica: 3
      resources: {}
  version: 4.6.0
(...)

Comment 10 Ohad 2020-10-20 10:33:05 UTC
Filip, It is ok to not set anything. The problem arises when you set them to empty objects or null's


Note You need to log in before you can comment on or make changes to this bug.