Bug 1885313

Summary: noobaa-endpoint HPA fires KubeHpaReplicasMismatch alert after installation
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Filip Balák <fbalak>
Component: Multi-Cloud Object GatewayAssignee: Nimrod Becker <nbecker>
Status: CLOSED NOTABUG QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: ebenahar, etamir, nberry, ocs-bugs, omitrani, tunguyen
Target Milestone: ---   
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-20 09:38:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1885320    
Bug Blocks:    

Description Filip Balák 2020-10-05 15:09:31 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After installation is fired alert:
  KubeHpaReplicasMismatch:
  HPA openshift-storage/noobaa-endpoint has not matched the desired number of replicas for longer than 15 minutes.

Version of all relevant components (if applicable):
ocs-operator.v4.6.0-108.ci

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

Steps to Reproduce:
1. Install OCS
2. Navigate to Monitoring -> Alerting

Actual results:
There is alert KubeHpaReplicasMismatch.

Expected results:
There should not be any alerts related to noobaa-endpoint HPA.

Additional info:
$ oc get deployment noobaa-endpoint -n openshift-storage -o yaml
kind: Deployment
apiVersion: apps/v1
(...)
spec:
  replicas: 1
(...)
status:
  observedGeneration: 1
  replicas: 1
  updatedReplicas: 1
  readyReplicas: 1
  availableReplicas: 1

$ oc get HorizontalPodAutoscaler noobaa-endpoint -n openshift-storage -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
(...)
spec:
  maxReplicas: 2
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: noobaa-endpoint
  targetCPUUtilizationPercentage: 80
status:
  currentReplicas: 1
  desiredReplicas: 0

$ oc get hpa -A
NAMESPACE           NAME              REFERENCE                    TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
openshift-storage   noobaa-endpoint   Deployment/noobaa-endpoint   <unknown>/80%   1         2         1          6h44m

Comment 3 Elad 2020-10-07 09:51:43 UTC
Filip, have we got this alert with previous OCS versions? If not, it's a regression IMO

Comment 7 Ohad 2020-10-15 12:22:37 UTC
We found the cause to the issue: 
ocs-ci deploy cluster with the following resources configuration on the storagecluster CR (under spec.resources):

 resources:
      mds: {}
      mgr: {}
      mon: {}
      noobaa-core: {}
      noobaa-db: {}
      noobaa-endpoint: {}
      rgw: {}

This will cause all pods related to these deployments/statefullsets to configure their pod's templates with a resources value of {}.
HPA, as of spec, cannot work without specific values set on the resources section of the pods it is observing which is causing the issue described here.
As is, this is not a bug in the OCS product, or at least not in the official/default/supported deployment for the product.

I am not sure if this configuration (configured via the ocs-ci deployment scripts) is indented or is a bug.
@Filip (or any other QE representative) can you please check and update the reasoning behind this setup, and what can we do in order to mitigate the problem?

Comment 8 Ohad 2020-10-15 13:52:14 UTC
After discussing the issue with Elad and Petr, it seems this is done in order to run OCS on clusters with low resources.

@Filip, can we please verify that deploying the cluster with specific values (maybe via UI) solves the issue
This way we could verify that this is not a bug in the product so we could close the BZ

Comment 9 Filip Balák 2020-10-20 09:38:55 UTC
I confirm that cluster with supported configuration installed manually doesn't trigger this alert. -> NOTABUG

Although, I don't see resources defined in storagecluster CR as provided in comment 7.
Excerpt from ocs-storagecluster instance of StorageCluster CR:
(...)
spec:
  encryption: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephFilesystems: {}
    cephObjectStoreUsers: {}
    cephObjectStores: {}
    snapshotClasses: {}
    storageClasses: {}
  storageDeviceSets:
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Ti
          storageClassName: gp2
          volumeMode: Block
        status: {}
      name: ocs-deviceset-gp2
      placement: {}
      portable: true
      replica: 3
      resources: {}
  version: 4.6.0
(...)

Comment 10 Ohad 2020-10-20 10:33:05 UTC
Filip, It is ok to not set anything. The problem arises when you set them to empty objects or null's