Description of problem (please be detailed as possible and provide log snippests): After installation is fired alert: KubeHpaReplicasMismatch: HPA openshift-storage/noobaa-endpoint has not matched the desired number of replicas for longer than 15 minutes. Version of all relevant components (if applicable): ocs-operator.v4.6.0-108.ci Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes Steps to Reproduce: 1. Install OCS 2. Navigate to Monitoring -> Alerting Actual results: There is alert KubeHpaReplicasMismatch. Expected results: There should not be any alerts related to noobaa-endpoint HPA. Additional info: $ oc get deployment noobaa-endpoint -n openshift-storage -o yaml kind: Deployment apiVersion: apps/v1 (...) spec: replicas: 1 (...) status: observedGeneration: 1 replicas: 1 updatedReplicas: 1 readyReplicas: 1 availableReplicas: 1 $ oc get HorizontalPodAutoscaler noobaa-endpoint -n openshift-storage -o yaml apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler (...) spec: maxReplicas: 2 minReplicas: 1 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: noobaa-endpoint targetCPUUtilizationPercentage: 80 status: currentReplicas: 1 desiredReplicas: 0 $ oc get hpa -A NAMESPACE NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE openshift-storage noobaa-endpoint Deployment/noobaa-endpoint <unknown>/80% 1 2 1 6h44m
Filip, have we got this alert with previous OCS versions? If not, it's a regression IMO
We found the cause to the issue: ocs-ci deploy cluster with the following resources configuration on the storagecluster CR (under spec.resources): resources: mds: {} mgr: {} mon: {} noobaa-core: {} noobaa-db: {} noobaa-endpoint: {} rgw: {} This will cause all pods related to these deployments/statefullsets to configure their pod's templates with a resources value of {}. HPA, as of spec, cannot work without specific values set on the resources section of the pods it is observing which is causing the issue described here. As is, this is not a bug in the OCS product, or at least not in the official/default/supported deployment for the product. I am not sure if this configuration (configured via the ocs-ci deployment scripts) is indented or is a bug. @Filip (or any other QE representative) can you please check and update the reasoning behind this setup, and what can we do in order to mitigate the problem?
After discussing the issue with Elad and Petr, it seems this is done in order to run OCS on clusters with low resources. @Filip, can we please verify that deploying the cluster with specific values (maybe via UI) solves the issue This way we could verify that this is not a bug in the product so we could close the BZ
I confirm that cluster with supported configuration installed manually doesn't trigger this alert. -> NOTABUG Although, I don't see resources defined in storagecluster CR as provided in comment 7. Excerpt from ocs-storagecluster instance of StorageCluster CR: (...) spec: encryption: {} externalStorage: {} managedResources: cephBlockPools: {} cephFilesystems: {} cephObjectStoreUsers: {} cephObjectStores: {} snapshotClasses: {} storageClasses: {} storageDeviceSets: - config: {} count: 1 dataPVCTemplate: metadata: creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 2Ti storageClassName: gp2 volumeMode: Block status: {} name: ocs-deviceset-gp2 placement: {} portable: true replica: 3 resources: {} version: 4.6.0 (...)
Filip, It is ok to not set anything. The problem arises when you set them to empty objects or null's