Bug 1885524
| Summary: | [Tracker] Unable to get metrics for resource cpu events reported after OCS installation (OCP bug 2029144 ) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Martin Bukatovic <mbukatov> |
| Component: | Multi-Cloud Object Gateway | Assignee: | Naveen Paul <napaul> |
| Status: | ON_QA --- | QA Contact: | Filip BalΓ‘k <fbalak> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | aindenba, bpratt, csharpe, dzaken, ebenahar, jolmomar, kjosy, muagarwa, nbecker, nberry, odf-bz-bot, tunguyen |
| Target Milestone: | --- | Keywords: | Tracking |
| Target Release: | ODF 4.14.0 | Flags: | sheggodu:
needinfo?
(aindenba) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.14.0-28 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2029144 | ||
| Bug Blocks: | |||
| Attachments: | |||
|
Description
Martin Bukatovic
2020-10-06 09:37:53 UTC
Created attachment 1719314 [details]
screenshot #1: storage dashboard with warning events
Created attachment 1719316 [details]
screenshot #2: Events page in OCP Console
I saw the same warning message after OCS installation on 4.7 Build version: * ocs-operator.v4.7.0-256.ci * 4.7.0-0.nightly-2021-02-09-024347 ------------- $ oc get hpa -n openshift-storage NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE noobaa-endpoint Deployment/noobaa-endpoint 0%/80% 1 2 1 11m (myenv) tunguyen-mac:ocs-ci tunguyen$ oc describe hpa noobaa-endpoint -n openshift-storage Name: noobaa-endpoint Namespace: openshift-storage Labels: app=noobaa Annotations: <none> CreationTimestamp: Tue, 09 Feb 2021 10:14:14 -0800 Reference: Deployment/noobaa-endpoint Metrics: ( current / target ) resource cpu on pods (as a percentage of request): 0% (4m) / 80% Min replicas: 1 Max replicas: 2 Deployment pods: 1 current / 1 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request) ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedGetResourceMetric 10m (x3 over 11m) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API Warning FailedComputeMetricsReplicas 10m (x3 over 11m) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API Warning FailedComputeMetricsReplicas 8m37s (x9 over 10m) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods Warning FailedGetResourceMetric 8m22s (x10 over 10m) horizontal-pod-autoscaler failed to get cpu utilization: did not receive metrics for any ready pods Created attachment 1756015 [details]
must-gather logs
Thank you for the verbose details! Resource usage metrics, such as container CPU and memory usage, are available in Kubernetes through the Metrics API. Note: The API requires the metrics server to be deployed in the cluster. Otherwise, it will be not available. See https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/ In order to determine the active metrics server on the provided OC cluster: > β ~ kubectl describe apiservice v1beta1.metrics.k8s.io > Name: v1beta1.metrics.k8s.io > Namespace: > Labels: app.kubernetes.io/component=metrics-adapter > app.kubernetes.io/name=prometheus-adapter > app.kubernetes.io/part-of=openshift-monitoring > app.kubernetes.io/version=0.8.4 > Annotations: service.alpha.openshift.io/inject-cabundle: true > API Version: apiregistration.k8s.io/v1 > Kind: APIService > Metadata: > Creation Timestamp: 2021-05-24T07:03:22Z > .... > Spec: > Ca Bundle: ... > Group: metrics.k8s.io > Group Priority Minimum: 100 > Service: > Name: prometheus-adapter > Namespace: openshift-monitoring > Port: 443 > Version: v1beta1 > Version Priority: 100 > Status: > Conditions: > Last Transition Time: 2021-05-27T09:59:42Z > Message: all checks passed > Reason: Passed > Status: True > Type: Available > Events: <none> In order to test the availability of pod metrics use: > β ~ kubectl top pod > NAME CPU(cores) MEMORY(bytes) > csi-cephfsplugin-7bkkr 0m 73Mi > csi-cephfsplugin-dzhkt 0m 73Mi > csi-cephfsplugin-kx7p8 0m 72Mi > ... According to the describe apiservice output above, the metrics service is provided by the prometheus-adapter in the openshift-monitoring namespace. Based on the info available so far, the issue is with the HPA noobaa-endpoint inability to obtain CPU utilization metrics. * After OCS installation * Stops about 15 minutes after OCS installation. ====================================================================================================================== | In order to troubleshoot this issue better, the following steps are recommended <em>before</em> OCS installation: | | | | * Provide more info with "kubectl describe apiservice v1beta1.metrics.k8s.io" | | * Check the status of the metrics server with "kubectl get -n openshift-monitoring pods | grep prometheus-adapter" | | * Finally ensure the metrics service is available by "kubectl top pod" | | | | Once the metrics server availability is ensured, please try to reproduce the HPA noobaa-endpoint issue | | by running the OCS installation and examining events. | ====================================================================================================================== Thank you! (In reply to aindenba from comment #13) > In order to troubleshoot this issue better, the following steps are > recommended <em>before</em> OCS installation: So you are basically saying that it's possible that something is wrong with a cluster prior to OCS installation, which could cause this bug? This is bit unlikely, but let's see. We can ask OCP team to help us audit related noobaa code if necessary. Hello Martin, Just wanted to mention that according to the available logs the issue is between HPA (https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) which is a standard K8S resource and the Metrics Server, served by prometheus-adapter in openshift-monitoring namespace, which is installed as a part of OCP. The error events do not really originate in the noobaa code per se. I am not sure what is the root cause at this stage. I would like to get a better idea of the installation procedure. Could you describe how do you guys roll out OCS? One possible explanation is that during installation, OCS is being installed just about 15 seconds before the prometheus-adapter in the openshift-monitoring namespace becomes ready. After 15 seconds prometheus-adapter is up and the issue is gone. So the platform is good, there might be a timing issue during OCP/OCS bootstrap. That explanation would match the existing evidence, from another hand, it could be totally off π. Nimrod suggested adding the debug commands steps suggested above (i.e. "kubectl describe apiservice v1beta1.metrics.k8s.io", "kubectl top pod") as a part of the cluster installation automation scripts, just before OCS is being installed. This way the debug info would be included in the installation logs once the issue happens. Hope it helps. Thank you! Hi Alexander, (In reply to Alexander Indenbaum from comment #15) > Hello Martin, > > Just wanted to mention that according to the available logs the issue is > between HPA > (https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) > which is a standard K8S resource and the Metrics Server, served by > prometheus-adapter in openshift-monitoring namespace, which is installed as > a part of OCP. The error events do not really originate in the noobaa code > per se. > > I am not sure what is the root cause at this stage. I would like to get a > better idea of the installation procedure. Could you describe how do you > guys roll out OCS? One possible explanation is that during installation, OCS > is being installed just about 15 seconds before the prometheus-adapter in > the openshift-monitoring namespace becomes ready. After 15 seconds > prometheus-adapter is up and the issue is gone. So the platform is good, > there might be a timing issue during OCP/OCS bootstrap. That explanation > would match the existing evidence, from another hand, it could be totally > off π. The installation steps for OCS vary according to the platform being installed on, however, they can be found here: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/ > > Nimrod suggested adding the debug commands steps suggested above (i.e. > "kubectl describe apiservice v1beta1.metrics.k8s.io", "kubectl top pod") as > a part of the cluster installation automation scripts, just before OCS is > being installed. This way the debug info would be included in the > installation logs once the issue happens. > > Hope it helps. Thank you! Can the information requested by Alex be added after deployment of OCP was successful and before OCS was started to be deployed? Created attachment 1788814 [details]
oc describe apiservice v1beta1.metrics.k8s.io > v1beta1.metrics.k8s.io.describe
(In reply to Alexander Indenbaum from comment #13) > In order to troubleshoot this issue better, the following steps are > recommended <em>before</em> OCS installation: I have installed OCP 4.8 (4.8.0-0.nightly-2021-06-03-055145) on vSphere UPI platform, and provided the information you requested before installing OCS, see details below. > * Provide more info with "kubectl describe apiservice v1beta1.metrics.k8s.io" ``` $ oc describe apiservice v1beta1.metrics.k8s.io > v1beta1.metrics.k8s.io.describe ``` See attachment 1788814 [details] from comment 18. > * Check the status of the metrics server with "kubectl get -n > openshift-monitoring pods | grep prometheus-adapter" ``` $ oc get -n openshift-monitoring pods | grep prometheus-adapter prometheus-adapter-5d9cbfdc5d-hlsm7 1/1 Running 0 102m prometheus-adapter-5d9cbfdc5d-wzfsv 1/1 Running 0 104m ``` > * Finally ensure the metrics service is available by "kubectl top pod" Are you interested in a particular namespace? The default namespace is obviously empty in my case: ``` $ oc adm top pod W0603 19:25:03.656858 14005 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag No resources found in default namespace. ``` If I try other existing ocp namespace, I see values as expected: ``` $ oc adm top pod -n openshift-etcd W0603 19:25:20.345433 14010 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) MEMORY(bytes) etcd-control-plane-0 102m 1038Mi etcd-control-plane-1 94m 1027Mi etcd-control-plane-2 81m 1028Mi etcd-quorum-guard-77b8f5d85b-8wsrb 3m 1Mi etcd-quorum-guard-77b8f5d85b-nsfjr 3m 1Mi etcd-quorum-guard-77b8f5d85b-vzzxh 5m 1Mi ``` Besides that, I also fetched a must gather data: ``` b26aa7119bad9525274a37c55beffd4851aa48ae mbukatov-0603b-local.must-gather.2021-06-03T19:26+02:00.tar.gz ``` See http://file.emea.redhat.com/~mbukatov/bz-1885524/ > Once the metrics server availability is ensured, please try to reproduce the > HPA noobaa-endpoint issue by running the OCS installation and examining > events. Then I installed the following operators from operator hub: - LSO 4.8.0-202106021817 - OCS 4.8.0-407.ci And created OCS StorageCluster via OCP Console. And I can confirm that the event in question is still there: ``` $ oc get events -n openshift-storage | grep FailedGetResourceMetric 5m3s Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 2m48s Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: did not receive metrics for any ready pods $ oc get events -n openshift-storage | grep FailedComputeMetricsReplicas 8m53s Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 6m53s Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods ``` Which means that the problem is reproducible with OCP/OCS 4.8. There seem to be additional issues with the cluster to debug though. > I am not sure what is the root cause at this stage. I would like to get a better idea of the installation procedure. Could you describe how do you guys roll out OCS? The steps are: - deploy OCP cluster - install LSO and OCS operators from operator hub (via OCP Console) - via OCP console, locate OCS operator and start Create Storage Cluster procedure > One possible explanation is that during installation, OCS is being installed just about 15 seconds before the prometheus-adapter in the openshift-monitoring namespace becomes ready. Because of my current testing scope, I install OCS manually on automatically deployed OCP cluster, which means that OCS is installed on a cluster which is running for few minutes and fully operational. After installation, I still see the events complaining about getting cpu utilization: ``` $ oc get events -n openshift-storage | grep FailedGetResourceMetric 30m Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 28m Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: did not receive metrics for any ready pods $ oc get events -n openshift-storage | grep FailedComputeMetricsReplicas 39m Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 36m Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods ``` Tested with: OCP 4.9.0-0.nightly-2021-10-05-004711 LSO 4.9.0-202109210853 ODF 4.9.0-164.ci And I can see it both on vSphere LSO onpremise cluster, and in AWS UPI cloud deployment. Hello, The issue is not reproducible with docker-desktop Kubernetes and the Metrics server https://github.com/kubernetes-sigs/metrics-server, however, the issue is still present in the OCP environment. To troubleshoot this issue better, tried a custom NooBaa build where the NooBaa operator first fetches the Endpoint pod metrics, verifying the availability of the endpoint deployment pods metrics, before creating an HPA instance, see https://github.com/noobaa/noobaa-operator/pull/750. In the OCP environment, using the build based on the PR #750 codebase above there are still FailedGetResourceMetric/FailedComputeMetricsReplicas warnings emitted by the HPA. @ @ebenahar I've found an OCP bug (https://bugzilla.redhat.com/show_bug.cgi?id=1993985) which seems similar to this BZ. it is closed as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2011815 which AFAIU is fixed in OCP 4.9.0. can we verify this issue is not reproduced on OCP 4.9? Retrying with vSphere LSO onpremise cluster with: - OCP 4.9.0-0.nightly-2021-11-24-090558 - LSO 4.9.0-202111151318 - OCS 4.9.0-249.ci And unfortunately, I still se the same behaviour as before, events related to metrics cpu issues are still present right after installation: ``` $ oc get events -n openshift-storage | grep FailedGetResourceMetric 12m Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 9m40s Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: did not receive metrics for any ready pods [ocsqe@fedora data]$ oc get events -n openshift-storage | grep FailedComputeMetricsReplicas 12m Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 10m Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods ``` For whatever it is worth, also seeing this on OCP 4.10.11 with ODF 4.10 over LSO. oc get events -n openshift-storage | grep -v Normal LAST SEEN TYPE REASON OBJECT MESSAGE 11m Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 11m Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API 8m23s Warning FailedGetResourceMetric horizontalpodautoscaler/noobaa-endpoint failed to get cpu utilization: did not receive metrics for any ready pods 8m38s Warning FailedComputeMetricsReplicas horizontalpodautoscaler/noobaa-endpoint invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods 17m Warning ReconcileFailed storagesystem/ocs-storagecluster-storagesystem Operation cannot be fulfilled on storageclusters.ocs.openshift.io "ocs-storagecluster": the object has been modified; please apply your changes to the latest version and try again |