1980123 – 4.7 to 4.8 oVirt update CI failing: monitoring UpdatingkubeStateMetricsFailed with a Prometheus pod stuck in ContainerCreating due to unmounted volume

Bug 1980123 - 4.7 to 4.8 oVirt update CI failing: monitoring UpdatingkubeStateMetricsFailed with a Prometheus pod stuck in ContainerCreating due to unmounted volume

Summary: 4.7 to 4.8 oVirt update CI failing: monitoring UpdatingkubeStateMetricsFailed...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Gal Zaidman
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-07 20:35 UTC by W. Trevor King
Modified:	2021-10-17 07:43 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-17 07:43:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2021-07-07 20:35:16 UTC

oVirt CI wedged mid-update [1]:

  : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	2h31m25s
    fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-1slt30sg/release@sha256:92fa389aa952b8ed952a3ffe624be6353e4be9ec2bec1a811f4172b324a7f84d
    Unexpected error:
        <*errors.errorString | 0xc000763c00>: {
            s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is not available",
        }
        Cluster did not complete upgrade: timed out waiting for the condition: 
    Cluster operator monitoring is not available
occurred

with monitoring sticking:

  : operator conditions monitoring	0s
    Operator unavailable (UpdatingkubeStateMetricsFailed): Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.

because an incoming prometheus-k8s-0 pod got stuck in ContainerCreating:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select((.metadata.name | startswith("prometheus-k8s-")) and .status.phase != "Running") | {name: .metadata.name, status}'
  {
    "name": "prometheus-k8s-0",
    "status": {
      "conditions": [
        {
          "lastProbeTime": null,
          "lastTransitionTime": "2021-07-06T14:10:19Z",
          "status": "True",
          "type": "Initialized"
        },
        {
          "lastProbeTime": null,
          "lastTransitionTime": "2021-07-06T14:10:19Z",
          "message": "containers with unready status: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy prom-label-proxy kube-rbac-proxy-thanos]",
          "reason": "ContainersNotReady",
          "status": "False",
          "type": "Ready"
        },
        {
          "lastProbeTime": null,
          "lastTransitionTime": "2021-07-06T14:10:19Z",
          "message": "containers with unready status: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy prom-label-proxy kube-rbac-proxy-thanos]",
          "reason": "ContainersNotReady",
          "status": "False",
          "type": "ContainersReady"
        },
        {
          "lastProbeTime": null,
          "lastTransitionTime": "2021-07-06T14:10:29Z",
          "status": "True",
          "type": "PodScheduled"
        }
      ],
      "containerStatuses": [
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:65b8e580a1c1af474a99722c1bd8f5c7cc01a20cd397fd15228805de4c4d64ee",
          "imageID": "",
          "lastState": {},
          "name": "config-reloader",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        },
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65",
          "imageID": "",
          "lastState": {},
          "name": "kube-rbac-proxy",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        },
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65",
          "imageID": "",
          "lastState": {},
          "name": "kube-rbac-proxy-thanos",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        },
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:30c97178960cc004aaae10337e9ccbe7a1e74367b5ca639da029c1f72a38d821",
          "imageID": "",
          "lastState": {},
          "name": "prom-label-proxy",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        },
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:12032ae530944cc7bbfc6ec9be25a256dd62585941e4bd2de12c51737e9f0086",
          "imageID": "",
          "lastState": {},
          "name": "prometheus",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        },
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:5dec72260a6969bc8890c570a17e2033973b4fa520a562108b46bdf086c5ab59",
          "imageID": "",
          "lastState": {},
          "name": "prometheus-proxy",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        },
        {
          "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:8f381896534e396f610904536cedba5cbdbdefe752529ef5d54e198356b92851",
          "imageID": "",
          "lastState": {},
          "name": "thanos-sidecar",
          "ready": false,
          "restartCount": 0,
          "started": false,
          "state": {
            "waiting": {
              "reason": "ContainerCreating"
            }
          }
        }
      ],
      "hostIP": "192.168.213.140",
      "phase": "Pending",
      "qosClass": "Burstable",
      "startTime": "2021-07-06T14:10:19Z"
    }
  }

which lead to a number of alerts going off:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade 
success	2h31m24s
    Jul  6 16:08:37.798: Unexpected alerts fired or pending during the upgrade:

    alert ClusterMonitoringOperatorReconciliationErrors fired for 6945 seconds with labels: {severity="warning"}
    alert ClusterOperatorDegraded fired for 4958 seconds with labels: {condition="Degraded", endpoint="metrics", instance="192.168.213.132:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-66f9c6c64-7dgf8", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"}
    alert ClusterOperatorDown fired for 6338 seconds with labels: {endpoint="metrics", instance="192.168.213.132:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-66f9c6c64-7dgf8", service="cluster-version-operator", severity="critical", version="4.7.19"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="config-reloader", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="kube-rbac-proxy", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="kube-rbac-proxy-thanos", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="prom-label-proxy", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="prometheus", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="prometheus-proxy", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeContainerWaiting fired for 3381 seconds with labels: {container="thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubePodNotReady fired for 6081 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
    alert KubeStatefulSetReplicasMismatch fired for 6081 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"}

Checking node logs, there are some volume mount issues:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-0").spec.nodeName'
  ovirt13-4qm42-worker-0-q62t6
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/nodes/ovirt13-4qm42-worker-0-q62t6/journal | gunzip | grep prometheus-k8s-0 | tail -n5
  Jul 06 16:08:08.648190 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:08:08.648171    1963 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 podName: nodeName:}" failed. No retries permitted until 2021-07-06 16:10:10.648150475 +0000 UTC m=+10059.377419896 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-ce22d879-024d-47f5-b7a8-fd487f70c73f\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6\") pod \"prometheus-k8s-0\" (UID: \"c5c69b35-c217-46cf-9d85-ff57fa486ca5\") "
  Jul 06 16:10:10.740838 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: I0706 16:10:10.740693    1963 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-ce22d879-024d-47f5-b7a8-fd487f70c73f" (UniqueName: "kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6") pod "prometheus-k8s-0" (UID: "c5c69b35-c217-46cf-9d85-ff57fa486ca5")
  Jul 06 16:10:10.744826 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:10:10.744808    1963 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 podName: nodeName:}" failed. No retries permitted until 2021-07-06 16:12:12.744788428 +0000 UTC m=+10181.474057843 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-ce22d879-024d-47f5-b7a8-fd487f70c73f\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6\") pod \"prometheus-k8s-0\" (UID: \"c5c69b35-c217-46cf-9d85-ff57fa486ca5\") "
  Jul 06 16:10:21.330557 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:10:21.330443    1963 kubelet.go:1684] Unable to attach or mount volumes for pod "prometheus-k8s-0_openshift-monitoring(c5c69b35-c217-46cf-9d85-ff57fa486ca5)": unmounted volumes=[prometheus-data], unattached volumes=[secret-grpc-tls prometheus-data prometheus-k8s-rulefiles-0 secret-prometheus-k8s-tls configmap-kubelet-serving-ca-bundle config kube-api-access-7zvzf secret-prometheus-k8s-htpasswd secret-kube-rbac-proxy configmap-serving-certs-ca-bundle config-out secret-prometheus-k8s-proxy secret-kube-etcd-client-certs prometheus-trusted-ca-bundle tls-assets secret-prometheus-k8s-thanos-sidecar-tls]: timed out waiting for the condition; skipping pod
  Jul 06 16:10:21.330557 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:10:21.330484    1963 pod_workers.go:191] Error syncing pod c5c69b35-c217-46cf-9d85-ff57fa486ca5 ("prometheus-k8s-0_openshift-monitoring(c5c69b35-c217-46cf-9d85-ff57fa486ca5)"), skipping: unmounted volumes=[prometheus-data], unattached volumes=[secret-grpc-tls prometheus-data prometheus-k8s-rulefiles-0 secret-prometheus-k8s-tls configmap-kubelet-serving-ca-bundle config kube-api-access-7zvzf secret-prometheus-k8s-htpasswd secret-kube-rbac-proxy configmap-serving-certs-ca-bundle config-out secret-prometheus-k8s-proxy secret-kube-etcd-client-certs prometheus-trusted-ca-bundle tls-assets secret-prometheus-k8s-thanos-sidecar-tls]: timed out waiting for the condition

I'm not quite sure what to look for for frequency.  Do we alert on volume-attachment issues specifically?  [2] looks similar.

Guessing the storage component for the volume attachment, but feel free to redirect if this seems like a different component.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412169425309667328

Comment 1 Benny Zlotnik 2021-07-12 07:33:16 UTC

I see a bunch of:
2021-07-06T14:10:42.632644361Z I0706 14:10:42.632571       1 controller.go:134] Attaching Disk fdccc92c-894a-4a10-a87b-fde38c4f0740 to VM 8ad754d4-ba57-4b82-8c22-947dce11e8f5
2021-07-06T14:10:44.632986238Z E0706 14:10:44.632826       1 server.go:125] /csi.v1.Controller/ControllerPublishVolume returned with error: Fault reason is "Operation Failed"
. Fault detail is "[Failed to hot-plug disk]". HTTP response code is "400". HTTP response message is "400 Bad Request".
2021-07-06T14:10:44.639532465Z I0706 14:10:44.639492       1 controller.go:134] Attaching Disk 17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 to VM 26f77521-b00a-4749-a305-4e7ffb4c13cf
2021-07-06T14:10:45.142879835Z I0706 14:10:45.142840       1 controller.go:170] Attached Disk fdccc92c-894a-4a10-a87b-fde38c4f0740 to VM 8ad754d4-ba57-4b82-8c22-947dce11e8f5
2021-07-06T14:10:59.380210933Z E0706 14:10:59.380160       1 server.go:125] /csi.v1.Controller/ControllerPublishVolume returned with error: Fault reason is "Operation Failed"
. Fault detail is "[Failed to hot-plug disk]". HTTP response code is "400". HTTP response message is "400 Bad Request".
2021-07-06T14:10:59.389755567Z I0706 14:10:59.389711       1 controller.go:134] Attaching Disk 17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 to VM 26f77521-b00a-4749-a305-4e7ffb4c13cf
2021-07-06T14:11:14.030995886Z E0706 14:11:14.030950       1 server.go:125] /csi.v1.Controller/ControllerPublishVolume returned with error: Fault reason is "Operation Failed"


Gal, anyway we can check out the engine logs?

Comment 2 Gal Zaidman 2021-08-31 11:54:17 UTC

Looking at the recent CI job this rarely happens anymore[1] we also did some storage-related infra work lately on CI that might have solved the issue and on 4.10 we have a number of bug fixes related to retries on storage provisioning.
I'm decreasing the severity to Low since this issue has a very limited impact on CI and is probably infra-related.
I don't want to close it yet because I want to track this issue for 4.10

[1] https://search.ci.openshift.org/?search=operator+conditions+monitoring&maxAge=168h&context=1&type=junit&name=ovirt&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 3 Gal Zaidman 2021-10-17 07:43:20 UTC

looks like this issue rarely happens, and only 4.7->4.8 with 7% failure match in the past 14 days

https://search.ci.openshift.org/?search=Cluster+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition%3A+Cluster+operator+monitoring+is+not+available&maxAge=336h&context=1&type=junit&name=ovirt&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Closing this issue since it is probably an infra issue, see can see that when it happens it continues and then disappears.

Note You need to log in before you can comment on or make changes to this bug.