oVirt CI wedged mid-update [1]: : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 2h31m25s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-1slt30sg/release@sha256:92fa389aa952b8ed952a3ffe624be6353e4be9ec2bec1a811f4172b324a7f84d Unexpected error: <*errors.errorString | 0xc000763c00>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is not available", } Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is not available occurred with monitoring sticking: : operator conditions monitoring 0s Operator unavailable (UpdatingkubeStateMetricsFailed): Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. because an incoming prometheus-k8s-0 pod got stuck in ContainerCreating: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select((.metadata.name | startswith("prometheus-k8s-")) and .status.phase != "Running") | {name: .metadata.name, status}' { "name": "prometheus-k8s-0", "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2021-07-06T14:10:19Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2021-07-06T14:10:19Z", "message": "containers with unready status: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy prom-label-proxy kube-rbac-proxy-thanos]", "reason": "ContainersNotReady", "status": "False", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2021-07-06T14:10:19Z", "message": "containers with unready status: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy prom-label-proxy kube-rbac-proxy-thanos]", "reason": "ContainersNotReady", "status": "False", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2021-07-06T14:10:29Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:65b8e580a1c1af474a99722c1bd8f5c7cc01a20cd397fd15228805de4c4d64ee", "imageID": "", "lastState": {}, "name": "config-reloader", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } }, { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65", "imageID": "", "lastState": {}, "name": "kube-rbac-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } }, { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65", "imageID": "", "lastState": {}, "name": "kube-rbac-proxy-thanos", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } }, { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:30c97178960cc004aaae10337e9ccbe7a1e74367b5ca639da029c1f72a38d821", "imageID": "", "lastState": {}, "name": "prom-label-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } }, { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:12032ae530944cc7bbfc6ec9be25a256dd62585941e4bd2de12c51737e9f0086", "imageID": "", "lastState": {}, "name": "prometheus", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } }, { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:5dec72260a6969bc8890c570a17e2033973b4fa520a562108b46bdf086c5ab59", "imageID": "", "lastState": {}, "name": "prometheus-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } }, { "image": "registry.ci.openshift.org/ocp/4.8-2021-06-14-090014@sha256:8f381896534e396f610904536cedba5cbdbdefe752529ef5d54e198356b92851", "imageID": "", "lastState": {}, "name": "thanos-sidecar", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } ], "hostIP": "192.168.213.140", "phase": "Pending", "qosClass": "Burstable", "startTime": "2021-07-06T14:10:19Z" } } which lead to a number of alerts going off: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 2h31m24s Jul 6 16:08:37.798: Unexpected alerts fired or pending during the upgrade: alert ClusterMonitoringOperatorReconciliationErrors fired for 6945 seconds with labels: {severity="warning"} alert ClusterOperatorDegraded fired for 4958 seconds with labels: {condition="Degraded", endpoint="metrics", instance="192.168.213.132:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-66f9c6c64-7dgf8", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"} alert ClusterOperatorDown fired for 6338 seconds with labels: {endpoint="metrics", instance="192.168.213.132:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-66f9c6c64-7dgf8", service="cluster-version-operator", severity="critical", version="4.7.19"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="config-reloader", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="kube-rbac-proxy", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="kube-rbac-proxy-thanos", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="prom-label-proxy", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="prometheus", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="prometheus-proxy", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeContainerWaiting fired for 3381 seconds with labels: {container="thanos-sidecar", namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubePodNotReady fired for 6081 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeStatefulSetReplicasMismatch fired for 6081 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"} Checking node logs, there are some volume mount issues: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-0").spec.nodeName' ovirt13-4qm42-worker-0-q62t6 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/nodes/ovirt13-4qm42-worker-0-q62t6/journal | gunzip | grep prometheus-k8s-0 | tail -n5 Jul 06 16:08:08.648190 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:08:08.648171 1963 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 podName: nodeName:}" failed. No retries permitted until 2021-07-06 16:10:10.648150475 +0000 UTC m=+10059.377419896 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-ce22d879-024d-47f5-b7a8-fd487f70c73f\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6\") pod \"prometheus-k8s-0\" (UID: \"c5c69b35-c217-46cf-9d85-ff57fa486ca5\") " Jul 06 16:10:10.740838 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: I0706 16:10:10.740693 1963 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-ce22d879-024d-47f5-b7a8-fd487f70c73f" (UniqueName: "kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6") pod "prometheus-k8s-0" (UID: "c5c69b35-c217-46cf-9d85-ff57fa486ca5") Jul 06 16:10:10.744826 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:10:10.744808 1963 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 podName: nodeName:}" failed. No retries permitted until 2021-07-06 16:12:12.744788428 +0000 UTC m=+10181.474057843 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-ce22d879-024d-47f5-b7a8-fd487f70c73f\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^17bf0834-a115-4bf2-8d1e-b60e3cc36bf6\") pod \"prometheus-k8s-0\" (UID: \"c5c69b35-c217-46cf-9d85-ff57fa486ca5\") " Jul 06 16:10:21.330557 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:10:21.330443 1963 kubelet.go:1684] Unable to attach or mount volumes for pod "prometheus-k8s-0_openshift-monitoring(c5c69b35-c217-46cf-9d85-ff57fa486ca5)": unmounted volumes=[prometheus-data], unattached volumes=[secret-grpc-tls prometheus-data prometheus-k8s-rulefiles-0 secret-prometheus-k8s-tls configmap-kubelet-serving-ca-bundle config kube-api-access-7zvzf secret-prometheus-k8s-htpasswd secret-kube-rbac-proxy configmap-serving-certs-ca-bundle config-out secret-prometheus-k8s-proxy secret-kube-etcd-client-certs prometheus-trusted-ca-bundle tls-assets secret-prometheus-k8s-thanos-sidecar-tls]: timed out waiting for the condition; skipping pod Jul 06 16:10:21.330557 ovirt13-4qm42-worker-0-q62t6 hyperkube[1963]: E0706 16:10:21.330484 1963 pod_workers.go:191] Error syncing pod c5c69b35-c217-46cf-9d85-ff57fa486ca5 ("prometheus-k8s-0_openshift-monitoring(c5c69b35-c217-46cf-9d85-ff57fa486ca5)"), skipping: unmounted volumes=[prometheus-data], unattached volumes=[secret-grpc-tls prometheus-data prometheus-k8s-rulefiles-0 secret-prometheus-k8s-tls configmap-kubelet-serving-ca-bundle config kube-api-access-7zvzf secret-prometheus-k8s-htpasswd secret-kube-rbac-proxy configmap-serving-certs-ca-bundle config-out secret-prometheus-k8s-proxy secret-kube-etcd-client-certs prometheus-trusted-ca-bundle tls-assets secret-prometheus-k8s-thanos-sidecar-tls]: timed out waiting for the condition I'm not quite sure what to look for for frequency. Do we alert on volume-attachment issues specifically? [2] looks similar. Guessing the storage component for the volume attachment, but feel free to redirect if this seems like a different component. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412396319145725952 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1412169425309667328
I see a bunch of: 2021-07-06T14:10:42.632644361Z I0706 14:10:42.632571 1 controller.go:134] Attaching Disk fdccc92c-894a-4a10-a87b-fde38c4f0740 to VM 8ad754d4-ba57-4b82-8c22-947dce11e8f5 2021-07-06T14:10:44.632986238Z E0706 14:10:44.632826 1 server.go:125] /csi.v1.Controller/ControllerPublishVolume returned with error: Fault reason is "Operation Failed" . Fault detail is "[Failed to hot-plug disk]". HTTP response code is "400". HTTP response message is "400 Bad Request". 2021-07-06T14:10:44.639532465Z I0706 14:10:44.639492 1 controller.go:134] Attaching Disk 17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 to VM 26f77521-b00a-4749-a305-4e7ffb4c13cf 2021-07-06T14:10:45.142879835Z I0706 14:10:45.142840 1 controller.go:170] Attached Disk fdccc92c-894a-4a10-a87b-fde38c4f0740 to VM 8ad754d4-ba57-4b82-8c22-947dce11e8f5 2021-07-06T14:10:59.380210933Z E0706 14:10:59.380160 1 server.go:125] /csi.v1.Controller/ControllerPublishVolume returned with error: Fault reason is "Operation Failed" . Fault detail is "[Failed to hot-plug disk]". HTTP response code is "400". HTTP response message is "400 Bad Request". 2021-07-06T14:10:59.389755567Z I0706 14:10:59.389711 1 controller.go:134] Attaching Disk 17bf0834-a115-4bf2-8d1e-b60e3cc36bf6 to VM 26f77521-b00a-4749-a305-4e7ffb4c13cf 2021-07-06T14:11:14.030995886Z E0706 14:11:14.030950 1 server.go:125] /csi.v1.Controller/ControllerPublishVolume returned with error: Fault reason is "Operation Failed" Gal, anyway we can check out the engine logs?
Looking at the recent CI job this rarely happens anymore[1] we also did some storage-related infra work lately on CI that might have solved the issue and on 4.10 we have a number of bug fixes related to retries on storage provisioning. I'm decreasing the severity to Low since this issue has a very limited impact on CI and is probably infra-related. I don't want to close it yet because I want to track this issue for 4.10 [1] https://search.ci.openshift.org/?search=operator+conditions+monitoring&maxAge=168h&context=1&type=junit&name=ovirt&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
looks like this issue rarely happens, and only 4.7->4.8 with 7% failure match in the past 14 days https://search.ci.openshift.org/?search=Cluster+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition%3A+Cluster+operator+monitoring+is+not+available&maxAge=336h&context=1&type=junit&name=ovirt&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Closing this issue since it is probably an infra issue, see can see that when it happens it continues and then disappears.