Seen in 4.8 CI (a presubmit job, but one that failed before the code in the PR was excercised) [1]: error: timed out waiting for the condition on clusteroperators/storage That's post-install while the test step waits for everyone to go Progressing=False. Checking the ClusterOperator [2]: { "lastTransitionTime": "2021-11-22T20:45:07Z", "message": "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods", "reason": "AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying", "status": "True", "type": "Progressing" }, and that seems to be because: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name | startswith("aws-ebs-csi-driver-controller-")) | .metadata.name as $n | .status.conditions[] | select(.type == "Ready" and .status != "True") | .lastTransitionTime + " " + $n + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-11-22T20:43:25Z aws-ebs-csi-driver-controller-5b854fcf75-sgsfp Ready=False ContainersNotReady: containers with unready status: [csi-liveness-probe] And that seems to be because [3]: F1122 21:26:35.761405 1 main.go:186] failed to start http server with error: listen tcp 0.0.0.0:10301: bind: address already in use Not a one-off, based on the previous logs [4]: F1122 21:26:35.761405 1 main.go:186] failed to start http server with error: listen tcp 0.0.0.0:10301: bind: address already in use goroutine Hrm, actually that's the same timestamp. Maybe not actually previous logs? Pod details show 14 restarts: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "aws-ebs-csi-driver-controller-5b854fcf75-sgsfp").status.containerStatuses[] | select(.name == "csi-liveness-probe")' { "containerID": "cri-o://9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96", "image": "registry.build03.ci.openshift.org/ci-op-6sgmz4g0/stable@sha256:90e484073db0a10ba792e6e910f47acbd5526c61ca7edbd35c454397de8b825b", "imageID": "registry.build03.ci.openshift.org/ci-op-6sgmz4g0/stable@sha256:90e484073db0a10ba792e6e910f47acbd5526c61ca7edbd35c454397de8b825b", "lastState": { "terminated": { "containerID": "cri-o://9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96", "exitCode": 255, "finishedAt": "2021-11-22T21:26:35Z", "reason": "Error", "startedAt": "2021-11-22T21:26:35Z" } }, "name": "csi-liveness-probe", "ready": false, "restartCount": 14, "started": false, "state": { "waiting": { "message": "back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)", "reason": "CrashLoopBackOff" } } } Does not seem to be particularly common: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=build-log&search=timed+out+waiting+for+the+condition+on+clusteroperators/storage' | grep 'failures match' | sort pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade (all) - 15 runs, 87% failed, 8% of failures match = 7% impact pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 27 runs, 70% failed, 5% of failures match = 4% impact pull-ci-openshift-origin-master-e2e-gcp-builds (all) - 23 runs, 43% failed, 10% of failures match = 4% impact pull-ci-openshift-origin-release-4.8-e2e-aws (all) - 5 runs, 80% failed, 25% of failures match = 20% impact pull-ci-openshift-vmware-vsphere-csi-driver-master-e2e-vsphere (all) - 3 runs, 100% failed, 33% of failures match = 33% impact pull-ci-openshift-vmware-vsphere-csi-driver-operator-master-e2e-vsphere-csi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact But also not great to have something that doesn't auto-recover. Kubelet/kernel bug for not releasing the port? Storage bug for not using SO_REUSEADDR? Something else? Looking up the kubelet logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "aws-ebs-csi-driver-controller-5b854fcf75-sgsfp").spec.nodeName' ip-10-0-154-151.us-west-1.compute.internal $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/nodes/ip-10-0-154-151.us-west-1.compute.internal/journal | zgrep -i 'aws-ebs-csi-driver-controller-5b854fcf75-sgsfp\|csi-liveness-probe\|9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96' | tail Nov 22 21:26:35.781053 ip-10-0-154-151 systemd[1]: crio-conmon-9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96.scope: Succeeded. Nov 22 21:26:35.781599 ip-10-0-154-151 systemd[1]: crio-conmon-9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96.scope: Consumed 34ms CPU time Nov 22 21:26:35.925547 ip-10-0-154-151 hyperkube[1377]: I1122 21:26:35.925431 1377 kubelet.go:1982] "SyncLoop (PLEG): event for pod" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" event=&{ID:51d772cf-cde3-4d78-a1d3-06538f4a8f53 Type:ContainerDied Data:9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96} Nov 22 21:26:35.970295 ip-10-0-154-151 crio[1340]: time="2021-11-22 21:26:35.970260922Z" level=info msg="Removed container fd8ad717aad9ab0fa0c696ac01e9516a82034b4086ff06701a14a5044dbcad18: openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp/csi-liveness-probe" id=f938e7ab-49d0-4747-af26-daf228cf5a6c name=/runtime.v1alpha2.RuntimeService/RemoveContainer Nov 22 21:26:36.936541 ip-10-0-154-151 hyperkube[1377]: I1122 21:26:36.936514 1377 scope.go:111] "RemoveContainer" containerID="9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96" Nov 22 21:26:36.937210 ip-10-0-154-151 hyperkube[1377]: E1122 21:26:36.936889 1377 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)\"" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" podUID=51d772cf-cde3-4d78-a1d3-06538f4a8f53 Nov 22 21:26:52.589643 ip-10-0-154-151 hyperkube[1377]: I1122 21:26:52.589612 1377 scope.go:111] "RemoveContainer" containerID="9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96" Nov 22 21:26:52.590188 ip-10-0-154-151 hyperkube[1377]: E1122 21:26:52.589890 1377 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)\"" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" podUID=51d772cf-cde3-4d78-a1d3-06538f4a8f53 Nov 22 21:27:05.590240 ip-10-0-154-151 hyperkube[1377]: I1122 21:27:05.590205 1377 scope.go:111] "RemoveContainer" containerID="9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96" Nov 22 21:27:05.590804 ip-10-0-154-151 hyperkube[1377]: E1122 21:27:05.590612 1377 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)\"" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" podUID=51d772cf-cde3-4d78-a1d3-06538f4a8f53 [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/clusteroperators.json [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_csi-liveness-probe.log [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_csi-liveness-probe_previous.log
This is a known bug. We would welcome any idea that would help us to find what process uses 10301 (or any other port, it's not just livenessprobe). *** This bug has been marked as a duplicate of bug 1990662 ***