2025742 – AWSEBSDriverControllerServiceControllerProgressing with csi-liveness-probe in CrashLoopBackOff on 10301: bind: address already in use

Bug 2025742 - AWSEBSDriverControllerServiceControllerProgressing with csi-liveness-probe in CrashLoopBackOff on 10301: bind: address already in use

Summary: AWSEBSDriverControllerServiceControllerProgressing with csi-liveness-probe in...

Keywords:
Status:	CLOSED DUPLICATE of bug 1990662
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	aos-storage-staff@redhat.com
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-22 22:27 UTC by W. Trevor King
Modified:	2021-11-23 11:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-23 11:32:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2021-11-22 22:27:47 UTC

Seen in 4.8 CI (a presubmit job, but one that failed before the code in the PR was excercised) [1]:

  error: timed out waiting for the condition on clusteroperators/storage

That's post-install while the test step waits for everyone to go Progressing=False.  Checking the ClusterOperator [2]:

  {
    "lastTransitionTime": "2021-11-22T20:45:07Z",
    "message": "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods",
    "reason": "AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying",
    "status": "True",
    "type": "Progressing"
  },

and that seems to be because:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name | startswith("aws-ebs-csi-driver-controller-")) | .metadata.name as $n | .status.conditions[] | select(.type == "Ready" and .status != "True") | .lastTransitionTime + " " + $n + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-11-22T20:43:25Z aws-ebs-csi-driver-controller-5b854fcf75-sgsfp Ready=False ContainersNotReady: containers with unready status: [csi-liveness-probe]

And that seems to be because [3]:

  F1122 21:26:35.761405       1 main.go:186] failed to start http server with error: listen tcp 0.0.0.0:10301: bind: address already in use

Not a one-off, based on the previous logs [4]:

  F1122 21:26:35.761405       1 main.go:186] failed to start http server with error: listen tcp 0.0.0.0:10301: bind: address already in use
goroutine

Hrm, actually that's the same timestamp.  Maybe not actually previous logs?  Pod details show 14 restarts:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "aws-ebs-csi-driver-controller-5b854fcf75-sgsfp").status.containerStatuses[] | select(.name == "csi-liveness-probe")'
  {
    "containerID": "cri-o://9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96",
    "image": "registry.build03.ci.openshift.org/ci-op-6sgmz4g0/stable@sha256:90e484073db0a10ba792e6e910f47acbd5526c61ca7edbd35c454397de8b825b",
    "imageID": "registry.build03.ci.openshift.org/ci-op-6sgmz4g0/stable@sha256:90e484073db0a10ba792e6e910f47acbd5526c61ca7edbd35c454397de8b825b",
    "lastState": {
      "terminated": {
        "containerID": "cri-o://9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96",
        "exitCode": 255,
        "finishedAt": "2021-11-22T21:26:35Z",
        "reason": "Error",
        "startedAt": "2021-11-22T21:26:35Z"
      }
    },
    "name": "csi-liveness-probe",
    "ready": false,
    "restartCount": 14,
    "started": false,
    "state": {
      "waiting": {
        "message": "back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)",
        "reason": "CrashLoopBackOff"
      }
    }
  }

Does not seem to be particularly common:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=build-log&search=timed+out+waiting+for+the+condition+on+clusteroperators/storage' | grep 'failures match' | sort
  pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade (all) - 15 runs, 87% failed, 8% of failures match = 7% impact
  pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 27 runs, 70% failed, 5% of failures match = 4% impact
  pull-ci-openshift-origin-master-e2e-gcp-builds (all) - 23 runs, 43% failed, 10% of failures match = 4% impact
  pull-ci-openshift-origin-release-4.8-e2e-aws (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
  pull-ci-openshift-vmware-vsphere-csi-driver-master-e2e-vsphere (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
  pull-ci-openshift-vmware-vsphere-csi-driver-operator-master-e2e-vsphere-csi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact

But also not great to have something that doesn't auto-recover.

Kubelet/kernel bug for not releasing the port?  Storage bug for not using SO_REUSEADDR?  Something else?  Looking up the kubelet logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "aws-ebs-csi-driver-controller-5b854fcf75-sgsfp").spec.nodeName'
  ip-10-0-154-151.us-west-1.compute.internal
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/nodes/ip-10-0-154-151.us-west-1.compute.internal/journal | zgrep -i 'aws-ebs-csi-driver-controller-5b854fcf75-sgsfp\|csi-liveness-probe\|9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96' | tail
  Nov 22 21:26:35.781053 ip-10-0-154-151 systemd[1]: crio-conmon-9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96.scope: Succeeded.
  Nov 22 21:26:35.781599 ip-10-0-154-151 systemd[1]: crio-conmon-9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96.scope: Consumed 34ms CPU time
  Nov 22 21:26:35.925547 ip-10-0-154-151 hyperkube[1377]: I1122 21:26:35.925431    1377 kubelet.go:1982] "SyncLoop (PLEG): event for pod" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" event=&{ID:51d772cf-cde3-4d78-a1d3-06538f4a8f53 Type:ContainerDied Data:9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96}
  Nov 22 21:26:35.970295 ip-10-0-154-151 crio[1340]: time="2021-11-22 21:26:35.970260922Z" level=info msg="Removed container fd8ad717aad9ab0fa0c696ac01e9516a82034b4086ff06701a14a5044dbcad18: openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp/csi-liveness-probe" id=f938e7ab-49d0-4747-af26-daf228cf5a6c name=/runtime.v1alpha2.RuntimeService/RemoveContainer
  Nov 22 21:26:36.936541 ip-10-0-154-151 hyperkube[1377]: I1122 21:26:36.936514    1377 scope.go:111] "RemoveContainer" containerID="9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96"
  Nov 22 21:26:36.937210 ip-10-0-154-151 hyperkube[1377]: E1122 21:26:36.936889    1377 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)\"" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" podUID=51d772cf-cde3-4d78-a1d3-06538f4a8f53
  Nov 22 21:26:52.589643 ip-10-0-154-151 hyperkube[1377]: I1122 21:26:52.589612    1377 scope.go:111] "RemoveContainer" containerID="9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96"
  Nov 22 21:26:52.590188 ip-10-0-154-151 hyperkube[1377]: E1122 21:26:52.589890    1377 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)\"" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" podUID=51d772cf-cde3-4d78-a1d3-06538f4a8f53
  Nov 22 21:27:05.590240 ip-10-0-154-151 hyperkube[1377]: I1122 21:27:05.590205    1377 scope.go:111] "RemoveContainer" containerID="9ea3fa15d13cf26a13356a36f01cdda473292d8a394943c3169e11fd1341ca96"
  Nov 22 21:27:05.590804 ip-10-0-154-151 hyperkube[1377]: E1122 21:27:05.590612    1377 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=csi-liveness-probe pod=aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_openshift-cluster-csi-drivers(51d772cf-cde3-4d78-a1d3-06538f4a8f53)\"" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5b854fcf75-sgsfp" podUID=51d772cf-cde3-4d78-a1d3-06538f4a8f53

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/clusteroperators.json
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_csi-liveness-probe.log
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26635/pull-ci-openshift-origin-release-4.8-e2e-aws/1462876690232381440/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_aws-ebs-csi-driver-controller-5b854fcf75-sgsfp_csi-liveness-probe_previous.log

Comment 1 Jan Safranek 2021-11-23 11:32:12 UTC

This is a known bug. We would welcome any idea that would help us to find what process uses 10301 (or any other port, it's not just livenessprobe).

*** This bug has been marked as a duplicate of bug 1990662 ***

Note You need to log in before you can comment on or make changes to this bug.