1691055 – Bind: Address already in use for clusteroperator/kube-controller-manager

Bug 1691055 - Bind: Address already in use for clusteroperator/kube-controller-manager

Summary: Bind: Address already in use for clusteroperator/kube-controller-manager

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Michal Fojtik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1690168 1691089 1695179 (view as bug list)
Depends On:	1697814
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-20 18:23 UTC by ewolinet
Modified:	2019-06-04 10:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:46:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (337.12 KB, image/svg+xml) 2019-03-22 06:04 UTC, W. Trevor King	no flags	Details
Occurrences of this error in CI from 2019-04-08T20:02 to 2019-04-09T19:58 UTC (253.88 KB, image/svg+xml) 2019-04-09 20:08 UTC, W. Trevor King	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1698251	0	unspecified	CLOSED	Bind: Address already in use for openshift-kube-scheduler/openshift-kube-scheduler	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:46:19 UTC

Description ewolinet 2019-03-20 18:23:52 UTC

[build cop]

Observing OCP tests failing due to static pod being unable to bind to 0.0.0.0:10252 due to the address already being in use.

Seems to be similar to: 
  https://bugzilla.redhat.com/show_bug.cgi?id=1678847
  https://bugzilla.redhat.com/show_bug.cgi?id=1670727

Snippet:
I0320 14:43:27.406257       1 flags.go:33] FLAG: --use-service-account-credentials=\"false\"\nI0320 14:43:27.406263       1 flags.go:33] FLAG: --v=\"2\"\nI0320 14:43:27.406270       1 flags.go:33] FLAG: --version=\"false\"\nI0320 14:43:27.406280       1 flags.go:33] FLAG: --vmodule=\"\"\nI0320 14:43:27.880763       1 serving.go:296] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use\n"
Mar 20 14:43:55.876 E clusteroperator/kube-apiserver changed Failing to True: RevisionControllerFailing: RevisionControllerFailing: Post https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps: read tcp 10.130.0.6:37742->172.30.0.1:443: read: connection reset by peer


Test build ref:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5925

Comment 1 W. Trevor King 2019-03-22 06:04:47 UTC

Created attachment 1546781 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 53 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'listen tcp 0\\.0\\.0\\.0:10251: bind: address already in use'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Tomáš Nožička 2019-03-22 12:45:33 UTC

*** Bug 1690168 has been marked as a duplicate of this bug. ***

Comment 3 Tomáš Nožička 2019-03-22 12:53:45 UTC

*** Bug 1691089 has been marked as a duplicate of this bug. ***

Comment 4 Maciej Szulik 2019-03-22 16:39:05 UTC

Fixes are in https://github.com/openshift/origin/pull/22398 and https://github.com/openshift/cluster-kube-controller-manager-operator/pull/197

Comment 5 Xingxing Xia 2019-04-01 10:17:12 UTC

In fresh install, I see restart count: kube-controller-manager-ip-172-31-138-177.ap-southeast-1.compute.internal   1/1     Running     3          68m
oc edit each static pods, all show:
...
 containerStatuses:
...
        message: |
          s.go:33] FLAG: --service-cluster-ip-range="172.30.0.0/16"
...
          I0401 08:39:02.584416       1 flags.go:33] FLAG: --resource-quota-sync-period="5m0s"
          failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use
        reason: Error
...

Per Slack confirmation with Maciej and Stefan's https://jira.coreos.com/browse/MSTR-357 , assigning back.

Comment 6 Scott Dodson 2019-04-01 20:06:55 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/801
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/805
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/806
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/808
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/809

all note pods/kube-controller-manager crashlooping on "failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use"

Comment 7 Ben Parees 2019-04-02 00:43:09 UTC

per build-cop report this is failing almost all upgrades, raising severity.

Comment 8 zhou ying 2019-04-04 03:36:07 UTC

The openshift-kube-scheduler pods also have the same error:
failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use

Comment 9 Michal Fojtik 2019-04-04 17:00:38 UTC

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/207

Comment 10 Xingxing Xia 2019-04-08 06:38:40 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1695179#c2

Comment 11 Xingxing Xia 2019-04-08 06:39:04 UTC

*** Bug 1695179 has been marked as a duplicate of this bug. ***

Comment 13 Xingxing Xia 2019-04-09 09:03:22 UTC

Above PR only landed in payload 4.0.0-0.nightly-2019-04-08-225815, the latest one as of now, which failed in env installation (bug 1697814). Will check again once that is fixed

Comment 14 W. Trevor King 2019-04-09 20:01:03 UTC

I'm still seeing this in CI, e.g. [1]:

  Apr 09 16:56:44.767 E clusteroperator/kube-controller-manager changed Failing to True: StaticPodsFailingError: StaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is not ready\nStaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is terminated: "Error" - "me-plugin-dir=\"/etc/kubernetes/kubelet-plugins/volume/exec\"\nI0409 16:56:41.142795       1 flags.go:33] FLAG: --pv-recycler-increment-timeout-nfs=\"30\"\nI0409 16:56:41.142801       1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-hostpath=\"60\"\nI0409 16:56:41.142809       1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-nfs=\"300\"\nI0409 16:56:41.142816       1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-hostpath=\"\"\nI0409 16:56:41.142822       1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-nfs=\"\"\nI0409 16:56:41.142829       1 flags.go:33] FLAG: --pv-recycler-timeout-increment-hostpath=\"30\"\nI0409 16:56:41.142835       1 flags.go:33] FLAG: --pvclaimbinder-sync-period=\"15s\"\nI0409 16:56:41.142843       1 glog.go:58] FLAGSET: podgc controller\nI0409 16:56:41.142850       1 flags.go:33] FLAG: --terminated-pod-gc-threshold=\"12500\"\nI0409 16:56:41.142857       1 glog.go:58] FLAGSET: misc\nI0409 16:56:41.142874       1 flags.go:33] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group=\"\"\nI0409 16:56:41.142879       1 flags.go:33] FLAG: --kubeconfig=\"/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig\"\nI0409 16:56:41.142885       1 flags.go:33] FLAG: --master=\"\"\nI0409 16:56:41.142889       1 flags.go:33] FLAG: --openshift-config=\"/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml\"\nI0409 16:56:41.142895       1 glog.go:58] FLAGSET: endpoint controller\nI0409 16:56:41.142900       1 flags.go:33] FLAG: --concurrent-endpoint-syncs=\"5\"\nI0409 16:56:41.142904       1 glog.go:58] FLAGSET: namespace controller\nI0409 16:56:41.142908       1 flags.go:33] FLAG: --concurrent-namespace-syncs=\"10\"\nI0409 16:56:41.142912       1 flags.go:33] FLAG: --namespace-sync-period=\"5m0s\"\nI0409 16:56:42.051147       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use\n"

That cluster includes kube-controller-manager-operator#207:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-kube-controller-manager-operator").annotations'
  {
    "io.openshift.build.commit.id": "f7af927a236fd0c7607966b948ce28f2c30f2c8c",
    "io.openshift.build.commit.ref": "master",
    "io.openshift.build.source-location": "https://github.com/openshift/cluster-kube-controller-manager-operator"
  }
  $ git log --first-parent --format='%ad %h %d %s' --date=iso -3 f7af927a236
  2019-04-09 06:20:34 -0700 f7af927a  (origin/release-4.0, origin/master, origin/HEAD) Merge pull request #212 from sjenning/use-default-node-controller-settings
  2019-04-08 07:52:53 -0700 760122d7  Merge pull request #213 from sttts/sttts-authnz
  2019-04-04 14:36:25 -0700 b80336d3  Merge pull request #207 from mfojtik/switch-to-secure-port

So was the fix broken, or partial, or has there been a subsequent regression?

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051

Comment 15 W. Trevor King 2019-04-09 20:08:38 UTC

Created attachment 1553934 [details]
Occurrences of this error in CI from 2019-04-08T20:02 to 2019-04-09T19:58 UTC

This occurred in 45 of our 347 failures (12%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'listen tcp 0.0.0.0:10257: bind: address already in use'
  45	listen tcp 0.0.0.0:10257: bind: address already in use
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/999	ci-op-mv2kbqps
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/998	ci-op-i0iif1nt
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/997	ci-op-m4cdk5s5
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/996	ci-op-tkh4n2yk
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/993	ci-op-b4n5ypht
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1037	ci-op-j8613c9l
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1036	ci-op-flggzz7s
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1034	ci-op-pbpzpm3f
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1031	ci-op-d77bkqbp
  ...

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 16 Michal Fojtik 2019-04-10 18:06:32 UTC

Trevor, those failures are from scheduler, I don't see controller-manager errors anymore. Moving to QA. We can track
scheduler operator in new bug assigned to Pod team.

Comment 17 Michal Fojtik 2019-04-10 18:12:11 UTC

False alarm, I can stilil see it :-(

Comment 18 Xingxing Xia 2019-04-11 02:49:17 UTC

Yes, can still see in 4.0.0-0.nightly-2019-04-10-182914:
oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager 
NAME                                                                      READY   STATUS    RESTARTS   AGE
kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal   1/1     Running   3          107m

oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager -o yaml
  containerStatuses:
...
        message: |
...
          I0411 01:01:13.599713       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
          failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use
        reason: Error
        startedAt: "2019-04-11T01:01:12Z"

Comment 19 Tomáš Nožička 2019-04-11 13:55:49 UTC

*** Bug 1691089 has been marked as a duplicate of this bug. ***

Comment 20 Michal Fojtik 2019-04-12 07:59:03 UTC

PR: https://github.com/openshift/origin/pull/22543

Comment 21 Michal Fojtik 2019-04-12 11:45:15 UTC

https://github.com/openshift/origin/pull/22543 is in merge queue and it should remove the problem for both KCM and KS

Comment 22 zhou ying 2019-04-16 06:45:06 UTC

Comfirmed with CI payload: 4.0.0-0.ci-2019-04-15-230244, could not see the issue, will verify when new nightly payload ready.

Comment 23 zhou ying 2019-04-19 06:10:08 UTC

verified with payload: 4.1.0-0.nightly-2019-04-18-210657, can't reproduce the issue.

Comment 25 errata-xmlrpc 2019-06-04 10:46:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.