Bug 1691055

Summary:

Bind: Address already in use for clusteroperator/kube-controller-manager

Product:

OpenShift Container Platform

Reporter:

ewolinet

Component:

Master

Assignee:

Michal Fojtik <mfojtik>

Status:

CLOSED ERRATA

QA Contact:

zhou ying <yinzhou>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.1.0

CC:

aos-bugs, bparees, deads, jokerman, mfojtik, mmccomas, rvokal, wking, xxia, yinzhou

Target Milestone:

---

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-04 10:46:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1697814

Bug Blocks:

Attachments:

Description	Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC	none
Occurrences of this error in CI from 2019-04-08T20:02 to 2019-04-09T19:58 UTC	none

Description ewolinet 2019-03-20 18:23:52 UTC

[build cop]

Observing OCP tests failing due to static pod being unable to bind to 0.0.0.0:10252 due to the address already being in use.

Seems to be similar to: 
  https://bugzilla.redhat.com/show_bug.cgi?id=1678847
  https://bugzilla.redhat.com/show_bug.cgi?id=1670727

Snippet:
I0320 14:43:27.406257       1 flags.go:33] FLAG: --use-service-account-credentials=\"false\"\nI0320 14:43:27.406263       1 flags.go:33] FLAG: --v=\"2\"\nI0320 14:43:27.406270       1 flags.go:33] FLAG: --version=\"false\"\nI0320 14:43:27.406280       1 flags.go:33] FLAG: --vmodule=\"\"\nI0320 14:43:27.880763       1 serving.go:296] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use\n"
Mar 20 14:43:55.876 E clusteroperator/kube-apiserver changed Failing to True: RevisionControllerFailing: RevisionControllerFailing: Post https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps: read tcp 10.130.0.6:37742->172.30.0.1:443: read: connection reset by peer


Test build ref:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5925

Comment 1 W. Trevor King 2019-03-22 06:04:47 UTC

Created attachment 1546781 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 53 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'listen tcp 0\\.0\\.0\\.0:10251: bind: address already in use'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Tomáš Nožička 2019-03-22 12:45:33 UTC

*** Bug 1690168 has been marked as a duplicate of this bug. ***

Comment 3 Tomáš Nožička 2019-03-22 12:53:45 UTC

*** Bug 1691089 has been marked as a duplicate of this bug. ***

Comment 4 Maciej Szulik 2019-03-22 16:39:05 UTC

Fixes are in https://github.com/openshift/origin/pull/22398 and https://github.com/openshift/cluster-kube-controller-manager-operator/pull/197

Comment 5 Xingxing Xia 2019-04-01 10:17:12 UTC

In fresh install, I see restart count: kube-controller-manager-ip-172-31-138-177.ap-southeast-1.compute.internal   1/1     Running     3          68m
oc edit each static pods, all show:
...
 containerStatuses:
...
        message: |
          s.go:33] FLAG: --service-cluster-ip-range="172.30.0.0/16"
...
          I0401 08:39:02.584416       1 flags.go:33] FLAG: --resource-quota-sync-period="5m0s"
          failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use
        reason: Error
...

Per Slack confirmation with Maciej and Stefan's https://jira.coreos.com/browse/MSTR-357 , assigning back.

Comment 6 Scott Dodson 2019-04-01 20:06:55 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/801
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/805
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/806
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/808
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/809

all note pods/kube-controller-manager crashlooping on "failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use"

Comment 7 Ben Parees 2019-04-02 00:43:09 UTC

per build-cop report this is failing almost all upgrades, raising severity.

Comment 8 zhou ying 2019-04-04 03:36:07 UTC

The openshift-kube-scheduler pods also have the same error:
failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use

Comment 9 Michal Fojtik 2019-04-04 17:00:38 UTC

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/207

Comment 10 Xingxing Xia 2019-04-08 06:38:40 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1695179#c2

Comment 11 Xingxing Xia 2019-04-08 06:39:04 UTC

*** Bug 1695179 has been marked as a duplicate of this bug. ***

Comment 13 Xingxing Xia 2019-04-09 09:03:22 UTC

Above PR only landed in payload 4.0.0-0.nightly-2019-04-08-225815, the latest one as of now, which failed in env installation (bug 1697814). Will check again once that is fixed

Comment 14 W. Trevor King 2019-04-09 20:01:03 UTC

I'm still seeing this in CI, e.g. [1]:

  Apr 09 16:56:44.767 E clusteroperator/kube-controller-manager changed Failing to True: StaticPodsFailingError: StaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is not ready\nStaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is terminated: "Error" - "me-plugin-dir=\"/etc/kubernetes/kubelet-plugins/volume/exec\"\nI0409 16:56:41.142795       1 flags.go:33] FLAG: --pv-recycler-increment-timeout-nfs=\"30\"\nI0409 16:56:41.142801       1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-hostpath=\"60\"\nI0409 16:56:41.142809       1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-nfs=\"300\"\nI0409 16:56:41.142816       1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-hostpath=\"\"\nI0409 16:56:41.142822       1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-nfs=\"\"\nI0409 16:56:41.142829       1 flags.go:33] FLAG: --pv-recycler-timeout-increment-hostpath=\"30\"\nI0409 16:56:41.142835       1 flags.go:33] FLAG: --pvclaimbinder-sync-period=\"15s\"\nI0409 16:56:41.142843       1 glog.go:58] FLAGSET: podgc controller\nI0409 16:56:41.142850       1 flags.go:33] FLAG: --terminated-pod-gc-threshold=\"12500\"\nI0409 16:56:41.142857       1 glog.go:58] FLAGSET: misc\nI0409 16:56:41.142874       1 flags.go:33] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group=\"\"\nI0409 16:56:41.142879       1 flags.go:33] FLAG: --kubeconfig=\"/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig\"\nI0409 16:56:41.142885       1 flags.go:33] FLAG: --master=\"\"\nI0409 16:56:41.142889       1 flags.go:33] FLAG: --openshift-config=\"/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml\"\nI0409 16:56:41.142895       1 glog.go:58] FLAGSET: endpoint controller\nI0409 16:56:41.142900       1 flags.go:33] FLAG: --concurrent-endpoint-syncs=\"5\"\nI0409 16:56:41.142904       1 glog.go:58] FLAGSET: namespace controller\nI0409 16:56:41.142908       1 flags.go:33] FLAG: --concurrent-namespace-syncs=\"10\"\nI0409 16:56:41.142912       1 flags.go:33] FLAG: --namespace-sync-period=\"5m0s\"\nI0409 16:56:42.051147       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use\n"

That cluster includes kube-controller-manager-operator#207:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-kube-controller-manager-operator").annotations'
  {
    "io.openshift.build.commit.id": "f7af927a236fd0c7607966b948ce28f2c30f2c8c",
    "io.openshift.build.commit.ref": "master",
    "io.openshift.build.source-location": "https://github.com/openshift/cluster-kube-controller-manager-operator"
  }
  $ git log --first-parent --format='%ad %h %d %s' --date=iso -3 f7af927a236
  2019-04-09 06:20:34 -0700 f7af927a  (origin/release-4.0, origin/master, origin/HEAD) Merge pull request #212 from sjenning/use-default-node-controller-settings
  2019-04-08 07:52:53 -0700 760122d7  Merge pull request #213 from sttts/sttts-authnz
  2019-04-04 14:36:25 -0700 b80336d3  Merge pull request #207 from mfojtik/switch-to-secure-port

So was the fix broken, or partial, or has there been a subsequent regression?

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051

Comment 15 W. Trevor King 2019-04-09 20:08:38 UTC

Created attachment 1553934 [details]
Occurrences of this error in CI from 2019-04-08T20:02 to 2019-04-09T19:58 UTC

This occurred in 45 of our 347 failures (12%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'listen tcp 0.0.0.0:10257: bind: address already in use'
  45	listen tcp 0.0.0.0:10257: bind: address already in use
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/999	ci-op-mv2kbqps
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/998	ci-op-i0iif1nt
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/997	ci-op-m4cdk5s5
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/996	ci-op-tkh4n2yk
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/993	ci-op-b4n5ypht
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1037	ci-op-j8613c9l
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1036	ci-op-flggzz7s
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1034	ci-op-pbpzpm3f
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1031	ci-op-d77bkqbp
  ...

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 16 Michal Fojtik 2019-04-10 18:06:32 UTC

Trevor, those failures are from scheduler, I don't see controller-manager errors anymore. Moving to QA. We can track
scheduler operator in new bug assigned to Pod team.

Comment 17 Michal Fojtik 2019-04-10 18:12:11 UTC

False alarm, I can stilil see it :-(

Comment 18 Xingxing Xia 2019-04-11 02:49:17 UTC

Yes, can still see in 4.0.0-0.nightly-2019-04-10-182914:
oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager 
NAME                                                                      READY   STATUS    RESTARTS   AGE
kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal   1/1     Running   3          107m

oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager -o yaml
  containerStatuses:
...
        message: |
...
          I0411 01:01:13.599713       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
          failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use
        reason: Error
        startedAt: "2019-04-11T01:01:12Z"

Comment 19 Tomáš Nožička 2019-04-11 13:55:49 UTC

*** Bug 1691089 has been marked as a duplicate of this bug. ***

Comment 20 Michal Fojtik 2019-04-12 07:59:03 UTC

PR: https://github.com/openshift/origin/pull/22543

Comment 21 Michal Fojtik 2019-04-12 11:45:15 UTC

https://github.com/openshift/origin/pull/22543 is in merge queue and it should remove the problem for both KCM and KS

Comment 22 zhou ying 2019-04-16 06:45:06 UTC

Comfirmed with CI payload: 4.0.0-0.ci-2019-04-15-230244, could not see the issue, will verify when new nightly payload ready.

Comment 23 zhou ying 2019-04-19 06:10:08 UTC

verified with payload: 4.1.0-0.nightly-2019-04-18-210657, can't reproduce the issue.

Comment 25 errata-xmlrpc 2019-06-04 10:46:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758