Hide Forgot
[build cop] Observing OCP tests failing due to static pod being unable to bind to 0.0.0.0:10252 due to the address already being in use. Seems to be similar to: https://bugzilla.redhat.com/show_bug.cgi?id=1678847 https://bugzilla.redhat.com/show_bug.cgi?id=1670727 Snippet: I0320 14:43:27.406257 1 flags.go:33] FLAG: --use-service-account-credentials=\"false\"\nI0320 14:43:27.406263 1 flags.go:33] FLAG: --v=\"2\"\nI0320 14:43:27.406270 1 flags.go:33] FLAG: --version=\"false\"\nI0320 14:43:27.406280 1 flags.go:33] FLAG: --vmodule=\"\"\nI0320 14:43:27.880763 1 serving.go:296] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use\n" Mar 20 14:43:55.876 E clusteroperator/kube-apiserver changed Failing to True: RevisionControllerFailing: RevisionControllerFailing: Post https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps: read tcp 10.130.0.6:37742->172.30.0.1:443: read: connection reset by peer Test build ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5925
Created attachment 1546781 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 53 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'listen tcp 0\\.0\\.0\\.0:10251: bind: address already in use' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
*** Bug 1690168 has been marked as a duplicate of this bug. ***
*** Bug 1691089 has been marked as a duplicate of this bug. ***
Fixes are in https://github.com/openshift/origin/pull/22398 and https://github.com/openshift/cluster-kube-controller-manager-operator/pull/197
In fresh install, I see restart count: kube-controller-manager-ip-172-31-138-177.ap-southeast-1.compute.internal 1/1 Running 3 68m oc edit each static pods, all show: ... containerStatuses: ... message: | s.go:33] FLAG: --service-cluster-ip-range="172.30.0.0/16" ... I0401 08:39:02.584416 1 flags.go:33] FLAG: --resource-quota-sync-period="5m0s" failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use reason: Error ... Per Slack confirmation with Maciej and Stefan's https://jira.coreos.com/browse/MSTR-357 , assigning back.
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/801 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/805 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/806 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/808 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/809 all note pods/kube-controller-manager crashlooping on "failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use"
per build-cop report this is failing almost all upgrades, raising severity.
The openshift-kube-scheduler pods also have the same error: failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/207
https://bugzilla.redhat.com/show_bug.cgi?id=1695179#c2
*** Bug 1695179 has been marked as a duplicate of this bug. ***
Above PR only landed in payload 4.0.0-0.nightly-2019-04-08-225815, the latest one as of now, which failed in env installation (bug 1697814). Will check again once that is fixed
I'm still seeing this in CI, e.g. [1]: Apr 09 16:56:44.767 E clusteroperator/kube-controller-manager changed Failing to True: StaticPodsFailingError: StaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is not ready\nStaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is terminated: "Error" - "me-plugin-dir=\"/etc/kubernetes/kubelet-plugins/volume/exec\"\nI0409 16:56:41.142795 1 flags.go:33] FLAG: --pv-recycler-increment-timeout-nfs=\"30\"\nI0409 16:56:41.142801 1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-hostpath=\"60\"\nI0409 16:56:41.142809 1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-nfs=\"300\"\nI0409 16:56:41.142816 1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-hostpath=\"\"\nI0409 16:56:41.142822 1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-nfs=\"\"\nI0409 16:56:41.142829 1 flags.go:33] FLAG: --pv-recycler-timeout-increment-hostpath=\"30\"\nI0409 16:56:41.142835 1 flags.go:33] FLAG: --pvclaimbinder-sync-period=\"15s\"\nI0409 16:56:41.142843 1 glog.go:58] FLAGSET: podgc controller\nI0409 16:56:41.142850 1 flags.go:33] FLAG: --terminated-pod-gc-threshold=\"12500\"\nI0409 16:56:41.142857 1 glog.go:58] FLAGSET: misc\nI0409 16:56:41.142874 1 flags.go:33] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group=\"\"\nI0409 16:56:41.142879 1 flags.go:33] FLAG: --kubeconfig=\"/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig\"\nI0409 16:56:41.142885 1 flags.go:33] FLAG: --master=\"\"\nI0409 16:56:41.142889 1 flags.go:33] FLAG: --openshift-config=\"/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml\"\nI0409 16:56:41.142895 1 glog.go:58] FLAGSET: endpoint controller\nI0409 16:56:41.142900 1 flags.go:33] FLAG: --concurrent-endpoint-syncs=\"5\"\nI0409 16:56:41.142904 1 glog.go:58] FLAGSET: namespace controller\nI0409 16:56:41.142908 1 flags.go:33] FLAG: --concurrent-namespace-syncs=\"10\"\nI0409 16:56:41.142912 1 flags.go:33] FLAG: --namespace-sync-period=\"5m0s\"\nI0409 16:56:42.051147 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use\n" That cluster includes kube-controller-manager-operator#207: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-kube-controller-manager-operator").annotations' { "io.openshift.build.commit.id": "f7af927a236fd0c7607966b948ce28f2c30f2c8c", "io.openshift.build.commit.ref": "master", "io.openshift.build.source-location": "https://github.com/openshift/cluster-kube-controller-manager-operator" } $ git log --first-parent --format='%ad %h %d %s' --date=iso -3 f7af927a236 2019-04-09 06:20:34 -0700 f7af927a (origin/release-4.0, origin/master, origin/HEAD) Merge pull request #212 from sjenning/use-default-node-controller-settings 2019-04-08 07:52:53 -0700 760122d7 Merge pull request #213 from sttts/sttts-authnz 2019-04-04 14:36:25 -0700 b80336d3 Merge pull request #207 from mfojtik/switch-to-secure-port So was the fix broken, or partial, or has there been a subsequent regression? [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051
Created attachment 1553934 [details] Occurrences of this error in CI from 2019-04-08T20:02 to 2019-04-09T19:58 UTC This occurred in 45 of our 347 failures (12%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'listen tcp 0.0.0.0:10257: bind: address already in use' 45 listen tcp 0.0.0.0:10257: bind: address already in use 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/999 ci-op-mv2kbqps 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/998 ci-op-i0iif1nt 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/997 ci-op-m4cdk5s5 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/996 ci-op-tkh4n2yk 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/993 ci-op-b4n5ypht 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1037 ci-op-j8613c9l 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1036 ci-op-flggzz7s 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1034 ci-op-pbpzpm3f 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1031 ci-op-d77bkqbp ... [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
Trevor, those failures are from scheduler, I don't see controller-manager errors anymore. Moving to QA. We can track scheduler operator in new bug assigned to Pod team.
False alarm, I can stilil see it :-(
Yes, can still see in 4.0.0-0.nightly-2019-04-10-182914: oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager NAME READY STATUS RESTARTS AGE kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal 1/1 Running 3 107m oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager -o yaml containerStatuses: ... message: | ... I0411 01:01:13.599713 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key) failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use reason: Error startedAt: "2019-04-11T01:01:12Z"
PR: https://github.com/openshift/origin/pull/22543
https://github.com/openshift/origin/pull/22543 is in merge queue and it should remove the problem for both KCM and KS
Comfirmed with CI payload: 4.0.0-0.ci-2019-04-15-230244, could not see the issue, will verify when new nightly payload ready.
verified with payload: 4.1.0-0.nightly-2019-04-18-210657, can't reproduce the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758