Bug 1691055
| Summary: | Bind: Address already in use for clusteroperator/kube-controller-manager | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | ewolinet | ||||||
| Component: | Master | Assignee: | Michal Fojtik <mfojtik> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 4.1.0 | CC: | aos-bugs, bparees, deads, jokerman, mfojtik, mmccomas, rvokal, wking, xxia, yinzhou | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.1.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2019-06-04 10:46:13 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1697814 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
ewolinet
2019-03-20 18:23:52 UTC
Created attachment 1546781 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 53 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'listen tcp 0\\.0\\.0\\.0:10251: bind: address already in use' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log *** Bug 1690168 has been marked as a duplicate of this bug. *** *** Bug 1691089 has been marked as a duplicate of this bug. *** Fixes are in https://github.com/openshift/origin/pull/22398 and https://github.com/openshift/cluster-kube-controller-manager-operator/pull/197 In fresh install, I see restart count: kube-controller-manager-ip-172-31-138-177.ap-southeast-1.compute.internal 1/1 Running 3 68m
oc edit each static pods, all show:
...
containerStatuses:
...
message: |
s.go:33] FLAG: --service-cluster-ip-range="172.30.0.0/16"
...
I0401 08:39:02.584416 1 flags.go:33] FLAG: --resource-quota-sync-period="5m0s"
failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use
reason: Error
...
Per Slack confirmation with Maciej and Stefan's https://jira.coreos.com/browse/MSTR-357 , assigning back.
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/801 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/805 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/806 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/808 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/809 all note pods/kube-controller-manager crashlooping on "failed to create listener: failed to listen on 0.0.0.0:10252: listen tcp 0.0.0.0:10252: bind: address already in use" per build-cop report this is failing almost all upgrades, raising severity. The openshift-kube-scheduler pods also have the same error: failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use *** Bug 1695179 has been marked as a duplicate of this bug. *** Above PR only landed in payload 4.0.0-0.nightly-2019-04-08-225815, the latest one as of now, which failed in env installation (bug 1697814). Will check again once that is fixed I'm still seeing this in CI, e.g. [1]: Apr 09 16:56:44.767 E clusteroperator/kube-controller-manager changed Failing to True: StaticPodsFailingError: StaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is not ready\nStaticPodsFailing: nodes/ip-10-0-151-165.ec2.internal pods/kube-controller-manager-ip-10-0-151-165.ec2.internal container="kube-controller-manager-3" is terminated: "Error" - "me-plugin-dir=\"/etc/kubernetes/kubelet-plugins/volume/exec\"\nI0409 16:56:41.142795 1 flags.go:33] FLAG: --pv-recycler-increment-timeout-nfs=\"30\"\nI0409 16:56:41.142801 1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-hostpath=\"60\"\nI0409 16:56:41.142809 1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-nfs=\"300\"\nI0409 16:56:41.142816 1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-hostpath=\"\"\nI0409 16:56:41.142822 1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-nfs=\"\"\nI0409 16:56:41.142829 1 flags.go:33] FLAG: --pv-recycler-timeout-increment-hostpath=\"30\"\nI0409 16:56:41.142835 1 flags.go:33] FLAG: --pvclaimbinder-sync-period=\"15s\"\nI0409 16:56:41.142843 1 glog.go:58] FLAGSET: podgc controller\nI0409 16:56:41.142850 1 flags.go:33] FLAG: --terminated-pod-gc-threshold=\"12500\"\nI0409 16:56:41.142857 1 glog.go:58] FLAGSET: misc\nI0409 16:56:41.142874 1 flags.go:33] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group=\"\"\nI0409 16:56:41.142879 1 flags.go:33] FLAG: --kubeconfig=\"/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig\"\nI0409 16:56:41.142885 1 flags.go:33] FLAG: --master=\"\"\nI0409 16:56:41.142889 1 flags.go:33] FLAG: --openshift-config=\"/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml\"\nI0409 16:56:41.142895 1 glog.go:58] FLAGSET: endpoint controller\nI0409 16:56:41.142900 1 flags.go:33] FLAG: --concurrent-endpoint-syncs=\"5\"\nI0409 16:56:41.142904 1 glog.go:58] FLAGSET: namespace controller\nI0409 16:56:41.142908 1 flags.go:33] FLAG: --concurrent-namespace-syncs=\"10\"\nI0409 16:56:41.142912 1 flags.go:33] FLAG: --namespace-sync-period=\"5m0s\"\nI0409 16:56:42.051147 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)\nfailed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use\n" That cluster includes kube-controller-manager-operator#207: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-kube-controller-manager-operator").annotations' { "io.openshift.build.commit.id": "f7af927a236fd0c7607966b948ce28f2c30f2c8c", "io.openshift.build.commit.ref": "master", "io.openshift.build.source-location": "https://github.com/openshift/cluster-kube-controller-manager-operator" } $ git log --first-parent --format='%ad %h %d %s' --date=iso -3 f7af927a236 2019-04-09 06:20:34 -0700 f7af927a (origin/release-4.0, origin/master, origin/HEAD) Merge pull request #212 from sjenning/use-default-node-controller-settings 2019-04-08 07:52:53 -0700 760122d7 Merge pull request #213 from sttts/sttts-authnz 2019-04-04 14:36:25 -0700 b80336d3 Merge pull request #207 from mfojtik/switch-to-secure-port So was the fix broken, or partial, or has there been a subsequent regression? [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1564/pull-ci-openshift-installer-master-e2e-aws/5051 Created attachment 1553934 [details] Occurrences of this error in CI from 2019-04-08T20:02 to 2019-04-09T19:58 UTC This occurred in 45 of our 347 failures (12%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'listen tcp 0.0.0.0:10257: bind: address already in use' 45 listen tcp 0.0.0.0:10257: bind: address already in use 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/999 ci-op-mv2kbqps 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/998 ci-op-i0iif1nt 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/997 ci-op-m4cdk5s5 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/996 ci-op-tkh4n2yk 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/993 ci-op-b4n5ypht 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1037 ci-op-j8613c9l 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1036 ci-op-flggzz7s 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1034 ci-op-pbpzpm3f 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/1031 ci-op-d77bkqbp ... [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log Trevor, those failures are from scheduler, I don't see controller-manager errors anymore. Moving to QA. We can track scheduler operator in new bug assigned to Pod team. False alarm, I can stilil see it :-( Yes, can still see in 4.0.0-0.nightly-2019-04-10-182914:
oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager
NAME READY STATUS RESTARTS AGE
kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal 1/1 Running 3 107m
oc get po kube-controller-manager-ip-10-0-138-175.ap-northeast-2.compute.internal -n openshift-kube-controller-manager -o yaml
containerStatuses:
...
message: |
...
I0411 01:01:13.599713 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use
reason: Error
startedAt: "2019-04-11T01:01:12Z"
*** Bug 1691089 has been marked as a duplicate of this bug. *** https://github.com/openshift/origin/pull/22543 is in merge queue and it should remove the problem for both KCM and KS Comfirmed with CI payload: 4.0.0-0.ci-2019-04-15-230244, could not see the issue, will verify when new nightly payload ready. verified with payload: 4.1.0-0.nightly-2019-04-18-210657, can't reproduce the issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |