Bug 1698672
Summary: | kube-controller-manager crashlooping: cannot get resource "configmaps" in API group "" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Master | Assignee: | David Eads <deads> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | aos-bugs, danw, deads, jokerman, mfojtik, mifiedle, mkarg, mmccomas, nagrawal, vlaad, wking, xxia |
Target Milestone: | --- | Keywords: | BetaBlocker, TestBlocker |
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:47:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1698950, 1700504 | ||
Bug Blocks: | 1694226 | ||
Attachments: |
Description
W. Trevor King
2019-04-10 23:44:23 UTC
Created attachment 1554367 [details] Occurrences of this error in CI from 2019-04-09T23:53 to 2019-04-10T23:39 UTC This occurred in 39 of our 468 failures (8%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'restarting failed container=kube-controller-manager' 39 restarting failed container=kube-controller-manager 2 https://github.com/openshift/origin/pull/22450 ci-op-16d8t700 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/9 ci-op-98ldn7y1 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8 ci-op-058jydv3 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7 ci-op-325vpl5h 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6 ci-op-8d4y588j 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4 ci-op-klfftt4z 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2 ci-op-vvpnjln3 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13 ci-op-vrp1m58s 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11 ci-op-pfvkwr5g ... [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log Moving this over to auth because this is caused by: ``` configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" ``` No idea why the user can't get configmaps, but this looks awfully similar to the Unauthorized errors we seen earlier this week when pods were not able to get self-references to report events. I suspect there is some new bug in the RBAC controller maybe? Created attachment 1554887 [details] Occurrences of this error in CI from 2019-04-12T04:31 to 2019-04-13T03:44 UTC This occurred in 19 of our 726 failures (2%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'clusteroperator/kube-controller-manager changed Failing to True.*cannot get resource.*configmaps.*in API group' 'clusteroperator/kube-controller-manager changed Failing to True.*cannot get resource.*configmaps.*in API group' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log 19 is not all that many, but as you can see in the attached plot, it also occurs in 296 (40% of failures) jobs in pods.json entries (but apparently gets resolved before the operator throws in the towel). For example, [1] doesn't surface much about kube-controller-manager in the build log: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285/build-log.txt | grep kube-controller-manager Apr 13 02:08:57.483 I ns/openshift-kube-controller-manager-operator deployment/kube-controller-manager-operator Changed loglevel level to "2" (91 times) Apr 13 02:18:57.472 I ns/openshift-kube-controller-manager-operator deployment/kube-controller-manager-operator Changed loglevel level to "2" (92 times) but it is reporting this in pods.json: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285/artifacts/e2e-aws/pods.json | jq -r '[.items[] | (.status.containerStatuses // [])[] | .restartCount = (.restartCount | tonumber)] | sort_by(-.restartCount)[] | select(.restartCount > 0) | .previousMessage = (.lastState.terminated.message // (.lastState.terminated.exitCode | tostring) // "?") | (.restartCount | tostring) + "\t" + .name + "\t" + .previousMessage' 3 kube-controller-manager-5 ue,LocalStorageCapacityIsolation=false,RotateKubeletServerCertificate=true,SupportPodPidsLimit=true" ... I0413 01:53:24.877669 1 flags.go:33] FLAG: --disable-attach-detach-reconcile-sync="false" W0413 01:53:30.656803 1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" ... 2 kube-controller-manager-5 --authorization-always-allow-paths="[/healthz]" ... I0413 01:49:12.528798 1 flags.go:33] FLAG: --service-account-private-key-file="/etc/kubernetes/static-pod-resources/secrets/service-account-private-key/service-account.key" W0413 01:49:15.896633 1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" ... 1 cluster-version-operator 0 1 console 3 01:48:26 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found 2019/04/13 01:48:36 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found ... 1 console 3 01:48:26 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found 2019/04/13 01:48:36 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found ... 1 sdn I0413 01:35:11.474413 2270 cmd.go:230] Overriding kubernetes api to https://api.ci-op-0gty3y99-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443 I0413 01:35:11.474518 2270 cmd.go:133] Reading node configuration from /config/sdn-config.yaml W0413 01:35:11.477492 2270 server.go:198] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0413 01:35:11.477610 2270 feature_gate.go:206] feature gates: &{map[]} I0413 01:35:11.477728 2270 cmd.go:256] Watching config file /config/sdn-config.yaml for changes I0413 01:35:11.477846 2270 cmd.go:256] Watching config file /config/..2019_04_13_01_35_02.247920586/sdn-config.yaml for changes I0413 01:35:11.480663 2270 node.go:148] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-169-215.ec2.internal" (IP ""), iptables sync period "30s" I0413 01:35:11.522827 2270 cmd.go:197] Starting node networking (v4.0.0-alpha.0+6f8c841-2022-dirty) I0413 01:35:11.522862 2270 node.go:267] Starting openshift-sdn network plugin F0413 01:35:11.568828 2270 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start That also contains OAuth 404s like we saw in bug 1699469, although cluster-config-operator#43 landed at 2019-04-12T19:16Z [2], and this cluster started at 2019-04-13T01:11Z [1] and: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5291/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-config-operator").annotations' { "io.openshift.build.commit.id": "8d0878a7418dc6c514f87ec9c46444d600a55b70", "io.openshift.build.commit.ref": "master", "io.openshift.build.source-location": "https://github.com/openshift/cluster-config-operator" } has the fix [3]. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285 [2]: https://github.com/openshift/cluster-config-operator/pull/43#event-2273345198 [3]: https://github.com/openshift/cluster-config-operator/commit/8d0878a7418dc6c514f87ec9c46444d600a55b70 There is a whole lot of overlap between the ~300 jobs mentioning: kube-controller-manager.*cannot get resource.*configmaps.*in API group in pods.json and those mentioning: master has not created a default cluster network, (as seen above for the SDN pod). Looping in Dan and David from bug 1698950, since these may be the same thing. The single SDN crash/restart early on is apparently expected (or at least acceptable), per bug 1674384. So maybe a red herring on that front. I'm not sure if there's a fundamental difference between the clusters that get three kube-controller-manager restarts (maybe ok, maybe not?) and those that get five (trips our "Managed cluster should have no crashlooping pods in core namespaces over two minutes" test). *** Bug 1698950 has been marked as a duplicate of this bug. *** we think this will be fixed by https://github.com/openshift/cluster-kube-controller-manager-operator/pull/224. @vlaad I see no reason to believe these are related. Created attachment 1555379 [details]
kube-controller-manager-operator#224 has fixed this in CI :)
Can reproduce the issue with latest payload: 4.1.0-0.nightly-2019-04-18-210657, so will verify it. Can't reproduce the issue with latest payload: 4.1.0-0.nightly-2019-04-18-210657, so will verify it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |