1698672 – kube-controller-manager crashlooping: cannot get resource "configmaps" in API group ""

Bug 1698672 - kube-controller-manager crashlooping: cannot get resource "configmaps" in API group ""

Summary: kube-controller-manager crashlooping: cannot get resource "configmaps" in API...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	David Eads
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1698950 1700504
Blocks:	1694226
TreeView+	depends on / blocked

Reported:	2019-04-10 23:44 UTC by W. Trevor King
Modified:	2021-05-06 14:02 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-04-09T23:53 to 2019-04-10T23:39 UTC (304.92 KB, image/svg+xml) 2019-04-10 23:45 UTC, W. Trevor King	no flags	Details
Occurrences of this error in CI from 2019-04-12T04:31 to 2019-04-13T03:44 UTC (374.00 KB, image/svg+xml) 2019-04-13 04:55 UTC, W. Trevor King	no flags	Details
kube-controller-manager-operator#224 has fixed this in CI :) (341.67 KB, image/png) 2019-04-16 04:10 UTC, W. Trevor King	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 224	0	'None'	closed	explicitly set client-ca and aggregator-client-ca to avoid early fatal contact with unready KAS	2021-02-16 15:10:12 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:30 UTC

Description W. Trevor King 2019-04-10 23:44:23 UTC

Description of problem:

Bunch of these in CI today.  For example:

Apr 10 22:52:50.864 E clusteroperator/kube-controller-manager changed Failing to True: StaticPodsFailingError: StaticPodsFailing: nodes/ip-10-0-160-159.ec2.internal pods/kube-controller-manager-ip-10-0-160-159.ec2.internal container="kube-controller-manager-5" is not ready\nStaticPodsFailing: nodes/ip-10-0-160-159.ec2.internal pods/kube-controller-manager-ip-10-0-160-159.ec2.internal container="kube-controller-manager-5" is waiting: "CrashLoopBackOff" - "Back-off 10s restarting failed container=kube-controller-manager-5 pod=kube-controller-manager-ip-10-0-160-159.ec2.internal_openshift-kube-controller-manager(c9b1abb0dbe03675f63cfe347c0f22db)"

Lots of issues in that cluster though:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6639/artifacts/e2e-aws/pods.json | jq -r '.items[] |  select(.status.phase != "Running" and .status.phase != "Succeeded") | .status.phase + " " + .metadata.name + " " + ([.status.containerStatuses[].state.waiting.message][0])'
Failed installer-1-ip-10-0-147-170.ec2.internal 

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6639/artifacts/e2e-aws/pods.json | jq -r '[.items[] | .status.containerStatuses[] | .restartCount = (.restartCount | tonumber)] | sort_by(-.restartCount)[] | select(.restartCount > 0) | .previousMessage = (.lastState.terminated.message // (.lastState.terminated.exitCode | tostring) // "?") | (.restartCount | tostring) + " " + .name + " " + .previousMessage'

5 kube-controller-manager-5 ces/configmaps/config/config.yaml"
I0410 22:49:10.920193       1 glog.go:58] FLAGSET: endpoint controller
I0410 22:49:10.920200       1 flags.go:33] FLAG: --concurrent-endpoint-syncs="5"
I0410 22:49:10.920207       1 glog.go:58] FLAGSET: nodelifecycle controller
I0410 22:49:10.920216       1 flags.go:33] FLAG: --enable-taint-manager="true"
I0410 22:49:10.920222       1 flags.go:33] FLAG: --large-cluster-size-threshold="50"
I0410 22:49:10.920229       1 flags.go:33] FLAG: --node-eviction-rate="0.1"
I0410 22:49:10.920236       1 flags.go:33] FLAG: --node-monitor-grace-period="40s"
I0410 22:49:10.920243       1 flags.go:33] FLAG: --node-startup-grace-period="1m0s"
I0410 22:49:10.920250       1 flags.go:33] FLAG: --pod-eviction-timeout="5m0s"
I0410 22:49:10.920257       1 flags.go:33] FLAG: --secondary-node-eviction-rate="0.01"
I0410 22:49:10.920264       1 flags.go:33] FLAG: --unhealthy-zone-threshold="0.55"
I0410 22:49:10.920272       1 glog.go:58] FLAGSET: serviceaccount controller
I0410 22:49:10.920279       1 flags.go:33] FLAG: --concurrent-serviceaccount-token-syncs="5"
I0410 22:49:10.920286       1 flags.go:33] FLAG: --root-ca-file="/etc/kubernetes/static-pod-resources/configmaps/serviceaccount-ca/ca-bundle.crt"
I0410 22:49:10.920295       1 flags.go:33] FLAG: --service-account-private-key-file="/etc/kubernetes/static-pod-resources/secrets/service-account-private-key/service-account.key"
I0410 22:49:11.969969       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
W0410 22:49:16.621781       1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

3 kube-controller-manager-5 belet-plugins/volume/exec"
I0410 22:51:36.596105       1 flags.go:33] FLAG: --pv-recycler-increment-timeout-nfs="30"
I0410 22:51:36.596112       1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-hostpath="60"
I0410 22:51:36.596119       1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-nfs="300"
I0410 22:51:36.596145       1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-hostpath=""
I0410 22:51:36.596152       1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-nfs=""
I0410 22:51:36.596158       1 flags.go:33] FLAG: --pv-recycler-timeout-increment-hostpath="30"
I0410 22:51:36.596165       1 flags.go:33] FLAG: --pvclaimbinder-sync-period="15s"
I0410 22:51:36.596173       1 glog.go:58] FLAGSET: podgc controller
I0410 22:51:36.596180       1 flags.go:33] FLAG: --terminated-pod-gc-threshold="12500"
I0410 22:51:36.596188       1 glog.go:58] FLAGSET: resourcequota controller
I0410 22:51:36.596195       1 flags.go:33] FLAG: --concurrent-resource-quota-syncs="5"
I0410 22:51:36.596202       1 flags.go:33] FLAG: --resource-quota-sync-period="5m0s"
I0410 22:51:36.596210       1 glog.go:58] FLAGSET: ttl-after-finished controller
I0410 22:51:36.596217       1 flags.go:33] FLAG: --concurrent-ttl-after-finished-syncs="5"
I0410 22:51:36.596224       1 glog.go:58] FLAGSET: misc
I0410 22:51:36.596232       1 flags.go:33] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group=""
I0410 22:51:36.596239       1 flags.go:33] FLAG: --kubeconfig="/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig"
I0410 22:51:36.596249       1 flags.go:33] FLAG: --master=""
I0410 22:51:36.596255       1 flags.go:33] FLAG: --openshift-config="/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml"
I0410 22:51:37.519611       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use

3 kube-controller-manager-5 ,-bootstrapsigner,-tokencleaner]"
I0410 22:53:08.221996       1 flags.go:33] FLAG: --external-cloud-volume-plugin=""
I0410 22:53:08.222000       1 flags.go:33] FLAG: --feature-gates="ExperimentalCriticalPodAnnotation=true,LocalStorageCapacityIsolation=false,RotateKubeletServerCertificate=true,SupportPodPidsLimit=true"
I0410 22:53:08.222021       1 flags.go:33] FLAG: --kube-api-burst="300"
I0410 22:53:08.222026       1 flags.go:33] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0410 22:53:08.222031       1 flags.go:33] FLAG: --kube-api-qps="150"
I0410 22:53:08.222036       1 flags.go:33] FLAG: --leader-elect="true"
I0410 22:53:08.222040       1 flags.go:33] FLAG: --leader-elect-lease-duration="15s"
I0410 22:53:08.222044       1 flags.go:33] FLAG: --leader-elect-renew-deadline="10s"
I0410 22:53:08.222048       1 flags.go:33] FLAG: --leader-elect-resource-lock="configmaps"
I0410 22:53:08.222053       1 flags.go:33] FLAG: --leader-elect-retry-period="3s"
I0410 22:53:08.222057       1 flags.go:33] FLAG: --min-resync-period="12h0m0s"
I0410 22:53:08.222061       1 flags.go:33] FLAG: --node-monitor-period="5s"
I0410 22:53:08.222065       1 flags.go:33] FLAG: --node-sync-period="0s"
I0410 22:53:08.222069       1 flags.go:33] FLAG: --route-reconciliation-period="10s"
I0410 22:53:08.222073       1 flags.go:33] FLAG: --use-service-account-credentials="true"
I0410 22:53:08.222078       1 glog.go:58] FLAGSET: insecure serving
I0410 22:53:08.222082       1 flags.go:33] FLAG: --address="0.0.0.0"
I0410 22:53:08.222087       1 flags.go:33] FLAG: --port="0"
I0410 22:53:08.222091       1 glog.go:58] FLAGSET: ttl-after-finished controller
I0410 22:53:08.222095       1 flags.go:33] FLAG: --concurrent-ttl-after-finished-syncs="5"
I0410 22:53:08.923528       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use

3 scheduler 
I0410 22:47:15.590634       1 flags.go:33] FLAG: --requestheader-group-headers="[x-remote-group]"
I0410 22:47:15.590639       1 flags.go:33] FLAG: --requestheader-username-headers="[x-remote-user]"
I0410 22:47:15.590646       1 flags.go:33] FLAG: --scheduler-name="default-scheduler"
I0410 22:47:15.590650       1 flags.go:33] FLAG: --secure-port="10259"
I0410 22:47:15.590654       1 flags.go:33] FLAG: --skip-headers="false"
I0410 22:47:15.590658       1 flags.go:33] FLAG: --stderrthreshold="2"
I0410 22:47:15.590662       1 flags.go:33] FLAG: --storage-driver-buffer-duration="1m0s"
I0410 22:47:15.590666       1 flags.go:33] FLAG: --storage-driver-db="cadvisor"
I0410 22:47:15.590670       1 flags.go:33] FLAG: --storage-driver-host="localhost:8086"
I0410 22:47:15.590674       1 flags.go:33] FLAG: --storage-driver-password="root"
I0410 22:47:15.590678       1 flags.go:33] FLAG: --storage-driver-secure="false"
I0410 22:47:15.590682       1 flags.go:33] FLAG: --storage-driver-table="stats"
I0410 22:47:15.590686       1 flags.go:33] FLAG: --storage-driver-user="root"
I0410 22:47:15.590690       1 flags.go:33] FLAG: --tls-cert-file=""
I0410 22:47:15.590694       1 flags.go:33] FLAG: --tls-cipher-suites="[]"
I0410 22:47:15.590698       1 flags.go:33] FLAG: --tls-min-version=""
I0410 22:47:15.590702       1 flags.go:33] FLAG: --tls-private-key-file=""
I0410 22:47:15.590706       1 flags.go:33] FLAG: --tls-sni-cert-key="[]"
I0410 22:47:15.590712       1 flags.go:33] FLAG: --use-legacy-policy-config="false"
I0410 22:47:15.590716       1 flags.go:33] FLAG: --v="2"
I0410 22:47:15.590720       1 flags.go:33] FLAG: --version="false"
I0410 22:47:15.590726       1 flags.go:33] FLAG: --vmodule=""
I0410 22:47:15.590730       1 flags.go:33] FLAG: --write-config-to=""
I0410 22:47:16.335085       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-scheduler.crt, /var/run/kubernetes/kube-scheduler.key)
failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use

3 scheduler 
I0410 22:47:48.307527       1 flags.go:33] FLAG: --requestheader-group-headers="[x-remote-group]"
I0410 22:47:48.307536       1 flags.go:33] FLAG: --requestheader-username-headers="[x-remote-user]"
I0410 22:47:48.307548       1 flags.go:33] FLAG: --scheduler-name="default-scheduler"
I0410 22:47:48.307555       1 flags.go:33] FLAG: --secure-port="10259"
I0410 22:47:48.307562       1 flags.go:33] FLAG: --skip-headers="false"
I0410 22:47:48.307570       1 flags.go:33] FLAG: --stderrthreshold="2"
I0410 22:47:48.307576       1 flags.go:33] FLAG: --storage-driver-buffer-duration="1m0s"
I0410 22:47:48.307584       1 flags.go:33] FLAG: --storage-driver-db="cadvisor"
I0410 22:47:48.307591       1 flags.go:33] FLAG: --storage-driver-host="localhost:8086"
I0410 22:47:48.307598       1 flags.go:33] FLAG: --storage-driver-password="root"
I0410 22:47:48.307604       1 flags.go:33] FLAG: --storage-driver-secure="false"
I0410 22:47:48.307611       1 flags.go:33] FLAG: --storage-driver-table="stats"
I0410 22:47:48.307618       1 flags.go:33] FLAG: --storage-driver-user="root"
I0410 22:47:48.307625       1 flags.go:33] FLAG: --tls-cert-file=""
I0410 22:47:48.307631       1 flags.go:33] FLAG: --tls-cipher-suites="[]"
I0410 22:47:48.307640       1 flags.go:33] FLAG: --tls-min-version=""
I0410 22:47:48.307646       1 flags.go:33] FLAG: --tls-private-key-file=""
I0410 22:47:48.307653       1 flags.go:33] FLAG: --tls-sni-cert-key="[]"
I0410 22:47:48.307663       1 flags.go:33] FLAG: --use-legacy-policy-config="false"
I0410 22:47:48.307670       1 flags.go:33] FLAG: --v="2"
I0410 22:47:48.307676       1 flags.go:33] FLAG: --version="false"
I0410 22:47:48.307687       1 flags.go:33] FLAG: --vmodule=""
I0410 22:47:48.307694       1 flags.go:33] FLAG: --write-config-to=""
I0410 22:47:49.130245       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-scheduler.crt, /var/run/kubernetes/kube-scheduler.key)
failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use

2 console 2
2 console 2
2 network-operator 2019/04/10 22:37:46 Go Version: go1.10.8
2019/04/10 22:37:46 Go OS/Arch: linux/amd64
2019/04/10 22:37:46 operator-sdk Version: v0.4.1
2019/04/10 22:37:46 overriding kubernetes api to https://api.ci-op-1mzys9cv-5a633.origin-ci-int-aws.dev.rhcloud.com:6443
2019/04/10 22:37:46 Get https://api.ci-op-1mzys9cv-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api?timeout=32s: dial tcp 10.0.146.57:6443: connect: connection refused

1 operator 255
1 operator 255
1 operator ", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingVersion' no image found for operand pod
I0410 22:39:58.146886       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"40ec51e3-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/revision-status-1 -n openshift-kube-apiserver: cause by changes in data.status
I0410 22:40:00.145621       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"40ec51e3-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingVersion' no image found for operand pod
I0410 22:40:14.992323       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "3dfb8a49ce2bf6a428549f07b24b9062d309cf0faa54a440008567912b6cdde0", lastKnown: "")
W0410 22:40:14.992750       1 builder.go:107] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
I0410 22:40:14.992820       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "a94256ff7fb509379617976b4b679f27c99d1fa1243d1e72ec629f97b49af38a", lastKnown: "")
F0410 22:40:14.992847       1 leaderelection.go:65] leaderelection lost
I0410 22:40:15.017377       1 node_controller.go:133] Shutting down NodeController
I0410 22:40:15.017451       1 status_controller.go:194] Shutting down StatusSyncer-kube-apiserver
I0410 22:40:15.017477       1 monitoring_resource_controller.go:163] Shutting down MonitoringResourceController
I0410 22:40:15.017489       1 unsupportedconfigoverrides_controller.go:164] Shutting down UnsupportedConfigOverridesController
I0410 22:40:15.017517       1 targetconfigcontroller.go:332] Shutting down TargetConfigController
F0410 22:40:15.017562       1 builder.go:242] stopped

1 operator  22:39:59.931772       1 monitoring_resource_controller.go:183] key failed with : the server could not find the requested resource
I0410 22:39:59.932051       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"40f3eb39-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ServiceMonitorCreateFailed' Failed to create ServiceMonitor.monitoring.coreos.com/v1: the server could not find the requested resource
I0410 22:40:00.101758       1 request.go:530] Throttling request took 1.196718245s, request: GET:https://172.30.0.1:443/api/v1/namespaces/default/endpoints/kubernetes
E0410 22:40:00.105365       1 satokensigner_controller.go:356] key failed with : unexpected addresses: 10.0.11.72
I0410 22:40:00.105742       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"40f3eb39-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SATokenSignerControllerStuck' unexpected addresses: 10.0.11.72
I0410 22:40:00.301864       1 request.go:530] Throttling request took 1.196471403s, request: DELETE:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cloud-config
I0410 22:40:00.453636       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "3a84e84ac280cd1f968b4be36daff4481a69779621e8ae30aae5e666e642b6a9", lastKnown: "")
W0410 22:40:00.453669       1 builder.go:107] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
I0410 22:40:00.453734       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "bcc4f6e5250b21110b4f5c52dc1aa6acd21ce978813ba825ff8bba2860bb0544", lastKnown: "")
F0410 22:40:00.453800       1 leaderelection.go:65] leaderelection lost

1 kube-scheduler-operator-container :443/api/v1/namespaces/openshift-kube-scheduler/serviceaccounts/openshift-kube-scheduler-sa
I0410 22:39:19.170757       1 target_config_reconciler_v311_00.go:98] Error while listing configmap scheduler.config.openshift.io "cluster" not found
I0410 22:39:19.170775       1 target_config_reconciler_v311_00.go:132] Error while getting scheduler type scheduler.config.openshift.io "cluster" not found and using default algorithm provider in kubernetes scheduler
I0410 22:39:19.356506       1 request.go:530] Throttling request took 185.584562ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/configmaps/config
I0410 22:39:19.556483       1 request.go:530] Throttling request took 152.757062ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/configmaps/serviceaccount-ca
I0410 22:39:19.756540       1 request.go:530] Throttling request took 196.412339ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/configmaps/kube-scheduler-pod
I0410 22:39:19.956518       1 request.go:530] Throttling request took 314.984401ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/secrets/kube-scheduler-client-cert-key
I0410 22:39:20.156496       1 request.go:530] Throttling request took 195.478373ms, request: DELETE:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/secrets/kube-scheduler-client-cert-key
I0410 22:40:15.599067       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "f216654f4f257725538388c95b3c650b1863affb1fbb0179bde9ab94394f11e0", lastKnown: "")
W0410 22:40:15.599466       1 builder.go:107] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
I0410 22:40:15.599516       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "05d5ed64720d6a2e8a2807dfae0f614a99ed70be6556a0d8ec5b877426276097", lastKnown: "")
F0410 22:40:15.599554       1 leaderelection.go:65] leaderelection lost

1 scheduler 
I0410 22:47:45.689339       1 flags.go:33] FLAG: --requestheader-group-headers="[x-remote-group]"
I0410 22:47:45.689348       1 flags.go:33] FLAG: --requestheader-username-headers="[x-remote-user]"
I0410 22:47:45.689359       1 flags.go:33] FLAG: --scheduler-name="default-scheduler"
I0410 22:47:45.689366       1 flags.go:33] FLAG: --secure-port="10259"
I0410 22:47:45.689373       1 flags.go:33] FLAG: --skip-headers="false"
I0410 22:47:45.689380       1 flags.go:33] FLAG: --stderrthreshold="2"
I0410 22:47:45.689386       1 flags.go:33] FLAG: --storage-driver-buffer-duration="1m0s"
I0410 22:47:45.689393       1 flags.go:33] FLAG: --storage-driver-db="cadvisor"
I0410 22:47:45.689400       1 flags.go:33] FLAG: --storage-driver-host="localhost:8086"
I0410 22:47:45.689407       1 flags.go:33] FLAG: --storage-driver-password="root"
I0410 22:47:45.689413       1 flags.go:33] FLAG: --storage-driver-secure="false"
I0410 22:47:45.689447       1 flags.go:33] FLAG: --storage-driver-table="stats"
I0410 22:47:45.689455       1 flags.go:33] FLAG: --storage-driver-user="root"
I0410 22:47:45.689461       1 flags.go:33] FLAG: --tls-cert-file=""
I0410 22:47:45.689467       1 flags.go:33] FLAG: --tls-cipher-suites="[]"
I0410 22:47:45.689475       1 flags.go:33] FLAG: --tls-min-version=""
I0410 22:47:45.689481       1 flags.go:33] FLAG: --tls-private-key-file=""
I0410 22:47:45.689488       1 flags.go:33] FLAG: --tls-sni-cert-key="[]"
I0410 22:47:45.689497       1 flags.go:33] FLAG: --use-legacy-policy-config="false"
I0410 22:47:45.689504       1 flags.go:33] FLAG: --v="2"
I0410 22:47:45.689510       1 flags.go:33] FLAG: --version="false"
I0410 22:47:45.689520       1 flags.go:33] FLAG: --vmodule=""
I0410 22:47:45.689527       1 flags.go:33] FLAG: --write-config-to=""
I0410 22:47:46.107552       1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-scheduler.crt, /var/run/kubernetes/kube-scheduler.key)
failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use

1 prometheus 1
1 prometheus 1
1 sdn 48] Associate netid 8225777 to namespace "e2e-tests-proxy-7hdpq" with mcEnabled false
I0410 23:13:57.979666    1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:portname1" at 172.30.128.99:80/TCP
I0410 23:13:57.979706    1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:portname2" at 172.30.128.99:81/TCP
I0410 23:13:57.979724    1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:tlsportname1" at 172.30.128.99:443/TCP
I0410 23:13:57.979740    1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:tlsportname2" at 172.30.128.99:444/TCP
I0410 23:13:58.603590    1369 vnids.go:148] Associate netid 10022278 to namespace "e2e-tests-volumes-r7twb" with mcEnabled false
I0410 23:13:59.139600    1369 vnids.go:162] Dissociate netid 2106653 from namespace "e2e-test-unprivileged-router-t9k9w"
I0410 23:13:59.840380    1369 vnids.go:148] Associate netid 11305619 to namespace "e2e-tests-volumes-lbsn8" with mcEnabled false
I0410 23:14:00.414157    1369 roundrobin.go:310] LoadBalancerRR: Setting endpoints for e2e-tests-statefulset-c22cn/test:http to [10.128.3.19:80 10.129.3.31:80 10.131.1.79:80]
I0410 23:14:00.414195    1369 roundrobin.go:240] Delete endpoint 10.131.1.79:80 for service "e2e-tests-statefulset-c22cn/test:http"
I0410 23:14:00.414492    1369 vnids.go:162] Dissociate netid 335978 from namespace "e2e-tests-volumes-qlk64"
I0410 23:14:00.419836    1369 roundrobin.go:310] LoadBalancerRR: Setting endpoints for e2e-tests-statefulset-c22cn/test:http to [10.129.3.31:80 10.131.1.79:80]
I0410 23:14:00.419864    1369 roundrobin.go:240] Delete endpoint 10.128.3.19:80 for service "e2e-tests-statefulset-c22cn/test:http"
I0410 23:14:00.615436    1369 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Broken pipe)
F0410 23:14:00.615465    1369 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: plugin is not setup

1 sdn I0410 22:38:19.819972    2279 cmd.go:230] Overriding kubernetes api to https://api.ci-op-1mzys9cv-5a633.origin-ci-int-aws.dev.rhcloud.com:6443
I0410 22:38:19.820041    2279 cmd.go:133] Reading node configuration from /config/sdn-config.yaml
W0410 22:38:19.822639    2279 server.go:198] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0410 22:38:19.822776    2279 feature_gate.go:206] feature gates: &{map[]}
I0410 22:38:19.822914    2279 cmd.go:256] Watching config file /config/sdn-config.yaml for changes
I0410 22:38:19.823019    2279 cmd.go:256] Watching config file /config/..2019_04_10_22_38_13.135639008/sdn-config.yaml for changes
I0410 22:38:19.825149    2279 node.go:148] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-160-159.ec2.internal" (IP ""), iptables sync period "30s"
I0410 22:38:19.852074    2279 cmd.go:197] Starting node networking (v4.0.0-alpha.0+2108314-1987-dirty)
I0410 22:38:19.852108    2279 node.go:267] Starting openshift-sdn network plugin
F0410 22:38:19.890043    2279 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start

So my guess is that this is a symptom of some more fundamental issue, but I thought I'd file a tracker so we can look up that symptom -> issue later ;).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6639

Comment 1 W. Trevor King 2019-04-10 23:45:24 UTC

Created attachment 1554367 [details]
Occurrences of this error in CI from 2019-04-09T23:53 to 2019-04-10T23:39 UTC

This occurred in 39 of our 468 failures (8%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'restarting failed container=kube-controller-manager'
  39	restarting failed container=kube-controller-manager
  	2	https://github.com/openshift/origin/pull/22450	ci-op-16d8t700
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/9	ci-op-98ldn7y1
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8	ci-op-058jydv3
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7	ci-op-325vpl5h
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6	ci-op-8d4y588j
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4	ci-op-klfftt4z
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2	ci-op-vvpnjln3
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13	ci-op-vrp1m58s
  	1	https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11	ci-op-pfvkwr5g
  ...

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Michal Fojtik 2019-04-11 11:36:54 UTC

Moving this over to auth because this is caused by:

```
configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
```

No idea why the user can't get configmaps, but this looks awfully similar to the Unauthorized errors we seen earlier this week when pods were not able to get self-references
to report events.

I suspect there is some new bug in the RBAC controller maybe?

Comment 3 W. Trevor King 2019-04-13 04:55:38 UTC

Created attachment 1554887 [details]
Occurrences of this error in CI from 2019-04-12T04:31 to 2019-04-13T03:44 UTC

This occurred in 19 of our 726 failures (2%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-controller-manager changed Failing to True.*cannot get resource.*configmaps.*in API group' 'clusteroperator/kube-controller-manager changed Failing to True.*cannot get resource.*configmaps.*in API group'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 4 W. Trevor King 2019-04-13 05:48:37 UTC

19 is not all that many, but as you can see in the attached plot, it also occurs in 296 (40% of failures) jobs in pods.json entries (but apparently gets resolved before the operator throws in the towel).  For example, [1] doesn't surface much about kube-controller-manager in the build log:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285/build-log.txt | grep kube-controller-manager
Apr 13 02:08:57.483 I ns/openshift-kube-controller-manager-operator deployment/kube-controller-manager-operator Changed loglevel level to "2" (91 times)
Apr 13 02:18:57.472 I ns/openshift-kube-controller-manager-operator deployment/kube-controller-manager-operator Changed loglevel level to "2" (92 times)

but it is reporting this in pods.json:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285/artifacts/e2e-aws/pods.json | jq -r '[.items[] | (.status.containerStatuses // [])[] | .restartCount = (.restartCount | tonumber)] | sort_by(-.restartCount)[] | select(.restartCount > 0) | .previousMessage = (.lastState.terminated.message // (.lastState.terminated.exitCode | tostring) // "?") | (.restartCount | tostring) + "\t" + .name + "\t" + .previousMessage'
3	kube-controller-manager-5	ue,LocalStorageCapacityIsolation=false,RotateKubeletServerCertificate=true,SupportPodPidsLimit=true"
...
I0413 01:53:24.877669       1 flags.go:33] FLAG: --disable-attach-detach-reconcile-sync="false"
W0413 01:53:30.656803       1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
...
2	kube-controller-manager-5	 --authorization-always-allow-paths="[/healthz]"
...
I0413 01:49:12.528798       1 flags.go:33] FLAG: --service-account-private-key-file="/etc/kubernetes/static-pod-resources/secrets/service-account-private-key/service-account.key"
W0413 01:49:15.896633       1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
...
1	cluster-version-operator	0
1	console	3 01:48:26 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/13 01:48:36 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
...
1	console	3 01:48:26 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/13 01:48:36 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
...
1	sdn	I0413 01:35:11.474413    2270 cmd.go:230] Overriding kubernetes api to https://api.ci-op-0gty3y99-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443
I0413 01:35:11.474518    2270 cmd.go:133] Reading node configuration from /config/sdn-config.yaml
W0413 01:35:11.477492    2270 server.go:198] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0413 01:35:11.477610    2270 feature_gate.go:206] feature gates: &{map[]}
I0413 01:35:11.477728    2270 cmd.go:256] Watching config file /config/sdn-config.yaml for changes
I0413 01:35:11.477846    2270 cmd.go:256] Watching config file /config/..2019_04_13_01_35_02.247920586/sdn-config.yaml for changes
I0413 01:35:11.480663    2270 node.go:148] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-169-215.ec2.internal" (IP ""), iptables sync period "30s"
I0413 01:35:11.522827    2270 cmd.go:197] Starting node networking (v4.0.0-alpha.0+6f8c841-2022-dirty)
I0413 01:35:11.522862    2270 node.go:267] Starting openshift-sdn network plugin
F0413 01:35:11.568828    2270 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start

That also contains OAuth 404s like we saw in bug 1699469, although cluster-config-operator#43 landed at 2019-04-12T19:16Z [2], and this cluster started at 2019-04-13T01:11Z [1] and:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5291/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-config-operator").annotations'
{
  "io.openshift.build.commit.id": "8d0878a7418dc6c514f87ec9c46444d600a55b70",
  "io.openshift.build.commit.ref": "master",
  "io.openshift.build.source-location": "https://github.com/openshift/cluster-config-operator"
}

has the fix [3].

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285
[2]: https://github.com/openshift/cluster-config-operator/pull/43#event-2273345198
[3]: https://github.com/openshift/cluster-config-operator/commit/8d0878a7418dc6c514f87ec9c46444d600a55b70

Comment 5 W. Trevor King 2019-04-13 06:08:35 UTC

There is a whole lot of overlap between the ~300 jobs mentioning:

  kube-controller-manager.*cannot get resource.*configmaps.*in API group

in pods.json and those mentioning:

  master has not created a default cluster network,

(as seen above for the SDN pod).  Looping in Dan and David from bug 1698950, since these may be the same thing.

Comment 6 W. Trevor King 2019-04-13 06:20:25 UTC

The single SDN crash/restart early on is apparently expected (or at least acceptable), per bug 1674384.  So maybe a red herring on that front.  I'm not sure if there's a fundamental difference between the clusters that get three kube-controller-manager restarts (maybe ok, maybe not?) and those that get five (trips our "Managed cluster should have no crashlooping pods in core namespaces over two minutes" test).

Comment 8 Sudha Ponnaganti 2019-04-15 18:20:06 UTC

*** Bug 1698950 has been marked as a duplicate of this bug. ***

Comment 9 David Eads 2019-04-15 21:18:43 UTC

we think this will be fixed by https://github.com/openshift/cluster-kube-controller-manager-operator/pull/224.  

@vlaad I see no reason to believe these are related.

Comment 11 W. Trevor King 2019-04-16 04:10:39 UTC

Created attachment 1555379 [details]
kube-controller-manager-operator#224 has fixed this in CI :)

Comment 14 zhou ying 2019-04-19 05:56:59 UTC

Can reproduce the issue with latest payload: 4.1.0-0.nightly-2019-04-18-210657, so will verify it.

Comment 15 zhou ying 2019-04-19 06:52:35 UTC

Can't reproduce the issue with latest payload: 4.1.0-0.nightly-2019-04-18-210657, so will verify it.

Comment 17 errata-xmlrpc 2019-06-04 10:47:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.