Hide Forgot
Description of problem: Bunch of these in CI today. For example: Apr 10 22:52:50.864 E clusteroperator/kube-controller-manager changed Failing to True: StaticPodsFailingError: StaticPodsFailing: nodes/ip-10-0-160-159.ec2.internal pods/kube-controller-manager-ip-10-0-160-159.ec2.internal container="kube-controller-manager-5" is not ready\nStaticPodsFailing: nodes/ip-10-0-160-159.ec2.internal pods/kube-controller-manager-ip-10-0-160-159.ec2.internal container="kube-controller-manager-5" is waiting: "CrashLoopBackOff" - "Back-off 10s restarting failed container=kube-controller-manager-5 pod=kube-controller-manager-ip-10-0-160-159.ec2.internal_openshift-kube-controller-manager(c9b1abb0dbe03675f63cfe347c0f22db)" Lots of issues in that cluster though: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6639/artifacts/e2e-aws/pods.json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | .status.phase + " " + .metadata.name + " " + ([.status.containerStatuses[].state.waiting.message][0])' Failed installer-1-ip-10-0-147-170.ec2.internal $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6639/artifacts/e2e-aws/pods.json | jq -r '[.items[] | .status.containerStatuses[] | .restartCount = (.restartCount | tonumber)] | sort_by(-.restartCount)[] | select(.restartCount > 0) | .previousMessage = (.lastState.terminated.message // (.lastState.terminated.exitCode | tostring) // "?") | (.restartCount | tostring) + " " + .name + " " + .previousMessage' 5 kube-controller-manager-5 ces/configmaps/config/config.yaml" I0410 22:49:10.920193 1 glog.go:58] FLAGSET: endpoint controller I0410 22:49:10.920200 1 flags.go:33] FLAG: --concurrent-endpoint-syncs="5" I0410 22:49:10.920207 1 glog.go:58] FLAGSET: nodelifecycle controller I0410 22:49:10.920216 1 flags.go:33] FLAG: --enable-taint-manager="true" I0410 22:49:10.920222 1 flags.go:33] FLAG: --large-cluster-size-threshold="50" I0410 22:49:10.920229 1 flags.go:33] FLAG: --node-eviction-rate="0.1" I0410 22:49:10.920236 1 flags.go:33] FLAG: --node-monitor-grace-period="40s" I0410 22:49:10.920243 1 flags.go:33] FLAG: --node-startup-grace-period="1m0s" I0410 22:49:10.920250 1 flags.go:33] FLAG: --pod-eviction-timeout="5m0s" I0410 22:49:10.920257 1 flags.go:33] FLAG: --secondary-node-eviction-rate="0.01" I0410 22:49:10.920264 1 flags.go:33] FLAG: --unhealthy-zone-threshold="0.55" I0410 22:49:10.920272 1 glog.go:58] FLAGSET: serviceaccount controller I0410 22:49:10.920279 1 flags.go:33] FLAG: --concurrent-serviceaccount-token-syncs="5" I0410 22:49:10.920286 1 flags.go:33] FLAG: --root-ca-file="/etc/kubernetes/static-pod-resources/configmaps/serviceaccount-ca/ca-bundle.crt" I0410 22:49:10.920295 1 flags.go:33] FLAG: --service-account-private-key-file="/etc/kubernetes/static-pod-resources/secrets/service-account-private-key/service-account.key" I0410 22:49:11.969969 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key) W0410 22:49:16.621781 1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" 3 kube-controller-manager-5 belet-plugins/volume/exec" I0410 22:51:36.596105 1 flags.go:33] FLAG: --pv-recycler-increment-timeout-nfs="30" I0410 22:51:36.596112 1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-hostpath="60" I0410 22:51:36.596119 1 flags.go:33] FLAG: --pv-recycler-minimum-timeout-nfs="300" I0410 22:51:36.596145 1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-hostpath="" I0410 22:51:36.596152 1 flags.go:33] FLAG: --pv-recycler-pod-template-filepath-nfs="" I0410 22:51:36.596158 1 flags.go:33] FLAG: --pv-recycler-timeout-increment-hostpath="30" I0410 22:51:36.596165 1 flags.go:33] FLAG: --pvclaimbinder-sync-period="15s" I0410 22:51:36.596173 1 glog.go:58] FLAGSET: podgc controller I0410 22:51:36.596180 1 flags.go:33] FLAG: --terminated-pod-gc-threshold="12500" I0410 22:51:36.596188 1 glog.go:58] FLAGSET: resourcequota controller I0410 22:51:36.596195 1 flags.go:33] FLAG: --concurrent-resource-quota-syncs="5" I0410 22:51:36.596202 1 flags.go:33] FLAG: --resource-quota-sync-period="5m0s" I0410 22:51:36.596210 1 glog.go:58] FLAGSET: ttl-after-finished controller I0410 22:51:36.596217 1 flags.go:33] FLAG: --concurrent-ttl-after-finished-syncs="5" I0410 22:51:36.596224 1 glog.go:58] FLAGSET: misc I0410 22:51:36.596232 1 flags.go:33] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group="" I0410 22:51:36.596239 1 flags.go:33] FLAG: --kubeconfig="/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig" I0410 22:51:36.596249 1 flags.go:33] FLAG: --master="" I0410 22:51:36.596255 1 flags.go:33] FLAG: --openshift-config="/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml" I0410 22:51:37.519611 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key) failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use 3 kube-controller-manager-5 ,-bootstrapsigner,-tokencleaner]" I0410 22:53:08.221996 1 flags.go:33] FLAG: --external-cloud-volume-plugin="" I0410 22:53:08.222000 1 flags.go:33] FLAG: --feature-gates="ExperimentalCriticalPodAnnotation=true,LocalStorageCapacityIsolation=false,RotateKubeletServerCertificate=true,SupportPodPidsLimit=true" I0410 22:53:08.222021 1 flags.go:33] FLAG: --kube-api-burst="300" I0410 22:53:08.222026 1 flags.go:33] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf" I0410 22:53:08.222031 1 flags.go:33] FLAG: --kube-api-qps="150" I0410 22:53:08.222036 1 flags.go:33] FLAG: --leader-elect="true" I0410 22:53:08.222040 1 flags.go:33] FLAG: --leader-elect-lease-duration="15s" I0410 22:53:08.222044 1 flags.go:33] FLAG: --leader-elect-renew-deadline="10s" I0410 22:53:08.222048 1 flags.go:33] FLAG: --leader-elect-resource-lock="configmaps" I0410 22:53:08.222053 1 flags.go:33] FLAG: --leader-elect-retry-period="3s" I0410 22:53:08.222057 1 flags.go:33] FLAG: --min-resync-period="12h0m0s" I0410 22:53:08.222061 1 flags.go:33] FLAG: --node-monitor-period="5s" I0410 22:53:08.222065 1 flags.go:33] FLAG: --node-sync-period="0s" I0410 22:53:08.222069 1 flags.go:33] FLAG: --route-reconciliation-period="10s" I0410 22:53:08.222073 1 flags.go:33] FLAG: --use-service-account-credentials="true" I0410 22:53:08.222078 1 glog.go:58] FLAGSET: insecure serving I0410 22:53:08.222082 1 flags.go:33] FLAG: --address="0.0.0.0" I0410 22:53:08.222087 1 flags.go:33] FLAG: --port="0" I0410 22:53:08.222091 1 glog.go:58] FLAGSET: ttl-after-finished controller I0410 22:53:08.222095 1 flags.go:33] FLAG: --concurrent-ttl-after-finished-syncs="5" I0410 22:53:08.923528 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key) failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use 3 scheduler I0410 22:47:15.590634 1 flags.go:33] FLAG: --requestheader-group-headers="[x-remote-group]" I0410 22:47:15.590639 1 flags.go:33] FLAG: --requestheader-username-headers="[x-remote-user]" I0410 22:47:15.590646 1 flags.go:33] FLAG: --scheduler-name="default-scheduler" I0410 22:47:15.590650 1 flags.go:33] FLAG: --secure-port="10259" I0410 22:47:15.590654 1 flags.go:33] FLAG: --skip-headers="false" I0410 22:47:15.590658 1 flags.go:33] FLAG: --stderrthreshold="2" I0410 22:47:15.590662 1 flags.go:33] FLAG: --storage-driver-buffer-duration="1m0s" I0410 22:47:15.590666 1 flags.go:33] FLAG: --storage-driver-db="cadvisor" I0410 22:47:15.590670 1 flags.go:33] FLAG: --storage-driver-host="localhost:8086" I0410 22:47:15.590674 1 flags.go:33] FLAG: --storage-driver-password="root" I0410 22:47:15.590678 1 flags.go:33] FLAG: --storage-driver-secure="false" I0410 22:47:15.590682 1 flags.go:33] FLAG: --storage-driver-table="stats" I0410 22:47:15.590686 1 flags.go:33] FLAG: --storage-driver-user="root" I0410 22:47:15.590690 1 flags.go:33] FLAG: --tls-cert-file="" I0410 22:47:15.590694 1 flags.go:33] FLAG: --tls-cipher-suites="[]" I0410 22:47:15.590698 1 flags.go:33] FLAG: --tls-min-version="" I0410 22:47:15.590702 1 flags.go:33] FLAG: --tls-private-key-file="" I0410 22:47:15.590706 1 flags.go:33] FLAG: --tls-sni-cert-key="[]" I0410 22:47:15.590712 1 flags.go:33] FLAG: --use-legacy-policy-config="false" I0410 22:47:15.590716 1 flags.go:33] FLAG: --v="2" I0410 22:47:15.590720 1 flags.go:33] FLAG: --version="false" I0410 22:47:15.590726 1 flags.go:33] FLAG: --vmodule="" I0410 22:47:15.590730 1 flags.go:33] FLAG: --write-config-to="" I0410 22:47:16.335085 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-scheduler.crt, /var/run/kubernetes/kube-scheduler.key) failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use 3 scheduler I0410 22:47:48.307527 1 flags.go:33] FLAG: --requestheader-group-headers="[x-remote-group]" I0410 22:47:48.307536 1 flags.go:33] FLAG: --requestheader-username-headers="[x-remote-user]" I0410 22:47:48.307548 1 flags.go:33] FLAG: --scheduler-name="default-scheduler" I0410 22:47:48.307555 1 flags.go:33] FLAG: --secure-port="10259" I0410 22:47:48.307562 1 flags.go:33] FLAG: --skip-headers="false" I0410 22:47:48.307570 1 flags.go:33] FLAG: --stderrthreshold="2" I0410 22:47:48.307576 1 flags.go:33] FLAG: --storage-driver-buffer-duration="1m0s" I0410 22:47:48.307584 1 flags.go:33] FLAG: --storage-driver-db="cadvisor" I0410 22:47:48.307591 1 flags.go:33] FLAG: --storage-driver-host="localhost:8086" I0410 22:47:48.307598 1 flags.go:33] FLAG: --storage-driver-password="root" I0410 22:47:48.307604 1 flags.go:33] FLAG: --storage-driver-secure="false" I0410 22:47:48.307611 1 flags.go:33] FLAG: --storage-driver-table="stats" I0410 22:47:48.307618 1 flags.go:33] FLAG: --storage-driver-user="root" I0410 22:47:48.307625 1 flags.go:33] FLAG: --tls-cert-file="" I0410 22:47:48.307631 1 flags.go:33] FLAG: --tls-cipher-suites="[]" I0410 22:47:48.307640 1 flags.go:33] FLAG: --tls-min-version="" I0410 22:47:48.307646 1 flags.go:33] FLAG: --tls-private-key-file="" I0410 22:47:48.307653 1 flags.go:33] FLAG: --tls-sni-cert-key="[]" I0410 22:47:48.307663 1 flags.go:33] FLAG: --use-legacy-policy-config="false" I0410 22:47:48.307670 1 flags.go:33] FLAG: --v="2" I0410 22:47:48.307676 1 flags.go:33] FLAG: --version="false" I0410 22:47:48.307687 1 flags.go:33] FLAG: --vmodule="" I0410 22:47:48.307694 1 flags.go:33] FLAG: --write-config-to="" I0410 22:47:49.130245 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-scheduler.crt, /var/run/kubernetes/kube-scheduler.key) failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use 2 console 2 2 console 2 2 network-operator 2019/04/10 22:37:46 Go Version: go1.10.8 2019/04/10 22:37:46 Go OS/Arch: linux/amd64 2019/04/10 22:37:46 operator-sdk Version: v0.4.1 2019/04/10 22:37:46 overriding kubernetes api to https://api.ci-op-1mzys9cv-5a633.origin-ci-int-aws.dev.rhcloud.com:6443 2019/04/10 22:37:46 Get https://api.ci-op-1mzys9cv-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api?timeout=32s: dial tcp 10.0.146.57:6443: connect: connection refused 1 operator 255 1 operator 255 1 operator ", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingVersion' no image found for operand pod I0410 22:39:58.146886 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"40ec51e3-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/revision-status-1 -n openshift-kube-apiserver: cause by changes in data.status I0410 22:40:00.145621 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"40ec51e3-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingVersion' no image found for operand pod I0410 22:40:14.992323 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "3dfb8a49ce2bf6a428549f07b24b9062d309cf0faa54a440008567912b6cdde0", lastKnown: "") W0410 22:40:14.992750 1 builder.go:107] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created I0410 22:40:14.992820 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "a94256ff7fb509379617976b4b679f27c99d1fa1243d1e72ec629f97b49af38a", lastKnown: "") F0410 22:40:14.992847 1 leaderelection.go:65] leaderelection lost I0410 22:40:15.017377 1 node_controller.go:133] Shutting down NodeController I0410 22:40:15.017451 1 status_controller.go:194] Shutting down StatusSyncer-kube-apiserver I0410 22:40:15.017477 1 monitoring_resource_controller.go:163] Shutting down MonitoringResourceController I0410 22:40:15.017489 1 unsupportedconfigoverrides_controller.go:164] Shutting down UnsupportedConfigOverridesController I0410 22:40:15.017517 1 targetconfigcontroller.go:332] Shutting down TargetConfigController F0410 22:40:15.017562 1 builder.go:242] stopped 1 operator 22:39:59.931772 1 monitoring_resource_controller.go:183] key failed with : the server could not find the requested resource I0410 22:39:59.932051 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"40f3eb39-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ServiceMonitorCreateFailed' Failed to create ServiceMonitor.monitoring.coreos.com/v1: the server could not find the requested resource I0410 22:40:00.101758 1 request.go:530] Throttling request took 1.196718245s, request: GET:https://172.30.0.1:443/api/v1/namespaces/default/endpoints/kubernetes E0410 22:40:00.105365 1 satokensigner_controller.go:356] key failed with : unexpected addresses: 10.0.11.72 I0410 22:40:00.105742 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"40f3eb39-5be1-11e9-86d4-12a17bbd683e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SATokenSignerControllerStuck' unexpected addresses: 10.0.11.72 I0410 22:40:00.301864 1 request.go:530] Throttling request took 1.196471403s, request: DELETE:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cloud-config I0410 22:40:00.453636 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "3a84e84ac280cd1f968b4be36daff4481a69779621e8ae30aae5e666e642b6a9", lastKnown: "") W0410 22:40:00.453669 1 builder.go:107] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created I0410 22:40:00.453734 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "bcc4f6e5250b21110b4f5c52dc1aa6acd21ce978813ba825ff8bba2860bb0544", lastKnown: "") F0410 22:40:00.453800 1 leaderelection.go:65] leaderelection lost 1 kube-scheduler-operator-container :443/api/v1/namespaces/openshift-kube-scheduler/serviceaccounts/openshift-kube-scheduler-sa I0410 22:39:19.170757 1 target_config_reconciler_v311_00.go:98] Error while listing configmap scheduler.config.openshift.io "cluster" not found I0410 22:39:19.170775 1 target_config_reconciler_v311_00.go:132] Error while getting scheduler type scheduler.config.openshift.io "cluster" not found and using default algorithm provider in kubernetes scheduler I0410 22:39:19.356506 1 request.go:530] Throttling request took 185.584562ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/configmaps/config I0410 22:39:19.556483 1 request.go:530] Throttling request took 152.757062ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/configmaps/serviceaccount-ca I0410 22:39:19.756540 1 request.go:530] Throttling request took 196.412339ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/configmaps/kube-scheduler-pod I0410 22:39:19.956518 1 request.go:530] Throttling request took 314.984401ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/secrets/kube-scheduler-client-cert-key I0410 22:39:20.156496 1 request.go:530] Throttling request took 195.478373ms, request: DELETE:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/secrets/kube-scheduler-client-cert-key I0410 22:40:15.599067 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "f216654f4f257725538388c95b3c650b1863affb1fbb0179bde9ab94394f11e0", lastKnown: "") W0410 22:40:15.599466 1 builder.go:107] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created I0410 22:40:15.599516 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "05d5ed64720d6a2e8a2807dfae0f614a99ed70be6556a0d8ec5b877426276097", lastKnown: "") F0410 22:40:15.599554 1 leaderelection.go:65] leaderelection lost 1 scheduler I0410 22:47:45.689339 1 flags.go:33] FLAG: --requestheader-group-headers="[x-remote-group]" I0410 22:47:45.689348 1 flags.go:33] FLAG: --requestheader-username-headers="[x-remote-user]" I0410 22:47:45.689359 1 flags.go:33] FLAG: --scheduler-name="default-scheduler" I0410 22:47:45.689366 1 flags.go:33] FLAG: --secure-port="10259" I0410 22:47:45.689373 1 flags.go:33] FLAG: --skip-headers="false" I0410 22:47:45.689380 1 flags.go:33] FLAG: --stderrthreshold="2" I0410 22:47:45.689386 1 flags.go:33] FLAG: --storage-driver-buffer-duration="1m0s" I0410 22:47:45.689393 1 flags.go:33] FLAG: --storage-driver-db="cadvisor" I0410 22:47:45.689400 1 flags.go:33] FLAG: --storage-driver-host="localhost:8086" I0410 22:47:45.689407 1 flags.go:33] FLAG: --storage-driver-password="root" I0410 22:47:45.689413 1 flags.go:33] FLAG: --storage-driver-secure="false" I0410 22:47:45.689447 1 flags.go:33] FLAG: --storage-driver-table="stats" I0410 22:47:45.689455 1 flags.go:33] FLAG: --storage-driver-user="root" I0410 22:47:45.689461 1 flags.go:33] FLAG: --tls-cert-file="" I0410 22:47:45.689467 1 flags.go:33] FLAG: --tls-cipher-suites="[]" I0410 22:47:45.689475 1 flags.go:33] FLAG: --tls-min-version="" I0410 22:47:45.689481 1 flags.go:33] FLAG: --tls-private-key-file="" I0410 22:47:45.689488 1 flags.go:33] FLAG: --tls-sni-cert-key="[]" I0410 22:47:45.689497 1 flags.go:33] FLAG: --use-legacy-policy-config="false" I0410 22:47:45.689504 1 flags.go:33] FLAG: --v="2" I0410 22:47:45.689510 1 flags.go:33] FLAG: --version="false" I0410 22:47:45.689520 1 flags.go:33] FLAG: --vmodule="" I0410 22:47:45.689527 1 flags.go:33] FLAG: --write-config-to="" I0410 22:47:46.107552 1 serving.go:312] Generated self-signed cert (/var/run/kubernetes/kube-scheduler.crt, /var/run/kubernetes/kube-scheduler.key) failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use 1 prometheus 1 1 prometheus 1 1 sdn 48] Associate netid 8225777 to namespace "e2e-tests-proxy-7hdpq" with mcEnabled false I0410 23:13:57.979666 1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:portname1" at 172.30.128.99:80/TCP I0410 23:13:57.979706 1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:portname2" at 172.30.128.99:81/TCP I0410 23:13:57.979724 1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:tlsportname1" at 172.30.128.99:443/TCP I0410 23:13:57.979740 1369 service.go:319] Adding new service port "e2e-tests-proxy-7hdpq/proxy-service-wjmj2:tlsportname2" at 172.30.128.99:444/TCP I0410 23:13:58.603590 1369 vnids.go:148] Associate netid 10022278 to namespace "e2e-tests-volumes-r7twb" with mcEnabled false I0410 23:13:59.139600 1369 vnids.go:162] Dissociate netid 2106653 from namespace "e2e-test-unprivileged-router-t9k9w" I0410 23:13:59.840380 1369 vnids.go:148] Associate netid 11305619 to namespace "e2e-tests-volumes-lbsn8" with mcEnabled false I0410 23:14:00.414157 1369 roundrobin.go:310] LoadBalancerRR: Setting endpoints for e2e-tests-statefulset-c22cn/test:http to [10.128.3.19:80 10.129.3.31:80 10.131.1.79:80] I0410 23:14:00.414195 1369 roundrobin.go:240] Delete endpoint 10.131.1.79:80 for service "e2e-tests-statefulset-c22cn/test:http" I0410 23:14:00.414492 1369 vnids.go:162] Dissociate netid 335978 from namespace "e2e-tests-volumes-qlk64" I0410 23:14:00.419836 1369 roundrobin.go:310] LoadBalancerRR: Setting endpoints for e2e-tests-statefulset-c22cn/test:http to [10.129.3.31:80 10.131.1.79:80] I0410 23:14:00.419864 1369 roundrobin.go:240] Delete endpoint 10.128.3.19:80 for service "e2e-tests-statefulset-c22cn/test:http" I0410 23:14:00.615436 1369 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Broken pipe) F0410 23:14:00.615465 1369 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: plugin is not setup 1 sdn I0410 22:38:19.819972 2279 cmd.go:230] Overriding kubernetes api to https://api.ci-op-1mzys9cv-5a633.origin-ci-int-aws.dev.rhcloud.com:6443 I0410 22:38:19.820041 2279 cmd.go:133] Reading node configuration from /config/sdn-config.yaml W0410 22:38:19.822639 2279 server.go:198] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0410 22:38:19.822776 2279 feature_gate.go:206] feature gates: &{map[]} I0410 22:38:19.822914 2279 cmd.go:256] Watching config file /config/sdn-config.yaml for changes I0410 22:38:19.823019 2279 cmd.go:256] Watching config file /config/..2019_04_10_22_38_13.135639008/sdn-config.yaml for changes I0410 22:38:19.825149 2279 node.go:148] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-160-159.ec2.internal" (IP ""), iptables sync period "30s" I0410 22:38:19.852074 2279 cmd.go:197] Starting node networking (v4.0.0-alpha.0+2108314-1987-dirty) I0410 22:38:19.852108 2279 node.go:267] Starting openshift-sdn network plugin F0410 22:38:19.890043 2279 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start So my guess is that this is a symptom of some more fundamental issue, but I thought I'd file a tracker so we can look up that symptom -> issue later ;). [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6639
Created attachment 1554367 [details] Occurrences of this error in CI from 2019-04-09T23:53 to 2019-04-10T23:39 UTC This occurred in 39 of our 468 failures (8%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'restarting failed container=kube-controller-manager' 39 restarting failed container=kube-controller-manager 2 https://github.com/openshift/origin/pull/22450 ci-op-16d8t700 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/9 ci-op-98ldn7y1 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8 ci-op-058jydv3 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7 ci-op-325vpl5h 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6 ci-op-8d4y588j 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4 ci-op-klfftt4z 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2 ci-op-vvpnjln3 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13 ci-op-vrp1m58s 1 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11 ci-op-pfvkwr5g ... [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
Moving this over to auth because this is caused by: ``` configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" ``` No idea why the user can't get configmaps, but this looks awfully similar to the Unauthorized errors we seen earlier this week when pods were not able to get self-references to report events. I suspect there is some new bug in the RBAC controller maybe?
Created attachment 1554887 [details] Occurrences of this error in CI from 2019-04-12T04:31 to 2019-04-13T03:44 UTC This occurred in 19 of our 726 failures (2%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'clusteroperator/kube-controller-manager changed Failing to True.*cannot get resource.*configmaps.*in API group' 'clusteroperator/kube-controller-manager changed Failing to True.*cannot get resource.*configmaps.*in API group' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
19 is not all that many, but as you can see in the attached plot, it also occurs in 296 (40% of failures) jobs in pods.json entries (but apparently gets resolved before the operator throws in the towel). For example, [1] doesn't surface much about kube-controller-manager in the build log: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285/build-log.txt | grep kube-controller-manager Apr 13 02:08:57.483 I ns/openshift-kube-controller-manager-operator deployment/kube-controller-manager-operator Changed loglevel level to "2" (91 times) Apr 13 02:18:57.472 I ns/openshift-kube-controller-manager-operator deployment/kube-controller-manager-operator Changed loglevel level to "2" (92 times) but it is reporting this in pods.json: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285/artifacts/e2e-aws/pods.json | jq -r '[.items[] | (.status.containerStatuses // [])[] | .restartCount = (.restartCount | tonumber)] | sort_by(-.restartCount)[] | select(.restartCount > 0) | .previousMessage = (.lastState.terminated.message // (.lastState.terminated.exitCode | tostring) // "?") | (.restartCount | tostring) + "\t" + .name + "\t" + .previousMessage' 3 kube-controller-manager-5 ue,LocalStorageCapacityIsolation=false,RotateKubeletServerCertificate=true,SupportPodPidsLimit=true" ... I0413 01:53:24.877669 1 flags.go:33] FLAG: --disable-attach-detach-reconcile-sync="false" W0413 01:53:30.656803 1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" ... 2 kube-controller-manager-5 --authorization-always-allow-paths="[/healthz]" ... I0413 01:49:12.528798 1 flags.go:33] FLAG: --service-account-private-key-file="/etc/kubernetes/static-pod-resources/secrets/service-account-private-key/service-account.key" W0413 01:49:15.896633 1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" ... 1 cluster-version-operator 0 1 console 3 01:48:26 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found 2019/04/13 01:48:36 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found ... 1 console 3 01:48:26 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found 2019/04/13 01:48:36 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found ... 1 sdn I0413 01:35:11.474413 2270 cmd.go:230] Overriding kubernetes api to https://api.ci-op-0gty3y99-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443 I0413 01:35:11.474518 2270 cmd.go:133] Reading node configuration from /config/sdn-config.yaml W0413 01:35:11.477492 2270 server.go:198] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0413 01:35:11.477610 2270 feature_gate.go:206] feature gates: &{map[]} I0413 01:35:11.477728 2270 cmd.go:256] Watching config file /config/sdn-config.yaml for changes I0413 01:35:11.477846 2270 cmd.go:256] Watching config file /config/..2019_04_13_01_35_02.247920586/sdn-config.yaml for changes I0413 01:35:11.480663 2270 node.go:148] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-169-215.ec2.internal" (IP ""), iptables sync period "30s" I0413 01:35:11.522827 2270 cmd.go:197] Starting node networking (v4.0.0-alpha.0+6f8c841-2022-dirty) I0413 01:35:11.522862 2270 node.go:267] Starting openshift-sdn network plugin F0413 01:35:11.568828 2270 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start That also contains OAuth 404s like we saw in bug 1699469, although cluster-config-operator#43 landed at 2019-04-12T19:16Z [2], and this cluster started at 2019-04-13T01:11Z [1] and: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5291/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-config-operator").annotations' { "io.openshift.build.commit.id": "8d0878a7418dc6c514f87ec9c46444d600a55b70", "io.openshift.build.commit.ref": "master", "io.openshift.build.source-location": "https://github.com/openshift/cluster-config-operator" } has the fix [3]. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1590/pull-ci-openshift-installer-master-e2e-aws/5285 [2]: https://github.com/openshift/cluster-config-operator/pull/43#event-2273345198 [3]: https://github.com/openshift/cluster-config-operator/commit/8d0878a7418dc6c514f87ec9c46444d600a55b70
There is a whole lot of overlap between the ~300 jobs mentioning: kube-controller-manager.*cannot get resource.*configmaps.*in API group in pods.json and those mentioning: master has not created a default cluster network, (as seen above for the SDN pod). Looping in Dan and David from bug 1698950, since these may be the same thing.
The single SDN crash/restart early on is apparently expected (or at least acceptable), per bug 1674384. So maybe a red herring on that front. I'm not sure if there's a fundamental difference between the clusters that get three kube-controller-manager restarts (maybe ok, maybe not?) and those that get five (trips our "Managed cluster should have no crashlooping pods in core namespaces over two minutes" test).
*** Bug 1698950 has been marked as a duplicate of this bug. ***
we think this will be fixed by https://github.com/openshift/cluster-kube-controller-manager-operator/pull/224. @vlaad I see no reason to believe these are related.
Created attachment 1555379 [details] kube-controller-manager-operator#224 has fixed this in CI :)
Can reproduce the issue with latest payload: 4.1.0-0.nightly-2019-04-18-210657, so will verify it.
Can't reproduce the issue with latest payload: 4.1.0-0.nightly-2019-04-18-210657, so will verify it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758