Bug 1769931

Summary:	Kubelet logging FATAL level leader election lost messages on master every 60-90 seconds. New: system:kube-scheduler failure to list configmaps
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	kube-scheduler	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	aos-bugs, deads, jokerman, mfojtik, mpatel, sttts, xxia
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-12-11 13:25:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Fiedler 2019-11-07 18:02:10 UTC

Description of problem:

journalctl -u kubelet | grep "F1107" (on 7-November) on a master node shows fatal level messages being logged every 60-90 seconds:

Nov 07 16:05:03 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:06:14 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:07:26 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:08:30 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:09:59 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:11:01 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:12:17 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:13:38 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:14:58 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:16:13 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:17:30 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
Nov 07 16:18:56 mffied-mzgsv-m-0.c.openshift-qe.internal hyperkube[1870]: F1107 14:35:11.393205       1 leaderelection.go:66] leaderelection lost
           

The cluster is nominally OK


Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-07-062654


How reproducible: Always


Steps to Reproduce:
1. Install a cluster with 4.3.0-0.nightly-2019-11-07-062654
2. Let it run for a while
3. oc adm node-logs <master-node> | grep "F<date>"    e.g. grep F1107 on 7-Nov

Actual results:

Fatal errors reported every 60-90 seconds


Expected results:

No fatal errors in a healthy cluster


Additional info:

Will provide location of must-gather and node logs in follow-up comment

Comment 3 Mike Fiedler 2019-11-08 01:57:06 UTC

Let me know if the logs requested in comment 2 are not in the must-gather.

Comment 4 Michal Fojtik 2019-11-12 10:37:10 UTC

Possible fix: https://github.com/openshift/library-go/pull/586

Comment 7 Mike Fiedler 2019-12-03 22:18:24 UTC

Still an issue on 4.3.0-0.nightly-2019-12-03-094421.   Now includes an additional F level message for system:kube-scheduler failure to list configmaps.

Let me know if you want a new bz.

Dec 03 21:05:25 ip-10-0-131-128 hyperkube[2536]: F1203 21:05:02.592648       1 server.go:264] leaderelection lost
Dec 03 21:05:35 ip-10-0-131-128 hyperkube[2536]: -         s"&PodStatus{Phase:Running,Conditions:[]PodCondition{PodCondition{Type:Initialized,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2019-12-03 21:03:56 +0000 UTC,Reason:,Message:,},PodCondition{Type:Ready,Status:False,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2019-12-03 21:05:15 +0000 UTC,Reason:ContainersNotReady,Message:containers with unready status: [scheduler],},PodCondition{Type:ContainersReady,Status:False,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2019-12-03 21:05:15 +0000 UTC,Reason:ContainersNotReady,Message:containers with unready status: [scheduler],},PodCondition{Type:PodScheduled,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2019-12-03 20:58:24 +0000 UTC,Reason:,Message:,},},Message:,Reason:,HostIP:10.0.131.128,PodIP:10.0.131.128,StartTime:2019-12-03 20:58:24 +0000 UTC,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:scheduler,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2019-12-03 21:05:16 +0000 UTC,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:m\"\nE1203 21:04:57.705581       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:04:58.706954       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:04:59.708420       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:04:59.800263       1 leaderelection.go:330] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: etcdserver: request timed out\nE1203 21:05:00.709747       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:05:01.711156       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nI1203 21:05:02.592623       1 leaderelection.go:287] failed to renew lease openshift-kube-scheduler/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded\nF1203 21:05:02.592648       1 server.go:264] leaderelection lost\n,StartedAt:2019-12-03 21:03:56 +0000 UTC,FinishedAt:2019-12-03 21:05:14 +0000 UTC,ContainerID:cri-o://2c8d7761b920953f619848f0f12d542ff230c7ea4c3c7a0f350c61333089454e,},},Ready:false,RestartCount:1,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf2e48a865a9754f6340600cb9c6b7a015f2eaf364d105ac70825455d296136e,ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf2e48a865a9754f6340600cb9c6b7a015f2eaf364d105ac70825455d296136e,ContainerID:cri-o://ce17338d014d46786cb9e8167c9a9dc96cf1515a6cb992aa158bed5b1bdecf6a,Started:*true,},},QOSClass:Burstable,InitContainerStatuses:[]ContainerStatus{ContainerStatus{Name:wait-for-host-port,State:ContainerState{Waiting:nil,Running:nil,Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2019-12-03 21:02:59 +0000 UTC,FinishedAt:2019-12-03 21:03:00 +0000 UTC,ContainerID:cri-o://cb06f359720c302cf32c4fe08d1e3ea1ba36e35164211adbb0a4d445fbd47cf9,},},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:true,RestartCount:0,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf2e48a865a9754f6340600cb9c6b7a015f2eaf364d105ac70825455d296136e,ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf2e48a865a9754f6340600cb9c6b7a015f2eaf364d105ac70825455d296136e,ContainerID:cri-o://cb06f359720c302cf32c4fe08d1e3ea1ba36e35164211adbb0a4d445fbd47cf9,Started:nil,},},NominatedNodeName:,PodIPs:[]PodIP{PodIP{IP:10.0.131.128,},},EphemeralContainerStatuses:[]ContainerStatus{},}",
Dec 03 21:05:35 ip-10-0-131-128 hyperkube[2536]: +                                         Terminated: s"&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:m\"\nE1203 21:04:57.705581       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:04:58.706954       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:04:59.708420       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:04:59.800263       1 leaderelection.go:330] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: etcdserver: request timed out\nE1203 21:05:00.709747       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nE1203 21:05:01.711156       1 reflector.go:123] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps \"extension-apiserver-authentication\" is forbidden: User \"system:kube-scheduler\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"\nI1203 21:05:02.592623       1 leaderelection.go:287] failed to renew lease openshift-kube-scheduler/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded\nF1203 21:05:02.592648       1 server.go:264] leaderelection lost\n,StartedAt:2019-12-03 21:03:56 +0000 UTC,FinishedAt:2019-12-03 21:05:14 +0000 UTC,ContainerID:cri-o://2c8d7761b920953f619848f0f12d542ff230c7ea4c3c7a0f350c61333089454e,}",
Dec 03 21:05:35 ip-10-0-131-128 hyperkube[2536]: F1203 21:05:02.592648       1 server.go:264] leaderelection lost

Comment 9 David Eads 2019-12-05 12:25:30 UTC

The code for operators and some non-critical controller managers simply suicides and restarts.  The fatal message is cosmetic in such cases.  Recommend reducing severity and defering.

Comment 11 Mike Fiedler 2019-12-06 01:11:06 UTC

IMO we should not have Fatal cosmetic messages.  Info, Warning?  Debug?  Re-opening just to confirm we want Fatal

Comment 12 Mike Fiedler 2019-12-06 01:13:14 UTC

Example for comment 11 is that I search Elasticsearch looking for Fxxxx messages in the OCP ops logs to look for indications for failures.  In previous releases, this has been reliable.

Comment 14 Maciej Szulik 2019-12-11 13:25:59 UTC

I assume the Fatal is clear that it's not gonna be changed, as for the scheduler part https://bugzilla.redhat.com/show_bug.cgi?id=1778072 exists about the configmaps problem.

*** This bug has been marked as a duplicate of bug 1778072 ***