1873412 – OAuth server returned error after cluster being used for few days

Bug 1873412 - OAuth server returned error after cluster being used for few days

Summary: OAuth server returned error after cluster being used for few days

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Luis Sanchez
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-28 08:54 UTC by Mikhail Krutov
Modified:	2020-10-27 16:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:35:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dmesg from a node affected. (144.64 KB, application/gzip) 2020-08-28 08:54 UTC, Mikhail Krutov	no flags	Details
etcd disk and peer RTT latency (100.70 KB, image/png) 2020-09-21 20:24 UTC, Sam Batschelet	no flags	Details
heap profile (61.72 KB, image/png) 2020-09-21 20:24 UTC, Sam Batschelet	no flags	Details
cpu profile (30.95 KB, application/gzip) 2020-09-21 20:55 UTC, Sam Batschelet	no flags	Details
heap source (55.20 KB, application/gzip) 2020-09-21 20:56 UTC, Sam Batschelet	no flags	Details
podnetworkconnectivitychecks key sizes in bytes (10.88 KB, text/plain) 2020-09-21 21:18 UTC, Sam Batschelet	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 956	0	None	closed	Bug 1873412: OAuth server returned error after cluster being used for few days	2020-10-02 07:14:29 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:35:51 UTC

Description Mikhail Krutov 2020-08-28 08:54:41 UTC

Created attachment 1712932 [details]
dmesg from a node affected.

Description of problem:
I've installed ocp 4.6 (on openstack), 

Version-Release number of selected component (if applicable):
/root/ocp46-new/openshift-install 4.6.0-0.nightly-2020-08-16-072105
built from commit c52bc55044a19220ec292da3428f4c3d84fdfdc6
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:4cb280f7473d303ce9ec7ab9b4daa54f00147ea637bbc848c9d1f7a72af5660a


How reproducible:
Not the first time similar thing happens. happened on 4.3, 4.4, 4.5 eventually. 

Steps to Reproduce:
1. Install OCP 4.6 
2. Use cluster as-normal for a week
3.

Actual results:
{"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504}

Expected results:
auth carries on working

Additional info:

I have ssh access to masters in case additional logs are required.
However, `podman ps` returns no containers.
dmesg returns a lot of oom killer operations, short snip:

http://pastebin.test.redhat.com/897433 (full dmesg available in attachement)


top shows that etcd is severely leaking:

   3418 root      20   0   22.9g  14.1g      0 R  55.8  90.3  49:46.42 etcd                                                                       

I'll attach more logs in comments

Comment 5 Standa Laznicka 2020-08-28 10:27:53 UTC

In private conversation, the reporter claimed etcd was leaking memory.

Comment 6 Sam Batschelet 2020-08-28 10:42:52 UTC

I have seen some interesting memory usage recently. It would be good if we could get a heap profile to analyze.

oc rsh -T -n openshift-etcd $ETCD_POD  sh -c 'curl --cert $ETCDCTL_CERT --key $ETCDCTL_KEY --cacert $ETCDCTL_CACERT https://localhost:2379/debug/pprof/heap' > /tmp/heap

I understand that apiserver might not be up in which case you should be able to perform simular actions against container directly on host using `crictl exec`. Once we have the profile we should be able to better understand what is going on.

Comment 9 Sam Batschelet 2020-08-28 14:07:35 UTC

After reviewing the cluster I did take memory profiles of etcd. But I also noticed that the etcd in this cluster was compiled with the wrong version of golang. This was a problem that happened during the build process with art. We are working with art team to clean this up but for the time being any build of etcd that uses golang 1.14 should be considered unstable. I will try to upfdate you with details on when good builds will be available.

```
2020-08-28 09:36:34.116351 I | etcdmain: etcd Version: 3.4.9
2020-08-28 09:36:34.116356 I | etcdmain: Git SHA: 47063fb
2020-08-28 09:36:34.116360 I | etcdmain: Go Version: go1.14.4
2020-08-28 09:36:34.116366 I | etcdmain: Go OS/Arch: linux/amd64
```

Comment 10 Sam Batschelet 2020-08-28 17:26:35 UTC

For the reasons listed above I am marking this a dupe of 1873590. Lets track there.

*** This bug has been marked as a duplicate of bug 1873590 ***

Comment 11 Sam Batschelet 2020-09-21 20:24:12 UTC

Created attachment 1715584 [details]
etcd disk and peer RTT latency

Comment 12 Sam Batschelet 2020-09-21 20:24:56 UTC

Created attachment 1715585 [details]
heap profile

Comment 13 Sam Batschelet 2020-09-21 20:51:49 UTC

I took a deep look at this today and what I believe is happening is the podnetworkconnectivitychecks is writing or reading keys between a couple hundred bytes to over 1MB. As this cluster has been around longer I believe that the key size keeps growing. The result for clusters without exceptional storage is this bloat begins to manifest as memory pressure.

```
etcdctl get /kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/openshift-apiserver/apiserver-broker46lpt-s6bxc-master-1-to-kubernetes-apiserver-service-cluster | wc -c
1006802
```

The total size of these keys is around 9MB
```
etcdctl get --prefix /kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/openshift-apiserver/ | wc -c                                                           
9260933
```

I looked at the heap from pprof(https://bugzilla.redhat.com/attachment.cgi?id=1715585) and it appears to me that all of the memory is in Unmarshal which would be explainable by the I/O backpressure. As you can see from the metrics query(https://bugzilla.redhat.com/attachment.cgi?id=1715584) etcd latency is off the chart. 

I am in the process of uploading the must-gather to google docs (1GB) once that is complete I will reference the link.

Comment 14 Sam Batschelet 2020-09-21 20:55:59 UTC

Created attachment 1715595 [details]
cpu profile

Comment 15 Sam Batschelet 2020-09-21 20:56:35 UTC

Created attachment 1715596 [details]
heap source

Comment 16 Sam Batschelet 2020-09-21 21:13:12 UTC

link to must-gather https://drive.google.com/file/d/1seggzMdNznqWRCY5rqN6-cI2c-P_Ylt1/view?usp=sharing

$ go/bin/go tool pprof ./heap 
File: etcd
Type: inuse_space
Time: Sep 21, 2020 at 5:07pm (EDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 6904.41MB, 99.30% of 6953.01MB total
Dropped 137 nodes (cum <= 34.77MB)
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
 6380.90MB 91.77% 91.77%  6380.90MB 91.77%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
  523.51MB  7.53% 99.30%   523.51MB  7.53%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*InternalRaftRequest).Marshal
         0     0% 99.30%     1700MB 24.45%  go.etcd.io/etcd/embed.StartEtcd
         0     0% 99.30%     1700MB 24.45%  go.etcd.io/etcd/etcdmain.Main
         0     0% 99.30%     1700MB 24.45%  go.etcd.io/etcd/etcdmain.startEtcd
         0     0% 99.30%     1700MB 24.45%  go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2
         0     0% 99.30%   523.51MB  7.53%  go.etcd.io/etcd/etcdserver.(*EtcdServer).Txn
         0     0% 99.30%   523.51MB  7.53%  go.etcd.io/etcd/etcdserver.(*EtcdServer).processInternalRaftRequestOnce
         0     0% 99.30%   523.51MB  7.53%  go.etcd.io/etcd/etcdserver.(*EtcdServer).raftRequest
         0     0% 99.30%   523.51MB  7.53%  go.etcd.io/etcd/etcdserver.(*EtcdServer).raftRequestOnce
```

Comment 18 Sam Batschelet 2020-09-21 21:18:52 UTC

Created attachment 1715598 [details]
podnetworkconnectivitychecks key sizes in bytes

Comment 21 Sam Batschelet 2020-09-22 14:17:24 UTC

I would like to add that etcd in 3.4 has what appears to be performance issues which it would be nice to have time to address upstream. But for the record I updated etcd container to run latest 3.4.13[1] upstream and saw the exact same memory profile.

[1] gcr.io/etcd-development/etcd:v3.4.13

Comment 23 Ke Wang 2020-09-29 09:25:34 UTC

Verified with OCP 4.6.0-0.nightly-2020-09-27-075304, 

Check the cluster health, 
$ oc get nodes
NAME                                STATUS   ROLES    AGE   VERSION
kewang27osp2-ghtnn-master-0         Ready    master   41h   v1.19.0+fff8183
kewang27osp2-ghtnn-master-1         Ready    master   41h   v1.19.0+fff8183
kewang27osp2-ghtnn-master-2         Ready    master   41h   v1.19.0+fff8183
kewang27osp2-ghtnn-worker-0-84vdd   Ready    worker   40h   v1.19.0+fff8183
kewang27osp2-ghtnn-worker-0-ftzc6   Ready    worker   40h   v1.19.0+fff8183
kewang27osp2-ghtnn-worker-0-kzp66   Ready    worker   40h   v1.19.0+fff8183

$ oc get co | grep -v '.True.*False.*False'
NAME          VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE

$  oc get pods -A | grep -vE 'Running|Completed'
NAMESPACE         NAME               READY   STATUS      RESTARTS   AGE

We can see node, co and pods all is well as expected.

$ oc debug node/kewang27osp2-ghtnn-master-0
sh-4.4# dmesg | grep oom

sh-4.4# top 

top - 09:10:57 up 1 day, 16:24,  0 users,  load average: 2.01, 2.00, 1.77
Tasks: 454 total,   2 running, 452 sleeping,   0 stopped,   0 zombie
%Cpu(s): 30.7 us, 15.7 sy,  0.0 ni, 45.8 id,  3.0 wa,  1.8 hi,  2.4 si,  0.6 st
MiB Mem :  16033.5 total,    167.6 free,   9080.0 used,   6785.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   9902.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                 
 519361 root      20   0   38876   2608   2160 R  85.0   0.0   0:00.39 lsof                                                                                                                    
4108558 root      20   0 3105300   2.0g  67276 S  65.0  12.7 127:12.51 kube-apiserver                                                                                                          
  12456 root      20   0 2040616 230756  28064 S  45.0   1.4  64:38.24 openshift-apise                                                                                                         
   5713 root      20   0   10.3g   2.4g 192868 S  40.0  15.4 424:19.05 etcd  
   
$ oc rsh -n openshift-etcd etcd-kewang27osp2-ghtnn-master-0

sh-4.4# etcd --version
etcd Version: 3.4.9
Git SHA: edaae9f
Go Version: go1.12.12
Go OS/Arch: linux/amd64

The total size of these podnetworkconnectivitychecks keys:
sh-4.4# etcdctl get --prefix /kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/openshift-apiserver/ | wc -c
731478

etcd db size:
sh-4.4# etcdctl endpoint status
https://192.168.1.132:2379, 64e4d45778428cb9, 3.4.9, 182 MB, true, false, 3468, 2266207, 2266207, 
https://192.168.3.174:2379, 1b4bae5b585a17be, 3.4.9, 182 MB, false, false, 3468, 2266207, 2266207, 
https://192.168.1.208:2379, 3fc6b761bb2a2713, 3.4.9, 182 MB, false, false, 3468, 2266207, 2266207, 

While the cluster is running, I did some operators included creating new resource and disconnected one master for a few hours, the etcd db size increased normally.

Comment 29 errata-xmlrpc 2020-10-27 16:35:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.