1839814 – kuryr-daemon buffer saturated

Bug 1839814 - kuryr-daemon buffer saturated

Summary: kuryr-daemon buffer saturated

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michał Dulko
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-25 16:18 UTC by rlobillo
Modified:	2020-10-27 16:01 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:01:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CNI-kuryr and cri-o logs on problematic worker (229.26 KB, application/gzip) 2020-05-25 16:18 UTC, rlobillo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kuryr-kubernetes pull 274	0	None	closed	Bug 1846225: CNI: Don't wait for missing pods on DEL	2020-08-03 14:47:56 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:01:27 UTC

Description rlobillo 2020-05-25 16:18:52 UTC

Created attachment 1691969 [details]
CNI-kuryr and cri-o logs on problematic worker

Description of problem:

Versions: 4.3.0-0.nightly-2020-05-22-083448 over OSP16.0 (RHOS_TRUNK-16.0-RHEL-8-20200506.n.2) 

After running NP+Conformace, one of the workers got kuryr-daemon buffer saturated:

[root@ostest-glbdk-worker-768p6 ~]# sudo netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:45097           0.0.0.0:*               LISTEN      1328/rpc.statd      
tcp      129      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      9211/kuryr-daemon:  
tcp        0      0 10.196.1.38:9100        0.0.0.0:*               LISTEN      3272/kube-rbac-prox 
tcp        0      0 127.0.0.1:9100          0.0.0.0:*               LISTEN      2723/node_exporter  
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd           
tcp        0      0 127.0.0.1:4180          0.0.0.0:*               LISTEN      5713/oauth-proxy    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1294/sshd           
tcp        0      0 0.0.0.0:8090            0.0.0.0:*               LISTEN      3518056/kuryr-daemo 
tcp        0      0 10.196.1.38:10010       0.0.0.0:*               LISTEN      1291/crio           
tcp        0      0 127.0.0.1:8797          0.0.0.0:*               LISTEN      5537/machine-config 
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      1373/hyperkube      
tcp6       0      0 :::9001                 :::*                    LISTEN      5713/oauth-proxy    
tcp6       0      0 :::10250                :::*                    LISTEN      1373/hyperkube      
tcp6       0      0 :::37227                :::*                    LISTEN      1328/rpc.statd      
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 :::53                   :::*                    LISTEN      1889/coredns        
tcp6       0      0 :::22                   :::*                    LISTEN      1294/sshd           
tcp6       0      0 :::18080                :::*                    LISTEN      1889/coredns        
tcp6       0      0 :::9537                 :::*                    LISTEN      1291/crio           




Version-Release number of selected component (if applicable):


How reproducible: not persistent.


Steps to Reproduce:
1. Run NP (Parallelism 1)
2. Conformance (Parallelism 10).
3. 

Actual results: 

Any pod created on that worker remains at ContainerCreating status indefinitely.

[stack@undercloud-0 ~]$ oc get pods -o wide
NAME             READY   STATUS              RESTARTS   AGE     IP       NODE                        NOMINATED NODE   READINESS GATES
demo1-1-deploy   0/1     ContainerCreating   0          8m38s   <none>   ostest-glbdk-worker-768p6   <none>           <none>



Expected results:

Queue elements expired.


Additional info:

Comment 1 Michał Dulko 2020-05-25 16:41:47 UTC

I was debugging this so two observations I had:

1. kuryr-daemon logs shown it was answering the requests all the time.
2. cri-o was calling CNI for pods that were already gone in the cluster. This takes a lot of time as kuryr-daemon will wait for them to show up in local cache.

Probably the way to go here is to have DEL requests check the existence of the pod in K8s API. If pod is gone and not in registry we should return early.

Comment 2 rlobillo 2020-05-26 07:10:32 UTC

rebooting the worker helps to free the queue:

(shiftstack) [stack@undercloud-0 ~]$ openstack server reboot ostest-glbdk-worker-768p6

----


(shiftstack) [stack@undercloud-0 ~]$ oc get pods -o wide -n test
NAME             READY   STATUS      RESTARTS   AGE     IP              NODE                        NOMINATED NODE   READINESS GATES
demo1-1-7trtv    1/1     Running     0          2m3s    10.128.106.41   ostest-glbdk-worker-768p6   <none>           <none>
demo1-1-deploy   0/1     Completed   0          2m37s   10.128.106.29   ostest-glbdk-worker-768p6   <none>           <none>

[core@ostest-glbdk-worker-768p6 ~]$ sudo netstat -puntl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      4424/kuryr-daemon:  
tcp        0      0 10.196.1.38:9100        0.0.0.0:*               LISTEN      3930/kube-rbac-prox 
tcp        0      0 127.0.0.1:9100          0.0.0.0:*               LISTEN      3840/node_exporter  
tcp        0      0 0.0.0.0:55117           0.0.0.0:*               LISTEN      1310/rpc.statd      
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd           
tcp        0      0 127.0.0.1:4180          0.0.0.0:*               LISTEN      3663/oauth-proxy    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1281/sshd           
tcp        0      0 0.0.0.0:8090            0.0.0.0:*               LISTEN      4428/kuryr-daemon:  
tcp        0      0 10.196.1.38:10010       0.0.0.0:*               LISTEN      1279/crio           
tcp        0      0 127.0.0.1:8797          0.0.0.0:*               LISTEN      3465/machine-config 
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      1919/hyperkube      
tcp6       0      0 :::9001                 :::*                    LISTEN      3663/oauth-proxy    
tcp6       0      0 :::10250                :::*                    LISTEN      1919/hyperkube      
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 :::53                   :::*                    LISTEN      2886/coredns        
tcp6       0      0 :::22                   :::*                    LISTEN      1281/sshd           
tcp6       0      0 :::49755                :::*                    LISTEN      1310/rpc.statd      
tcp6       0      0 :::18080                :::*                    LISTEN      2886/coredns        
tcp6       0      0 :::9537                 :::*                    LISTEN      1279/crio

Comment 3 Michał Dulko 2020-06-16 15:03:09 UTC

This got fixed with the PR I'm adding here.

There is not much we can do to verify this one, it was extremely rare thing. If https://bugzilla.redhat.com/show_bug.cgi?id=1846225 gets verified I believe we're good with this one too.

Comment 4 rlobillo 2020-08-03 14:50:53 UTC

Verified on 4.6.0-0.nightly-2020-07-25-065959 on OSP16.1 (RHOS-16.1-RHEL-8-20200723.n.0).

https://bugzilla.redhat.com/show_bug.cgi?id=1846225 already verified, NP and conformance tests run without expected results.

No queued elements neither on the workers nor masters:

(overcloud) [stack@undercloud-0 ~]$ for i in $(oc get nodes -o wide | awk '{print $6}' | grep -v INTERNAL); do ssh -J core.22.242 core@$i sudo netstat -puntl | grep 5036; done
Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.1.38' (ECDSA) to the list of known hosts.
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      98251/kuryr-daemon: 
Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.1.177' (ECDSA) to the list of known hosts.
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      102873/kuryr-daemon 
Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.0.193' (ECDSA) to the list of known hosts.
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      95284/kuryr-daemon: 
Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.0.66' (ECDSA) to the list of known hosts.
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      499004/kuryr-daemon 
Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.2.64' (ECDSA) to the list of known hosts.
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      727484/kuryr-daemon 
Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.0.117' (ECDSA) to the list of known hosts.
tcp        0      0 127.0.0.1:5036          0.0.0.0:*               LISTEN      416121/kuryr-daemon

Comment 7 errata-xmlrpc 2020-10-27 16:01:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.