Created attachment 1691969 [details] CNI-kuryr and cri-o logs on problematic worker Description of problem: Versions: 4.3.0-0.nightly-2020-05-22-083448 over OSP16.0 (RHOS_TRUNK-16.0-RHEL-8-20200506.n.2) After running NP+Conformace, one of the workers got kuryr-daemon buffer saturated: [root@ostest-glbdk-worker-768p6 ~]# sudo netstat -plnt Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:45097 0.0.0.0:* LISTEN 1328/rpc.statd tcp 129 0 127.0.0.1:5036 0.0.0.0:* LISTEN 9211/kuryr-daemon: tcp 0 0 10.196.1.38:9100 0.0.0.0:* LISTEN 3272/kube-rbac-prox tcp 0 0 127.0.0.1:9100 0.0.0.0:* LISTEN 2723/node_exporter tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 127.0.0.1:4180 0.0.0.0:* LISTEN 5713/oauth-proxy tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1294/sshd tcp 0 0 0.0.0.0:8090 0.0.0.0:* LISTEN 3518056/kuryr-daemo tcp 0 0 10.196.1.38:10010 0.0.0.0:* LISTEN 1291/crio tcp 0 0 127.0.0.1:8797 0.0.0.0:* LISTEN 5537/machine-config tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 1373/hyperkube tcp6 0 0 :::9001 :::* LISTEN 5713/oauth-proxy tcp6 0 0 :::10250 :::* LISTEN 1373/hyperkube tcp6 0 0 :::37227 :::* LISTEN 1328/rpc.statd tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::53 :::* LISTEN 1889/coredns tcp6 0 0 :::22 :::* LISTEN 1294/sshd tcp6 0 0 :::18080 :::* LISTEN 1889/coredns tcp6 0 0 :::9537 :::* LISTEN 1291/crio Version-Release number of selected component (if applicable): How reproducible: not persistent. Steps to Reproduce: 1. Run NP (Parallelism 1) 2. Conformance (Parallelism 10). 3. Actual results: Any pod created on that worker remains at ContainerCreating status indefinitely. [stack@undercloud-0 ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo1-1-deploy 0/1 ContainerCreating 0 8m38s <none> ostest-glbdk-worker-768p6 <none> <none> Expected results: Queue elements expired. Additional info:
I was debugging this so two observations I had: 1. kuryr-daemon logs shown it was answering the requests all the time. 2. cri-o was calling CNI for pods that were already gone in the cluster. This takes a lot of time as kuryr-daemon will wait for them to show up in local cache. Probably the way to go here is to have DEL requests check the existence of the pod in K8s API. If pod is gone and not in registry we should return early.
rebooting the worker helps to free the queue: (shiftstack) [stack@undercloud-0 ~]$ openstack server reboot ostest-glbdk-worker-768p6 ---- (shiftstack) [stack@undercloud-0 ~]$ oc get pods -o wide -n test NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo1-1-7trtv 1/1 Running 0 2m3s 10.128.106.41 ostest-glbdk-worker-768p6 <none> <none> demo1-1-deploy 0/1 Completed 0 2m37s 10.128.106.29 ostest-glbdk-worker-768p6 <none> <none> [core@ostest-glbdk-worker-768p6 ~]$ sudo netstat -puntl Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 4424/kuryr-daemon: tcp 0 0 10.196.1.38:9100 0.0.0.0:* LISTEN 3930/kube-rbac-prox tcp 0 0 127.0.0.1:9100 0.0.0.0:* LISTEN 3840/node_exporter tcp 0 0 0.0.0.0:55117 0.0.0.0:* LISTEN 1310/rpc.statd tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 127.0.0.1:4180 0.0.0.0:* LISTEN 3663/oauth-proxy tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1281/sshd tcp 0 0 0.0.0.0:8090 0.0.0.0:* LISTEN 4428/kuryr-daemon: tcp 0 0 10.196.1.38:10010 0.0.0.0:* LISTEN 1279/crio tcp 0 0 127.0.0.1:8797 0.0.0.0:* LISTEN 3465/machine-config tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 1919/hyperkube tcp6 0 0 :::9001 :::* LISTEN 3663/oauth-proxy tcp6 0 0 :::10250 :::* LISTEN 1919/hyperkube tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::53 :::* LISTEN 2886/coredns tcp6 0 0 :::22 :::* LISTEN 1281/sshd tcp6 0 0 :::49755 :::* LISTEN 1310/rpc.statd tcp6 0 0 :::18080 :::* LISTEN 2886/coredns tcp6 0 0 :::9537 :::* LISTEN 1279/crio
This got fixed with the PR I'm adding here. There is not much we can do to verify this one, it was extremely rare thing. If https://bugzilla.redhat.com/show_bug.cgi?id=1846225 gets verified I believe we're good with this one too.
Verified on 4.6.0-0.nightly-2020-07-25-065959 on OSP16.1 (RHOS-16.1-RHEL-8-20200723.n.0). https://bugzilla.redhat.com/show_bug.cgi?id=1846225 already verified, NP and conformance tests run without expected results. No queued elements neither on the workers nor masters: (overcloud) [stack@undercloud-0 ~]$ for i in $(oc get nodes -o wide | awk '{print $6}' | grep -v INTERNAL); do ssh -J core.22.242 core@$i sudo netstat -puntl | grep 5036; done Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.1.38' (ECDSA) to the list of known hosts. tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 98251/kuryr-daemon: Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.1.177' (ECDSA) to the list of known hosts. tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 102873/kuryr-daemon Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.0.193' (ECDSA) to the list of known hosts. tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 95284/kuryr-daemon: Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.0.66' (ECDSA) to the list of known hosts. tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 499004/kuryr-daemon Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.2.64' (ECDSA) to the list of known hosts. tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 727484/kuryr-daemon Warning: Permanently added '10.46.22.242' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.0.117' (ECDSA) to the list of known hosts. tcp 0 0 127.0.0.1:5036 0.0.0.0:* LISTEN 416121/kuryr-daemon
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196