Created attachment 1837924 [details] CNI log Description of problem: A pods in stuck in ContainerCreating status when running on a node where CNI pod has the following message in the log OSError: [Errno 98] Address already in use (Attached CNI log) Processes in the worker node [root@ostest-9smhv-worker-0-6k5br /]# ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.4 389136 77988 ? Ssl Oct27 0:07 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf] root 18 0.0 0.4 538896 77228 ? Sl Oct27 0:19 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf] root 28 0.1 0.4 476660 75820 ? Sl Oct27 1:26 kuryr-daemon: watcher worker(0) root 157 0.1 0.5 767132 95412 ? Sl Oct27 2:01 kuryr-daemon: health worker(0) root 39019 0.0 0.0 19240 3652 pts/0 Ss 10:25 0:00 bash root 39140 0.0 0.0 54776 3864 pts/0 R+ 10:28 0:00 ps -aux [root@ostest-9smhv-worker-0-6k5br /]# ss -tulpn | grep LISTEN tcp LISTEN 0 128 127.0.0.1:8797 0.0.0.0:* tcp LISTEN 0 128 127.0.0.1:10248 0.0.0.0:* tcp LISTEN 0 128 10.196.3.39:10250 0.0.0.0:* tcp LISTEN 0 128 127.0.0.1:10443 0.0.0.0:* tcp LISTEN 0 128 127.0.0.1:10444 0.0.0.0:* tcp LISTEN 0 128 10.196.3.39:9100 0.0.0.0:* tcp LISTEN 0 128 127.0.0.1:9100 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:111 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:80 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:54961 0.0.0.0:* tcp LISTEN 0 128 127.0.0.1:4180 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* tcp LISTEN 0 128 10.196.3.39:10010 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:443 0.0.0.0:* tcp LISTEN 0 128 *:18080 *:* tcp LISTEN 0 128 *:9537 *:* tcp LISTEN 0 128 [::]:54147 [::]:* tcp LISTEN 0 128 *:9001 *:* tcp LISTEN 0 128 [::]:111 [::]:* tcp LISTEN 0 128 *:1936 *:* tcp LISTEN 0 128 *:53 *:* tcp LISTEN 0 128 [::]:22 [::]:* tcp LISTEN 0 128 *:8090 *:* users:(("kuryr-daemon: h",pid=157,fd=6)) tcp LISTEN 0 128 *:10300 *:* Version-Release number of selected component (if applicable): OCP 4.9.1 Kuryr How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Verified using the following steps 1. oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas 0 2. oc -n openshift-network-operator scale deploy network-operator --replicas 0 3. oc -n openshift-kuryr delete ds kuryr-cni 4. Wait for kuryr-cni pods to disappear. 5. <SSH into any node> (You'll need to add a security group rule) 6. Run on the node: nc -k -l 5036 7. oc -n openshift-network-operator scale deploy network-operator --replicas 1 8. Wait for the kuryr-cni pods to be back. 9. One of the pods should be in a CrashLoop. 10. Exit the node 11. Verify that all the CNI pods are ready 12. oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas 1 OCP 4.10.0-0.nightly-2021-11-27-004934 OSP RHOS-16.1-RHEL-8-20210903.n.0 Thanks to Michal
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056