2018152 – CNI pod is not restarted when It cannot start servers due to ports being used

Bug 2018152 - CNI pod is not restarted when It cannot start servers due to ports being used

Summary: CNI pod is not restarted when It cannot start servers due to ports being used

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Michał Dulko
QA Contact:	Itzik Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2023383
TreeView+	depends on / blocked

Reported:	2021-10-28 11:13 UTC by Itzik Brown
Modified:	2022-03-10 16:23 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2023383 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:22:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CNI log (138.78 KB, text/plain) 2021-10-28 11:13 UTC, Itzik Brown	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kuryr-kubernetes pull 598	0	None	open	Bug 2018152: Do not start kuryr-daemon when worker_num <= 1	2021-11-09 17:06:49 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:23:10 UTC

Description Itzik Brown 2021-10-28 11:13:36 UTC

Created attachment 1837924 [details]
CNI log

Description of problem:
A pods in stuck in ContainerCreating status when running on a node where CNI pod has the following message in the log OSError: [Errno 98] Address already in use (Attached CNI log)

Processes in the worker node

[root@ostest-9smhv-worker-0-6k5br /]# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.4 389136 77988 ?        Ssl  Oct27   0:07 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
root          18  0.0  0.4 538896 77228 ?        Sl   Oct27   0:19 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
root          28  0.1  0.4 476660 75820 ?        Sl   Oct27   1:26 kuryr-daemon: watcher worker(0)
root         157  0.1  0.5 767132 95412 ?        Sl   Oct27   2:01 kuryr-daemon: health worker(0)
root       39019  0.0  0.0  19240  3652 pts/0    Ss   10:25   0:00 bash
root       39140  0.0  0.0  54776  3864 pts/0    R+   10:28   0:00 ps -aux
[root@ostest-9smhv-worker-0-6k5br /]# ss -tulpn | grep LISTEN
tcp   LISTEN 0      128        127.0.0.1:8797       0.0.0.0:*                                            
tcp   LISTEN 0      128        127.0.0.1:10248      0.0.0.0:*                                            
tcp   LISTEN 0      128      10.196.3.39:10250      0.0.0.0:*                                            
tcp   LISTEN 0      128        127.0.0.1:10443      0.0.0.0:*                                            
tcp   LISTEN 0      128        127.0.0.1:10444      0.0.0.0:*                                            
tcp   LISTEN 0      128      10.196.3.39:9100       0.0.0.0:*                                            
tcp   LISTEN 0      128        127.0.0.1:9100       0.0.0.0:*                                            
tcp   LISTEN 0      128          0.0.0.0:111        0.0.0.0:*                                            
tcp   LISTEN 0      128          0.0.0.0:80         0.0.0.0:*                                            
tcp   LISTEN 0      128          0.0.0.0:54961      0.0.0.0:*                                            
tcp   LISTEN 0      128        127.0.0.1:4180       0.0.0.0:*                                            
tcp   LISTEN 0      128          0.0.0.0:22         0.0.0.0:*                                            
tcp   LISTEN 0      128      10.196.3.39:10010      0.0.0.0:*                                            
tcp   LISTEN 0      128          0.0.0.0:443        0.0.0.0:*                                            
tcp   LISTEN 0      128                *:18080            *:*                                            
tcp   LISTEN 0      128                *:9537             *:*                                            
tcp   LISTEN 0      128             [::]:54147         [::]:*                                            
tcp   LISTEN 0      128                *:9001             *:*                                            
tcp   LISTEN 0      128             [::]:111           [::]:*                                            
tcp   LISTEN 0      128                *:1936             *:*                                            
tcp   LISTEN 0      128                *:53               *:*                                            
tcp   LISTEN 0      128             [::]:22            [::]:*                                            
tcp   LISTEN 0      128                *:8090             *:*    users:(("kuryr-daemon: h",pid=157,fd=6))
tcp   LISTEN 0      128                *:10300            *:*          


Version-Release number of selected component (if applicable):
OCP 4.9.1 Kuryr

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Itzik Brown 2021-11-29 13:56:14 UTC

Verified using the following steps 
1. oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas 0
2. oc -n openshift-network-operator scale deploy network-operator --replicas 0
3. oc -n openshift-kuryr delete ds kuryr-cni
4. Wait for kuryr-cni pods to disappear.
5. <SSH into any node> (You'll need to add a security group rule)
6. Run on the node: nc -k -l 5036
7. oc -n openshift-network-operator scale deploy network-operator --replicas 1
8. Wait for the kuryr-cni pods to be back.
9. One of the pods should be in a CrashLoop.
10. Exit the node
11. Verify that all the CNI pods are ready
12. oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas 1

OCP 4.10.0-0.nightly-2021-11-27-004934
OSP RHOS-16.1-RHEL-8-20210903.n.0


Thanks to Michal

Comment 5 ShiftStack Bugwatcher 2022-03-05 07:07:20 UTC

Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 7 errata-xmlrpc 2022-03-10 16:22:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.