2013496 – Kuryr-cni hitting >1024 pid and crashing

Bug 2013496 - Kuryr-cni hitting >1024 pid and crashing

Summary: Kuryr-cni hitting >1024 pid and crashing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Michał Dulko
QA Contact:	Jon Uriarte
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-13 04:34 UTC by Robin Cernin
Modified:	2021-12-02 22:01 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-02 22:01:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kuryr-kubernetes pull 439	0	None	open	Bug 2013496: Reap zombie child processes in CNI daemon service	2021-10-26 13:35:58 UTC
Red Hat Product Errata	RHSA-2021:4827	0	None	None	None	2021-12-02 22:01:58 UTC

Description Robin Cernin 2021-10-13 04:34:21 UTC

Upstream: https://review.opendev.org/c/openstack/kuryr-kubernetes/+/696240/
Downstream: https://github.com/openshift/kuryr-kubernetes/pull/439
Launchpad Bug: https://launchpad.net/bugs/1854134

Trying to backport upstream patch to downstream 3.11

===
OCP 3.11/OSP13 (CRI-O) without https://github.com/openshift/kuryr-kubernetes/pull/439
===

```bash
root      97164      1  97164  0    1 Oct09 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     105810      1 105810  0    1 Oct10 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     105820      1 105820  0    1 Oct10 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     105824      1 105824  0    1 Oct10 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     105830      1 105830  0    1 Oct10 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     114476      1 114476  0    1 Oct11 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     114486      1 114486  0    1 Oct11 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     114490      1 114490  0    1 Oct11 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     114496      1 114496  0    1 Oct11 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     117817     29 117817  0    1 01:36 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     117819      1 117819  0    1 01:36 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     117829      1 117829  0    1 01:36 ?        00:00:00 [kuryr-daemon: s] <defunct>
root     117833      1 117833  0    1 01:36 ?        00:00:00 [kuryr-daemon: s] <defunct>
```

Eventually exhausts the `pid-max` value of crio `1024` and renders CNI un-usable.



===
OCP 3.11/OSP13 (CRI-O) with https://github.com/openshift/kuryr-kubernetes/pull/439
===

```bash
for POD in $(oc get pods |grep cni |awk '{print $1}'); do echo "$POD processes: $(oc rsh -c kuryr-cni $POD ps -eLf | wc -l )" ; done

kuryr-cni-ds-6nkxc processes: 15
kuryr-cni-ds-8z6bq processes: 15
kuryr-cni-ds-cfwl6 processes: 15
kuryr-cni-ds-l82qr processes: 15
kuryr-cni-ds-nvqwv processes: 15
```
Scale from 15 pods to 10 pods
```bash
oc scale -n momo --replicas=10 deployment.apps/echo
deployment.apps/echo scaled
```

For each HTTP request the `werkzeug` creates a zombie process
```bash
for POD in $(oc get pods |grep cni |awk '{print $1}'); do echo "$POD processes: $(oc rsh -c kuryr-cni $POD ps -eLf | wc -l )" ; done
kuryr-cni-ds-6nkxc processes: 17
kuryr-cni-ds-8z6bq processes: 15
kuryr-cni-ds-cfwl6 processes: 18
kuryr-cni-ds-l82qr processes: 18
kuryr-cni-ds-nvqwv processes: 15
```

The process numbers are only growing.

With patch, it does not create zombies anymore, however it leaves behind un-terminated child.

When we scale the pods we can see that the daemon creates multiple threads:

```bash
strace -ff -s 1024 -p 28 -o 28.log
strace: Process 28 attached with 3 threads
strace: Process 7200 attached
strace: Process 7201 attached
strace: Process 7202 attached
strace: Process 7203 attached
^C
strace: Process 28 detached
strace: Process 34 detached
strace: Process 38 detached
strace: Process 7200 detached
```
Which not all of them are properly cleaned:
```bash
sh-4.2# ps -eLf | egrep "7200|7201|7202"
root       7200     28   7200  0    1 04:19 ?        00:00:00 kuryr-daemon: server worker(0)
root       7211   7029   7211  0    1 04:20 pts/1    00:00:00 grep -E 7200|7201|7202
```

These three processes `7200` is stuck in CNI forever.

```bash
sh-4.2# tail 28.log.7200
select(584, [581 583], [], [581 583], NULL) = 1 (in [581])
read(581, "\20\2\0\0\0\0\0\0", 8)       = 8
read(581, "(dp0\nS'kwarg'\np1\nNsS'cookie'\np2\nNsS'name'\np3\nNsS'argv'\np4\n(lp5\ncpyroute2.netlink.rtnl.ifinfmsg\nifinfmsg\np6\naS\"(dp0\\nS'index'\\np1\\nI183\\nsS'family'\\np2\\nI0\\nsS'__align'\\np3\\nI0\\nsS'value'\\np4\\ncpyroute2.netlink\\nNotInitialized\\np5\\nsS'header'\\np6\\n(dp7\\nS'pid'\\np8\\nI28\\nsS'flags'\\np9\\nI1541\\nsS'sequence_number'\\np10\\nI262\\nsS'type'\\np11\\nI17\\nssS'flags'\\np12\\nI0\\nsS'ifi_type'\\np13\\nI0\\nsS'change'\\np14\\nI0\\nsS'attrs'\\np15\\n(lp16\\n(lp17\\nS'IFLA_INDEX'\\np18\\naI0\\naas.\"\np7\na(I0\nI0\ntp8\nasS'stage'\np9\nS'reconstruct'\np10\ns.", 520) = 520
sendto(583, {{len=32, type=0x11 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_ACK|0x600, seq=262, pid=28}, "\x00\x00\x00\x00\xb7\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"}, 32, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 32
write(580, "Y\0\0\0\0\0\0\0(dp0\nS'cookie'\np1\nNsS'error'\np2\nNsS'return'\np3\nNsS'stage'\np4\nS'reconstruct'\np5\ns.", 89) = 89
select(584, [581 583], [], [581 583], NULL) = 1 (in [583])
getsockopt(583, SOL_SOCKET, SO_RCVBUF, [425984], [4]) = 0
recvfrom(583, {{len=36, type=NLMSG_ERROR, flags=0, seq=262, pid=7200}, {error=0, msg={len=32, type=0x11 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_ACK|0x600, seq=262, pid=28}}}, 212992, 0, NULL, NULL) = 36
write(580, "\323\0\0\0\0\0\0\0(dp0\nS'error'\np1\nNsS'data'\np2\nS'$\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x06\\x01\\x00\\x00 \\x1c\\x00\\x00\\x00\\x00\\x00\\x00 \\x00\\x00\\x00\\x11\\x00\\x05\\x06\\x06\\x01\\x00\\x00\\x1c\\x00\\x00\\x00'\np3\nsS'stage'\np4\nS'broadcast'\np5\ns.", 211) = 211
select(584, [581 583], [], [581 583], NULL <detached ...>

sh-4.2# strace -s 1024 -p 7200
strace: Process 7200 attached
select(584, [581 583], [], [581 583], NULL
```

Comment 1 Robin Cernin 2021-10-13 04:36:31 UTC

When I did the write-up I did it multiple times, sometimes there are more processes stuck in un-terminated state. In this particular case it was one, but before I had 3, hence:

- These three processes `7200` is stuck in CNI forever.
+ This process `7200` is stuck in CNI forever.

Comment 15 Itzik Brown 2021-11-09 08:36:20 UTC

How to verify:
1.Create a deployment
2. Run a script scaling deployment to 20 pods, waiting 60 seconds, scaling back to 0, waiting 60 seconds in a loop, let it run for 15 minutes, stop it and check the process number. It shouldn't grow.

# Checking the processes in CNI
$ for POD in $(oc -n kuryr get pods |grep cni |awk '{print $1}'); do echo "$POD processes: $(oc -n kuryr rsh -c kuryr-cni $POD ps -eLf | wc -l )" ; done

Comment 16 Itzik Brown 2021-11-10 16:17:15 UTC

Checked with version: v3.11.550
At the end of the scales:

[stack@undercloud-0 ~]$ for POD in $(oc -n kuryr get pods |grep cni |awk '{print $1}'); do echo "$POD processes: $(oc -
n kuryr rsh -c kuryr-cni $POD ps -eLf | wc -l )" ; done
kuryr-cni-ds-2fl86 processes: 19
kuryr-cni-ds-84ck7 processes: 17
kuryr-cni-ds-ftf2v processes: 20
kuryr-cni-ds-g8qqv processes: 17
kuryr-cni-ds-gntts processes: 18
kuryr-cni-ds-kj9ld processes: 21
kuryr-cni-ds-qxfnp processes: 17
kuryr-cni-ds-zpg7v processes: 18

Comment 19 errata-xmlrpc 2021-12-02 22:01:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.569 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4827

Note You need to log in before you can comment on or make changes to this bug.