1939605 – cri-o calling CNI ADD for missing network namespaces

Bug 1939605 - cri-o calling CNI ADD for missing network namespaces

Summary: cri-o calling CNI ADD for missing network namespaces

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Peter Hunt
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-16 17:04 UTC by Michał Dulko
Modified:	2024-10-01 17:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-09 17:06:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Kuryr-cni logs (192.47 KB, application/zip) 2021-03-17 10:57 UTC, Mohammad	no flags	Details
CNI logs after crio config changes (218.19 KB, application/zip) 2021-03-17 22:23 UTC, Mohammad	no flags	Details
Kuryr cni pod yaml (32.77 KB, application/zip) 2021-03-17 22:24 UTC, Mohammad	no flags	Details
Crio logs of all nodes attached. (14.72 MB, application/gzip) 2021-03-18 23:07 UTC, Mohammad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12316	0	None	open	Bug 1939605: cri-o+kuryr: Switch to proper netns management …	2021-05-18 18:45:49 UTC
Red Hat Product Errata	RHSA-2021:2150	0	None	None	None	2021-06-09 17:06:45 UTC

Description Michał Dulko 2021-03-16 17:04:38 UTC

Description of problem:
Client sees problems [1] like those when in 3.11 cri-o calls kuryr-kubernetes CNI plugin:

2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     vif = self.plugin.add(params)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 74, in add
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     vifs = self._do_work(params, b_base.connect, timeout)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 179, in _do_work
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     container_id=params.CNI_CONTAINERID)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 119, in connect
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     driver.connect(vif, ifname, netns, container_id)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 61, in connect
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     with b_base.get_ipdb(netns) as c_ipdb:
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 70, in get_ipdb
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     ipdb = pyroute2.IPDB(nl=pyroute2.NetNS(netns))
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/netns/nslink.py", line 167, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     super(NetNS, self).__init__()
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/iproute/linux.py", line 127, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     super(RTNL_API, self).__init__(*argv, **kwarg)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/netlink/nlsocket.py", line 301, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     super(NetlinkMixin, self).__init__()
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/remote/__init__.py", line 176, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     raise init['error']
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service OSError: [Errno 2] No such file or directory: '/host_proc/104348/ns/net'

Clearly while CNI request is being served by kuryr-kubernetes, the requested netns is missing. This is a violation of the CNI spec [2]:

> The container runtime must create a new network namespace for the container before invoking any plugins.

This was worked on in 4.x ([3], [4], [5], [6]) but apparently never addressed in 3.11.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1937991
[2] https://github.com/containernetworking/cni/blob/spec-v0.4.0/SPEC.md#general-considerations
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1820118
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1838116
[5] https://github.com/openshift/machine-config-operator/pull/1600
[6] https://github.com/openshift/machine-config-operator/pull/1689

Version-Release number of selected component (if applicable):


How reproducible:
Occasionally?

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2021-03-16 18:26:54 UTC

I've opened a PR that should fix.

note: it is relying on a feature that has not been tested in its current iteration. The 4.x version is a re-write. We'll need some serious testing of this change to make sure we don't regress.


Another (less risky) option is customers who want to use kuryr can enable this change, but we don't merge it in-tree.

Comment 2 Mohammad 2021-03-16 19:37:48 UTC

Hi,

Our customer has some major production workloads on OCP3.11/Kuryr/OSP13 using CRI-O and is impacted, further advice on how to enable this change would be appreciated.

Regards

Comment 3 Peter Hunt 2021-03-16 19:48:25 UTC

Updating their /etc/crio/crio.conf to have
```
[crio.runtime]
manage_network_ns_lifecycle = true
```

and draining a node, then restarting crio would enable this change.

I would recommend being wary about making this change immediately. Maybe do it on one node and allow for the change to soak a bit

Comment 5 Mohammad 2021-03-17 02:57:01 UTC

After adding this change, I can't seem to create any new pods on the nodes that had this change and got restarted.

I see a lot of these messages in the kuryr-cni:


-13c495761791),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapa940beed-dd',vlan_id=282)^[[00m
2021-03-17 02:15:07.139 236 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-24-76dd6d6984-s5j57_momo_3ac6ed6f-86c4-11eb-b88d-fa163e6dae63_0-844570f6 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m
2021-03-17 02:15:07.141 214 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /delNetwork HTTP/1.1" 204 -^[[00m
2021-03-17 02:15:07.143 245 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-16-6f77fc5574-t6s55_momo_3a9723eb-86c4-11eb-b88d-fa163e6dae63_0-63409683 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m
2021-03-17 02:15:07.152 216 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:18:76:1e,has_traffic_filtering=False,id=df5bf65a-6f5a-4ebc-a3d9-59a9736f4e68,network=Network(807728c3-fedc-491c-87ae-02b3fc701f97),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapdf5bf65a-6f',vlan_id=1340)^[[00m
2021-03-17 02:15:07.160 216 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /delNetwork HTTP/1.1" 204 -^[[00m
2021-03-17 02:15:07.167 121 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /addNetwork HTTP/1.1" 202 -^[[00m
2021-03-17 02:15:07.178 254 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-03-7b8d5bb654-zcqbv_momo_3a969a38-86c4-11eb-b88d-fa163e6dae63_0-2d3cad36 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m
2021-03-17 02:15:07.185 236 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:c3:3e:3b,has_traffic_filtering=False,id=136e4aef-a77d-44fa-ac1f-e535e820a5c7,network=Network(783feff0-f5eb-4ade-9abb-13c495761791),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap136e4aef-a7',vlan_id=2479)^[[00m
2021-03-17 02:15:07.185 256 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-07-c56ffd74b-2mj89_momo_3a8f3c18-86c4-11eb-b88d-fa163e6dae63_0-e045c4d7 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m
2021-03-17 02:15:07.188 236 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /delNetwork HTTP/1.1" 204 -^[[00m
2021-03-17 02:15:07.196 259 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-36-5c4d97cb8b-wk7qk_momo_3a97e58b-86c4-11eb-b88d-fa163e6dae63_0-b56a4ea0 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m

Comment 6 Mohammad 2021-03-17 03:00:58 UTC

And:

2021-03-17 02:19:17.838 7160 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:19:17] "POST /addNetwork HTTP/1.1" 500 -^[[00m
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing delNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-16-6f77fc5574-t6s55_momo_3a9723eb-86c4-11eb-b88d-fa163e6dae63_0-106a5be6', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-16-6f77fc5574-t6s55;K8S_POD_INFRA_CONTAINER_ID=7d66b29680878a45ff22b19244279a6434f18dcdf8c076a6f7dff24ac5fd9e72', 'CNI_COMMAND': u'DEL', 'CNI_CONTAINERID': u'7d66b29680878a45ff22b19244279a6434f18dcdf8c076a6f7dff24ac5fd9e72'}.: error: [Errno 2] No such file or directory
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 103, in delete
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     self.plugin.delete(params)
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 109, in delete
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     reg_ci = self.registry[pod_name]['containerid']
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "<string>", line 2, in __getitem__
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 755, in _callmethod
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     self._connect()
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 742, in _connect
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     conn = self._Client(self._token.address, authkey=self._authkey)
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib64/python2.7/multiprocessing/connection.py", line 173, in Client
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     c = SocketClient(address)
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib64/python2.7/multiprocessing/connection.py", line 308, in SocketClient
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     s.connect(address)
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib64/python2.7/socket.py", line 224, in meth
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service     return getattr(self._sock,name)(*args)
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service error: [Errno 2] No such file or directory
2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service ^[[00m

Comment 7 Mohammad 2021-03-17 03:13:21 UTC

As soon as I removed this setting:
ansible nodes -m lineinfile -b -a 'path=/etc/crio/crio.conf state="absent" regexp="^manage_network_ns_lifecycle"'

And restart crio pods where able to get created again.

Comment 8 Mohammad 2021-03-17 03:15:07 UTC

[root@master-2 ~]# rpm -qa |grep cri-o
cri-o-1.11.16-0.13.rhaos3.11.git5218c73.el7.x86_64

Comment 9 Michał Dulko 2021-03-17 09:01:00 UTC

Mohammad, can you post full log of that kuryr-cni? First log you show seems okay (kubelet is trying to clean up the node as containers are gone) and the second one seems unrelated to the change Peter proposed (kuryr-cni complains about Python's multiprocessing utilities, we never had problems with these).

Comment 10 Michał Dulko 2021-03-17 09:02:04 UTC

(In reply to Michał Dulko from comment #9)
> Mohammad, can you post full log of that kuryr-cni? First log you show seems
> okay (kubelet is trying to clean up the node as containers are gone) and the
> second one seems unrelated to the change Peter proposed (kuryr-cni complains
> about Python's multiprocessing utilities, we never had problems with these).

Had you recreated all the pods on that affected node?

Comment 11 Mohammad 2021-03-17 10:57:21 UTC

Created attachment 1763965 [details]
Kuryr-cni logs

Logs are for the pods below.

[openshift@master-2 momologs]$ oc get pods -o wide
NAME                                READY     STATUS    RESTARTS   AGE       IP                NODE                                           NOMINATED NODE
kuryr-cni-ds-2bbv5                  2/2       Running   4          41m       XX.XX.192.227   infra-node-0.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-47fk5                  1/2       Running   3          41m       XX.XX.192.156   master-1.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-6lp69                  2/2       Running   4          41m       XX.XX.192.198   master-0.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-8p5n2                  1/2       Running   2          41m       XX.XX.192.63    app-node-2.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-brqs6                  2/2       Running   3          40m       XX.XX.192.104   master-2.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-gck4q                  2/2       Running   4          41m       XX.XX.192.59    infra-node-2.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-m7ttl                  2/2       Running   6          41m       XX.XX.192.197   app-node-0.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-mngjd                  2/2       Running   0          41m       XX.XX.192.173   app-node-3.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-rft49                  2/2       Running   5          41m       XX.XX.192.33    infra-node-1.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-rnm2q                  2/2       Running   4          41m       XX.XX.192.130   app-node-1.openshift-mydev2-mydc.mydomain     <none>
kuryr-controller-59d5d4fd5b-tx8cf   1/1       Running   1          2h        XX.XX.192.33    infra-node-1.openshift-mydev2-mydc.mydomain   <none>

Comment 12 Mohammad 2021-03-17 10:59:35 UTC

(In reply to Michał Dulko from comment #10)
> (In reply to Michał Dulko from comment #9)
> > Mohammad, can you post full log of that kuryr-cni? First log you show seems
> > okay (kubelet is trying to clean up the node as containers are gone) and the
> > second one seems unrelated to the change Peter proposed (kuryr-cni complains
> > about Python's multiprocessing utilities, we never had problems with these).
> 
> Had you recreated all the pods on that affected node?

As soon as I made the change back, some pods became online. I didn't wait, and just recreated the rest.

Comment 13 Michał Dulko 2021-03-17 11:18:13 UTC

Mohammad: Looks like you've started hitting NetlinkError(17, 'File exists'), which would make sense - cri-o started to manage network namespaces on its own and haven't cleaned up the previous ones first causing VLAN conflicts. You need to drain the node from all the pods (non-host-networking ones) before the config change.

Comment 14 Michał Dulko 2021-03-17 17:43:02 UTC

Mohammand, I just realized that to test Peter's idea correctly you also need to make sure kuryr-cni mounts /run/netns from the host by changing kuryr-cni-ds DaemonSet (oc -n kuryr edit ds kuryr-cni-ds) to have those volume:

      - name: netns
        hostPath:
          path: /run/netns

And volumeMount (on kuryr-cni container):

        - name: netns
          mountPath: /run/netns
          mountPropagation: HostToContainer

Comment 15 Mohammad 2021-03-17 19:51:47 UTC

Thanks Michal,

So, I'll try the following:

1- Drain node
2- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" insertafter="\[crio.runtime\]"'`
3- Update Daemonset as per [1] and figure out a way to do it in a rolling manner
4- Create a load and observe/

I'll post back findings.


References:

1- Diff:

$ diff -u kuryr-cni-ds-before.yml kuryr-cni-ds-after.yml
--- kuryr-cni-ds-before.yml     2021-03-17 15:33:34.315354396 -0400
+++ kuryr-cni-ds-after.yml      2021-03-17 15:46:10.632727119 -0400
@@ -158,6 +158,12 @@
           name: proc
         - mountPath: /var/run/openvswitch
           name: openvswitch
+        - mountPath: /run/netns
+          name: netns
+          mountPropegation: HostToContainer
+        hostPath:
+        - path: /run/netns
+          name: netns
       dnsPolicy: ClusterFirst
       hostNetwork: true
       initContainers:

Comment 16 Mohammad 2021-03-17 20:32:17 UTC

Correcting [1]

References"
1- [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yml kuryr-cni-ds-after.yml

--- kuryr-cni-ds-before.yml     2021-03-17 15:33:34.315354396 -0400
+++ kuryr-cni-ds-after.yml      2021-03-17 16:29:30.837409022 -0400
@@ -148,6 +148,9 @@
         terminationMessagePath: /dev/termination-log
         terminationMessagePolicy: File
         volumeMounts:
+        - mountPath: /run/netns
+          name: netns
+          mountPropegation: HostToContainer
         - mountPath: /opt/cni/bin
           name: bin
         - mountPath: /etc/cni/net.d
@@ -193,6 +196,10 @@
         operator: Exists
       volumes:
       - hostPath:
+          path: /run/netns
+          type: ""
+        name: netns
+      - hostPath:
           path: /opt/cni/bin
           type: ""
         name: bin

Comment 17 Mohammad 2021-03-17 22:21:39 UTC

Actually Steps:

## Check config:

$ ansible nodes -m shell -b -a 'cat /etc/crio/crio.conf |grep manage_network_ns_lifecycle'                                                                              [WARNING]: Found both group and host with same name: localhost
infra-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
master-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
infra-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
app-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
master-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
app-node-3.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
master-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
infra-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
app-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true
app-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
manage_network_ns_lifecycle = true

## Restart CRI-O after corresponding node is drained

$ ansible nodes -m shell -b -a 'systemctl status crio |grep "Active:"'
[WARNING]: Found both group and host with same name: localhost
app-node-3.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago
master-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago
master-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago
master-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago
infra-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago
infra-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:31:27 EDT; 8min ago
app-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago
infra-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago
app-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago
app-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >>
   Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago

## Restart kuryr-cni pods

[openshift@master-2 ~]$ oc get pods -o wide
NAME                                READY     STATUS             RESTARTS   AGE       IP                NODE                                           NOMINATED NODE
kuryr-cni-ds-5vjv6                  2/2       Running            0          8m        XX.XX.192.59    infra-node-2.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-642vx                  2/2       Running            0          8m        XX.XX.192.156   master-1.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-9xmbv                  2/2       Running            0          8m        XX.XX.192.33    infra-node-1.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-cxmnb                  1/2       CrashLoopBackOff   6          17m       XX.XX.192.197   app-node-0.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-fcqsv                  2/2       Running            0          8m        XX.XX.192.63    app-node-2.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-mffqr                  2/2       Running            6          17m       XX.XX.192.198   master-0.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-qv27z                  1/2       CrashLoopBackOff   6          17m       XX.XX.192.227   infra-node-0.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-s8rd2                  2/2       Running            0          8m        XX.XX.192.173   app-node-3.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-tdfwb                  2/2       Running            0          8m        XX.XX.192.104   master-2.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-w7zqf                  1/2       CrashLoopBackOff   6          17m       XX.XX.192.130   app-node-1.openshift-mydev2-mydc.mydomain     <none>
kuryr-controller-59d5d4fd5b-c85dj   1/1       Running            1          13m       XX.XX.192.227   infra-node-0.openshift-mydev2-mydc.mydomain   <none>

## Logs and pod definitions will be added.

Comment 18 Mohammad 2021-03-17 22:23:10 UTC

Created attachment 1764201 [details]
CNI logs after crio config changes

Comment 19 Mohammad 2021-03-17 22:24:02 UTC

Created attachment 1764202 [details]
Kuryr cni pod yaml

Comment 20 Mohammad 2021-03-17 22:26:22 UTC

Attention CRi-O engineers: Are there any logs I can add to support this BZ?

Comment 21 Mohammad 2021-03-17 22:29:06 UTC

These pods were running and now they are stuck in ContainerCreating:

[openshift@master-2 ~]$ oc get pods -n momo
NAME                       READY     STATUS              RESTARTS   AGE
echo-01-5996bf84f8-j4jxs   0/1       ContainerCreating   0          1h
echo-02-774674b9dd-7pcdq   0/1       ContainerCreating   0          1h
echo-03-7b8d5bb654-c5qpp   0/1       ContainerCreating   0          1h
.
.
.

Comment 23 Peter Hunt 2021-03-18 14:01:31 UTC

can I have the cri-o?
`journalctl -u crio`

Comment 24 Mohammad 2021-03-18 23:07:33 UTC

Created attachment 1764500 [details]
Crio logs of all nodes attached.

Comment 25 Michał Dulko 2021-03-19 10:02:16 UTC

(In reply to Mohammad from comment #16)
> Correcting [1]
> 
> References"
> 1- [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yml
> kuryr-cni-ds-after.yml
> 
> --- kuryr-cni-ds-before.yml     2021-03-17 15:33:34.315354396 -0400
> +++ kuryr-cni-ds-after.yml      2021-03-17 16:29:30.837409022 -0400
> @@ -148,6 +148,9 @@
>          terminationMessagePath: /dev/termination-log
>          terminationMessagePolicy: File
>          volumeMounts:
> +        - mountPath: /run/netns
> +          name: netns
> +          mountPropegation: HostToContainer
>          - mountPath: /opt/cni/bin
>            name: bin
>          - mountPath: /etc/cni/net.d
> @@ -193,6 +196,10 @@
>          operator: Exists
>        volumes:
>        - hostPath:
> +          path: /run/netns
> +          type: ""
> +        name: netns
> +      - hostPath:
>            path: /opt/cni/bin
>            type: ""
>          name: bin

You seem to have a typo in mountPropagation. I'm not 100% sure if that can cause the issues you're seeing, this might be a bug in pyroute2 version used in 3.11 as this configuration change was never done with 3.11.

Comment 27 Michał Dulko 2021-03-19 17:23:58 UTC

(In reply to Michał Dulko from comment #25)
> (In reply to Mohammad from comment #16)
> > Correcting [1]
> > 
> > References"
> > 1- [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yml
> > kuryr-cni-ds-after.yml
> > 
> > --- kuryr-cni-ds-before.yml     2021-03-17 15:33:34.315354396 -0400
> > +++ kuryr-cni-ds-after.yml      2021-03-17 16:29:30.837409022 -0400
> > @@ -148,6 +148,9 @@
> >          terminationMessagePath: /dev/termination-log
> >          terminationMessagePolicy: File
> >          volumeMounts:
> > +        - mountPath: /run/netns
> > +          name: netns
> > +          mountPropegation: HostToContainer
> >          - mountPath: /opt/cni/bin
> >            name: bin
> >          - mountPath: /etc/cni/net.d
> > @@ -193,6 +196,10 @@
> >          operator: Exists
> >        volumes:
> >        - hostPath:
> > +          path: /run/netns
> > +          type: ""
> > +        name: netns
> > +      - hostPath:
> >            path: /opt/cni/bin
> >            type: ""
> >          name: bin
> 
> You seem to have a typo in mountPropagation. I'm not 100% sure if that can
> cause the issues you're seeing, this might be a bug in pyroute2 version used
> in 3.11 as this configuration change was never done with 3.11.

I observed same issue on my env and confirmed that setting mountPropagation correctly fixes this problem.

Comment 28 Mohammad 2021-03-21 21:13:56 UTC

Thanks Michal, you are right, there is a typo (my apologies). I'm testing this today and will report back.

Comment 29 Mohammad 2021-03-21 21:13:57 UTC

Thanks Michal, you are right, there is a typo (my apologies). I'm testing this today and will report back.

Comment 30 Mohammad 2021-03-22 06:22:37 UTC

Hi Michal,

You'd be pleased to know that I've fixed the typo and restarted all the cni pods, and they have been running all day with zero restarts:

[openshift@master-2 ~]$ oc get pods
NAME                                READY     STATUS    RESTARTS   AGE
kuryr-cni-ds-5lhlr                  2/2       Running   0          9h
kuryr-cni-ds-8kg5b                  2/2       Running   0          9h
kuryr-cni-ds-ct9f7                  2/2       Running   0          9h
kuryr-cni-ds-d8spf                  2/2       Running   0          8h
kuryr-cni-ds-f7dhv                  2/2       Running   0          9h
kuryr-cni-ds-f9bxg                  2/2       Running   0          9h
kuryr-cni-ds-hhxpz                  2/2       Running   0          9h
kuryr-cni-ds-hrw5v                  2/2       Running   0          9h
kuryr-cni-ds-n6wbt                  2/2       Running   0          8h
kuryr-cni-ds-w2k68                  2/2       Running   0          8h
kuryr-controller-59d5d4fd5b-q4f5z   1/1       Running   3          9h

All my application pods are running. 

So what would be the next step? I'm guessing you won't be able to give us the answer since this is CRI-O related, so I'm guessing someone from the CRI-O team would hopefully answer me.

The steps I followed precisely are:

Steps:

1- Drain node
2- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" insertafter="\[crio.runtime\]"'`
3- Update Daemonset as per [1] and restart cni pods one at a time
4- Create a load and observe


References:
1- Update daemonset with netns mounts as per:

[openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yaml kuryr-cni-ds-after.yaml
--- kuryr-cni-ds-before.yaml    2021-03-22 02:12:05.894023362 -0400
+++ kuryr-cni-ds-after.yaml     2021-03-22 02:12:46.761584217 -0400
@@ -148,6 +148,9 @@
         terminationMessagePath: /dev/termination-log
         terminationMessagePolicy: File
         volumeMounts:
+        - mountPath: /run/netns
+          mountPropagation: HostToContainer
+          name: netns
         - mountPath: /opt/cni/bin
           name: bin
         - mountPath: /etc/cni/net.d
@@ -193,6 +196,10 @@
         operator: Exists
       volumes:
       - hostPath:
+          path: /run/netns
+          type: ""
+        name: netns
+      - hostPath:
           path: /opt/cni/bin
           type: ""
         name: bin

Comment 33 Michał Dulko 2021-03-22 15:50:09 UTC

(In reply to Mohammad from comment #30)
> Hi Michal,
> 
> You'd be pleased to know that I've fixed the typo and restarted all the cni
> pods, and they have been running all day with zero restarts:
> 
> [openshift@master-2 ~]$ oc get pods
> NAME                                READY     STATUS    RESTARTS   AGE
> kuryr-cni-ds-5lhlr                  2/2       Running   0          9h
> kuryr-cni-ds-8kg5b                  2/2       Running   0          9h
> kuryr-cni-ds-ct9f7                  2/2       Running   0          9h
> kuryr-cni-ds-d8spf                  2/2       Running   0          8h
> kuryr-cni-ds-f7dhv                  2/2       Running   0          9h
> kuryr-cni-ds-f9bxg                  2/2       Running   0          9h
> kuryr-cni-ds-hhxpz                  2/2       Running   0          9h
> kuryr-cni-ds-hrw5v                  2/2       Running   0          9h
> kuryr-cni-ds-n6wbt                  2/2       Running   0          8h
> kuryr-cni-ds-w2k68                  2/2       Running   0          8h
> kuryr-controller-59d5d4fd5b-q4f5z   1/1       Running   3          9h
> 
> All my application pods are running. 
> 
> So what would be the next step? I'm guessing you won't be able to give us
> the answer since this is CRI-O related, so I'm guessing someone from the
> CRI-O team would hopefully answer me.

I've only did very basic testing of this but looking through [1] and [2] that happened around 4.3 timeframe, it seemed to just work once Kuryr correctly mounted /var/run/netns, so hopefully it's enough for this to be more stable. I'd like a confirmation from cri-o team as they should know most about what versions of cri-o were used in different OCPs.

[1] https://github.com/openshift/machine-config-operator/pull/1600
[2] https://github.com/openshift/machine-config-operator/pull/1689

> The steps I followed precisely are:
> 
> Steps:
> 
> 1- Drain node
> 2- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a
> 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true"
> insertafter="\[crio.runtime\]"'`
> 3- Update Daemonset as per [1] and restart cni pods one at a time
> 4- Create a load and observe
> 
> 
> References:
> 1- Update daemonset with netns mounts as per:
> 
> [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yaml
> kuryr-cni-ds-after.yaml
> --- kuryr-cni-ds-before.yaml    2021-03-22 02:12:05.894023362 -0400
> +++ kuryr-cni-ds-after.yaml     2021-03-22 02:12:46.761584217 -0400
> @@ -148,6 +148,9 @@
>          terminationMessagePath: /dev/termination-log
>          terminationMessagePolicy: File
>          volumeMounts:
> +        - mountPath: /run/netns
> +          mountPropagation: HostToContainer
> +          name: netns
>          - mountPath: /opt/cni/bin
>            name: bin
>          - mountPath: /etc/cni/net.d
> @@ -193,6 +196,10 @@
>          operator: Exists
>        volumes:
>        - hostPath:
> +          path: /run/netns
> +          type: ""
> +        name: netns
> +      - hostPath:
>            path: /opt/cni/bin
>            type: ""
>          name: bin

Comment 34 Peter Hunt 2021-03-22 15:53:38 UTC

https://github.com/cri-o/cri-o/blob/5218c7371c2911663c1cc687c7baaa0df5921e0c/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go#L43

yeah we put the new namespaces in /var/run/netns in 3.11

Comment 35 Michał Dulko 2021-03-22 16:00:33 UTC

(In reply to Peter Hunt from comment #34)
> https://github.com/cri-o/cri-o/blob/5218c7371c2911663c1cc687c7baaa0df5921e0c/
> vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go#L43
> 
> yeah we put the new namespaces in /var/run/netns in 3.11

Sure, but isn't 3.11 cri-o missing any fixes that 4.4/4.3 cri-o has regarding netns management? The sudden revert of the PR enabling manage_ns_lifecycle [1] made me doubt but now I think it might have happenend only because kuryr-kubernetes and ovn-kubernetes weren't ready to do the switch.

[1] https://github.com/openshift/machine-config-operator/pull/1600/files

Comment 36 Peter Hunt 2021-03-22 17:17:40 UTC

for 1.17 or 4.4 (when introducing the new config value manage_ns_lifecycle, instead of the legacy manage_network_ns_lifecycle), we refactored the implementation to extend to other namespaces. The need for the mentioned revert and fixes were due to bugs in the new version. The legacy should have everything that's needed for network namespace managing, AFAICT

Comment 37 Mohammad 2021-03-23 01:07:49 UTC

Current Environment:
====================

1- Customer is currently running cri-o-1.11.16-0.13.rhaos3.11.git5218c73.el7.x86_64 on all nodes (AFAIK).
2- Customer is using RHEL versions between 7.7 - 7.9 

Assuming this fix works (so far it looks like it), this is my understanding of what needs to be done (please correct me if I'm wrong for both long term and short term fixes):

Short term fix (for existing environments):
===========================================

1- Add the crio-o change as per [1]
2- Update the daemonset to include changes in [2]
3- Per node, a. Drain node, b. Restart cri-o on that node, c. Restart kuryr-cni pod on node, d. Uncordon node
4- Repeat for each node until finished

Long term fix (for new environments):
=====================================

1- Update openshift-ansible playbooks to automatically add the extra crio.conf as per [1]
2- Update openshift-ansible playbooks to automatically add the extra daemonset netns parameters as per [2]

Questions:
==========

1- My assumption is that crio upgrades will not affect existing environments once the fix is applied, is that correct?
2- What are the downsides/risks (long and short term) of doing these changes on cri-o in OCP3.11?
3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but still need to do [2]?


References:
===========

1- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" insertafter="\[crio.runtime\]"'`

2- Update daemonset with netns mounts as per:

[openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yaml kuryr-cni-ds-after.yaml
--- kuryr-cni-ds-before.yaml    2021-03-22 02:12:05.894023362 -0400
+++ kuryr-cni-ds-after.yaml     2021-03-22 02:12:46.761584217 -0400
@@ -148,6 +148,9 @@
         terminationMessagePath: /dev/termination-log
         terminationMessagePolicy: File
         volumeMounts:
+        - mountPath: /run/netns
+          mountPropagation: HostToContainer
+          name: netns
         - mountPath: /opt/cni/bin
           name: bin
         - mountPath: /etc/cni/net.d
@@ -193,6 +196,10 @@
         operator: Exists
       volumes:
       - hostPath:
+          path: /run/netns
+          type: ""
+        name: netns
+      - hostPath:
           path: /opt/cni/bin
           type: ""
           name: bin

Comment 40 Mohammad 2021-03-23 04:16:37 UTC

I decided to do further tests, I have 4 app nodes that have around 50 application pods running between them.

I ran this node drain, the uncordon about 4 times:

for NODE in $(oc get nodes |grep app-node- |awk '{print $1}'); do oc adm drain $NODE --ignore-daemonsets --delete-local-data --force; sleep 60; oc adm uncordon $NODE; echo "Drained and uncordoned $NODE"; done

This is what the pods looked like:

[openshift@master-2 mologs]$ oc get pods -o wide
NAME                                READY     STATUS    RESTARTS   AGE       IP                NODE                                           NOMINATED NODE
kuryr-cni-ds-5lhlr                  2/2       Running   0          1d        XX.XX.192.104   master-2.openshift.mydomain       <none>
kuryr-cni-ds-8kg5b                  2/2       Running   0          1d        XX.XX.192.59    infra-node-2.openshift.mydomain   <none>
kuryr-cni-ds-ct9f7                  2/2       Running   0          1d        XX.XX.192.227   infra-node-0.openshift.mydomain   <none>
kuryr-cni-ds-d8spf                  2/2       Running   6          1d        XX.XX.192.130   app-node-1.openshift.mydomain     <none>
kuryr-cni-ds-f7dhv                  2/2       Running   0          1d        XX.XX.192.156   master-1.openshift.mydomain       <none>
kuryr-cni-ds-f9bxg                  2/2       Running   0          1d        XX.XX.192.198   master-0.openshift.mydomain       <none>
kuryr-cni-ds-hhxpz                  2/2       Running   6          1d        XX.XX.192.173   app-node-3.openshift.mydomain     <none>
kuryr-cni-ds-hrw5v                  2/2       Running   0          1d        XX.XX.192.33    infra-node-1.openshift.mydomain   <none>
kuryr-cni-ds-n6wbt                  2/2       Running   7          1d        XX.XX.192.63    app-node-2.openshift.mydomain     <none>
kuryr-cni-ds-sl87c                  2/2       Running   3          35m       XX.XX.192.197   app-node-0.openshift.mydomain     <none>
kuryr-controller-59d5d4fd5b-frptd   1/1       Running   2          20h       XX.XX.192.59    infra-node-2.openshift.mydomain   <none>

[openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-d8spf > kuryr-cni-ds-d8spf_app-node-1.log
[openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-hhxpz > kuryr-cni-ds-hhxpz_app-node-3.log
[openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-n6wbt > kuryr-cni-ds-n6wbt_app-node-2.log
[openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-sl87c > kuryr-cni-ds-sl87c_app-node-0.log

I'm seeing a few of these errors:

kuryr-cni-ds-hhxpz_app-node-3.log:2021-03-23 03:57:24.564 551 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-31-79bc4b496-fx85n_momo_9a12c78f-8b8b-11eb-b88d-fa163e6dae63_0-f2c33ef4', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496-fx85n;K8S_POD_INFRA_CONTAINER_ID=07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a203e98929a3270f6', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a203e98929a3270f6'}: IOError: [Errno 32] Broken pipe
kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:45:36.690 240 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-31-79bc4b496-k5hrw_momo_099d7cdd-8b89-11eb-b88d-fa163e6dae63_0-31048169', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496-k5hrw;K8S_POD_INFRA_CONTAINER_ID=f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f4249497c3b6958c4', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f4249497c3b6958c4'}: RuntimeError
kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:55:52.915 1213 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-17-5475cdc776-nsxk8_momo_38298655-8b8b-11eb-b88d-fa163e6dae63_0-8e0b0aeb', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-17-5475cdc776-nsxk8;K8S_POD_INFRA_CONTAINER_ID=e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6efa12b8dce78de7d', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6efa12b8dce78de7d'}: IOError: [Errno 32] Broken pipe
kuryr-cni-ds-sl87c_app-node-0.log:2021-03-23 03:56:05.350 345 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-21-57f7ccfd54-hx5dp_momo_99470682-8b8b-11eb-b88d-fa163e6dae63_0-70c47700', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-21-57f7ccfd54-hx5dp;K8S_POD_INFRA_CONTAINER_ID=39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd776b58f245e83f04', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd776b58f245e83f04'}: RuntimeError


I'm guessing these are harmless as all the application pods are running.

Comment 42 Peter Hunt 2021-03-23 15:18:06 UTC

answers to the questions:

1- My assumption is that crio upgrades will not affect existing environments once the fix is applied, is that correct?
sorry I don't understand what this means
2- What are the downsides/risks (long and short term) of doing these changes on cri-o in OCP3.11?
It's hard to tell. Many consumers use that package, and after we reimplemented this feature, we've set it to default and now have dropped the config field entirely.
Unfortunately, we don't have a lot of testing against this version. I would naively say it's very likely to work without a problem.
3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but still need to do [2]?
the MCO has the cri-o changes already in 4.4, so they're not applicable. I am not sure exactly what needs to be done for kuryr, but i would guess that is already taken care of as well

Comment 43 Michał Dulko 2021-03-23 16:57:07 UTC

(In reply to Peter Hunt from comment #42)
> answers to the questions:
> 
> 1- My assumption is that crio upgrades will not affect existing environments
> once the fix is applied, is that correct?
> sorry I don't understand what this means

I think it's a question if 3.11 minor upgrade will revert the config change.

> 2- What are the downsides/risks (long and short term) of doing these changes
> on cri-o in OCP3.11?
> It's hard to tell. Many consumers use that package, and after we
> reimplemented this feature, we've set it to default and now have dropped the
> config field entirely.
> Unfortunately, we don't have a lot of testing against this version. I would
> naively say it's very likely to work without a problem.
> 3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but
> still need to do [2]?
> the MCO has the cri-o changes already in 4.4, so they're not applicable. I
> am not sure exactly what needs to be done for kuryr, but i would guess that
> is already taken care of as well

Yeah, in 4.x Kuryr works just fine with this change but obviously 3.11 code diverged from 4.x already.

Comment 44 Michał Dulko 2021-03-23 17:14:06 UTC

(In reply to Mohammad from comment #40)
> I decided to do further tests, I have 4 app nodes that have around 50
> application pods running between them.
> 
> I ran this node drain, the uncordon about 4 times:
> 
> for NODE in $(oc get nodes |grep app-node- |awk '{print $1}'); do oc adm
> drain $NODE --ignore-daemonsets --delete-local-data --force; sleep 60; oc
> adm uncordon $NODE; echo "Drained and uncordoned $NODE"; done
> 
> This is what the pods looked like:
> 
> [openshift@master-2 mologs]$ oc get pods -o wide
> NAME                                READY     STATUS    RESTARTS   AGE      
> IP                NODE                                           NOMINATED
> NODE
> kuryr-cni-ds-5lhlr                  2/2       Running   0          1d       
> XX.XX.192.104   master-2.openshift.mydomain       <none>
> kuryr-cni-ds-8kg5b                  2/2       Running   0          1d       
> XX.XX.192.59    infra-node-2.openshift.mydomain   <none>
> kuryr-cni-ds-ct9f7                  2/2       Running   0          1d       
> XX.XX.192.227   infra-node-0.openshift.mydomain   <none>
> kuryr-cni-ds-d8spf                  2/2       Running   6          1d       
> XX.XX.192.130   app-node-1.openshift.mydomain     <none>
> kuryr-cni-ds-f7dhv                  2/2       Running   0          1d       
> XX.XX.192.156   master-1.openshift.mydomain       <none>
> kuryr-cni-ds-f9bxg                  2/2       Running   0          1d       
> XX.XX.192.198   master-0.openshift.mydomain       <none>
> kuryr-cni-ds-hhxpz                  2/2       Running   6          1d       
> XX.XX.192.173   app-node-3.openshift.mydomain     <none>
> kuryr-cni-ds-hrw5v                  2/2       Running   0          1d       
> XX.XX.192.33    infra-node-1.openshift.mydomain   <none>
> kuryr-cni-ds-n6wbt                  2/2       Running   7          1d       
> XX.XX.192.63    app-node-2.openshift.mydomain     <none>
> kuryr-cni-ds-sl87c                  2/2       Running   3          35m      
> XX.XX.192.197   app-node-0.openshift.mydomain     <none>
> kuryr-controller-59d5d4fd5b-frptd   1/1       Running   2          20h      
> XX.XX.192.59    infra-node-2.openshift.mydomain   <none>
> 
> [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-d8spf >
> kuryr-cni-ds-d8spf_app-node-1.log
> [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-hhxpz >
> kuryr-cni-ds-hhxpz_app-node-3.log
> [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-n6wbt >
> kuryr-cni-ds-n6wbt_app-node-2.log
> [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-sl87c >
> kuryr-cni-ds-sl87c_app-node-0.log
> 
> I'm seeing a few of these errors:
> 
> kuryr-cni-ds-hhxpz_app-node-3.log:2021-03-23 03:57:24.564 551 ERROR
> kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork
> request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS':
> u'/var/run/netns/k8s_echo-31-79bc4b496-fx85n_momo_9a12c78f-8b8b-11eb-b88d-
> fa163e6dae63_0-f2c33ef4', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin',
> 'CNI_ARGS':
> u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496-
> fx85n;
> K8S_POD_INFRA_CONTAINER_ID=07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a20
> 3e98929a3270f6', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID':
> u'07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a203e98929a3270f6'}:
> IOError: [Errno 32] Broken pipe
> kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:45:36.690 240 ERROR
> kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork
> request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS':
> u'/var/run/netns/k8s_echo-31-79bc4b496-k5hrw_momo_099d7cdd-8b89-11eb-b88d-
> fa163e6dae63_0-31048169', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin',
> 'CNI_ARGS':
> u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496-
> k5hrw;
> K8S_POD_INFRA_CONTAINER_ID=f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f42
> 49497c3b6958c4', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID':
> u'f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f4249497c3b6958c4'}:
> RuntimeError
> kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:55:52.915 1213 ERROR
> kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork
> request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS':
> u'/var/run/netns/k8s_echo-17-5475cdc776-nsxk8_momo_38298655-8b8b-11eb-b88d-
> fa163e6dae63_0-8e0b0aeb', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin',
> 'CNI_ARGS':
> u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-17-5475cdc776-
> nsxk8;
> K8S_POD_INFRA_CONTAINER_ID=e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6ef
> a12b8dce78de7d', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID':
> u'e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6efa12b8dce78de7d'}:
> IOError: [Errno 32] Broken pipe
> kuryr-cni-ds-sl87c_app-node-0.log:2021-03-23 03:56:05.350 345 ERROR
> kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork
> request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS':
> u'/var/run/netns/k8s_echo-21-57f7ccfd54-hx5dp_momo_99470682-8b8b-11eb-b88d-
> fa163e6dae63_0-70c47700', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin',
> 'CNI_ARGS':
> u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-21-57f7ccfd54-
> hx5dp;
> K8S_POD_INFRA_CONTAINER_ID=39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd77
> 6b58f245e83f04', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID':
> u'39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd776b58f245e83f04'}:
> RuntimeError
> 
> 
> I'm guessing these are harmless as all the application pods are running.

Honestly I'm not sure where those come from. We've seen this happening in the past but simple pyroute2 upgrade helped to clear it up. Unfortunately pyroute2 doesn't seem to provide a good debugging info in this case. It does however seem that it fixes itself after a retry made by cri-o.

Comment 45 Mohammad 2021-03-23 20:20:31 UTC

(In reply to Peter Hunt from comment #42)
> answers to the questions:
> 
> 1- My assumption is that crio upgrades will not affect existing environments
> once the fix is applied, is that correct?
> sorry I don't understand what this means
> 2- What are the downsides/risks (long and short term) of doing these changes
> on cri-o in OCP3.11?
> It's hard to tell. Many consumers use that package, and after we
> reimplemented this feature, we've set it to default and now have dropped the
> config field entirely.
> Unfortunately, we don't have a lot of testing against this version. I would
> naively say it's very likely to work without a problem.
> 3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but
> still need to do [2]?
> the MCO has the cri-o changes already in 4.4, so they're not applicable. I
> am not sure exactly what needs to be done for kuryr, but i would guess that
> is already taken care of as well

It looks like this works well and the errors we see in https://bugzilla.redhat.com/show_bug.cgi?id=1939605#c44 are not a show stopper.

To ask differently, what are our next steps? Do we apply the short term fix as per https://bugzilla.redhat.com/show_bug.cgi?id=1939605#c37 and wait for a long term fix?

Then how do we maintain the short term fix throughout upgrades to crio (OS updates) and openshift-ansible upgrades?

I need to consider this carefully because we have 10 production clusters we need to do this on.

Comment 66 Itzik Brown 2021-06-02 10:44:04 UTC

Checked with: 
OCP v3.11.448
OSP 13-2021-03-24.1
cri-o 1.11.17-dev

$ oc get pods -n kuryr 
NAME                                READY     STATUS    RESTARTS   AGE
kuryr-cni-ds-bsjsh                  2/2       Running   0          2d
kuryr-cni-ds-c9n2x                  2/2       Running   0          2d
kuryr-cni-ds-d4bz9                  2/2       Running   0          2d
kuryr-cni-ds-ddqsb                  2/2       Running   0          2d
kuryr-cni-ds-dr2rc                  2/2       Running   0          2d
kuryr-cni-ds-rb2xq                  2/2       Running   0          2d
kuryr-cni-ds-rpfsf                  2/2       Running   0          2d
kuryr-cni-ds-vbg6l                  2/2       Running   0          2d
kuryr-controller-567485999f-29wbt   1/1       Running   0          2d

Haven't found the errors mentioned in the bug description

Comment 68 errata-xmlrpc 2021-06-09 17:06:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.452 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2150

Note You need to log in before you can comment on or make changes to this bug.