Description of problem: Client sees problems [1] like those when in 3.11 cri-o calls kuryr-kubernetes CNI plugin: 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last): 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service vif = self.plugin.add(params) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 74, in add 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service vifs = self._do_work(params, b_base.connect, timeout) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 179, in _do_work 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 119, in connect 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service driver.connect(vif, ifname, netns, container_id) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 61, in connect 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service with b_base.get_ipdb(netns) as c_ipdb: 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 70, in get_ipdb 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service ipdb = pyroute2.IPDB(nl=pyroute2.NetNS(netns)) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/netns/nslink.py", line 167, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service super(NetNS, self).__init__() 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/iproute/linux.py", line 127, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service super(RTNL_API, self).__init__(*argv, **kwarg) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/netlink/nlsocket.py", line 301, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service super(NetlinkMixin, self).__init__() 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/remote/__init__.py", line 176, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service raise init['error'] 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service OSError: [Errno 2] No such file or directory: '/host_proc/104348/ns/net' Clearly while CNI request is being served by kuryr-kubernetes, the requested netns is missing. This is a violation of the CNI spec [2]: > The container runtime must create a new network namespace for the container before invoking any plugins. This was worked on in 4.x ([3], [4], [5], [6]) but apparently never addressed in 3.11. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1937991 [2] https://github.com/containernetworking/cni/blob/spec-v0.4.0/SPEC.md#general-considerations [3] https://bugzilla.redhat.com/show_bug.cgi?id=1820118 [4] https://bugzilla.redhat.com/show_bug.cgi?id=1838116 [5] https://github.com/openshift/machine-config-operator/pull/1600 [6] https://github.com/openshift/machine-config-operator/pull/1689 Version-Release number of selected component (if applicable): How reproducible: Occasionally? Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I've opened a PR that should fix. note: it is relying on a feature that has not been tested in its current iteration. The 4.x version is a re-write. We'll need some serious testing of this change to make sure we don't regress. Another (less risky) option is customers who want to use kuryr can enable this change, but we don't merge it in-tree.
Hi, Our customer has some major production workloads on OCP3.11/Kuryr/OSP13 using CRI-O and is impacted, further advice on how to enable this change would be appreciated. Regards
Updating their /etc/crio/crio.conf to have ``` [crio.runtime] manage_network_ns_lifecycle = true ``` and draining a node, then restarting crio would enable this change. I would recommend being wary about making this change immediately. Maybe do it on one node and allow for the change to soak a bit
After adding this change, I can't seem to create any new pods on the nodes that had this change and got restarted. I see a lot of these messages in the kuryr-cni: -13c495761791),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapa940beed-dd',vlan_id=282)^[[00m 2021-03-17 02:15:07.139 236 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-24-76dd6d6984-s5j57_momo_3ac6ed6f-86c4-11eb-b88d-fa163e6dae63_0-844570f6 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m 2021-03-17 02:15:07.141 214 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /delNetwork HTTP/1.1" 204 -^[[00m 2021-03-17 02:15:07.143 245 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-16-6f77fc5574-t6s55_momo_3a9723eb-86c4-11eb-b88d-fa163e6dae63_0-63409683 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m 2021-03-17 02:15:07.152 216 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:18:76:1e,has_traffic_filtering=False,id=df5bf65a-6f5a-4ebc-a3d9-59a9736f4e68,network=Network(807728c3-fedc-491c-87ae-02b3fc701f97),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapdf5bf65a-6f',vlan_id=1340)^[[00m 2021-03-17 02:15:07.160 216 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /delNetwork HTTP/1.1" 204 -^[[00m 2021-03-17 02:15:07.167 121 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /addNetwork HTTP/1.1" 202 -^[[00m 2021-03-17 02:15:07.178 254 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-03-7b8d5bb654-zcqbv_momo_3a969a38-86c4-11eb-b88d-fa163e6dae63_0-2d3cad36 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m 2021-03-17 02:15:07.185 236 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:c3:3e:3b,has_traffic_filtering=False,id=136e4aef-a77d-44fa-ac1f-e535e820a5c7,network=Network(783feff0-f5eb-4ade-9abb-13c495761791),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap136e4aef-a7',vlan_id=2479)^[[00m 2021-03-17 02:15:07.185 256 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-07-c56ffd74b-2mj89_momo_3a8f3c18-86c4-11eb-b88d-fa163e6dae63_0-e045c4d7 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m 2021-03-17 02:15:07.188 236 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:15:07] "POST /delNetwork HTTP/1.1" 204 -^[[00m 2021-03-17 02:15:07.196 259 WARNING kuryr_kubernetes.cni.binding.base [-] Found hanging interface eth0 inside /var/run/netns/k8s_echo-36-5c4d97cb8b-wk7qk_momo_3a97e58b-86c4-11eb-b88d-fa163e6dae63_0-b56a4ea0 netns. Most likely it is a leftover from a kuryr-daemon restart. Trying to delete it.^[[00m
And: 2021-03-17 02:19:17.838 7160 INFO werkzeug [-] XXX.ZZZ.0.1 - - [17/Mar/2021 02:19:17] "POST /addNetwork HTTP/1.1" 500 -^[[00m 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing delNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-16-6f77fc5574-t6s55_momo_3a9723eb-86c4-11eb-b88d-fa163e6dae63_0-106a5be6', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-16-6f77fc5574-t6s55;K8S_POD_INFRA_CONTAINER_ID=7d66b29680878a45ff22b19244279a6434f18dcdf8c076a6f7dff24ac5fd9e72', 'CNI_COMMAND': u'DEL', 'CNI_CONTAINERID': u'7d66b29680878a45ff22b19244279a6434f18dcdf8c076a6f7dff24ac5fd9e72'}.: error: [Errno 2] No such file or directory 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last): 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 103, in delete 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service self.plugin.delete(params) 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 109, in delete 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service reg_ci = self.registry[pod_name]['containerid'] 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "<string>", line 2, in __getitem__ 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib64/python2.7/multiprocessing/managers.py", line 755, in _callmethod 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service self._connect() 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib64/python2.7/multiprocessing/managers.py", line 742, in _connect 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service conn = self._Client(self._token.address, authkey=self._authkey) 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib64/python2.7/multiprocessing/connection.py", line 173, in Client 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service c = SocketClient(address) 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib64/python2.7/multiprocessing/connection.py", line 308, in SocketClient 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service s.connect(address) 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib64/python2.7/socket.py", line 224, in meth 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service return getattr(self._sock,name)(*args) 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service error: [Errno 2] No such file or directory 2021-03-17 02:19:17.848 7161 ERROR kuryr_kubernetes.cni.daemon.service ^[[00m
As soon as I removed this setting: ansible nodes -m lineinfile -b -a 'path=/etc/crio/crio.conf state="absent" regexp="^manage_network_ns_lifecycle"' And restart crio pods where able to get created again.
[root@master-2 ~]# rpm -qa |grep cri-o cri-o-1.11.16-0.13.rhaos3.11.git5218c73.el7.x86_64
Mohammad, can you post full log of that kuryr-cni? First log you show seems okay (kubelet is trying to clean up the node as containers are gone) and the second one seems unrelated to the change Peter proposed (kuryr-cni complains about Python's multiprocessing utilities, we never had problems with these).
(In reply to Michał Dulko from comment #9) > Mohammad, can you post full log of that kuryr-cni? First log you show seems > okay (kubelet is trying to clean up the node as containers are gone) and the > second one seems unrelated to the change Peter proposed (kuryr-cni complains > about Python's multiprocessing utilities, we never had problems with these). Had you recreated all the pods on that affected node?
Created attachment 1763965 [details] Kuryr-cni logs Logs are for the pods below. [openshift@master-2 momologs]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE kuryr-cni-ds-2bbv5 2/2 Running 4 41m XX.XX.192.227 infra-node-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-47fk5 1/2 Running 3 41m XX.XX.192.156 master-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-6lp69 2/2 Running 4 41m XX.XX.192.198 master-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-8p5n2 1/2 Running 2 41m XX.XX.192.63 app-node-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-brqs6 2/2 Running 3 40m XX.XX.192.104 master-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-gck4q 2/2 Running 4 41m XX.XX.192.59 infra-node-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-m7ttl 2/2 Running 6 41m XX.XX.192.197 app-node-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-mngjd 2/2 Running 0 41m XX.XX.192.173 app-node-3.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-rft49 2/2 Running 5 41m XX.XX.192.33 infra-node-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-rnm2q 2/2 Running 4 41m XX.XX.192.130 app-node-1.openshift-mydev2-mydc.mydomain <none> kuryr-controller-59d5d4fd5b-tx8cf 1/1 Running 1 2h XX.XX.192.33 infra-node-1.openshift-mydev2-mydc.mydomain <none>
(In reply to Michał Dulko from comment #10) > (In reply to Michał Dulko from comment #9) > > Mohammad, can you post full log of that kuryr-cni? First log you show seems > > okay (kubelet is trying to clean up the node as containers are gone) and the > > second one seems unrelated to the change Peter proposed (kuryr-cni complains > > about Python's multiprocessing utilities, we never had problems with these). > > Had you recreated all the pods on that affected node? As soon as I made the change back, some pods became online. I didn't wait, and just recreated the rest.
Mohammad: Looks like you've started hitting NetlinkError(17, 'File exists'), which would make sense - cri-o started to manage network namespaces on its own and haven't cleaned up the previous ones first causing VLAN conflicts. You need to drain the node from all the pods (non-host-networking ones) before the config change.
Mohammand, I just realized that to test Peter's idea correctly you also need to make sure kuryr-cni mounts /run/netns from the host by changing kuryr-cni-ds DaemonSet (oc -n kuryr edit ds kuryr-cni-ds) to have those volume: - name: netns hostPath: path: /run/netns And volumeMount (on kuryr-cni container): - name: netns mountPath: /run/netns mountPropagation: HostToContainer
Thanks Michal, So, I'll try the following: 1- Drain node 2- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" insertafter="\[crio.runtime\]"'` 3- Update Daemonset as per [1] and figure out a way to do it in a rolling manner 4- Create a load and observe/ I'll post back findings. References: 1- Diff: $ diff -u kuryr-cni-ds-before.yml kuryr-cni-ds-after.yml --- kuryr-cni-ds-before.yml 2021-03-17 15:33:34.315354396 -0400 +++ kuryr-cni-ds-after.yml 2021-03-17 15:46:10.632727119 -0400 @@ -158,6 +158,12 @@ name: proc - mountPath: /var/run/openvswitch name: openvswitch + - mountPath: /run/netns + name: netns + mountPropegation: HostToContainer + hostPath: + - path: /run/netns + name: netns dnsPolicy: ClusterFirst hostNetwork: true initContainers:
Correcting [1] References" 1- [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yml kuryr-cni-ds-after.yml --- kuryr-cni-ds-before.yml 2021-03-17 15:33:34.315354396 -0400 +++ kuryr-cni-ds-after.yml 2021-03-17 16:29:30.837409022 -0400 @@ -148,6 +148,9 @@ terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: + - mountPath: /run/netns + name: netns + mountPropegation: HostToContainer - mountPath: /opt/cni/bin name: bin - mountPath: /etc/cni/net.d @@ -193,6 +196,10 @@ operator: Exists volumes: - hostPath: + path: /run/netns + type: "" + name: netns + - hostPath: path: /opt/cni/bin type: "" name: bin
Actually Steps: ## Check config: $ ansible nodes -m shell -b -a 'cat /etc/crio/crio.conf |grep manage_network_ns_lifecycle' [WARNING]: Found both group and host with same name: localhost infra-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true master-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true infra-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true app-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true master-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true app-node-3.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true master-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true infra-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true app-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true app-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> manage_network_ns_lifecycle = true ## Restart CRI-O after corresponding node is drained $ ansible nodes -m shell -b -a 'systemctl status crio |grep "Active:"' [WARNING]: Found both group and host with same name: localhost app-node-3.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago master-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago master-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago master-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago infra-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago infra-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:31:27 EDT; 8min ago app-node-1.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago infra-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago app-node-0.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:23:46 EDT; 15min ago app-node-2.openshift-mydev2-mydc.mydomain | CHANGED | rc=0 >> Active: active (running) since Wed 2021-03-17 17:31:28 EDT; 8min ago ## Restart kuryr-cni pods [openshift@master-2 ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE kuryr-cni-ds-5vjv6 2/2 Running 0 8m XX.XX.192.59 infra-node-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-642vx 2/2 Running 0 8m XX.XX.192.156 master-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-9xmbv 2/2 Running 0 8m XX.XX.192.33 infra-node-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-cxmnb 1/2 CrashLoopBackOff 6 17m XX.XX.192.197 app-node-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-fcqsv 2/2 Running 0 8m XX.XX.192.63 app-node-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-mffqr 2/2 Running 6 17m XX.XX.192.198 master-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-qv27z 1/2 CrashLoopBackOff 6 17m XX.XX.192.227 infra-node-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-s8rd2 2/2 Running 0 8m XX.XX.192.173 app-node-3.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-tdfwb 2/2 Running 0 8m XX.XX.192.104 master-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-w7zqf 1/2 CrashLoopBackOff 6 17m XX.XX.192.130 app-node-1.openshift-mydev2-mydc.mydomain <none> kuryr-controller-59d5d4fd5b-c85dj 1/1 Running 1 13m XX.XX.192.227 infra-node-0.openshift-mydev2-mydc.mydomain <none> ## Logs and pod definitions will be added.
Created attachment 1764201 [details] CNI logs after crio config changes
Created attachment 1764202 [details] Kuryr cni pod yaml
Attention CRi-O engineers: Are there any logs I can add to support this BZ?
These pods were running and now they are stuck in ContainerCreating: [openshift@master-2 ~]$ oc get pods -n momo NAME READY STATUS RESTARTS AGE echo-01-5996bf84f8-j4jxs 0/1 ContainerCreating 0 1h echo-02-774674b9dd-7pcdq 0/1 ContainerCreating 0 1h echo-03-7b8d5bb654-c5qpp 0/1 ContainerCreating 0 1h . . .
can I have the cri-o? `journalctl -u crio`
Created attachment 1764500 [details] Crio logs of all nodes attached.
(In reply to Mohammad from comment #16) > Correcting [1] > > References" > 1- [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yml > kuryr-cni-ds-after.yml > > --- kuryr-cni-ds-before.yml 2021-03-17 15:33:34.315354396 -0400 > +++ kuryr-cni-ds-after.yml 2021-03-17 16:29:30.837409022 -0400 > @@ -148,6 +148,9 @@ > terminationMessagePath: /dev/termination-log > terminationMessagePolicy: File > volumeMounts: > + - mountPath: /run/netns > + name: netns > + mountPropegation: HostToContainer > - mountPath: /opt/cni/bin > name: bin > - mountPath: /etc/cni/net.d > @@ -193,6 +196,10 @@ > operator: Exists > volumes: > - hostPath: > + path: /run/netns > + type: "" > + name: netns > + - hostPath: > path: /opt/cni/bin > type: "" > name: bin You seem to have a typo in mountPropagation. I'm not 100% sure if that can cause the issues you're seeing, this might be a bug in pyroute2 version used in 3.11 as this configuration change was never done with 3.11.
(In reply to Michał Dulko from comment #25) > (In reply to Mohammad from comment #16) > > Correcting [1] > > > > References" > > 1- [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yml > > kuryr-cni-ds-after.yml > > > > --- kuryr-cni-ds-before.yml 2021-03-17 15:33:34.315354396 -0400 > > +++ kuryr-cni-ds-after.yml 2021-03-17 16:29:30.837409022 -0400 > > @@ -148,6 +148,9 @@ > > terminationMessagePath: /dev/termination-log > > terminationMessagePolicy: File > > volumeMounts: > > + - mountPath: /run/netns > > + name: netns > > + mountPropegation: HostToContainer > > - mountPath: /opt/cni/bin > > name: bin > > - mountPath: /etc/cni/net.d > > @@ -193,6 +196,10 @@ > > operator: Exists > > volumes: > > - hostPath: > > + path: /run/netns > > + type: "" > > + name: netns > > + - hostPath: > > path: /opt/cni/bin > > type: "" > > name: bin > > You seem to have a typo in mountPropagation. I'm not 100% sure if that can > cause the issues you're seeing, this might be a bug in pyroute2 version used > in 3.11 as this configuration change was never done with 3.11. I observed same issue on my env and confirmed that setting mountPropagation correctly fixes this problem.
Thanks Michal, you are right, there is a typo (my apologies). I'm testing this today and will report back.
Hi Michal, You'd be pleased to know that I've fixed the typo and restarted all the cni pods, and they have been running all day with zero restarts: [openshift@master-2 ~]$ oc get pods NAME READY STATUS RESTARTS AGE kuryr-cni-ds-5lhlr 2/2 Running 0 9h kuryr-cni-ds-8kg5b 2/2 Running 0 9h kuryr-cni-ds-ct9f7 2/2 Running 0 9h kuryr-cni-ds-d8spf 2/2 Running 0 8h kuryr-cni-ds-f7dhv 2/2 Running 0 9h kuryr-cni-ds-f9bxg 2/2 Running 0 9h kuryr-cni-ds-hhxpz 2/2 Running 0 9h kuryr-cni-ds-hrw5v 2/2 Running 0 9h kuryr-cni-ds-n6wbt 2/2 Running 0 8h kuryr-cni-ds-w2k68 2/2 Running 0 8h kuryr-controller-59d5d4fd5b-q4f5z 1/1 Running 3 9h All my application pods are running. So what would be the next step? I'm guessing you won't be able to give us the answer since this is CRI-O related, so I'm guessing someone from the CRI-O team would hopefully answer me. The steps I followed precisely are: Steps: 1- Drain node 2- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" insertafter="\[crio.runtime\]"'` 3- Update Daemonset as per [1] and restart cni pods one at a time 4- Create a load and observe References: 1- Update daemonset with netns mounts as per: [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yaml kuryr-cni-ds-after.yaml --- kuryr-cni-ds-before.yaml 2021-03-22 02:12:05.894023362 -0400 +++ kuryr-cni-ds-after.yaml 2021-03-22 02:12:46.761584217 -0400 @@ -148,6 +148,9 @@ terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: + - mountPath: /run/netns + mountPropagation: HostToContainer + name: netns - mountPath: /opt/cni/bin name: bin - mountPath: /etc/cni/net.d @@ -193,6 +196,10 @@ operator: Exists volumes: - hostPath: + path: /run/netns + type: "" + name: netns + - hostPath: path: /opt/cni/bin type: "" name: bin
(In reply to Mohammad from comment #30) > Hi Michal, > > You'd be pleased to know that I've fixed the typo and restarted all the cni > pods, and they have been running all day with zero restarts: > > [openshift@master-2 ~]$ oc get pods > NAME READY STATUS RESTARTS AGE > kuryr-cni-ds-5lhlr 2/2 Running 0 9h > kuryr-cni-ds-8kg5b 2/2 Running 0 9h > kuryr-cni-ds-ct9f7 2/2 Running 0 9h > kuryr-cni-ds-d8spf 2/2 Running 0 8h > kuryr-cni-ds-f7dhv 2/2 Running 0 9h > kuryr-cni-ds-f9bxg 2/2 Running 0 9h > kuryr-cni-ds-hhxpz 2/2 Running 0 9h > kuryr-cni-ds-hrw5v 2/2 Running 0 9h > kuryr-cni-ds-n6wbt 2/2 Running 0 8h > kuryr-cni-ds-w2k68 2/2 Running 0 8h > kuryr-controller-59d5d4fd5b-q4f5z 1/1 Running 3 9h > > All my application pods are running. > > So what would be the next step? I'm guessing you won't be able to give us > the answer since this is CRI-O related, so I'm guessing someone from the > CRI-O team would hopefully answer me. I've only did very basic testing of this but looking through [1] and [2] that happened around 4.3 timeframe, it seemed to just work once Kuryr correctly mounted /var/run/netns, so hopefully it's enough for this to be more stable. I'd like a confirmation from cri-o team as they should know most about what versions of cri-o were used in different OCPs. [1] https://github.com/openshift/machine-config-operator/pull/1600 [2] https://github.com/openshift/machine-config-operator/pull/1689 > The steps I followed precisely are: > > Steps: > > 1- Drain node > 2- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a > 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" > insertafter="\[crio.runtime\]"'` > 3- Update Daemonset as per [1] and restart cni pods one at a time > 4- Create a load and observe > > > References: > 1- Update daemonset with netns mounts as per: > > [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yaml > kuryr-cni-ds-after.yaml > --- kuryr-cni-ds-before.yaml 2021-03-22 02:12:05.894023362 -0400 > +++ kuryr-cni-ds-after.yaml 2021-03-22 02:12:46.761584217 -0400 > @@ -148,6 +148,9 @@ > terminationMessagePath: /dev/termination-log > terminationMessagePolicy: File > volumeMounts: > + - mountPath: /run/netns > + mountPropagation: HostToContainer > + name: netns > - mountPath: /opt/cni/bin > name: bin > - mountPath: /etc/cni/net.d > @@ -193,6 +196,10 @@ > operator: Exists > volumes: > - hostPath: > + path: /run/netns > + type: "" > + name: netns > + - hostPath: > path: /opt/cni/bin > type: "" > name: bin
https://github.com/cri-o/cri-o/blob/5218c7371c2911663c1cc687c7baaa0df5921e0c/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go#L43 yeah we put the new namespaces in /var/run/netns in 3.11
(In reply to Peter Hunt from comment #34) > https://github.com/cri-o/cri-o/blob/5218c7371c2911663c1cc687c7baaa0df5921e0c/ > vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go#L43 > > yeah we put the new namespaces in /var/run/netns in 3.11 Sure, but isn't 3.11 cri-o missing any fixes that 4.4/4.3 cri-o has regarding netns management? The sudden revert of the PR enabling manage_ns_lifecycle [1] made me doubt but now I think it might have happenend only because kuryr-kubernetes and ovn-kubernetes weren't ready to do the switch. [1] https://github.com/openshift/machine-config-operator/pull/1600/files
for 1.17 or 4.4 (when introducing the new config value manage_ns_lifecycle, instead of the legacy manage_network_ns_lifecycle), we refactored the implementation to extend to other namespaces. The need for the mentioned revert and fixes were due to bugs in the new version. The legacy should have everything that's needed for network namespace managing, AFAICT
Current Environment: ==================== 1- Customer is currently running cri-o-1.11.16-0.13.rhaos3.11.git5218c73.el7.x86_64 on all nodes (AFAIK). 2- Customer is using RHEL versions between 7.7 - 7.9 Assuming this fix works (so far it looks like it), this is my understanding of what needs to be done (please correct me if I'm wrong for both long term and short term fixes): Short term fix (for existing environments): =========================================== 1- Add the crio-o change as per [1] 2- Update the daemonset to include changes in [2] 3- Per node, a. Drain node, b. Restart cri-o on that node, c. Restart kuryr-cni pod on node, d. Uncordon node 4- Repeat for each node until finished Long term fix (for new environments): ===================================== 1- Update openshift-ansible playbooks to automatically add the extra crio.conf as per [1] 2- Update openshift-ansible playbooks to automatically add the extra daemonset netns parameters as per [2] Questions: ========== 1- My assumption is that crio upgrades will not affect existing environments once the fix is applied, is that correct? 2- What are the downsides/risks (long and short term) of doing these changes on cri-o in OCP3.11? 3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but still need to do [2]? References: =========== 1- Update /etc/crio/crio.conf with `ansible targetnode -m lineinfile -b -a 'path=/etc/crio/crio.conf line="manage_network_ns_lifecycle = true" insertafter="\[crio.runtime\]"'` 2- Update daemonset with netns mounts as per: [openshift@master-2 ~]$ diff -u kuryr-cni-ds-before.yaml kuryr-cni-ds-after.yaml --- kuryr-cni-ds-before.yaml 2021-03-22 02:12:05.894023362 -0400 +++ kuryr-cni-ds-after.yaml 2021-03-22 02:12:46.761584217 -0400 @@ -148,6 +148,9 @@ terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: + - mountPath: /run/netns + mountPropagation: HostToContainer + name: netns - mountPath: /opt/cni/bin name: bin - mountPath: /etc/cni/net.d @@ -193,6 +196,10 @@ operator: Exists volumes: - hostPath: + path: /run/netns + type: "" + name: netns + - hostPath: path: /opt/cni/bin type: "" name: bin
I decided to do further tests, I have 4 app nodes that have around 50 application pods running between them. I ran this node drain, the uncordon about 4 times: for NODE in $(oc get nodes |grep app-node- |awk '{print $1}'); do oc adm drain $NODE --ignore-daemonsets --delete-local-data --force; sleep 60; oc adm uncordon $NODE; echo "Drained and uncordoned $NODE"; done This is what the pods looked like: [openshift@master-2 mologs]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE kuryr-cni-ds-5lhlr 2/2 Running 0 1d XX.XX.192.104 master-2.openshift.mydomain <none> kuryr-cni-ds-8kg5b 2/2 Running 0 1d XX.XX.192.59 infra-node-2.openshift.mydomain <none> kuryr-cni-ds-ct9f7 2/2 Running 0 1d XX.XX.192.227 infra-node-0.openshift.mydomain <none> kuryr-cni-ds-d8spf 2/2 Running 6 1d XX.XX.192.130 app-node-1.openshift.mydomain <none> kuryr-cni-ds-f7dhv 2/2 Running 0 1d XX.XX.192.156 master-1.openshift.mydomain <none> kuryr-cni-ds-f9bxg 2/2 Running 0 1d XX.XX.192.198 master-0.openshift.mydomain <none> kuryr-cni-ds-hhxpz 2/2 Running 6 1d XX.XX.192.173 app-node-3.openshift.mydomain <none> kuryr-cni-ds-hrw5v 2/2 Running 0 1d XX.XX.192.33 infra-node-1.openshift.mydomain <none> kuryr-cni-ds-n6wbt 2/2 Running 7 1d XX.XX.192.63 app-node-2.openshift.mydomain <none> kuryr-cni-ds-sl87c 2/2 Running 3 35m XX.XX.192.197 app-node-0.openshift.mydomain <none> kuryr-controller-59d5d4fd5b-frptd 1/1 Running 2 20h XX.XX.192.59 infra-node-2.openshift.mydomain <none> [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-d8spf > kuryr-cni-ds-d8spf_app-node-1.log [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-hhxpz > kuryr-cni-ds-hhxpz_app-node-3.log [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-n6wbt > kuryr-cni-ds-n6wbt_app-node-2.log [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-sl87c > kuryr-cni-ds-sl87c_app-node-0.log I'm seeing a few of these errors: kuryr-cni-ds-hhxpz_app-node-3.log:2021-03-23 03:57:24.564 551 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-31-79bc4b496-fx85n_momo_9a12c78f-8b8b-11eb-b88d-fa163e6dae63_0-f2c33ef4', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496-fx85n;K8S_POD_INFRA_CONTAINER_ID=07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a203e98929a3270f6', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a203e98929a3270f6'}: IOError: [Errno 32] Broken pipe kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:45:36.690 240 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-31-79bc4b496-k5hrw_momo_099d7cdd-8b89-11eb-b88d-fa163e6dae63_0-31048169', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496-k5hrw;K8S_POD_INFRA_CONTAINER_ID=f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f4249497c3b6958c4', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f4249497c3b6958c4'}: RuntimeError kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:55:52.915 1213 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-17-5475cdc776-nsxk8_momo_38298655-8b8b-11eb-b88d-fa163e6dae63_0-8e0b0aeb', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-17-5475cdc776-nsxk8;K8S_POD_INFRA_CONTAINER_ID=e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6efa12b8dce78de7d', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6efa12b8dce78de7d'}: IOError: [Errno 32] Broken pipe kuryr-cni-ds-sl87c_app-node-0.log:2021-03-23 03:56:05.350 345 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/var/run/netns/k8s_echo-21-57f7ccfd54-hx5dp_momo_99470682-8b8b-11eb-b88d-fa163e6dae63_0-70c47700', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-21-57f7ccfd54-hx5dp;K8S_POD_INFRA_CONTAINER_ID=39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd776b58f245e83f04', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd776b58f245e83f04'}: RuntimeError I'm guessing these are harmless as all the application pods are running.
answers to the questions: 1- My assumption is that crio upgrades will not affect existing environments once the fix is applied, is that correct? sorry I don't understand what this means 2- What are the downsides/risks (long and short term) of doing these changes on cri-o in OCP3.11? It's hard to tell. Many consumers use that package, and after we reimplemented this feature, we've set it to default and now have dropped the config field entirely. Unfortunately, we don't have a lot of testing against this version. I would naively say it's very likely to work without a problem. 3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but still need to do [2]? the MCO has the cri-o changes already in 4.4, so they're not applicable. I am not sure exactly what needs to be done for kuryr, but i would guess that is already taken care of as well
(In reply to Peter Hunt from comment #42) > answers to the questions: > > 1- My assumption is that crio upgrades will not affect existing environments > once the fix is applied, is that correct? > sorry I don't understand what this means I think it's a question if 3.11 minor upgrade will revert the config change. > 2- What are the downsides/risks (long and short term) of doing these changes > on cri-o in OCP3.11? > It's hard to tell. Many consumers use that package, and after we > reimplemented this feature, we've set it to default and now have dropped the > config field entirely. > Unfortunately, we don't have a lot of testing against this version. I would > naively say it's very likely to work without a problem. > 3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but > still need to do [2]? > the MCO has the cri-o changes already in 4.4, so they're not applicable. I > am not sure exactly what needs to be done for kuryr, but i would guess that > is already taken care of as well Yeah, in 4.x Kuryr works just fine with this change but obviously 3.11 code diverged from 4.x already.
(In reply to Mohammad from comment #40) > I decided to do further tests, I have 4 app nodes that have around 50 > application pods running between them. > > I ran this node drain, the uncordon about 4 times: > > for NODE in $(oc get nodes |grep app-node- |awk '{print $1}'); do oc adm > drain $NODE --ignore-daemonsets --delete-local-data --force; sleep 60; oc > adm uncordon $NODE; echo "Drained and uncordoned $NODE"; done > > This is what the pods looked like: > > [openshift@master-2 mologs]$ oc get pods -o wide > NAME READY STATUS RESTARTS AGE > IP NODE NOMINATED > NODE > kuryr-cni-ds-5lhlr 2/2 Running 0 1d > XX.XX.192.104 master-2.openshift.mydomain <none> > kuryr-cni-ds-8kg5b 2/2 Running 0 1d > XX.XX.192.59 infra-node-2.openshift.mydomain <none> > kuryr-cni-ds-ct9f7 2/2 Running 0 1d > XX.XX.192.227 infra-node-0.openshift.mydomain <none> > kuryr-cni-ds-d8spf 2/2 Running 6 1d > XX.XX.192.130 app-node-1.openshift.mydomain <none> > kuryr-cni-ds-f7dhv 2/2 Running 0 1d > XX.XX.192.156 master-1.openshift.mydomain <none> > kuryr-cni-ds-f9bxg 2/2 Running 0 1d > XX.XX.192.198 master-0.openshift.mydomain <none> > kuryr-cni-ds-hhxpz 2/2 Running 6 1d > XX.XX.192.173 app-node-3.openshift.mydomain <none> > kuryr-cni-ds-hrw5v 2/2 Running 0 1d > XX.XX.192.33 infra-node-1.openshift.mydomain <none> > kuryr-cni-ds-n6wbt 2/2 Running 7 1d > XX.XX.192.63 app-node-2.openshift.mydomain <none> > kuryr-cni-ds-sl87c 2/2 Running 3 35m > XX.XX.192.197 app-node-0.openshift.mydomain <none> > kuryr-controller-59d5d4fd5b-frptd 1/1 Running 2 20h > XX.XX.192.59 infra-node-2.openshift.mydomain <none> > > [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-d8spf > > kuryr-cni-ds-d8spf_app-node-1.log > [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-hhxpz > > kuryr-cni-ds-hhxpz_app-node-3.log > [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-n6wbt > > kuryr-cni-ds-n6wbt_app-node-2.log > [openshift@master-2 mologs]$ oc logs -p -c kuryr-cni kuryr-cni-ds-sl87c > > kuryr-cni-ds-sl87c_app-node-0.log > > I'm seeing a few of these errors: > > kuryr-cni-ds-hhxpz_app-node-3.log:2021-03-23 03:57:24.564 551 ERROR > kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork > request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': > u'/var/run/netns/k8s_echo-31-79bc4b496-fx85n_momo_9a12c78f-8b8b-11eb-b88d- > fa163e6dae63_0-f2c33ef4', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', > 'CNI_ARGS': > u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496- > fx85n; > K8S_POD_INFRA_CONTAINER_ID=07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a20 > 3e98929a3270f6', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': > u'07e1232e9bf1ebc898c63bc97f939dfe79e21f35d352907a203e98929a3270f6'}: > IOError: [Errno 32] Broken pipe > kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:45:36.690 240 ERROR > kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork > request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': > u'/var/run/netns/k8s_echo-31-79bc4b496-k5hrw_momo_099d7cdd-8b89-11eb-b88d- > fa163e6dae63_0-31048169', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', > 'CNI_ARGS': > u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-31-79bc4b496- > k5hrw; > K8S_POD_INFRA_CONTAINER_ID=f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f42 > 49497c3b6958c4', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': > u'f50d75ee0194ff8f4096591691f4b59949cea11449e8af9f4249497c3b6958c4'}: > RuntimeError > kuryr-cni-ds-n6wbt_app-node-2.log:2021-03-23 03:55:52.915 1213 ERROR > kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork > request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': > u'/var/run/netns/k8s_echo-17-5475cdc776-nsxk8_momo_38298655-8b8b-11eb-b88d- > fa163e6dae63_0-8e0b0aeb', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', > 'CNI_ARGS': > u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-17-5475cdc776- > nsxk8; > K8S_POD_INFRA_CONTAINER_ID=e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6ef > a12b8dce78de7d', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': > u'e2e2eb222c1b0697dcecd7be98885eebf542423eeb2109a6efa12b8dce78de7d'}: > IOError: [Errno 32] Broken pipe > kuryr-cni-ds-sl87c_app-node-0.log:2021-03-23 03:56:05.350 345 ERROR > kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork > request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': > u'/var/run/netns/k8s_echo-21-57f7ccfd54-hx5dp_momo_99470682-8b8b-11eb-b88d- > fa163e6dae63_0-70c47700', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', > 'CNI_ARGS': > u'IgnoreUnknown=1;K8S_POD_NAMESPACE=momo;K8S_POD_NAME=echo-21-57f7ccfd54- > hx5dp; > K8S_POD_INFRA_CONTAINER_ID=39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd77 > 6b58f245e83f04', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': > u'39078f44a6308d1aa46c558b17399c0b2fcbeb21ab11aecd776b58f245e83f04'}: > RuntimeError > > > I'm guessing these are harmless as all the application pods are running. Honestly I'm not sure where those come from. We've seen this happening in the past but simple pyroute2 upgrade helped to clear it up. Unfortunately pyroute2 doesn't seem to provide a good debugging info in this case. It does however seem that it fixes itself after a retry made by cri-o.
(In reply to Peter Hunt from comment #42) > answers to the questions: > > 1- My assumption is that crio upgrades will not affect existing environments > once the fix is applied, is that correct? > sorry I don't understand what this means > 2- What are the downsides/risks (long and short term) of doing these changes > on cri-o in OCP3.11? > It's hard to tell. Many consumers use that package, and after we > reimplemented this feature, we've set it to default and now have dropped the > config field entirely. > Unfortunately, we don't have a lot of testing against this version. I would > naively say it's very likely to work without a problem. > 3- Would we need to avoid the cri-o change [1] if we upgrade to 1.17 but > still need to do [2]? > the MCO has the cri-o changes already in 4.4, so they're not applicable. I > am not sure exactly what needs to be done for kuryr, but i would guess that > is already taken care of as well It looks like this works well and the errors we see in https://bugzilla.redhat.com/show_bug.cgi?id=1939605#c44 are not a show stopper. To ask differently, what are our next steps? Do we apply the short term fix as per https://bugzilla.redhat.com/show_bug.cgi?id=1939605#c37 and wait for a long term fix? Then how do we maintain the short term fix throughout upgrades to crio (OS updates) and openshift-ansible upgrades? I need to consider this carefully because we have 10 production clusters we need to do this on.
Checked with: OCP v3.11.448 OSP 13-2021-03-24.1 cri-o 1.11.17-dev $ oc get pods -n kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-ds-bsjsh 2/2 Running 0 2d kuryr-cni-ds-c9n2x 2/2 Running 0 2d kuryr-cni-ds-d4bz9 2/2 Running 0 2d kuryr-cni-ds-ddqsb 2/2 Running 0 2d kuryr-cni-ds-dr2rc 2/2 Running 0 2d kuryr-cni-ds-rb2xq 2/2 Running 0 2d kuryr-cni-ds-rpfsf 2/2 Running 0 2d kuryr-cni-ds-vbg6l 2/2 Running 0 2d kuryr-controller-567485999f-29wbt 1/1 Running 0 2d Haven't found the errors mentioned in the bug description
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 3.11.452 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2150