Description of problem: We might get that traceback when there are unscheduled matching pods on NP deletion: 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging Traceback (most recent call last): 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging File "/opt/stack/kuryr-kubernetes/kuryr_kubernetes/handlers/logging.py", line 37, in __call__ 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, **kwargs) 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging File "/opt/stack/kuryr-kubernetes/kuryr_kubernetes/handlers/retry.py", line 81, in __call__ 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, **kwargs) 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging File "/opt/stack/kuryr-kubernetes/kuryr_kubernetes/handlers/k8s_base.py", line 81, in __call__ 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging self.on_finalize(obj) 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging File "/opt/stack/kuryr-kubernetes/kuryr_kubernetes/controller/handlers/kuryrnetworkpolicy.py", line 288, in on_finalize 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging self._drv_vif_pool.update_vif_sgs(pod, pod_sgs) 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging File "/opt/stack/kuryr-kubernetes/kuryr_kubernetes/controller/drivers/vif_pool.py", line 1224, in update_vif_sgs 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging pod_vif_type = self._get_pod_vif_type(pod) 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging File "/opt/stack/kuryr-kubernetes/kuryr_kubernetes/controller/drivers/vif_pool.py", line 1244, in _get_pod_vif_type 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging node_name = pod['spec']['nodeName'] 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging KeyError: 'nodeName' 2020-11-04 13:41:33.559 19857 ERROR kuryr_kubernetes.handlers.logging Version-Release number of selected component (if applicable): How reproducible: Always if hit. Steps to Reproduce: 1. Disable kube-scheduler. 2. Create an NP. 3. Create a pod that matches this NP. Double check that it'll hang unscheduled due to kube-scheduler being disabled. 4. Delete NP from #2. Actual results: Traceback in kuryr-controller logs. Expected results: Unscheduled pod getting ignored. Additional info:
Failed on OCP4.7.0-0.nightly-2020-12-14-080124 over OSP16.1 with OVN-Octavia (RHOS-16.1-RHEL-8-20201124.n.0) kuryr-controller still on crashloopback on the mentioned scenario: 1. Disable kube-scheduler by moving the manifest file (kube-scheduler-pod.yaml) on the three masters: Before moving the files, the pods are in place: (shiftstack) [stack@undercloud-0 ~]$ oc get pods -A | grep kube-scheduler [...] openshift-kube-scheduler openshift-kube-scheduler-ostest-hlxdv-master-0 3/3 Running 4 18h openshift-kube-scheduler openshift-kube-scheduler-ostest-hlxdv-master-1 3/3 Running 3 18h openshift-kube-scheduler openshift-kube-scheduler-ostest-hlxdv-master-2 3/3 Running 3 19h Now, moving the files: $ openstack server list +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+ | 94b0c150-cf2b-4c8f-a5ca-9b2e011931c0 | ostest-hlxdv-worker-0-dm5jz | ACTIVE | ostest-hlxdv-openshift=10.196.3.13 | ostest-hlxdv-rhcos | | | 2646058e-932d-41d2-acc3-7ffcf8553bd5 | ostest-hlxdv-worker-0-f6dhp | ACTIVE | ostest-hlxdv-openshift=10.196.1.2 | ostest-hlxdv-rhcos | | | 4aa63711-5871-4998-bfdd-447a2ff9aaa5 | ostest-hlxdv-worker-0-85w2q | ACTIVE | ostest-hlxdv-openshift=10.196.1.124 | ostest-hlxdv-rhcos | | | fce689d3-995c-4a1a-b061-3b064d7521f7 | ostest-hlxdv-master-2 | ACTIVE | ostest-hlxdv-openshift=10.196.2.68 | ostest-hlxdv-rhcos | | | 56538749-5fd5-49a7-96b8-8a3c0b071f0b | ostest-hlxdv-master-1 | ACTIVE | ostest-hlxdv-openshift=10.196.2.224 | ostest-hlxdv-rhcos | | | 48b95246-b460-4497-bbc1-fefdfd775e20 | ostest-hlxdv-master-0 | ACTIVE | ostest-hlxdv-openshift=10.196.1.51 | ostest-hlxdv-rhcos | | +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+ $ ssh -J core.22.93 core.2.224 Warning: Permanently added '10.46.22.93' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.2.224' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 47.83.202012140447-0 Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html --- [core@ostest-hlxdv-master-1 ~]$ sudo mv /etc/kubernetes/manifests/kube-scheduler-pod.yaml /tmp/ [core@ostest-hlxdv-master-1 ~]$ exit logout (^ same repeated for the other two masters) As a result, the pods are not present: $ oc get pods -n openshift-kube-scheduler | grep -v Completed NAME READY STATUS RESTARTS AGE 2. Create the NP and the pods: $ oc new-project test $ cat np_resource.yaml kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: np spec: podSelector: matchLabels: run: demo ingress: - from: - podSelector: matchLabels: run: demo-allowed-caller $ oc apply -f np_resource.yaml $ oc run --image kuryr/demo demo-allowed-caller $ oc run --image kuryr/demo demo 3. kuryr-controller starts to be restarted with below exception: 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging Traceback (most recent call last): 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 37, in __call__ 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, **kwargs) 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 81, in __call__ 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, **kwargs) 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 90, in __call__ 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging self.on_present(obj) 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrnetworkpolicy.py", line 204, in on_present 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging self._drv_vif_pool.update_vif_sgs(pod, pod_sgs) 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 1224, in update_vif_sgs 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging pod_vif_type = self._get_pod_vif_type(pod) 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 1244, in _get_pod_vif_type 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging node_name = pod['spec']['nodeName'] 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging KeyError: 'nodeName' 2020-12-15 12:18:10.251 1 ERROR kuryr_kubernetes.handlers.logging
Verified on OCP4.7.0-0.nightly-2021-01-10-070949 over OSP13 with Amphoras (2020-10-06.2) Performing above procedure, the pods remain on pending state but there is no backtraces on kuryr-controller: [stack@undercloud-0 ~]$ oc logs -n openshift-kuryr kuryr-controller-74cbc895fc-8hbsg | grep -i KeyError [stack@undercloud-0 ~]$ oc get all NAME READY STATUS RESTARTS AGE pod/demo 0/1 Pending 0 8m45s pod/demo-allowed-caller 0/1 Pending 0 8m50s # Restoring the kube-scheduler-pod.yaml: (shiftstack) [stack@undercloud-0 ~]$ ssh -J core.22.232 core.0.141 sudo mv /tmp/kube-scheduler-pod.yaml /etc/kubernetes/manifests/kube-scheduler-pod.yaml Warning: Permanently added '10.46.22.232' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.0.141' (ECDSA) to the list of known hosts. Killed by signal 1. (shiftstack) [stack@undercloud-0 ~]$ ssh -J core.22.232 core.3.88 sudo mv /tmp/kube-scheduler-pod.yaml /etc/kubernetes/manifests/kube-scheduler-pod.yaml Warning: Permanently added '10.46.22.232' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.3.88' (ECDSA) to the list of known hosts. Killed by signal 1. (shiftstack) [stack@undercloud-0 ~]$ ssh -J core.22.232 core.1.211 sudo mv /tmp/kube-scheduler-pod.yaml /etc/kubernetes/manifests/kube-scheduler-pod.yaml Warning: Permanently added '10.46.22.232' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.1.211' (ECDSA) to the list of known hosts. Killed by signal 1. # The pods are started to be deployed: (shiftstack) [stack@undercloud-0 ~]$ oc get all NAME READY STATUS RESTARTS AGE pod/demo 0/1 ContainerCreating 0 10m pod/demo-allowed-caller 0/1 ContainerCreating 0 10m (shiftstack) [stack@undercloud-0 ~]$ oc get all NAME READY STATUS RESTARTS AGE pod/demo 1/1 Running 0 11m pod/demo-allowed-caller 1/1 Running 0 11m (shiftstack) [stack@undercloud-0 ~]$ oc expose pod/demo --port 80 --target-port 8080 service/demo exposed (shiftstack) [stack@undercloud-0 ~]$ oc get all NAME READY STATUS RESTARTS AGE pod/demo 1/1 Running 0 12m pod/demo-allowed-caller 1/1 Running 0 12m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/demo ClusterIP 172.30.77.173 <none> 80/TCP 4s (shiftstack) [stack@undercloud-0 ~]$ oc rsh pod/demo-allowed-caller curl 172.30.77.173 demo: HELLO! I AM ALIVE!!! Kuryr controller remains stable during this procedure.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633