Description of problem: EgressIP on OVN-Kubernetes cluster on GCP, reboot a node with egressIP assigned, the node comes back becoming NodeReady, but egressIP on it is lost Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-10-101431 True False 4h1m Cluster version is 4.10.0-0.nightly-2022-01-10-101431 How reproducible: create a OVN-K cluster on GCP with 3 worker nodes $ oc get node NAME STATUS ROLES AGE VERSION jechen-0110a-lh2jd-master-0.c.openshift-qe.internal Ready master 4h24m v1.22.1+6859754 jechen-0110a-lh2jd-master-1.c.openshift-qe.internal Ready master 4h24m v1.22.1+6859754 jechen-0110a-lh2jd-master-2.c.openshift-qe.internal Ready master 4h23m v1.22.1+6859754 jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal Ready worker 4h11m v1.22.1+6859754 jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal Ready worker 4h11m v1.22.1+6859754 jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal Ready worker 4h11m v1.22.1+6859754 Steps to Reproduce: 1. label nodes as egress-assignable $ oc label node jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal labeled $ oc label node jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal labeled $ oc label node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal labeled 2. $ cat config_egressip1_ovn_ns_team_red.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egressip1 spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red $ oc create -f config_egressip1_ovn_ns_team_red.yaml egressip.k8s.ovn.org/egressip1 created $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip1 10.0.128.101 jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal 10.0.128.101 $ oc get egressip -oyaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2022-01-10T23:09:59Z" generation: 4 name: egressip1 resourceVersion: "117389" uid: 892940e6-4363-4e1c-a8f5-65a40286a19f spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red podSelector: {} status: items: - egressIP: 10.0.128.101 node: jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal - egressIP: 10.0.128.103 node: jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal - egressIP: 10.0.128.102 node: jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal kind: List metadata: resourceVersion: "" selfLink: "" 3. reboot one of the node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal, wait till after it comes back $ oc describe node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal <--snip--> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 3m28s kubelet Starting kubelet. Normal NodeHasSufficientMemory 3m28s (x2 over 3m28s) kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 3m28s (x2 over 3m28s) kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 3m28s (x2 over 3m28s) kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeHasSufficientPID Warning Rebooted 3m28s kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal has been rebooted, boot id: c78abedd-f7bf-4b76-82fc-f77285378439 Normal NodeNotReady 3m28s kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeNotReady Normal NodeAllocatableEnforced 3m27s kubelet Updated Node Allocatable limit across pods Normal NodeReady 3m17s kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeReady Actual results: node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal lost egressip previously assigned to it $ oc get egressip -oyaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2022-01-10T23:09:59Z" generation: 7 name: egressip1 resourceVersion: "124609" uid: 892940e6-4363-4e1c-a8f5-65a40286a19f spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red podSelector: {} status: items: - egressIP: 10.0.128.103 node: jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal - egressIP: 10.0.128.102 node: jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal kind: List metadata: resourceVersion: "" selfLink: "" Expected results: node should not lose egressip previously assigned it Additional info:
@jechen assign this bug to you for verification this bug, thanks
Verified in 4.10.0-0.nightly-2022-01-25-023600 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-25-023600 True False 12m Cluster version is 4.10.0-0.nightly-2022-01-25-023600 $ oc get node NAME STATUS ROLES AGE VERSION jechen-0125c-r6m4q-master-0.c.openshift-qe.internal Ready master 68m v1.23.0+06791f6 jechen-0125c-r6m4q-master-1.c.openshift-qe.internal Ready master 68m v1.23.0+06791f6 jechen-0125c-r6m4q-master-2.c.openshift-qe.internal Ready master 68m v1.23.0+06791f6 jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal Ready worker 53m v1.23.0+06791f6 jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal Ready worker 53m v1.23.0+06791f6 jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal Ready worker 53m v1.23.0+06791f6 $ oc label node jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal labeled $ oc label node jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal labeled $ oc label node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal labeled $ oc create -f ./SDN-1332-test/config_egressip1_ovn_ns_team_red.yaml egressip.k8s.ovn.org/egressip1 created $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip1 10.0.128.101 jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal 10.0.128.103 $ oc get egressip -oyaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2022-01-26T02:44:14Z" generation: 4 name: egressip1 resourceVersion: "42056" uid: c0d0c881-d566-4c31-b984-1b26854447a1 spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red status: items: - egressIP: 10.0.128.103 node: jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal - egressIP: 10.0.128.102 node: jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal - egressIP: 10.0.128.101 node: jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal kind: List metadata: resourceVersion: "" selfLink: "" [jechen@jechen ~]$ oc new-project test $ oc label ns test team=red namespace/test labeled $ oc create -f ./SDN-1332-test/list_for_pods.json replicationcontroller/test-rc created service/test-service created $ oc get pod NAME READY STATUS RESTARTS AGE test-rc-749c9 0/1 ContainerCreating 0 2s test-rc-99lxj 0/1 ContainerCreating 0 2s test-rc-mx5zb 0/1 ContainerCreating 0 2s $ oc rsh test-rc-749c9 ~ $ curl 10.0.0.2:8888 10.0.128.101~ $ ~ $ curl 10.0.0.2:8888 10.0.128.101~ $ ~ $ curl 10.0.0.2:8888 10.0.128.101~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ exit $ oc debug node/jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal Starting pod/jechen-0125c-r6m4q-worker-c-b787bcopenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.4 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# sh-4.4# reboot Terminated sh-4.4# Removing debug pod ... ###wait till the node comes back $ oc describe node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 14s kubelet Starting kubelet. Normal NodeHasSufficientMemory 14s (x2 over 14s) kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 14s (x2 over 14s) kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 14s (x2 over 14s) kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeHasSufficientPID Warning Rebooted 14s kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal has been rebooted, boot id: d693e91e-272c-4420-9125-66931845e6e5 Normal NodeNotReady 14s kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeNotReady Normal NodeAllocatableEnforced 14s kubelet Updated Node Allocatable limit across pods $ oc get egressip -oyaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2022-01-26T02:44:14Z" generation: 6 name: egressip1 resourceVersion: "44232" uid: c0d0c881-d566-4c31-b984-1b26854447a1 spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red status: items: - egressIP: 10.0.128.102 node: jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal - egressIP: 10.0.128.101 node: jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal - egressIP: 10.0.128.103 node: jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal kind: List metadata: resourceVersion: "" selfLink: "" $ oc rsh test-rc-749c9 ~ $ curl 10.0.0.2:8888 10.0.128.102~ $ ~ $ curl 10.0.0.2:8888 10.0.128.102~ $ ~ $ curl 10.0.0.2:8888 10.0.128.101~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ curl 10.0.0.2:8888 10.0.128.102~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ curl 10.0.0.2:8888 10.0.128.102~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ curl 10.0.0.2:8888 10.0.128.101~ $ ~ $ curl 10.0.0.2:8888 10.0.128.101~ $ ~ $ curl 10.0.0.2:8888 10.0.128.102~ $ ~ $ curl 10.0.0.2:8888 10.0.128.103~ $ ~ $ exit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056