Bug 2039099
| Summary: | [OVN EgressIP GCP] After reboot egress node, egressip that was previously assigned got lost | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jechen <jechen> |
| Component: | Networking | Assignee: | Ben Bennett <bbennett> |
| Networking sub component: | ovn-kubernetes | QA Contact: | jechen <jechen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | bpickard, huirwang, zzhao |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:38:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@jechen assign this bug to you for verification this bug, thanks Verified in 4.10.0-0.nightly-2022-01-25-023600
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2022-01-25-023600 True False 12m Cluster version is 4.10.0-0.nightly-2022-01-25-023600
$ oc get node
NAME STATUS ROLES AGE VERSION
jechen-0125c-r6m4q-master-0.c.openshift-qe.internal Ready master 68m v1.23.0+06791f6
jechen-0125c-r6m4q-master-1.c.openshift-qe.internal Ready master 68m v1.23.0+06791f6
jechen-0125c-r6m4q-master-2.c.openshift-qe.internal Ready master 68m v1.23.0+06791f6
jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal Ready worker 53m v1.23.0+06791f6
jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal Ready worker 53m v1.23.0+06791f6
jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal Ready worker 53m v1.23.0+06791f6
$ oc label node jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"=""
node/jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal labeled
$ oc label node jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"=""
node/jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal labeled
$ oc label node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"=""
node/jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal labeled
$ oc create -f ./SDN-1332-test/config_egressip1_ovn_ns_team_red.yaml
egressip.k8s.ovn.org/egressip1 created
$ oc get egressip
NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS
egressip1 10.0.128.101 jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal 10.0.128.103
$ oc get egressip -oyaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
creationTimestamp: "2022-01-26T02:44:14Z"
generation: 4
name: egressip1
resourceVersion: "42056"
uid: c0d0c881-d566-4c31-b984-1b26854447a1
spec:
egressIPs:
- 10.0.128.101
- 10.0.128.102
- 10.0.128.103
namespaceSelector:
matchLabels:
team: red
status:
items:
- egressIP: 10.0.128.103
node: jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal
- egressIP: 10.0.128.102
node: jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal
- egressIP: 10.0.128.101
node: jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal
kind: List
metadata:
resourceVersion: ""
selfLink: ""
[jechen@jechen ~]$ oc new-project test
$ oc label ns test team=red
namespace/test labeled
$ oc create -f ./SDN-1332-test/list_for_pods.json
replicationcontroller/test-rc created
service/test-service created
$ oc get pod
NAME READY STATUS RESTARTS AGE
test-rc-749c9 0/1 ContainerCreating 0 2s
test-rc-99lxj 0/1 ContainerCreating 0 2s
test-rc-mx5zb 0/1 ContainerCreating 0 2s
$ oc rsh test-rc-749c9
~ $ curl 10.0.0.2:8888
10.0.128.101~ $
~ $ curl 10.0.0.2:8888
10.0.128.101~ $
~ $ curl 10.0.0.2:8888
10.0.128.101~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ exit
$ oc debug node/jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal
Starting pod/jechen-0125c-r6m4q-worker-c-b787bcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4#
sh-4.4# reboot
Terminated
sh-4.4#
Removing debug pod ...
###wait till the node comes back
$ oc describe node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 14s kubelet Starting kubelet.
Normal NodeHasSufficientMemory 14s (x2 over 14s) kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 14s (x2 over 14s) kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 14s (x2 over 14s) kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeHasSufficientPID
Warning Rebooted 14s kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal has been rebooted, boot id: d693e91e-272c-4420-9125-66931845e6e5
Normal NodeNotReady 14s kubelet Node jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal status is now: NodeNotReady
Normal NodeAllocatableEnforced 14s kubelet Updated Node Allocatable limit across pods
$ oc get egressip -oyaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
creationTimestamp: "2022-01-26T02:44:14Z"
generation: 6
name: egressip1
resourceVersion: "44232"
uid: c0d0c881-d566-4c31-b984-1b26854447a1
spec:
egressIPs:
- 10.0.128.101
- 10.0.128.102
- 10.0.128.103
namespaceSelector:
matchLabels:
team: red
status:
items:
- egressIP: 10.0.128.102
node: jechen-0125c-r6m4q-worker-a-6zf4c.c.openshift-qe.internal
- egressIP: 10.0.128.101
node: jechen-0125c-r6m4q-worker-b-ppgq4.c.openshift-qe.internal
- egressIP: 10.0.128.103
node: jechen-0125c-r6m4q-worker-c-b787b.c.openshift-qe.internal
kind: List
metadata:
resourceVersion: ""
selfLink: ""
$ oc rsh test-rc-749c9
~ $ curl 10.0.0.2:8888
10.0.128.102~ $
~ $ curl 10.0.0.2:8888
10.0.128.102~ $
~ $ curl 10.0.0.2:8888
10.0.128.101~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ curl 10.0.0.2:8888
10.0.128.102~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ curl 10.0.0.2:8888
10.0.128.102~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ curl 10.0.0.2:8888
10.0.128.101~ $
~ $ curl 10.0.0.2:8888
10.0.128.101~ $
~ $ curl 10.0.0.2:8888
10.0.128.102~ $
~ $ curl 10.0.0.2:8888
10.0.128.103~ $
~ $ exit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Description of problem: EgressIP on OVN-Kubernetes cluster on GCP, reboot a node with egressIP assigned, the node comes back becoming NodeReady, but egressIP on it is lost Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-10-101431 True False 4h1m Cluster version is 4.10.0-0.nightly-2022-01-10-101431 How reproducible: create a OVN-K cluster on GCP with 3 worker nodes $ oc get node NAME STATUS ROLES AGE VERSION jechen-0110a-lh2jd-master-0.c.openshift-qe.internal Ready master 4h24m v1.22.1+6859754 jechen-0110a-lh2jd-master-1.c.openshift-qe.internal Ready master 4h24m v1.22.1+6859754 jechen-0110a-lh2jd-master-2.c.openshift-qe.internal Ready master 4h23m v1.22.1+6859754 jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal Ready worker 4h11m v1.22.1+6859754 jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal Ready worker 4h11m v1.22.1+6859754 jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal Ready worker 4h11m v1.22.1+6859754 Steps to Reproduce: 1. label nodes as egress-assignable $ oc label node jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal labeled $ oc label node jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal labeled $ oc label node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal "k8s.ovn.org/egress-assignable"="" node/jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal labeled 2. $ cat config_egressip1_ovn_ns_team_red.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egressip1 spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red $ oc create -f config_egressip1_ovn_ns_team_red.yaml egressip.k8s.ovn.org/egressip1 created $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip1 10.0.128.101 jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal 10.0.128.101 $ oc get egressip -oyaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2022-01-10T23:09:59Z" generation: 4 name: egressip1 resourceVersion: "117389" uid: 892940e6-4363-4e1c-a8f5-65a40286a19f spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red podSelector: {} status: items: - egressIP: 10.0.128.101 node: jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal - egressIP: 10.0.128.103 node: jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal - egressIP: 10.0.128.102 node: jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal kind: List metadata: resourceVersion: "" selfLink: "" 3. reboot one of the node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal, wait till after it comes back $ oc describe node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal <--snip--> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 3m28s kubelet Starting kubelet. Normal NodeHasSufficientMemory 3m28s (x2 over 3m28s) kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 3m28s (x2 over 3m28s) kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 3m28s (x2 over 3m28s) kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeHasSufficientPID Warning Rebooted 3m28s kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal has been rebooted, boot id: c78abedd-f7bf-4b76-82fc-f77285378439 Normal NodeNotReady 3m28s kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeNotReady Normal NodeAllocatableEnforced 3m27s kubelet Updated Node Allocatable limit across pods Normal NodeReady 3m17s kubelet Node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal status is now: NodeReady Actual results: node jechen-0110a-lh2jd-worker-c-ksx8w.c.openshift-qe.internal lost egressip previously assigned to it $ oc get egressip -oyaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2022-01-10T23:09:59Z" generation: 7 name: egressip1 resourceVersion: "124609" uid: 892940e6-4363-4e1c-a8f5-65a40286a19f spec: egressIPs: - 10.0.128.101 - 10.0.128.102 - 10.0.128.103 namespaceSelector: matchLabels: team: red podSelector: {} status: items: - egressIP: 10.0.128.103 node: jechen-0110a-lh2jd-worker-a-2ltqm.c.openshift-qe.internal - egressIP: 10.0.128.102 node: jechen-0110a-lh2jd-worker-b-jm4nj.c.openshift-qe.internal kind: List metadata: resourceVersion: "" selfLink: "" Expected results: node should not lose egressip previously assigned it Additional info: