Description of problem: [SDN AWS]sdn-controller crashed after re-configure EgressIPs Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-21-130047 How reproducible: Steps to Reproduce: 1. Patch two nodes as EgressCIDRs nodes 2. Create 20 namespaces for i in {1..10};do oc create ns p$i;sleep 1;done namespace/p1 created namespace/p2 created namespace/p3 created namespace/p4 created namespace/p5 created namespace/p6 created namespace/p7 created namespace/p8 created namespace/p9 created namespace/p10 created for i in {11..20};do oc create ns p$i;sleep 1;done namespace/p11 created namespace/p12 created namespace/p13 created namespace/p14 created namespace/p15 created namespace/p16 created namespace/p17 created namespace/p18 created namespace/p19 created namespace/p20 created Patch one egressip to each namespace for i in {1..10};do oc patch netnamespace p$i -p "{\"egressIPs\":[\"10.0.59.$i\"]}" --type=merge ;sleep 1;done netnamespace.network.openshift.io/p1 patched (no change) netnamespace.network.openshift.io/p2 patched netnamespace.network.openshift.io/p3 patched netnamespace.network.openshift.io/p4 patched netnamespace.network.openshift.io/p5 patched netnamespace.network.openshift.io/p6 patched netnamespace.network.openshift.io/p7 patched netnamespace.network.openshift.io/p8 patched netnamespace.network.openshift.io/p9 patched netnamespace.network.openshift.io/p10 patched $ for i in {11..20};do oc patch netnamespace p$i -p "{\"egressIPs\":[\"10.0.59.$i\"]}" --type=merge ;sleep 1;done netnamespace.network.openshift.io/p11 patched netnamespace.network.openshift.io/p12 patched netnamespace.network.openshift.io/p13 patched netnamespace.network.openshift.io/p14 patched netnamespace.network.openshift.io/p15 patched netnamespace.network.openshift.io/p16 patched netnamespace.network.openshift.io/p17 patched netnamespace.network.openshift.io/p18 patched netnamespace.network.openshift.io/p19 patched netnamespace.network.openshift.io/p20 patched 3. Check hostsubnet , each node was applied 9 IPs as the IP capacity for each node is 9. oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS ip-10-0-50-111.us-east-2.compute.internal ip-10-0-50-111.us-east-2.compute.internal 10.0.50.111 10.131.0.0/23 ["10.0.48.0/20"] ["10.0.59.13","10.0.59.7","10.0.59.11","10.0.59.100","10.0.59.1","10.0.59.5","10.0.59.15","10.0.59.9","10.0.59.17"] ip-10-0-52-192.us-east-2.compute.internal ip-10-0-52-192.us-east-2.compute.internal 10.0.52.192 10.130.0.0/23 ip-10-0-57-18.us-east-2.compute.internal ip-10-0-57-18.us-east-2.compute.internal 10.0.57.18 10.129.0.0/23 ip-10-0-59-247.us-east-2.compute.internal ip-10-0-59-247.us-east-2.compute.internal 10.0.59.247 10.129.2.0/23 ["10.0.48.0/20"] ["10.0.59.12","10.0.59.10","10.0.59.14","10.0.59.2","10.0.59.3","10.0.59.4","10.0.59.6","10.0.59.8","10.0.59.16"] ip-10-0-65-70.us-east-2.compute.internal ip-10-0-65-70.us-east-2.compute.internal 10.0.65.70 10.128.0.0/23 ip-10-0-77-149.us-east-2.compute.internal ip-10-0-77-149.us-east-2.compute.internal 10.0.77.149 10.128.2.0/23 5. Delete all above namespace and repeat step 2 Actual results: Found only a few EgressIPs configured. oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS ip-10-0-50-111.us-east-2.compute.internal ip-10-0-50-111.us-east-2.compute.internal 10.0.50.111 10.131.0.0/23 ["10.0.48.0/20"] ["10.0.59.2","10.0.59.100","10.0.59.4"] ip-10-0-52-192.us-east-2.compute.internal ip-10-0-52-192.us-east-2.compute.internal 10.0.52.192 10.130.0.0/23 ip-10-0-57-18.us-east-2.compute.internal ip-10-0-57-18.us-east-2.compute.internal 10.0.57.18 10.129.0.0/23 ip-10-0-59-247.us-east-2.compute.internal ip-10-0-59-247.us-east-2.compute.internal 10.0.59.247 10.129.2.0/23 ["10.0.48.0/20"] ["10.0.59.1","10.0.59.21","10.0.59.3"] ip-10-0-65-70.us-east-2.compute.internal ip-10-0-65-70.us-east-2.compute.internal 10.0.65.70 10.128.0.0/23 ip-10-0-77-149.us-east-2.compute.internal ip-10-0-77-149.us-east-2.compute.internal 10.0.77.149 10.128.2.0/23 And one sdn-controller was CrashLoopBackOff $ oc get pods -n openshift-sdn -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sdn-bscb4 2/2 Running 0 4h54m 10.0.57.18 ip-10-0-57-18.us-east-2.compute.internal <none> <none> sdn-controller-949mj 1/1 Running 0 4h54m 10.0.52.192 ip-10-0-52-192.us-east-2.compute.internal <none> <none> sdn-controller-cg8gd 0/1 CrashLoopBackOff 5 (41s ago) 4h54m 10.0.57.18 ip-10-0-57-18.us-east-2.compute.internal <none> <none> sdn-controller-fpvfg 1/1 Running 0 4h54m 10.0.65.70 ip-10-0-65-70.us-east-2.compute.internal <none> <none> sdn-hcvcr 2/2 Running 0 4h41m 10.0.59.247 ip-10-0-59-247.us-east-2.compute.internal <none> <none> sdn-k8kk9 2/2 Running 0 4h54m 10.0.52.192 ip-10-0-52-192.us-east-2.compute.internal <none> <none> sdn-mck5t 2/2 Running 0 4h46m 10.0.77.149 ip-10-0-77-149.us-east-2.compute.internal <none> <none> sdn-tpb68 2/2 Running 0 4h46m 10.0.50.111 ip-10-0-50-111.us-east-2.compute.internal <none> <none> sdn-ztlm8 2/2 Running 0 4h54m 10.0.65.70 ip-10-0-65-70.us-east-2.compute.internal <none> <none> $ oc describe pod sdn-controller-cg8gd -n openshift-sdn Name: sdn-controller-cg8gd Namespace: openshift-sdn Priority: 2000000000 Priority Class Name: system-cluster-critical Node: ip-10-0-57-18.us-east-2.compute.internal/10.0.57.18 Start Time: Thu, 23 Dec 2021 09:36:59 +0800 Labels: app=sdn-controller controller-revision-hash=58747c9748 pod-template-generation=1 Annotations: <none> Status: Running IP: 10.0.57.18 IPs: IP: 10.0.57.18 Controlled By: DaemonSet/sdn-controller Containers: sdn-controller: Container ID: cri-o://16369d3d891e93bc037ce46d868d64734eaf0375ef99298f98c7d777f90eaff8 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:39c6b94542b3297130a345d82db7620e79950ee58b6c692d3c933fe426f2e0de Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:39c6b94542b3297130a345d82db7620e79950ee58b6c692d3c933fe426f2e0de Port: <none> Host Port: <none> Command: /bin/bash -c if [[ -f /env/_master ]]; then set -o allexport source /env/_master set +o allexport fi exec openshift-sdn-controller \ --platform-type AWS \ --v=${OPENSHIFT_SDN_LOG_LEVEL:-2} State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 0600d20, 0x30) github.com/openshift/sdn/pkg/network/common/egressip.go:625 +0x471 github.com/openshift/sdn/pkg/network/common.(*EgressIPTracker).syncEgressIPs(0xc000600460) github.com/openshift/sdn/pkg/network/common/egressip.go:600 +0xeb github.com/openshift/sdn/pkg/network/common.(*EgressIPTracker).UpdateHostSubnetEgress(0xc000600460, 0xc0004aa6f0) github.com/openshift/sdn/pkg/network/common/egressip.go:373 +0xab0 github.com/openshift/sdn/pkg/network/common.(*EgressIPTracker).handleAddOrUpdateHostSubnet(0xc000600460, {0x195b4a0, 0xc0004aa6f0}, {0x100000000000000, 0x0}, {0x19841b8, 0x5}) github.com/openshift/sdn/pkg/network/common/egressip.go:254 +0x6fa github.com/openshift/sdn/pkg/network/common.InformerFuncs.func1({0x195b4a0, 0xc0004aa6f0}) github.com/openshift/sdn/pkg/network/common/informers.go:19 +0x39 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) k8s.io/client-go.0-rc.0/tools/cache/controller.go:231 k8s.io/client-go/tools/cache.(*processorListener).run.func1() k8s.io/client-go.0-rc.0/tools/cache/shared_informer.go:777 +0x9f k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f8a88061da0) k8s.io/apimachinery.0-rc.0/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000147738, {0x1ba1cc0, 0xc0000ae240}, 0x1, 0xc0005ec8a0) k8s.io/apimachinery.0-rc.0/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000138000, 0x3b9aca00, 0x0, 0xc0, 0xc0001477b0) k8s.io/apimachinery.0-rc.0/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(...) k8s.io/apimachinery.0-rc.0/pkg/util/wait/wait.go:90 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000478680) k8s.io/client-go.0-rc.0/tools/cache/shared_informer.go:771 +0x6b k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() k8s.io/apimachinery.0-rc.0/pkg/util/wait/wait.go:73 +0x5a created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start k8s.io/apimachinery.0-rc.0/pkg/util/wait/wait.go:71 +0x88 Exit Code: 2 Started: Thu, 23 Dec 2021 14:31:14 +0800 Finished: Thu, 23 Dec 2021 14:31:14 +0800 Ready: False Restart Count: 5 Requests: cpu: 10m memory: 50Mi Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: api-int.huirwang-23a.qe.devcluster.openshift.com Mounts: /env from env-overrides (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gnjdw (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: env-overrides: Type: ConfigMap (a volume populated by a ConfigMap) Name: env-overrides Optional: true kube-api-access-gnjdw: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 2m25s (x5 over 4h55m) kubelet Created container sdn-controller Normal Started 2m25s (x5 over 4h55m) kubelet Started container sdn-controller Warning BackOff 78s (x12 over 3m45s) kubelet Back-off restarting failed container Normal Pulled 63s (x5 over 3m46s) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:39c6b94542b3297130a345d82db7620e79950ee58b6c692d3c933fe426f2e0de" already present on machine Expected results: No sdn-controler crashing and re-configure EgressIP working. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056