Bug 2038840

Summary: [SDN EgressIP]cloud-network-config-controller pod was CrashLoopBackOff after some operation
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: dbrahane, jechen
Version: 4.10Keywords: Reopened
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:38:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description huirwang 2022-01-10 09:40:14 UTC
Description of problem:
Tested on AWS SDN cluster, after add/remove EgressIPs from the namespace many times, cloud-network-config-controller pod was CrashLoopBackOff 

Version-Release number of selected component (if applicable):
 4.10.0-0.nightly-2022-01-09-195852 

How reproducible:
Not sure

Steps to Reproduce:
1. Before cloud-network-config-controller pod crash, I observed some unused cloudprivateipconfigs was not removed. Here in the environment, like 10.0.51.100 and 10.0.67.21. Then I re-add egressip 10.0.51.100 to the hostsubnet and netnamespace, however, this 10.0.51.100  was always in incorrect status.
$ oc get cloudprivateipconfigs  
NAME          AGE
10.0.51.100   7h31m
10.0.67.21    37m
10.0.73.100   7m33s

$  oc get cloudprivateipconfigs  10.0.51.100 -o yaml
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  creationTimestamp: "2022-01-10T01:52:09Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-01-10T02:14:17Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 2
  name: 10.0.51.100
  resourceVersion: "44970"
  uid: 77aa9143-2e1c-4976-ab26-fa759866f14c
spec:
  node: ip-10-0-51-186.us-east-2.compute.internal
status:
  conditions:
  - lastTransitionTime: "2022-01-10T02:14:17Z"
    message: Deleting IP address
    observedGeneration: 2
    reason: CloudResponsePending
    status: Unknown
    type: Assigned
  node: ""

2. Then try to reproduce this issue. 
Patch 3 egressips to different host and create a new namespace test3 , patch the 3 egressips to test3, then remove all the egressips from test3.
$ oc get hostsubnet
NAME                                        HOST                                        HOST IP       SUBNET          EGRESS CIDRS   EGRESS IPS
ip-10-0-51-186.us-east-2.compute.internal   ip-10-0-51-186.us-east-2.compute.internal   10.0.51.186   10.129.0.0/23                  ["10.0.51.100"]
ip-10-0-57-103.us-east-2.compute.internal   ip-10-0-57-103.us-east-2.compute.internal   10.0.57.103   10.129.2.0/23                  ["10.0.57.50"]
ip-10-0-57-202.us-east-2.compute.internal   ip-10-0-57-202.us-east-2.compute.internal   10.0.57.202   10.128.2.0/23                  
ip-10-0-67-247.us-east-2.compute.internal   ip-10-0-67-247.us-east-2.compute.internal   10.0.67.247   10.128.0.0/23                  ["10.0.67.50"]
ip-10-0-71-99.us-east-2.compute.internal    ip-10-0-71-99.us-east-2.compute.internal    10.0.71.99    10.131.0.0/23                  []
ip-10-0-73-87.us-east-2.compute.internal    ip-10-0-73-87.us-east-2.compute.internal    10.0.73.87    10.130.0.0/23                  ["10.0.73.50"]

oc patch netnamespace  test3  --type=merge -p '{"egressIPs": ["10.0.73.50","10.0.57.50","10.0.67.50"]}'
netnamespace.network.openshift.io/test3 patched

$ oc get cloudprivateipconfigs  
NAME          AGE
10.0.51.100   7h30m
10.0.57.50    6s
10.0.67.21    36m
10.0.67.50    6s
10.0.73.100   7m33s
10.0.73.50    6s
$ oc patch netnamespace  test3  --type=merge -p '{"egressIPs": []}'
netnamespace.network.openshift.io/test3 patched

10.0.67.50  and 10.0.57.50  was left.
$  oc get cloudprivateipconfigs  
NAME          AGE
10.0.51.100   7h44m
10.0.57.50    13m
10.0.67.21    50m
10.0.67.50    13m
10.0.73.100   21m

 oc get pods  -n  openshift-cloud-network-config-controller
NAME                                              READY   STATUS             RESTARTS         AGE
cloud-network-config-controller-6999cd7db-l8bjh   0/1     CrashLoopBackOff   29 (2m48s ago)   8h

$ oc logs cloud-network-config-controller-6999cd7db-l8bjh -n  openshift-cloud-network-config-controller
W0110 09:22:46.563873       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0110 09:22:46.564492       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
I0110 09:22:46.578052       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0110 09:22:46.578575       1 controller.go:88] Starting node controller
I0110 09:22:46.578585       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0110 09:22:46.578622       1 controller.go:88] Starting cloud-private-ip-config controller
I0110 09:22:46.578669       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0110 09:22:46.578623       1 controller.go:88] Starting secret controller
I0110 09:22:46.578720       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0110 09:22:46.581035       1 controller.go:182] Assigning key: 10.0.67.21 to cloud-private-ip-config workqueue
I0110 09:22:46.581054       1 controller.go:182] Assigning key: 10.0.73.100 to cloud-private-ip-config workqueue
I0110 09:22:46.581060       1 controller.go:182] Assigning key: 10.0.51.100 to cloud-private-ip-config workqueue
I0110 09:22:46.583295       1 controller.go:182] Assigning key: ip-10-0-51-186.us-east-2.compute.internal to node workqueue
I0110 09:22:46.583311       1 controller.go:182] Assigning key: ip-10-0-57-103.us-east-2.compute.internal to node workqueue
I0110 09:22:46.583316       1 controller.go:182] Assigning key: ip-10-0-57-202.us-east-2.compute.internal to node workqueue
I0110 09:22:46.583318       1 controller.go:182] Assigning key: ip-10-0-67-247.us-east-2.compute.internal to node workqueue
I0110 09:22:46.583322       1 controller.go:182] Assigning key: ip-10-0-71-99.us-east-2.compute.internal to node workqueue
I0110 09:22:46.583325       1 controller.go:182] Assigning key: ip-10-0-73-87.us-east-2.compute.internal to node workqueue
I0110 09:22:46.678714       1 controller.go:96] Starting node workers
I0110 09:22:46.678747       1 controller.go:102] Started node workers
I0110 09:22:46.678783       1 controller.go:160] Dropping key 'ip-10-0-51-186.us-east-2.compute.internal' from the node workqueue
I0110 09:22:46.678788       1 controller.go:160] Dropping key 'ip-10-0-57-103.us-east-2.compute.internal' from the node workqueue
I0110 09:22:46.678803       1 controller.go:160] Dropping key 'ip-10-0-67-247.us-east-2.compute.internal' from the node workqueue
I0110 09:22:46.678806       1 controller.go:160] Dropping key 'ip-10-0-71-99.us-east-2.compute.internal' from the node workqueue
I0110 09:22:46.678809       1 controller.go:160] Dropping key 'ip-10-0-73-87.us-east-2.compute.internal' from the node workqueue
I0110 09:22:46.678818       1 controller.go:160] Dropping key 'ip-10-0-57-202.us-east-2.compute.internal' from the node workqueue
I0110 09:22:46.678842       1 controller.go:96] Starting secret workers
I0110 09:22:46.678852       1 controller.go:102] Started secret workers
I0110 09:22:46.678859       1 controller.go:96] Starting cloud-private-ip-config workers
I0110 09:22:46.678877       1 controller.go:102] Started cloud-private-ip-config workers
I0110 09:22:46.681301       1 controller.go:160] Dropping key '10.0.51.100' from the cloud-private-ip-config workqueue
I0110 09:22:46.681302       1 controller.go:160] Dropping key '10.0.73.100' from the cloud-private-ip-config workqueue
I0110 09:22:46.682331       1 controller.go:160] Dropping key '10.0.67.21' from the cloud-private-ip-config workqueue
I0110 09:22:56.395401       1 controller.go:182] Assigning key: 10.0.57.50 to cloud-private-ip-config workqueue
I0110 09:22:56.395446       1 controller.go:182] Assigning key: 10.0.67.50 to cloud-private-ip-config workqueue
I0110 09:22:56.398059       1 cloudprivateipconfig_controller.go:257] CloudPrivateIPConfig: "10.0.67.50" will be added to node: "ip-10-0-67-247.us-east-2.compute.internal"
I0110 09:22:56.398304       1 cloudprivateipconfig_controller.go:257] CloudPrivateIPConfig: "10.0.57.50" will be added to node: "ip-10-0-57-103.us-east-2.compute.internal"
I0110 09:22:56.399266       1 controller.go:182] Assigning key: 10.0.73.50 to cloud-private-ip-config workqueue
I0110 09:22:56.401326       1 cloudprivateipconfig_controller.go:257] CloudPrivateIPConfig: "10.0.73.50" will be added to node: "ip-10-0-73-87.us-east-2.compute.internal"
I0110 09:22:56.406525       1 cloudprivateipconfig_controller.go:281] Adding finalizer to CloudPrivateIPConfig: "10.0.67.50"
I0110 09:22:56.408671       1 cloudprivateipconfig_controller.go:281] Adding finalizer to CloudPrivateIPConfig: "10.0.57.50"
I0110 09:22:56.408884       1 cloudprivateipconfig_controller.go:281] Adding finalizer to CloudPrivateIPConfig: "10.0.73.50"
I0110 09:22:57.607979       1 cloudprivateipconfig_controller.go:338] Added IP address to node: "ip-10-0-67-247.us-east-2.compute.internal" for CloudPrivateIPConfig: "10.0.67.50"
I0110 09:22:57.738806       1 cloudprivateipconfig_controller.go:338] Added IP address to node: "ip-10-0-57-103.us-east-2.compute.internal" for CloudPrivateIPConfig: "10.0.57.50"
I0110 09:22:57.799929       1 controller.go:160] Dropping key '10.0.67.50' from the cloud-private-ip-config workqueue
I0110 09:22:58.002283       1 cloudprivateipconfig_controller.go:338] Added IP address to node: "ip-10-0-73-87.us-east-2.compute.internal" for CloudPrivateIPConfig: "10.0.73.50"
I0110 09:22:58.202889       1 controller.go:160] Dropping key '10.0.57.50' from the cloud-private-ip-config workqueue
I0110 09:22:58.601868       1 controller.go:160] Dropping key '10.0.73.50' from the cloud-private-ip-config workqueue
I0110 09:23:20.643913       1 controller.go:182] Assigning key: 10.0.73.50 to cloud-private-ip-config workqueue
I0110 09:23:20.702755       1 controller.go:182] Assigning key: 10.0.57.50 to cloud-private-ip-config workqueue
I0110 09:23:20.702780       1 controller.go:182] Assigning key: 10.0.67.50 to cloud-private-ip-config workqueue
I0110 09:23:20.703333       1 cloudprivateipconfig_controller.go:174] CloudPrivateIPConfig: "10.0.73.50" will be deleted from node: "ip-10-0-73-87.us-east-2.compute.internal"
I0110 09:23:20.704984       1 cloudprivateipconfig_controller.go:174] CloudPrivateIPConfig: "10.0.57.50" will be deleted from node: "ip-10-0-57-103.us-east-2.compute.internal"
I0110 09:23:20.706223       1 cloudprivateipconfig_controller.go:174] CloudPrivateIPConfig: "10.0.67.50" will be deleted from node: "ip-10-0-67-247.us-east-2.compute.internal"
I0110 09:23:20.711438       1 controller.go:182] Assigning key: 10.0.73.50 to cloud-private-ip-config workqueue
I0110 09:23:20.711461       1 controller.go:182] Assigning key: 10.0.73.50 to cloud-private-ip-config workqueue
I0110 09:23:20.712551       1 controller.go:182] Assigning key: 10.0.57.50 to cloud-private-ip-config workqueue
I0110 09:23:20.712600       1 controller.go:182] Assigning key: 10.0.57.50 to cloud-private-ip-config workqueue
I0110 09:23:20.715140       1 controller.go:182] Assigning key: 10.0.67.50 to cloud-private-ip-config workqueue
I0110 09:23:20.715158       1 controller.go:182] Assigning key: 10.0.67.50 to cloud-private-ip-config workqueue
I0110 09:23:21.214815       1 cloudprivateipconfig_controller.go:228] CloudPrivateIPConfig: 10.0.73.50 object has been marked for complete deletion
I0110 09:23:21.214833       1 cloudprivateipconfig_controller.go:235] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.73.50", deleting it completely
I0110 09:23:21.224238       1 controller.go:160] Dropping key '10.0.73.50' from the cloud-private-ip-config workqueue
I0110 09:23:21.224288       1 controller.go:182] Assigning key: 10.0.73.50 to cloud-private-ip-config workqueue
I0110 09:23:21.225985       1 cloudprivateipconfig_controller.go:405] CloudPrivateIPConfig: "10.0.73.50" in work queue no longer exists
E0110 09:23:21.226056       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 130 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1f7dca0, 0x3913df0})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000973e10})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1f7dca0, 0x3913df0})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig.(*CloudPrivateIPConfigController).SyncHandler(0xc0002a6580, {0xc00059a916, 0xa})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go:165 +0x57
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc0002a51a0, {0x1db1440, 0xc000973e10})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x126
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc0002a51a0)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f66fdf6dff8)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x26b4260, 0xc0004c2f60}, 0x1, 0xc00009c120)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x398
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1b744b7]

goroutine 130 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000973e10})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x1f7dca0, 0x3913df0})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig.(*CloudPrivateIPConfigController).SyncHandler(0xc0002a6580, {0xc00059a916, 0xa})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go:165 +0x57
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc0002a51a0, {0x1db1440, 0xc000973e10})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x126
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc0002a51a0)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f66fdf6dff8)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x26b4260, 0xc0004c2f60}, 0x1, 0xc00009c120)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x398


Actual results



Expected results:
cloud-network-config-controller  should not crash and corresponding cloudprivateipconfigs ips was removed if they were removed from namespace.

Additional info:

Comment 2 Alexander Constantinescu 2022-01-10 11:59:48 UTC

*** This bug has been marked as a duplicate of bug 2034144 ***

Comment 3 Alexander Constantinescu 2022-01-10 12:54:58 UTC
Re-opening, the bug I referenced is caused partially due to the same problem, but having this bug explicitly capture this bug only makes things clearer.

Comment 10 errata-xmlrpc 2022-03-10 16:38:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056