Hide Forgot
+++ This bug was initially created as a clone of Bug #2004632 +++ Description of problem: During scale testing, I saw pod periodically created and destroyed, then stuck on DEL loop. Example output from WB debug log: 19:46:12Z [debug] ADD 20:02:58Z [debug] DEL 20:03:59Z [debug] DEL 20:04:59Z [debug] DEL 20:05:59Z [debug] DEL etc If pod is not created in ~5m, kubelet kills the pod because it doesn’t detect that a sandbox status has been populated: Sep 10 19:29:33 ip-10-0-165-101 hyperkube[1430]: I0910 19:29:33.549539 1430 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="default/whereabouts-scale-test-748g8" Sep 10 19:29:34 ip-10-0-165-101 hyperkube[1430]: I0910 19:29:34.550139 1430 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="default/whereabouts-scale-test-7zqzr" Old instance of WB still remain active. If a pod requesting an IP using whereabouts doesn’t start in ~5 mins, you add another connection to the backend and the old connection persists. See attach output from WB log file during allocate IP phase. All instances were elected as leader at the same time and all were trying to allocate IP for the pod. Version-Release number of selected component (if applicable): How reproducible: At scale - greater than 600 pods which have a WB managed IP, you will see this. Steps to Reproduce: 1. Use this script to setup the test that will cause congestion and long LE period for pods - https://github.com/maiqueb/whereabouts-stopwatch/pull/5 2. Cordon a node 4. Uncordon node 5. Schedule a new pod requesting IP using WB 6. Check if pod landed on node that was previously uncordoned - if not use nodeSelector and try again. 7. Wait more than 5 mins, and see ensure pod has not received IP. 8. On node, use ps to view WB instances. There should be at least 2 instances of WB. Actual results: multiple WB instances for a single pod Expected results: one wb instance per pod Additional info: --- Additional comment from Martin Kennelly on 2021-09-16 14:01:13 UTC --- I think, I see the issue in the code - in pkg/storage/kubernetes/ipam.go: go func() { defer wg.Done() ctx, cancel := context.WithCancel(context.Background()) res := make(chan error) go func() { logging.Debugf("Started leader election") le.Run(ctx) logging.Debugf("Finished leader election") res <- nil }() LE never ends and needs a timeout. It should be context.WithTimeout(). --- Additional comment from Martin Kennelly on 2021-09-20 11:29:29 UTC --- Here is a possible implementation of a fix: https://github.com/k8snetworkplumbingwg/whereabouts/pull/142
Tested and verified in 4.9.0-0.nightly-2021-11-03-043308
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.6 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4119