Description of problem: Started with an IPI AWS 4.2.12 cluster, 3 worker nodes and 3 masters m5.xlarge. Deployed 10 projects for each of the 7 quickstart apps cakephp-mysql, dancer-mysql, django-postgresql, nodejs-mongodb, rails-postgresql, eap71-mysql, tomcat8-mongodb. 70 total projects with no memory and cpu reservations using our python scripts. Then proceeded with upgrade to 4.3.0-0.nightly-2020-01-02-141332. After upgrade completed, ingress operator degraded first followed later by networking, monitoring and image-registry. After several hours one worker node was NotReady. Events in openshift-ingress show the system is running out of ip addresses: # oc get events -n openshift-ingress LAST SEEN TYPE REASON OBJECT MESSAGE 4m23s Warning FailedCreatePodSandBox pod/router-default-6957c7f94-5bcqq (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-6957c7f94-5bcqq_openshift-ingress_cbd05364-ef26-4524-8b21-a09d731a0292_0(df00fe2d53e899ec7a0cd068a0414070c087113a356f86e7518b100479348fc9): Multus: error adding pod to network "openshift-sdn": delegateAdd: error invoking DelegateAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to run IPAM for df00fe2d53e899ec7a0cd068a0414070c087113a356f86e7518b100479348fc9: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.2.1-10.128.3.254 Version-Release number of selected component (if applicable): Before upgrade: # oc version Client Version: openshift-clients-4.3.0-201910250623-42-gc276ecb7 Server Version: 4.2.12 Kubernetes Version: v1.14.6+32dc4a0 After upgrade: # oc version Client Version: openshift-clients-4.2.2-201910250432-8-g98a84c61 Server Version: 4.3.0-0.nightly-2020-01-02-141332 Kubernetes Version: v1.16.2 How reproducible: Once so far Steps to Reproduce: 1. IPI AWS install of OCP 4.2.12 2. git clone https://github.com/openshift/svt.git 3. cd svt/openshift_scalability/config/ ; edit file: all-quickstarts-no-limits.yaml and increase number of projects from 1 to 10 4. cd svt/openshift_scalability ; ./cluster-loader.py -vf config/all-quickstarts-no-limits.yaml 5. Wait for all the apps to get deployed in all 70 projects 6. Proceed with upgrade: oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge oc adm upgrade --force=true --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-01-02-141332 --allow-explicit-upgrade Actual results: ingress and network operators are degraded # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2020-01-02-141332 True False True 8h cloud-credential 4.3.0-0.nightly-2020-01-02-141332 True False False 8h cluster-autoscaler 4.3.0-0.nightly-2020-01-02-141332 True False False 8h console 4.3.0-0.nightly-2020-01-02-141332 True False False 141m dns 4.3.0-0.nightly-2020-01-02-141332 True False False 8h image-registry 4.3.0-0.nightly-2020-01-02-141332 True False False 4m2s ingress 4.3.0-0.nightly-2020-01-02-141332 False True True 67m insights 4.3.0-0.nightly-2020-01-02-141332 True False False 8h kube-apiserver 4.3.0-0.nightly-2020-01-02-141332 True False False 8h kube-controller-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 8h kube-scheduler 4.3.0-0.nightly-2020-01-02-141332 True False False 8h machine-api 4.3.0-0.nightly-2020-01-02-141332 True False False 8h machine-config 4.3.0-0.nightly-2020-01-02-141332 True False False 8h marketplace 4.3.0-0.nightly-2020-01-02-141332 True False False 134m monitoring 4.3.0-0.nightly-2020-01-02-141332 False True True 67m network 4.3.0-0.nightly-2020-01-02-141332 True True False 8h node-tuning 4.3.0-0.nightly-2020-01-02-141332 True False False 137m openshift-apiserver 4.3.0-0.nightly-2020-01-02-141332 True False False 134m openshift-controller-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 8h openshift-samples 4.3.0-0.nightly-2020-01-02-141332 True False False 167m operator-lifecycle-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 8h operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-01-02-141332 True False False 8h operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-01-02-141332 True False False 138m service-ca 4.3.0-0.nightly-2020-01-02-141332 True False False 8h service-catalog-apiserver 4.3.0-0.nightly-2020-01-02-141332 True False False 8h service-catalog-controller-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 8h storage 4.3.0-0.nightly-2020-01-02-141332 True False False 166m After several hours: -------------------- # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2020-01-02-141332 True False False 24h cloud-credential 4.3.0-0.nightly-2020-01-02-141332 True False False 24h cluster-autoscaler 4.3.0-0.nightly-2020-01-02-141332 True False False 24h console 4.3.0-0.nightly-2020-01-02-141332 True False False 18h dns 4.3.0-0.nightly-2020-01-02-141332 True True False 24h image-registry 4.3.0-0.nightly-2020-01-02-141332 False True False 14h ingress 4.3.0-0.nightly-2020-01-02-141332 False True True 14h insights 4.3.0-0.nightly-2020-01-02-141332 True False False 24h kube-apiserver 4.3.0-0.nightly-2020-01-02-141332 True False False 24h kube-controller-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 24h kube-scheduler 4.3.0-0.nightly-2020-01-02-141332 True False False 24h machine-api 4.3.0-0.nightly-2020-01-02-141332 True False False 24h machine-config 4.3.0-0.nightly-2020-01-02-141332 True False False 24h marketplace 4.3.0-0.nightly-2020-01-02-141332 True False False 18h monitoring 4.3.0-0.nightly-2020-01-02-141332 False True True 17h network 4.3.0-0.nightly-2020-01-02-141332 True True True 24h node-tuning 4.3.0-0.nightly-2020-01-02-141332 True False False 18h openshift-apiserver 4.3.0-0.nightly-2020-01-02-141332 True False False 18h openshift-controller-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 24h openshift-samples 4.3.0-0.nightly-2020-01-02-141332 True False False 18h operator-lifecycle-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 24h operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-01-02-141332 True False False 24h operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-01-02-141332 True False False 5h26m service-ca 4.3.0-0.nightly-2020-01-02-141332 True False False 24h service-catalog-apiserver 4.3.0-0.nightly-2020-01-02-141332 True False False 24h service-catalog-controller-manager 4.3.0-0.nightly-2020-01-02-141332 True False False 24h storage 4.3.0-0.nightly-2020-01-02-141332 True False False 18h Expected results: All cluster operators should be available and not degraded Additional info: Links to must-gather logs and oc logs will be provided in next comment
Looks we not cleaning up something in the IPAM code. I don't see enough pods in the logs to exhaust the range.
Weibin, can you try to reproduce this? There are insufficient logs to work out what the problem really is, so it would help to have a broken cluster we can dissect.
*** Bug 1788683 has been marked as a duplicate of this bug. ***
Setting priority appropriately, not that we had a CI failure in the dupe bug
Going to set this to 4.4 and the clone to 4.3 (currently 4.3.z)
*** Bug 1789248 has been marked as a duplicate of this bug. ***
It looks like OpenShift can leak IP address allocations when a node reboots since we don't get called by Kubelet for all of the pods that have gone away. The fix will be to remove the contents of /var/lib/cni/networks/openshift-sdn/ on a reboot.
Created attachment 1651387 [details] cluster-loader.py config file
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581