1787581 – OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3

Bug 1787581 - OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3

Summary: OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.z
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Alexander Constantinescu
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1788683 (view as bug list)
Depends On:
Blocks:	1787635
TreeView+	depends on / blocked

Reported:	2020-01-03 14:15 UTC by Walid A.
Modified:	2020-05-04 11:22 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1787635 (view as bug list)
Environment:
Last Closed:	2020-05-04 11:22:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-loader.py config file (1.55 KB, text/plain) 2020-01-10 20:07 UTC, Walid A.	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1360	0	None	closed	Bug 1787581: cleanup sdn ips on reboot	2021-01-28 03:06:07 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:22:46 UTC

Internal Links: 1789993

Description Walid A. 2020-01-03 14:15:26 UTC

Description of problem:
Started with an IPI AWS 4.2.12 cluster, 3 worker nodes and 3 masters m5.xlarge.  Deployed 10 projects for each of the 7 quickstart apps cakephp-mysql, dancer-mysql, django-postgresql, nodejs-mongodb, rails-postgresql, eap71-mysql, tomcat8-mongodb.  70 total projects with no memory and cpu reservations using our python scripts. Then proceeded with upgrade to 4.3.0-0.nightly-2020-01-02-141332.

After upgrade completed, ingress operator degraded first followed later by networking, monitoring and image-registry.

After several hours one worker node was NotReady.

Events in openshift-ingress show the system is running out of ip addresses:

# oc get events -n openshift-ingress
LAST SEEN   TYPE      REASON                   OBJECT                               MESSAGE
4m23s       Warning   FailedCreatePodSandBox   pod/router-default-6957c7f94-5bcqq   (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-6957c7f94-5bcqq_openshift-ingress_cbd05364-ef26-4524-8b21-a09d731a0292_0(df00fe2d53e899ec7a0cd068a0414070c087113a356f86e7518b100479348fc9): Multus: error adding pod to network "openshift-sdn": delegateAdd: error invoking DelegateAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to run IPAM for df00fe2d53e899ec7a0cd068a0414070c087113a356f86e7518b100479348fc9: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.2.1-10.128.3.254

Version-Release number of selected component (if applicable):
Before upgrade:
# oc version
Client Version: openshift-clients-4.3.0-201910250623-42-gc276ecb7
Server Version: 4.2.12
Kubernetes Version: v1.14.6+32dc4a0

After upgrade:
# oc version
Client Version: openshift-clients-4.2.2-201910250432-8-g98a84c61
Server Version: 4.3.0-0.nightly-2020-01-02-141332
Kubernetes Version: v1.16.2


How reproducible:
Once so far

Steps to Reproduce:
1.  IPI AWS install of OCP 4.2.12
2.  git clone https://github.com/openshift/svt.git
3.  cd svt/openshift_scalability/config/ ; edit file: all-quickstarts-no-limits.yaml and increase number of projects from 1 to 10
4.  cd svt/openshift_scalability ; ./cluster-loader.py -vf config/all-quickstarts-no-limits.yaml
5.  Wait for all the apps to get deployed in all 70 projects
6.  Proceed with upgrade:
    oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
    oc adm upgrade --force=true --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-01-02-141332 --allow-explicit-upgrade

Actual results: 
ingress and network operators are degraded
# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-01-02-141332   True        False         True       8h
cloud-credential                           4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
cluster-autoscaler                         4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
console                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      141m
dns                                        4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
image-registry                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      4m2s
ingress                                    4.3.0-0.nightly-2020-01-02-141332   False       True          True       67m
insights                                   4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
kube-apiserver                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
kube-controller-manager                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
kube-scheduler                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
machine-api                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
machine-config                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
marketplace                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      134m
monitoring                                 4.3.0-0.nightly-2020-01-02-141332   False       True          True       67m
network                                    4.3.0-0.nightly-2020-01-02-141332   True        True          False      8h
node-tuning                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      137m
openshift-apiserver                        4.3.0-0.nightly-2020-01-02-141332   True        False         False      134m
openshift-controller-manager               4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
openshift-samples                          4.3.0-0.nightly-2020-01-02-141332   True        False         False      167m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-01-02-141332   True        False         False      138m
service-ca                                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
service-catalog-apiserver                  4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
service-catalog-controller-manager         4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
storage                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      166m

After several hours:
--------------------
# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
cloud-credential                           4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
cluster-autoscaler                         4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
console                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
dns                                        4.3.0-0.nightly-2020-01-02-141332   True        True          False      24h
image-registry                             4.3.0-0.nightly-2020-01-02-141332   False       True          False      14h
ingress                                    4.3.0-0.nightly-2020-01-02-141332   False       True          True       14h
insights                                   4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
kube-apiserver                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
kube-controller-manager                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
kube-scheduler                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
machine-api                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
machine-config                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
marketplace                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
monitoring                                 4.3.0-0.nightly-2020-01-02-141332   False       True          True       17h
network                                    4.3.0-0.nightly-2020-01-02-141332   True        True          True       24h
node-tuning                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
openshift-apiserver                        4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
openshift-controller-manager               4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
openshift-samples                          4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
operator-lifecycle-manager                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-01-02-141332   True        False         False      5h26m
service-ca                                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
service-catalog-apiserver                  4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
service-catalog-controller-manager         4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
storage                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h


Expected results:
All cluster operators should be available and not degraded

Additional info:
Links to must-gather logs and oc logs will be provided in next comment

Comment 2 Ben Bennett 2020-01-03 18:07:42 UTC

Looks we not cleaning up something in the IPAM code.  I don't see enough pods in the logs to exhaust the range.

Comment 3 Ben Bennett 2020-01-06 15:52:09 UTC

Weibin, can you try to reproduce this?  There are insufficient logs to work out what the problem really is, so it would help to have a broken cluster we can dissect.

Comment 4 Ben Bennett 2020-01-07 23:25:59 UTC

*** Bug 1788683 has been marked as a duplicate of this bug. ***

Comment 5 Clayton Coleman 2020-01-08 00:59:56 UTC

Setting priority appropriately, not that we had a CI failure in the dupe bug

Comment 7 Stephen Cuppett 2020-01-08 13:20:42 UTC

Going to set this to 4.4 and the clone to 4.3 (currently 4.3.z)

Comment 10 Alexander Constantinescu 2020-01-09 10:25:50 UTC

*** Bug 1789248 has been marked as a duplicate of this bug. ***

Comment 12 Ben Bennett 2020-01-09 20:00:22 UTC

It looks like OpenShift can leak IP address allocations when a node reboots since we don't get called by Kubelet for all of the pods that have gone away.

The fix will be to remove the contents of /var/lib/cni/networks/openshift-sdn/ on a reboot.

Comment 14 Walid A. 2020-01-10 20:07:44 UTC

Created attachment 1651387 [details]
cluster-loader.py config file

Comment 17 errata-xmlrpc 2020-05-04 11:22:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.