Bug 1787581

Summary: OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, bbennett, bparees, ccoleman, huirwang, jokerman, lmohanty, mifiedle, scuppett, wking, zzhao
Version: 4.2.z   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1787635 (view as bug list) Environment:
Last Closed: 2020-05-04 11:22:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1787635    
Attachments:
Description Flags
cluster-loader.py config file none

Description Walid A. 2020-01-03 14:15:26 UTC
Description of problem:
Started with an IPI AWS 4.2.12 cluster, 3 worker nodes and 3 masters m5.xlarge.  Deployed 10 projects for each of the 7 quickstart apps cakephp-mysql, dancer-mysql, django-postgresql, nodejs-mongodb, rails-postgresql, eap71-mysql, tomcat8-mongodb.  70 total projects with no memory and cpu reservations using our python scripts. Then proceeded with upgrade to 4.3.0-0.nightly-2020-01-02-141332.

After upgrade completed, ingress operator degraded first followed later by networking, monitoring and image-registry.

After several hours one worker node was NotReady.

Events in openshift-ingress show the system is running out of ip addresses:

# oc get events -n openshift-ingress
LAST SEEN   TYPE      REASON                   OBJECT                               MESSAGE
4m23s       Warning   FailedCreatePodSandBox   pod/router-default-6957c7f94-5bcqq   (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_router-default-6957c7f94-5bcqq_openshift-ingress_cbd05364-ef26-4524-8b21-a09d731a0292_0(df00fe2d53e899ec7a0cd068a0414070c087113a356f86e7518b100479348fc9): Multus: error adding pod to network "openshift-sdn": delegateAdd: error invoking DelegateAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to run IPAM for df00fe2d53e899ec7a0cd068a0414070c087113a356f86e7518b100479348fc9: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.2.1-10.128.3.254

Version-Release number of selected component (if applicable):
Before upgrade:
# oc version
Client Version: openshift-clients-4.3.0-201910250623-42-gc276ecb7
Server Version: 4.2.12
Kubernetes Version: v1.14.6+32dc4a0

After upgrade:
# oc version
Client Version: openshift-clients-4.2.2-201910250432-8-g98a84c61
Server Version: 4.3.0-0.nightly-2020-01-02-141332
Kubernetes Version: v1.16.2


How reproducible:
Once so far

Steps to Reproduce:
1.  IPI AWS install of OCP 4.2.12
2.  git clone https://github.com/openshift/svt.git
3.  cd svt/openshift_scalability/config/ ; edit file: all-quickstarts-no-limits.yaml and increase number of projects from 1 to 10
4.  cd svt/openshift_scalability ; ./cluster-loader.py -vf config/all-quickstarts-no-limits.yaml
5.  Wait for all the apps to get deployed in all 70 projects
6.  Proceed with upgrade:
    oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
    oc adm upgrade --force=true --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-01-02-141332 --allow-explicit-upgrade

Actual results: 
ingress and network operators are degraded
# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-01-02-141332   True        False         True       8h
cloud-credential                           4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
cluster-autoscaler                         4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
console                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      141m
dns                                        4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
image-registry                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      4m2s
ingress                                    4.3.0-0.nightly-2020-01-02-141332   False       True          True       67m
insights                                   4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
kube-apiserver                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
kube-controller-manager                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
kube-scheduler                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
machine-api                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
machine-config                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
marketplace                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      134m
monitoring                                 4.3.0-0.nightly-2020-01-02-141332   False       True          True       67m
network                                    4.3.0-0.nightly-2020-01-02-141332   True        True          False      8h
node-tuning                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      137m
openshift-apiserver                        4.3.0-0.nightly-2020-01-02-141332   True        False         False      134m
openshift-controller-manager               4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
openshift-samples                          4.3.0-0.nightly-2020-01-02-141332   True        False         False      167m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-01-02-141332   True        False         False      138m
service-ca                                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
service-catalog-apiserver                  4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
service-catalog-controller-manager         4.3.0-0.nightly-2020-01-02-141332   True        False         False      8h
storage                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      166m

After several hours:
--------------------
# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
cloud-credential                           4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
cluster-autoscaler                         4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
console                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
dns                                        4.3.0-0.nightly-2020-01-02-141332   True        True          False      24h
image-registry                             4.3.0-0.nightly-2020-01-02-141332   False       True          False      14h
ingress                                    4.3.0-0.nightly-2020-01-02-141332   False       True          True       14h
insights                                   4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
kube-apiserver                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
kube-controller-manager                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
kube-scheduler                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
machine-api                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
machine-config                             4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
marketplace                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
monitoring                                 4.3.0-0.nightly-2020-01-02-141332   False       True          True       17h
network                                    4.3.0-0.nightly-2020-01-02-141332   True        True          True       24h
node-tuning                                4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
openshift-apiserver                        4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
openshift-controller-manager               4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
openshift-samples                          4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h
operator-lifecycle-manager                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-01-02-141332   True        False         False      5h26m
service-ca                                 4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
service-catalog-apiserver                  4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
service-catalog-controller-manager         4.3.0-0.nightly-2020-01-02-141332   True        False         False      24h
storage                                    4.3.0-0.nightly-2020-01-02-141332   True        False         False      18h


Expected results:
All cluster operators should be available and not degraded

Additional info:
Links to must-gather logs and oc logs will be provided in next comment

Comment 2 Ben Bennett 2020-01-03 18:07:42 UTC
Looks we not cleaning up something in the IPAM code.  I don't see enough pods in the logs to exhaust the range.

Comment 3 Ben Bennett 2020-01-06 15:52:09 UTC
Weibin, can you try to reproduce this?  There are insufficient logs to work out what the problem really is, so it would help to have a broken cluster we can dissect.

Comment 4 Ben Bennett 2020-01-07 23:25:59 UTC
*** Bug 1788683 has been marked as a duplicate of this bug. ***

Comment 5 Clayton Coleman 2020-01-08 00:59:56 UTC
Setting priority appropriately, not that we had a CI failure in the dupe bug

Comment 7 Stephen Cuppett 2020-01-08 13:20:42 UTC
Going to set this to 4.4 and the clone to 4.3 (currently 4.3.z)

Comment 10 Alexander Constantinescu 2020-01-09 10:25:50 UTC
*** Bug 1789248 has been marked as a duplicate of this bug. ***

Comment 12 Ben Bennett 2020-01-09 20:00:22 UTC
It looks like OpenShift can leak IP address allocations when a node reboots since we don't get called by Kubelet for all of the pods that have gone away.

The fix will be to remove the contents of /var/lib/cni/networks/openshift-sdn/ on a reboot.

Comment 14 Walid A. 2020-01-10 20:07:44 UTC
Created attachment 1651387 [details]
cluster-loader.py config file

Comment 17 errata-xmlrpc 2020-05-04 11:22:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581