Bug 1781345

Summary:	OCP 4.3: Azure - ingress operator degraded and worker nodes go NotReady when deploying 250 pause-pods per node
Product:	OpenShift Container Platform	Reporter:	Walid A. <wabouham>
Component:	Networking	Assignee:	Dan Mace <dmace>
Networking sub component:	router	QA Contact:	Walid A. <wabouham>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, dhansen, mifiedle, mmasters
Version:	4.3.0
Target Milestone:	---
Target Release:	4.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1781948 (view as bug list)		Environment:
Last Closed:	2020-01-27 17:36:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Walid A. 2019-12-09 20:05:11 UTC

Description of problem:
This on an 4.3 OCP IPI installed cluster on Azure.  When trying to run the node-vertical test where we try to deploy up to 250 gcr.io/google_containers/pause-amd64:3.0 pods per worker node in a single namespace, the ingress operator degraded and 2 worker nodes became NotReady.
This cluster is fips-enabled and using SDN network type.

root@ip-172-31-40-229: ~/openshift-scale/workloads/workloads # oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      148m
cloud-credential                           4.3.0-0.nightly-2019-12-09-035405   True        False         False      170m
cluster-autoscaler                         4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
console                                    4.3.0-0.nightly-2019-12-09-035405   True        False         False      156m
dns                                        4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
image-registry                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      17m
ingress                                    4.3.0-0.nightly-2019-12-09-035405   False       True          True       23m
insights                                   4.3.0-0.nightly-2019-12-09-035405   True        False         False      167m
kube-apiserver                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      165m
kube-controller-manager                    4.3.0-0.nightly-2019-12-09-035405   True        False         False      164m
kube-scheduler                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      163m
machine-api                                4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
machine-config                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
marketplace                                4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
monitoring                                 4.3.0-0.nightly-2019-12-09-035405   False       True          True       22m
network                                    4.3.0-0.nightly-2019-12-09-035405   True        True          True       165m
node-tuning                                4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
openshift-apiserver                        4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
openshift-controller-manager               4.3.0-0.nightly-2019-12-09-035405   True        False         False      165m
openshift-samples                          4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
service-ca                                 4.3.0-0.nightly-2019-12-09-035405   True        False         False      167m
service-catalog-apiserver                  4.3.0-0.nightly-2019-12-09-035405   True        False         False      164m
service-catalog-controller-manager         4.3.0-0.nightly-2019-12-09-035405   True        False         False      164m
storage                                    4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
root@ip-172-31-40-229: ~/openshift-scale/workloads/workloads # 


In openshift-ingress-operator logs, I am seeing:

2019-12-09T15:48:01.016Z        ERROR   operator.init.controller-runtime.controller     controller/controller.go:218    Reconciler error        {"controller": "ingress_controller", "request": "openshift-ingress-operator/default", "error": "failed to sync ingresscontroller status: IngressController is degraded", "errorCauses": [{"error": "failed to sync ingresscontroller status: IngressController is degraded"}]}


In openshift-ingress events I am seeing:
34m         Warning   UpdateLoadBalancerFailed   service/router-default                Error updating load balancer with new hosts map[wal43fp-2wbz2-master-0:{} wal43fp-2wbz2-master-1:{} wal43fp-2wbz2-master-2:{} wal43fp-2wbz2-worker-centralus1-mt8nl:{} wal43fp-2wbz2-worker-centralus3-2fm9n:{} wal43fp-2wbz2-workload-centralus1-spnpc:{}]: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wal43fp-2wbz2-rg/providers/Microsoft.Network/loadBalancers/wal43fp-2wbz2/backendAddressPools/wal43fp-2wbz2) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"
19m         Warning   UpdateLoadBalancerFailed   service/router-default                Error updating load balancer with new hosts map[wal43fp-2wbz2-master-0:{} wal43fp-2wbz2-master-1:{} wal43fp-2wbz2-master-2:{} wal43fp-2wbz2-worker-centralus3-2fm9n:{} wal43fp-2wbz2-workload-centralus1-spnpc:{}]: azure - cloud provider rate limited(read) for operation:NSGGet


Version-Release number of selected component (if applicable):
# oc version
Client Version: openshift-clients-4.3.0-201910250623-70-g0ed83003
Server Version: 4.3.0-0.nightly-2019-12-09-035405
Kubernetes Version: v1.16.2

How reproducible:
Twice so far on two different clusters

Steps to Reproduce:
1. IPI install of OCP 4.3.0-0.nightly-2019-12-09-035405 on Azure.
workers: Standard_D2s_v3
masters: Standard_D16s_v3.
We use a Jenkins job to run the openshift-install.
2. Deploy containerized tooling from https://github.com/openshift-scale/scale-ci-deploy/blob/master/OCP-4.X/install-on-aws.yml.
3. Run the nodevertical workload which will deploy ~ 250 google pause pods on each of the 3 worker nodes, basically deploying these pause-pods until we have the max 250 total pods per node, in a single namespace:

VIPERCONFIG=/tmp/nodevertical.yaml openshift-tests run-test "[Feature:Performance][Serial][Slow] Load cluster should load the cluster [Suite:openshift]" | tee "${result_dir}/clusterloader.txt"

I can provide the exact steps, as they are very detailed.


Actual results:
Google pause-pods get deployed on the 3 worker nodes, ingress operator is degraded and 2 out of 3 worker nodes go NotReady

Expected results:
All 3 worker nodes should have a total of 250 pods per node, the existing pods plus the difference in google pause pods.

Additional info:
must-gather logs along with oc get events, oc describe node logs are located in next private comment

Comment 2 Miciah Dashiel Butler Masters 2019-12-10 20:05:32 UTC

The "failed to sync ingresscontroller status" errors are spurious.  In fact, the ingresscontroller status is being updated and reporting, "Deployment does not have minimum availability."  I made https://github.com/openshift/cluster-ingress-operator/pull/337 to fix the spurious "failed to sync" errors.

As for why the deployment cannot achieve minimum availability, note this event in the openshift-ingress namespace: "0/7 nodes are available: 2 Insufficient cpu, 3 Insufficient pods, 4 node(s) didn't match node selector."  That is the most recent event in the namespace.  Could the cluster simply be too small to run 250 pods?

If the cluster should be able to handle 250 pods, we can clone this Bugzilla report and use the new report to track the "failed to sync" errors.  Otherwise, if the "failed to sync" errors are the only real issue, we can use this report to track it.

Comment 3 Miciah Dashiel Butler Masters 2019-12-10 23:25:16 UTC

I went ahead and cloned to bug 1781950 for the spurious "failed to sync" errors.

Comment 4 Miciah Dashiel Butler Masters 2019-12-10 23:28:12 UTC

Correction: cloned to bug 1781948 for 4.4.0, then cloned that to bug 1781950 for 4.3.0.

Comment 5 Daneyon Hansen 2020-01-06 18:47:45 UTC

Setting bz to needinfo to get feedback from the question in https://bugzilla.redhat.com/show_bug.cgi?id=1781345#c2.

Comment 6 Walid A. 2020-01-17 22:25:06 UTC

In response to comment 5 and original comment https://bugzilla.redhat.com/show_bug.cgi?id=1781345#c2:

The cluster could be too small to run 250 pods per node, event though the pods deployed (gcr.io/google_containers/pause-amd64:3.0 pods) do not have memory or cpu requests.  There are other system pods on the worker nodes with memory and cpu requests.  Also there's a half CPU core (500 millicore) reserved by kubelet on the worker nodes, leaving 1500 millicore avail per worker node on the Standard_D2s_v3 (8GB mem and 2 vCPUs) VMs.

I repeated the node vertical tests on a more recent IPI Azure cluster on build 4.3.0-0.nightly-2020-01-16-123848 with larger worker instances Standard_D4s_v3 (16GB memory and 4 vCPUs).
With the larger instances I was able to deploy the max 250 pods per node successfully, and also all nodes remained Ready, with no degraded operators.  

# oc describe node -l node-role.kubernetes.io/worker= | grep memory ; oc describe node -l node-role.kubernetes.io/worker= | grep cpu
  MemoryPressure   False   Fri, 17 Jan 2020 22:06:31 +0000   Fri, 17 Jan 2020 19:17:12 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
 memory:                         16398024Ki
 memory:                         15783624Ki
  memory                         1797Mi (11%)  537Mi (3%)
  MemoryPressure   False   Fri, 17 Jan 2020 22:06:30 +0000   Fri, 17 Jan 2020 19:16:11 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
 memory:                         16398032Ki
 memory:                         15783632Ki
  memory                         2915Mi (18%)  587Mi (3%)
  MemoryPressure   False   Fri, 17 Jan 2020 22:06:42 +0000   Fri, 17 Jan 2020 19:16:12 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
 memory:                         16398032Ki
 memory:                         15783632Ki
  memory                         3813Mi (24%)  587Mi (3%)
 cpu:                            4
 cpu:                            3500m
  cpu                            740m (21%)    100m (2%)
 cpu:                            4
 cpu:                            3500m
  cpu                            1120m (32%)   300m (8%)
 cpu:                            4
 cpu:                            3500m
  cpu                            1460m (41%)   300m (8%)


Note: one worker node is already at 1460m in CPU requests which is close the 1500m max available on the smaller Standard_D2s_v3 (2 vCPUs).


# oc describe node -l node-role.kubernetes.io/worker= | grep "Non-terminated Pods:"
Non-terminated Pods:                      (250 in total)
Non-terminated Pods:                      (250 in total)
Non-terminated Pods:                      (250 in total)

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-01-16-123848   True        False         False      163m
cloud-credential                           4.3.0-0.nightly-2020-01-16-123848   True        False         False      3h10m
cluster-autoscaler                         4.3.0-0.nightly-2020-01-16-123848   True        False         False      173m
console                                    4.3.0-0.nightly-2020-01-16-123848   True        False         False      165m
dns                                        4.3.0-0.nightly-2020-01-16-123848   True        False         False      179m
image-registry                             4.3.0-0.nightly-2020-01-16-123848   True        False         False      169m
ingress                                    4.3.0-0.nightly-2020-01-16-123848   True        False         False      169m
insights                                   4.3.0-0.nightly-2020-01-16-123848   True        False         False      3h8m
kube-apiserver                             4.3.0-0.nightly-2020-01-16-123848   True        False         False      3h
kube-controller-manager                    4.3.0-0.nightly-2020-01-16-123848   True        False         False      179m
kube-scheduler                             4.3.0-0.nightly-2020-01-16-123848   True        False         False      179m
machine-api                                4.3.0-0.nightly-2020-01-16-123848   True        False         False      179m
machine-config                             4.3.0-0.nightly-2020-01-16-123848   True        False         False      176m
marketplace                                4.3.0-0.nightly-2020-01-16-123848   True        False         False      176m
monitoring                                 4.3.0-0.nightly-2020-01-16-123848   True        False         False      151m
network                                    4.3.0-0.nightly-2020-01-16-123848   True        False         False      3h2m
node-tuning                                4.3.0-0.nightly-2020-01-16-123848   True        False         False      176m
openshift-apiserver                        4.3.0-0.nightly-2020-01-16-123848   True        False         False      175m
openshift-controller-manager               4.3.0-0.nightly-2020-01-16-123848   True        False         False      3h
openshift-samples                          4.3.0-0.nightly-2020-01-16-123848   True        False         False      172m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-01-16-123848   True        False         False      179m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-01-16-123848   True        False         False      179m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-01-16-123848   True        False         False      177m
service-ca                                 4.3.0-0.nightly-2020-01-16-123848   True        False         False      3h8m
service-catalog-apiserver                  4.3.0-0.nightly-2020-01-16-123848   True        False         False      178m
service-catalog-controller-manager         4.3.0-0.nightly-2020-01-16-123848   True        False         False      176m
storage                                    4.3.0-0.nightly-2020-01-16-123848   True        False         False      177m

Comment 7 Daneyon Hansen 2020-01-27 17:36:53 UTC

Closing since https://github.com/openshift/cluster-ingress-operator/pull/337 fixed the spurious "failed to sync" errors and https://bugzilla.redhat.com/show_bug.cgi?id=1781345#c6 confirmed the original cluster did not have enough resources to support 250 pods/node.