Bug 1781345
Summary: | OCP 4.3: Azure - ingress operator degraded and worker nodes go NotReady when deploying 250 pause-pods per node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> | |
Component: | Networking | Assignee: | Dan Mace <dmace> | |
Networking sub component: | router | QA Contact: | Walid A. <wabouham> | |
Status: | CLOSED NOTABUG | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | aos-bugs, dhansen, mifiedle, mmasters | |
Version: | 4.3.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.4.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1781948 (view as bug list) | Environment: | ||
Last Closed: | 2020-01-27 17:36:53 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Walid A.
2019-12-09 20:05:11 UTC
The "failed to sync ingresscontroller status" errors are spurious. In fact, the ingresscontroller status is being updated and reporting, "Deployment does not have minimum availability." I made https://github.com/openshift/cluster-ingress-operator/pull/337 to fix the spurious "failed to sync" errors. As for why the deployment cannot achieve minimum availability, note this event in the openshift-ingress namespace: "0/7 nodes are available: 2 Insufficient cpu, 3 Insufficient pods, 4 node(s) didn't match node selector." That is the most recent event in the namespace. Could the cluster simply be too small to run 250 pods? If the cluster should be able to handle 250 pods, we can clone this Bugzilla report and use the new report to track the "failed to sync" errors. Otherwise, if the "failed to sync" errors are the only real issue, we can use this report to track it. I went ahead and cloned to bug 1781950 for the spurious "failed to sync" errors. Correction: cloned to bug 1781948 for 4.4.0, then cloned that to bug 1781950 for 4.3.0. Setting bz to needinfo to get feedback from the question in https://bugzilla.redhat.com/show_bug.cgi?id=1781345#c2. In response to comment 5 and original comment https://bugzilla.redhat.com/show_bug.cgi?id=1781345#c2: The cluster could be too small to run 250 pods per node, event though the pods deployed (gcr.io/google_containers/pause-amd64:3.0 pods) do not have memory or cpu requests. There are other system pods on the worker nodes with memory and cpu requests. Also there's a half CPU core (500 millicore) reserved by kubelet on the worker nodes, leaving 1500 millicore avail per worker node on the Standard_D2s_v3 (8GB mem and 2 vCPUs) VMs. I repeated the node vertical tests on a more recent IPI Azure cluster on build 4.3.0-0.nightly-2020-01-16-123848 with larger worker instances Standard_D4s_v3 (16GB memory and 4 vCPUs). With the larger instances I was able to deploy the max 250 pods per node successfully, and also all nodes remained Ready, with no degraded operators. # oc describe node -l node-role.kubernetes.io/worker= | grep memory ; oc describe node -l node-role.kubernetes.io/worker= | grep cpu MemoryPressure False Fri, 17 Jan 2020 22:06:31 +0000 Fri, 17 Jan 2020 19:17:12 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available memory: 16398024Ki memory: 15783624Ki memory 1797Mi (11%) 537Mi (3%) MemoryPressure False Fri, 17 Jan 2020 22:06:30 +0000 Fri, 17 Jan 2020 19:16:11 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available memory: 16398032Ki memory: 15783632Ki memory 2915Mi (18%) 587Mi (3%) MemoryPressure False Fri, 17 Jan 2020 22:06:42 +0000 Fri, 17 Jan 2020 19:16:12 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available memory: 16398032Ki memory: 15783632Ki memory 3813Mi (24%) 587Mi (3%) cpu: 4 cpu: 3500m cpu 740m (21%) 100m (2%) cpu: 4 cpu: 3500m cpu 1120m (32%) 300m (8%) cpu: 4 cpu: 3500m cpu 1460m (41%) 300m (8%) Note: one worker node is already at 1460m in CPU requests which is close the 1500m max available on the smaller Standard_D2s_v3 (2 vCPUs). # oc describe node -l node-role.kubernetes.io/worker= | grep "Non-terminated Pods:" Non-terminated Pods: (250 in total) Non-terminated Pods: (250 in total) Non-terminated Pods: (250 in total) # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2020-01-16-123848 True False False 163m cloud-credential 4.3.0-0.nightly-2020-01-16-123848 True False False 3h10m cluster-autoscaler 4.3.0-0.nightly-2020-01-16-123848 True False False 173m console 4.3.0-0.nightly-2020-01-16-123848 True False False 165m dns 4.3.0-0.nightly-2020-01-16-123848 True False False 179m image-registry 4.3.0-0.nightly-2020-01-16-123848 True False False 169m ingress 4.3.0-0.nightly-2020-01-16-123848 True False False 169m insights 4.3.0-0.nightly-2020-01-16-123848 True False False 3h8m kube-apiserver 4.3.0-0.nightly-2020-01-16-123848 True False False 3h kube-controller-manager 4.3.0-0.nightly-2020-01-16-123848 True False False 179m kube-scheduler 4.3.0-0.nightly-2020-01-16-123848 True False False 179m machine-api 4.3.0-0.nightly-2020-01-16-123848 True False False 179m machine-config 4.3.0-0.nightly-2020-01-16-123848 True False False 176m marketplace 4.3.0-0.nightly-2020-01-16-123848 True False False 176m monitoring 4.3.0-0.nightly-2020-01-16-123848 True False False 151m network 4.3.0-0.nightly-2020-01-16-123848 True False False 3h2m node-tuning 4.3.0-0.nightly-2020-01-16-123848 True False False 176m openshift-apiserver 4.3.0-0.nightly-2020-01-16-123848 True False False 175m openshift-controller-manager 4.3.0-0.nightly-2020-01-16-123848 True False False 3h openshift-samples 4.3.0-0.nightly-2020-01-16-123848 True False False 172m operator-lifecycle-manager 4.3.0-0.nightly-2020-01-16-123848 True False False 179m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-01-16-123848 True False False 179m operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-01-16-123848 True False False 177m service-ca 4.3.0-0.nightly-2020-01-16-123848 True False False 3h8m service-catalog-apiserver 4.3.0-0.nightly-2020-01-16-123848 True False False 178m service-catalog-controller-manager 4.3.0-0.nightly-2020-01-16-123848 True False False 176m storage 4.3.0-0.nightly-2020-01-16-123848 True False False 177m Closing since https://github.com/openshift/cluster-ingress-operator/pull/337 fixed the spurious "failed to sync" errors and https://bugzilla.redhat.com/show_bug.cgi?id=1781345#c6 confirmed the original cluster did not have enough resources to support 250 pods/node. |