Bug 1802544
| Summary: | The default workers' number can not make all the monitoring pods become running with default IPI installation setting | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
| Component: | Node | Assignee: | Ryan Phillips <rphillips> |
| Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.4 | CC: | aos-bugs, jialiu, jokerman, sdodson, yanyang |
| Target Milestone: | --- | Keywords: | Regression, Reopened |
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-04 11:36:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Seems reasonable to mark this as a dupe of 1803239 then? *** This bug has been marked as a duplicate of bug 1803239 *** Yep. Thanks! QE caught this effective issue from block box testing perspective in the first time, now the PR is reverted, so I think this is effective bug. So move this bug to ON_QA. Verified this bug with 4.4.0-0.nightly-2020-02-17-211020, and PASS. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |
Description of problem: use "openshift-install create install-config" to create a 4.4 AWS cluster, the default masters' number is 3(instance type: m4.xlarge(4 CPU/16Gi Memory)), default workers' number is 3(instance type: m4.large(2 CPU/8Gi Memory) install-config see below ******************************************* apiVersion: v1 baseDomain: qe.devcluster.openshift.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 metadata: creationTimestamp: null name: xx networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: aws: region: us-east-2 publish: External ******************************************* due to we have increased the systemReserved cpu/memory, for a m4.large worker, there is 2G-768m=1232m allocatable cpu see https://github.com/openshift/machine-config-operator/commit/b811616049d7990c70fcfd56ff1d5b746b1a1121 each prometheus-k8s pod requests 480m cpu, if there is less than 480m cpu left, another prometheus-k8s pod would be failed to start up, then we can see the installation error level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus: expected 2 replicas, updated 1 and available 1" # oc -n openshift-monitoring get pod | grep prometheus-k8s prometheus-k8s-0 7/7 Running 1 165m 10.129.2.7 ip-10-0-60-1.us-east-2.compute.internal <none> <none> prometheus-k8s-1 0/7 Pending 0 159m <none> <none> <none> <none> # oc -n openshift-monitoring describe pod prometheus-k8s-1 Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. but the allocatable.cpu on worker is 1232m, except the already reserved cpu for other pods, there is not enough cpu for prometheus-k8s-1 pod 932+480, 1192+480, 1212+480 already > 1232 # for i in $(oc get node | grep worker | awk '{print $1}'); do echo $i; oc describe node $i | tail; done ip-10-0-60-1.us-east-2.compute.internal openshift-sdn sdn-nc8zv 100m (8%) 0 (0%) 200Mi (2%) 0 (0%) 3h16m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1192m (96%) 300m (24%) memory 2679Mi (39%) 587Mi (8%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: <none> ip-10-0-64-143.us-east-2.compute.internal openshift-sdn sdn-pxk6r 100m (8%) 0 (0%) 200Mi (2%) 0 (0%) 3h16m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1212m (98%) 100m (8%) memory 2837Mi (41%) 537Mi (7%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: <none> ip-10-0-75-97.us-east-2.compute.internal openshift-sdn sdn-l4bw8 100m (8%) 0 (0%) 200Mi (2%) 0 (0%) 3h16m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 932m (75%) 100m (8%) memory 1951Mi (28%) 537Mi (7%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: <none> # for i in $(oc get node | grep worker | awk '{print $1}'); do echo $i;oc get node $i -o jsonpath="{.status.allocatable.cpu}"; echo -e "\n";done ip-10-0-60-1.us-east-2.compute.internal 1232m ip-10-0-64-143.us-east-2.compute.internal 1232m ip-10-0-75-97.us-east-2.compute.internal 1232m # kubectl -n openshift-monitoring get pod prometheus-k8s-1 -o go-template='{{range.spec.containers}}{{"Container Name: "}}{{.name}}{{"\r\nresources: "}}{{.resources}}{{"\n"}}{{end}}' Container Name: prometheus resources: map[requests:map[cpu:200m memory:1Gi]] Container Name: prometheus-config-reloader resources: map[limits:map[cpu:100m memory:25Mi] requests:map[cpu:100m memory:25Mi]] Container Name: rules-configmap-reloader resources: map[limits:map[cpu:100m memory:25Mi] requests:map[cpu:100m memory:25Mi]] Container Name: thanos-sidecar resources: map[requests:map[cpu:50m memory:100Mi]] Container Name: prometheus-proxy resources: map[requests:map[cpu:10m memory:20Mi]] Container Name: kube-rbac-proxy resources: map[requests:map[cpu:10m memory:20Mi]] Container Name: prom-label-proxy resources: map[requests:map[cpu:10m memory:20Mi]] Version-Release number of the following components: 4.4.0-0.nightly-2020-02-12-211301 How reproducible: Always Steps to Reproduce: 1. "openshift-install create install-config" with the default setting 2. 3. Actual results: Cluster monitoring is degraded Expected results: Cluster should be fine Additional info: