Bug 1802332
| Summary: | CI: occasional: 'failed to initialize the cluster: Cluster operator monitoring is still updating' with insufficient compute CPU | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | ||||
| Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.4 | CC: | aos-bugs, eparis, jokerman | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-02-24 20:01:25 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
W. Trevor King
2020-02-12 22:44:48 UTC
Created attachment 1662829 [details] Pod CPU request differences between 4.3 and 4.4 tips In case it turns out to be an over-request issue, I'm attaching the changes in pod CPU requests between recent 4.3 and 4.4 release images. Generated with: $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/818/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata.namespace as $namespace | .spec.containers[] | $namespace + " " + .name + " " + .resources.requests.cpu' | sort | uniq | awk 'BEGIN{a = ""}{if ($1 != a) {print ""}; a = $1; print $0}') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/1177/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata.namespace as $namespace | .spec.containers[] | $namespace + " " + .name + " " + .resources.requests.cpu' | sort | uniq | awk 'BEGIN{a = ""}{if ($1 != a) {print ""}; a = $1; print $0}') Consumption on one of the compute nodes in the CI job I picked in comment 0: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/125/pull-ci-openshift-cluster-etcd-operator-master-e2e-azure/420/artifacts/e2e-azure/pods.json | jq -r '.items[] | select(.spec.nodeName == "ci-op-wj40jzv7-52597-7qblc-worker-eastus22-nqdqk") | .metadata.namespace as $namespace | .metadata.name as $name | .status.phase as $phase | .spec.containers[] | $phase + " " + $namespace + " " + $name + " " + .name + " " + .resources.requests.cpu' Running openshift-cluster-node-tuning-operator tuned-2pzdm tuned 10m Running openshift-csi-snapshot-controller-operator csi-snapshot-controller-operator-5555c45855-hhfts operator 10m Running openshift-dns dns-default-57kkz dns 100m Running openshift-dns dns-default-57kkz dns-node-resolver 10m Running openshift-image-registry image-registry-698d4b86f8-62pvp registry 100m Running openshift-image-registry node-ca-qd2pl node-ca 10m Running openshift-ingress router-default-5479579b9f-hwk9g router 100m Running openshift-kube-storage-version-migrator migrator-78dcfcb958-x44ff migrator 100m Running openshift-machine-config-operator machine-config-daemon-j4mm6 machine-config-daemon 20m Running openshift-machine-config-operator machine-config-daemon-j4mm6 oauth-proxy 20m Running openshift-marketplace certified-operators-7b56bc5f66-7gfjp certified-operators 10m Running openshift-marketplace community-operators-58f88b46bd-xfv8p community-operators 10m Running openshift-marketplace redhat-marketplace-6f68bc576f-t7hbf redhat-marketplace 10m Running openshift-marketplace redhat-operators-58dcd77c84-9nnwx redhat-operators 10m Running openshift-monitoring alertmanager-main-2 alertmanager Running openshift-monitoring alertmanager-main-2 config-reloader 100m Running openshift-monitoring alertmanager-main-2 alertmanager-proxy 10m Running openshift-monitoring kube-state-metrics-b57d78654-7zcz4 kube-rbac-proxy-main 10m Running openshift-monitoring kube-state-metrics-b57d78654-7zcz4 kube-rbac-proxy-self 10m Running openshift-monitoring kube-state-metrics-b57d78654-7zcz4 kube-state-metrics 10m Running openshift-monitoring node-exporter-f52v8 node-exporter 102m Running openshift-monitoring node-exporter-f52v8 kube-rbac-proxy 10m Running openshift-monitoring openshift-state-metrics-68747b76d9-6cvhh kube-rbac-proxy-main 10m Running openshift-monitoring openshift-state-metrics-68747b76d9-6cvhh kube-rbac-proxy-self 10m Running openshift-monitoring openshift-state-metrics-68747b76d9-6cvhh openshift-state-metrics 100m Running openshift-monitoring prometheus-adapter-5c6d59c87d-2d9rn prometheus-adapter 10m Running openshift-monitoring prometheus-adapter-5c6d59c87d-lz7r5 prometheus-adapter 10m Running openshift-monitoring telemeter-client-7f759bd4b-qgk4p telemeter-client Running openshift-monitoring telemeter-client-7f759bd4b-qgk4p reload Running openshift-monitoring telemeter-client-7f759bd4b-qgk4p kube-rbac-proxy 10m Running openshift-multus multus-9rw56 kube-multus 10m Running openshift-sdn ovs-9psjp openvswitch 200m Running openshift-sdn sdn-dw4v9 sdn 100m So that's 1232m. Ah, is the node status.allocatable.cpu really "how much we've already handed out" instead of "how much we have left"? But even if it is and we've handed out 1232m of the full 2 CPU, we should have 768m left which is plenty for the 480m prometheus pod that is failing to schedule. I dunno; hopefully component team folks understand this better than me ;). Allocatable == 1232 = Capacity(2) - System Reserved(768m) https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/_base/files/kubelet.yaml#L25 So we need to calculate how much we handed out by summing up the pod requests manually (that's included in `oc describe node` but doesn't help here) > So we need to calculate how much we handed out by summing up the pod requests manually. That's what I did in comment 2, and it shows us with 1232m requested running. So we've neatly filled our capacity? But MemoryPressure is still False [1]? [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/125/pull-ci-openshift-cluster-etcd-operator-master-e2e-azure/420/artifacts/e2e-azure/must-gather/registry-svc-ci-openshift-org-ci-op-wj40jzv7-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/cluster-scoped-resources/core/nodes/ci-op-wj40jzv7-52597-7qblc-worker-eastus22-nqdqk.yaml CPU reservations were bumped back down. Marking as a dup of 1803239 *** This bug has been marked as a duplicate of bug 1803239 *** |