Bug 1927478
Summary: | Default CatalogSources deployed by marketplace do not have toleration for tainted nodes. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anik <anbhatta> |
Component: | OLM | Assignee: | Per da Silva <pegoncal> |
OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | dsilvaju, gagore, jiazha, krizza, naygupta, nhale, nivarma, oarribas, pegoncal, rgudimet, ssonigra, tflannag, ychoukse |
Version: | 4.8 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.10.0 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
The underlying pod spec for CatalogSource pod does not set the required tolerations, nodeSelector, or priorityClassName by default. Other than priorityClassName there was no way for the user to influence tolerations and nodeSelector.
Consequence:
The CatalogSource pod would have the default settings for tolerations, nodeSelector, and priorityClassName
Fix:
The CatalogSource spec has been expanded and now includes an optional field spec.grpcPodConfig that can be used to override the tolerations, nodeSelector, priorityClassName for the underlying pod. The marketplace-operator CatalogSources were updated to make use of this new feature
Result:
The CatalogSource pods for the default CatalogSources now have the expected nodeSelector, tolerations, and priorityClassName
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:02:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Anik
2021-02-10 20:20:56 UTC
There is no workaround available, there is no trivial way to add these with the current scope of the CatalogSource API. We would need to explicitly add a knob to add this feature in the upstream or start managing these pods manually. In lieu of that, for now, there's no workaround. *** Bug 2019963 has been marked as a duplicate of this bug. *** Is this something we'd need to backport? *** Bug 2019963 has been marked as a duplicate of this bug. *** > Is this something we'd need to backport?
Per,
This is where the question of if this is bug or a feature request comes into play. Since this (most likely) involves changes to the CatalogSource API, I think backporting this change will be a hard sell. To me it feels like it's enough to say "if you want to specify taints and toleration for your CatalogSource, upgrade to the newest OCP version".
We've added the upstream changes here: https://github.com/operator-framework/operator-lifecycle-manager/pull/2512 to be able to override the tolerations. We may also need to update the catalog source definitions to make use of the new optional fields for overriding tolerations. Additional documentation can be found here: https://olm.operatorframework.io/docs/advanced-tasks/overriding-catalog-source-pod-scheduling-configuration/ Only thing left is to make the change in the marketplace to actually make use of the new API fields. LGTM, marking as VERIFIED. Cluster version: 4.10.0-0.nightly-2022-01-15-092722 oc exec catalog-operator-6fbcf6cc9f-mnpqf -n openshift-operator-lifecycle-manager -- olm --version OLM version: 0.19.0 git commit: 79c782526c3c1c2da88f63b34707b23fb04f7da5 oc get pods certified-operators-dcz4n -o yaml -n openshift-marketplace | grep toleration tolerations: tolerationSeconds: 120 tolerationSeconds: 120 1, Check the version of the marketplace-operator, the fixed PR merged in, as follows, [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-17-182202 True False 169m Cluster version is 4.10.0-0.nightly-2022-01-17-182202 [cloud-user@preserve-olm-env jian]$ oc exec marketplace-operator-595c466c46-96hxm -- marketplace-operator --version time="2022-01-18T02:24:08Z" level=info msg="Go Version: go1.17.2" time="2022-01-18T02:24:08Z" level=info msg="Go OS/Arch: linux/amd64" time="2022-01-18T02:24:08Z" level=info msg="Marketplace source git commit: 80b92ecff398578b389cd953605a7b0f7bbd4f24\n" 2, After the fixed PR, the pods of those default CatalogSource were scheduled to the 'master' node, as follows, [cloud-user@preserve-olm-env jian]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 3fa839059996b63185244c32e43eb14f576c6549a69a0fde60a2013130bq2sf 0/1 Completed 0 161m 10.129.2.11 ip-10-0-168-171.ap-northeast-2.compute.internal <none> <none> c53346a710e71a53959eb1c9104cb2c3c0bb496af3f24c8f0d68a6d7e127xkb 0/1 Completed 0 161m 10.129.2.12 ip-10-0-168-171.ap-northeast-2.compute.internal <none> <none> certified-operators-9x9cc 1/1 Running 0 3h6m 10.128.0.17 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> community-operators-6fsjq 1/1 Running 0 3h6m 10.128.0.18 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> marketplace-operator-595c466c46-96hxm 1/1 Running 4 (3h1m ago) 3h11m 10.130.0.29 ip-10-0-204-118.ap-northeast-2.compute.internal <none> <none> qe-app-registry-sjs8l 1/1 Running 0 98m 10.129.2.28 ip-10-0-168-171.ap-northeast-2.compute.internal <none> <none> redhat-marketplace-lbc2d 1/1 Running 0 3h6m 10.128.0.19 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> redhat-operators-897kf 1/1 Running 0 3h6m 10.128.0.16 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-114.ap-northeast-2.compute.internal Ready master 3h9m v1.23.0+60f5a1c ip-10-0-141-27.ap-northeast-2.compute.internal Ready worker 178m v1.23.0+60f5a1c ip-10-0-168-171.ap-northeast-2.compute.internal Ready worker 178m v1.23.0+60f5a1c ip-10-0-181-155.ap-northeast-2.compute.internal Ready master 3h9m v1.23.0+60f5a1c ip-10-0-204-118.ap-northeast-2.compute.internal Ready master 3h9m v1.23.0+60f5a1c ip-10-0-220-187.ap-northeast-2.compute.internal Ready worker 178m v1.23.0+60f5a1c All default CatalogSource pods have the 'tolerations' to match the master nodes' 'taints', as follows, [cloud-user@preserve-olm-env jian]$ oc get nodes ip-10-0-131-114.ap-northeast-2.compute.internal -o=jsonpath={.spec.taints} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"}] [cloud-user@preserve-olm-env jian]$ oc get pods certified-operators-9x9cc -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"} [cloud-user@preserve-olm-env jian]$ oc get pods community-operators-6fsjq -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}] [cloud-user@preserve-olm-env jian]$ oc get pods redhat-marketplace-lbc2d -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}] [cloud-user@preserve-olm-env jian]$ oc get pods redhat-operators-897kf -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}] Hi @Kevin, now, all default CatalogSource pods running on the master node, but as I remember, before, in order to reduce the Master node resource presure, we don't schedule them on the master node, so does this solution really as expected? Thanks! @Sonigra Surab: we won't be backporting this. I will talk to the team tomorrow to decide whether the doc should be handled as a bug fix or a feature as well. @Jian I've spoken to Kevin and he's confirmed this fix is as expected. @naygupta: I don't think anything is required from the customer side. The information you got is accurate. For the memory footprint of the catalog source is comes from the content within it which is shipped out-of-band and isn't part of the core payload. The current fix in the ticket would mean that from 4.10.0 the catalog source pods would be executed on the master node. This may alleviate the customer's issue. Actually, maybe that's a worthwhile question for the customer: "If Catalog Sources pods are scheduled on the master node, would that improve the status-quo?". If not, then maybe it would be worth re-opening BZ #2010599. I hope this helps. I've discussed it internally. This medium/medium and non-blocking. We won't have the bandwidth to backport, unfortunately. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |