Hide Forgot
Description of problem: The pods created by the marketplace-operator's default CatalogSources do not have standard toleration consistent with other core ocp components Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Install OCP cluster 2. oc get pod <a-catsrc-pod> -o yaml | grep toleration 3. Actual results: Nothing is returned. Toleration is not specified. Expected results: Catsrc pods are created with toleration consistent with core ocp components. Additional info: Eg: https://github.com/operator-framework/operator-marketplace/blob/master/manifests/09_operator.yaml#L20-L34
There is no workaround available, there is no trivial way to add these with the current scope of the CatalogSource API. We would need to explicitly add a knob to add this feature in the upstream or start managing these pods manually. In lieu of that, for now, there's no workaround.
*** Bug 2019963 has been marked as a duplicate of this bug. ***
Is this something we'd need to backport?
> Is this something we'd need to backport? Per, This is where the question of if this is bug or a feature request comes into play. Since this (most likely) involves changes to the CatalogSource API, I think backporting this change will be a hard sell. To me it feels like it's enough to say "if you want to specify taints and toleration for your CatalogSource, upgrade to the newest OCP version".
We've added the upstream changes here: https://github.com/operator-framework/operator-lifecycle-manager/pull/2512 to be able to override the tolerations. We may also need to update the catalog source definitions to make use of the new optional fields for overriding tolerations.
Additional documentation can be found here: https://olm.operatorframework.io/docs/advanced-tasks/overriding-catalog-source-pod-scheduling-configuration/
Only thing left is to make the change in the marketplace to actually make use of the new API fields.
The PR: https://github.com/operator-framework/operator-marketplace
LGTM, marking as VERIFIED. Cluster version: 4.10.0-0.nightly-2022-01-15-092722 oc exec catalog-operator-6fbcf6cc9f-mnpqf -n openshift-operator-lifecycle-manager -- olm --version OLM version: 0.19.0 git commit: 79c782526c3c1c2da88f63b34707b23fb04f7da5 oc get pods certified-operators-dcz4n -o yaml -n openshift-marketplace | grep toleration tolerations: tolerationSeconds: 120 tolerationSeconds: 120
1, Check the version of the marketplace-operator, the fixed PR merged in, as follows, [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-17-182202 True False 169m Cluster version is 4.10.0-0.nightly-2022-01-17-182202 [cloud-user@preserve-olm-env jian]$ oc exec marketplace-operator-595c466c46-96hxm -- marketplace-operator --version time="2022-01-18T02:24:08Z" level=info msg="Go Version: go1.17.2" time="2022-01-18T02:24:08Z" level=info msg="Go OS/Arch: linux/amd64" time="2022-01-18T02:24:08Z" level=info msg="Marketplace source git commit: 80b92ecff398578b389cd953605a7b0f7bbd4f24\n" 2, After the fixed PR, the pods of those default CatalogSource were scheduled to the 'master' node, as follows, [cloud-user@preserve-olm-env jian]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 3fa839059996b63185244c32e43eb14f576c6549a69a0fde60a2013130bq2sf 0/1 Completed 0 161m 10.129.2.11 ip-10-0-168-171.ap-northeast-2.compute.internal <none> <none> c53346a710e71a53959eb1c9104cb2c3c0bb496af3f24c8f0d68a6d7e127xkb 0/1 Completed 0 161m 10.129.2.12 ip-10-0-168-171.ap-northeast-2.compute.internal <none> <none> certified-operators-9x9cc 1/1 Running 0 3h6m 10.128.0.17 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> community-operators-6fsjq 1/1 Running 0 3h6m 10.128.0.18 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> marketplace-operator-595c466c46-96hxm 1/1 Running 4 (3h1m ago) 3h11m 10.130.0.29 ip-10-0-204-118.ap-northeast-2.compute.internal <none> <none> qe-app-registry-sjs8l 1/1 Running 0 98m 10.129.2.28 ip-10-0-168-171.ap-northeast-2.compute.internal <none> <none> redhat-marketplace-lbc2d 1/1 Running 0 3h6m 10.128.0.19 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> redhat-operators-897kf 1/1 Running 0 3h6m 10.128.0.16 ip-10-0-131-114.ap-northeast-2.compute.internal <none> <none> [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-114.ap-northeast-2.compute.internal Ready master 3h9m v1.23.0+60f5a1c ip-10-0-141-27.ap-northeast-2.compute.internal Ready worker 178m v1.23.0+60f5a1c ip-10-0-168-171.ap-northeast-2.compute.internal Ready worker 178m v1.23.0+60f5a1c ip-10-0-181-155.ap-northeast-2.compute.internal Ready master 3h9m v1.23.0+60f5a1c ip-10-0-204-118.ap-northeast-2.compute.internal Ready master 3h9m v1.23.0+60f5a1c ip-10-0-220-187.ap-northeast-2.compute.internal Ready worker 178m v1.23.0+60f5a1c All default CatalogSource pods have the 'tolerations' to match the master nodes' 'taints', as follows, [cloud-user@preserve-olm-env jian]$ oc get nodes ip-10-0-131-114.ap-northeast-2.compute.internal -o=jsonpath={.spec.taints} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"}] [cloud-user@preserve-olm-env jian]$ oc get pods certified-operators-9x9cc -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"} [cloud-user@preserve-olm-env jian]$ oc get pods community-operators-6fsjq -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}] [cloud-user@preserve-olm-env jian]$ oc get pods redhat-marketplace-lbc2d -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}] [cloud-user@preserve-olm-env jian]$ oc get pods redhat-operators-897kf -o=jsonpath={.spec.tolerations} [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120},{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}] Hi @Kevin, now, all default CatalogSource pods running on the master node, but as I remember, before, in order to reduce the Master node resource presure, we don't schedule them on the master node, so does this solution really as expected? Thanks!
@Sonigra Surab: we won't be backporting this. I will talk to the team tomorrow to decide whether the doc should be handled as a bug fix or a feature as well.
@Jian I've spoken to Kevin and he's confirmed this fix is as expected.
@naygupta: I don't think anything is required from the customer side. The information you got is accurate. For the memory footprint of the catalog source is comes from the content within it which is shipped out-of-band and isn't part of the core payload. The current fix in the ticket would mean that from 4.10.0 the catalog source pods would be executed on the master node. This may alleviate the customer's issue. Actually, maybe that's a worthwhile question for the customer: "If Catalog Sources pods are scheduled on the master node, would that improve the status-quo?". If not, then maybe it would be worth re-opening BZ #2010599. I hope this helps.
I've discussed it internally. This medium/medium and non-blocking. We won't have the bandwidth to backport, unfortunately.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056