Bug 2057740
Summary: | openshift-marketplace pods with no 'controller: true' ownerReferences | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | OLM | Assignee: | Per da Silva <pegoncal> | |
OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> | |
Status: | CLOSED WONTFIX | Docs Contact: | ||
Severity: | low | |||
Priority: | low | CC: | aos-bugs, jiazha, jkeister, kramraja, mbargenq, mimccune, oarribas, pegoncal, pmagotra, wking | |
Version: | 4.11 | |||
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 2053343 | |||
: | 2057756 (view as bug list) | Environment: | ||
Last Closed: | 2023-02-13 21:57:16 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
W. Trevor King
2022-02-24 00:11:24 UTC
(In reply to W. Trevor King from comment #0) > Dropping into Loki, machine-config-daemon-zk9tj logs have: > > E0223 16:07:08.199572 195651 daemon.go:340] WARNING: ignoring > DaemonSet-managed Pods: ..., > openshift-marketplace/certified-operators-zbb6r, > openshift-marketplace/community-operators-qpvff, > openshift-marketplace/redhat-marketplace-dxpbn, > openshift-marketplace/redhat-operators-mhlf5 > ... > I0223 16:07:08.201839 195651 daemon.go:340] evicting pod > openshift-marketplace/certified-operators-zbb6r > ... > I0223 16:07:19.831014 195651 daemon.go:325] Evicted pod > openshift-marketplace/certified-operators-zbb6r > > That's... not entirely clear to me. Certainly doesn't look like a DaemonSet > pod to me. But whatever, seems like MCO is able to drain this pod without > the 'controller: true' setting. Aha, this is because the MCO is forcing the drain [1]. So when we fix this bug and declare 'controller: true' on an ownerReferences entry, folks will no longer to force when using the upstream drain library to drain these openshift-marketplace pods. [1]: https://github.com/openshift/machine-config-operator/blob/b7f7bb950e1d1ee66c90ed6761a162d402b74664/pkg/daemon/daemon.go#L315 (In reply to W. Trevor King from comment #0) > E0223 16:07:08.199572 195651 daemon.go:340] WARNING: ignoring > DaemonSet-managed Pods: ..., > openshift-marketplace/certified-operators-zbb6r, > ... Better ellipsis for this log line: E0223 16:07:08.199572 195651 daemon.go:340] WARNING: ignoring DaemonSet-managed Pods: ...; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-kube-apiserver/kube-apiserver-guard-ip-10-0-151-30.us-west-1.compute.internal, openshift-kube-controller-manager/kube-controller-manager-guard-ip-10-0-151-30.us-west-1.compute.internal, openshift-kube-scheduler/openshift-kube-scheduler-guard-ip-10-0-151-30.us-west-1.compute.internal, openshift-marketplace/certified-operators-zbb6r, openshift-marketplace/community-operators-qpvff, openshift-marketplace/redhat-marketplace-dxpbn, openshift-marketplace/redhat-operators-mhlf5 I've filed [1] to clean up the messaging a bit. And looks like I need to follow up with whoever creates those guard-ip pods too... [1]: https://github.com/kubernetes/kubernetes/pull/108314 I've spun off bug 2057756 to track those guard pods. The issue is that the ownerref on the marketplace pods is the catalogsource that is fronting the pod (so that when a user deletes the catalogsource, the pod itself gets removed) instead of the underlying controller. This has been the behavior since the API has been introduced, so users are expecting the deletion to behave that way. It may make sense to update the ownerref to refer to the marketplace operator explicitly, which makes sense for draining purposes, but then the catalogsource won’t own it’s own pod anymore, and users who are used to the normal behavior (delete catalogsource deletes the underlying pod) may be impacted. If we could have multiple ownerrefs on the pod (one for the catalog and one for the controller) that might resolve the issue. Since the metadata.ownerReferences field is plural this might be ok. It would be a backwards-compatible change. (In reply to Daniel Sover from comment #4) > If we could have multiple ownerrefs on the pod... That sounds plausible to me. But looking for precedent in a Kube-core controller (ReplicaSet -> Pod): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496494490028871680/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[].metadata | select(.name | startswith("cluster-version-operator-")).ownerReferences[]' { "apiVersion": "apps/v1", "blockOwnerDeletion": true, "controller": true, "kind": "ReplicaSet", "name": "cluster-version-operator-66bcf76d6b", "uid": "49a02058-a7c8-4edd-8b61-cfe6aba21d85" } So at least the ReplicaSet controller (which consumes the ReplicaSet and spits out Pods) doesn't actually point at itself, and claims the ReplicaSet itself as the controller. And I see no pods today that declare multiple owners: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496494490028871680/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[].metadata | select(.ownerReferences | length > 1) | .namespace + " " + .name' ...no hits... But yeah, I'd personally have expected things like: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496494490028871680/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[].metadata | select(.name | startswith("cluster-version-operator-")).ownerReferences[]' { "apiVersion": "apps/v1", "blockOwnerDeletion": true, "controller": false, "kind": "ReplicaSet", "name": "cluster-version-operator-66bcf76d6b", "uid": "49a02058-a7c8-4edd-8b61-cfe6aba21d85" } { "apiVersion": "apps/v1", "blockOwnerDeletion": true, "controller": true, "kind": "Deployment", "namespace": "openshift-kube-controller-manager", "name": "openshift-controller-manager-operator", "uid": "..." } For "created by Deployment for openshift-controller-manager-operator for ReplicaSet cluster-version-operator-66bcf76d6b". I'm not actually sure that's where the ReplicaSet controller lives, but whatever, that's the idea. However, that's not going to work because [1]: > Cross-namespace owner references are disallowed by design. Namespaced dependents can specify cluster-scoped or namespaced owners. A namespaced owner must exist in the same namespace as the dependent. If it does not, the owner reference is treated as absent, and the dependent is subject to deletion once all owners are verified absent. > > Cluster-scoped dependents can only specify cluster-scoped owners. So in situations where we can't actually point at the controller, but where we can point at the triggering resource, then claiming that the triggering resource is a controller seems like a reasonable hack. The alternative would be to find some other resource upstream of the controller (like a ClusterOperator or some such), and glossing over the fact that the actual controller is a subsystem beneath that more-central resource seems like a reasonable hack too. [1]: https://kubernetes.io/docs/concepts/architecture/garbage-collection/#owners-dependents OLM team has moved to jira. Closing this as won't fix since it has not bubbled up to see action since 05Apr2022. Please open an OCPBUGS jira, component OLM, if this is still an issue. I've ported this to Jira as https://issues.redhat.com/browse/OCPBUGS-7431 |