+++ This bug was initially created as a clone of Bug #2057740 +++ +++ This bug was initially created as a clone of Bug #2053343 +++ --- Additional comment from W. Trevor King on 2022-02-24 00:21:08 UTC --- (In reply to W. Trevor King from comment #0) > Dropping into Loki, machine-config-daemon-zk9tj logs have: > > E0223 16:07:08.199572 195651 daemon.go:340] WARNING: ignoring > DaemonSet-managed Pods: ..., > openshift-marketplace/certified-operators-zbb6r, > openshift-marketplace/community-operators-qpvff, > openshift-marketplace/redhat-marketplace-dxpbn, > openshift-marketplace/redhat-operators-mhlf5 > ... > I0223 16:07:08.201839 195651 daemon.go:340] evicting pod > openshift-marketplace/certified-operators-zbb6r > ... > I0223 16:07:19.831014 195651 daemon.go:325] Evicted pod > openshift-marketplace/certified-operators-zbb6r > > That's... not entirely clear to me. Certainly doesn't look like a DaemonSet > pod to me. But whatever, seems like MCO is able to drain this pod without > the 'controller: true' setting. Aha, this is because the MCO is forcing the drain [1]. So when we fix this bug and declare 'controller: true' on an ownerReferences entry, folks will no longer to force when using the upstream drain library to drain these openshift-marketplace pods. [1]: https://github.com/openshift/machine-config-operator/blob/b7f7bb950e1d1ee66c90ed6761a162d402b74664/pkg/daemon/daemon.go#L315 --- Additional comment from W. Trevor King on 2022-02-24 02:36:41 UTC --- (In reply to W. Trevor King from comment #0) > E0223 16:07:08.199572 195651 daemon.go:340] WARNING: ignoring > DaemonSet-managed Pods: ..., > openshift-marketplace/certified-operators-zbb6r, > ... Better ellipsis for this log line: E0223 16:07:08.199572 195651 daemon.go:340] WARNING: ignoring DaemonSet-managed Pods: ...; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-kube-apiserver/kube-apiserver-guard-ip-10-0-151-30.us-west-1.compute.internal, openshift-kube-controller-manager/kube-controller-manager-guard-ip-10-0-151-30.us-west-1.compute.internal, openshift-kube-scheduler/openshift-kube-scheduler-guard-ip-10-0-151-30.us-west-1.compute.internal, openshift-marketplace/certified-operators-zbb6r, openshift-marketplace/community-operators-qpvff, openshift-marketplace/redhat-marketplace-dxpbn, openshift-marketplace/redhat-operators-mhlf5 I've filed [1] to clean up the messaging a bit. And looks like I need to follow up with whoever creates those guard-ip pods too... [1]: https://github.com/kubernetes/kubernetes/pull/108314 --- Bug 2057740 covers a lack of 'controller: true' ownerReferences keeping some openshift-marketplace from being able to be drained without --force. This bug tracks the new-in-4.10 guard pods lacking ownerReferences entirely. Ideally they'd be marked so that it was clear to drain (and everyone else) that there was a controller in charge of creating those Pods and pointing at some resource associated with that controller. Definition of done is emptying out the following query: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496494490028871680/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[].metadata | select((.name | contains("-guard-")) and ([(.ownerReferences // [])[] | select(.controller)] | length == 0)) | .namespace + " " + .name + " " + (.ownerReferences | tostring)' openshift-kube-apiserver kube-apiserver-guard-ip-10-0-151-30.us-west-1.compute.internal null openshift-kube-apiserver kube-apiserver-guard-ip-10-0-184-130.us-west-1.compute.internal null openshift-kube-apiserver kube-apiserver-guard-ip-10-0-193-89.us-west-1.compute.internal null openshift-kube-controller-manager kube-controller-manager-guard-ip-10-0-151-30.us-west-1.compute.internal null openshift-kube-controller-manager kube-controller-manager-guard-ip-10-0-184-130.us-west-1.compute.internal null openshift-kube-controller-manager kube-controller-manager-guard-ip-10-0-193-89.us-west-1.compute.internal null openshift-kube-scheduler openshift-kube-scheduler-guard-ip-10-0-151-30.us-west-1.compute.internal null openshift-kube-scheduler openshift-kube-scheduler-guard-ip-10-0-184-130.us-west-1.compute.internal null openshift-kube-scheduler openshift-kube-scheduler-guard-ip-10-0-193-89.us-west-1.compute.internal null where I'm using [1] as an example CI run. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496494490028871680
Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.
Ported to https://issues.redhat.com/browse/WRKLDS-646