I believe this has been fixed in 4.11 but given the nature of this (it blocks IBM for months), we would like to request that it be back-ported to 4.10 Here is the Jira ticket which IBM team opened: https://issues.redhat.com/browse/CPSYN-63 +++ This bug was initially created as a clone of Bug #1999325 +++ [sig-arch] events should not repeat pathologically We've seen an uptick in these messages: event happened 24 times, something is wrong: ns/openshift-kube-scheduler pod/revision-pruner-8-ci-op-k3ifhkdh-875d2-xjdv2-master-2 node/ci-op-k3ifhkdh-875d2-xjdv2-master-2 - reason/FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered See https://search.ci.openshift.org/?search=FailedMount+MountVolume.SetUp+failed+for+volume.*kube-api-access.*not+registered&maxAge=168h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job From Slack thread discussing this: "The latter (kube-root-ca.crt) is Node bug (kubelet cache probably being slow)" --- Additional comment from Jan Safranek on 2021-09-10 12:57:57 UTC --- I am sorry, but here I must exercise the agreement with sig-node - secret / config maps / downward API volumes are their territory. We can help with issues there (and we often do), but here it's kubelet who can't get a config map from the API server in time. Storage is not involved here. In addition, name "kube-root-ca.crt" suggests it might be one of the volumes projected by BoundServiceAccountTokenVolume feature (but please double check). --- Additional comment from Yang Yang on 2021-09-17 08:14:13 UTC --- Observed the similar event in KCM operator. # oc get event | grep FailedMount 130m Warning FailedMount pod/installer-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered 129m Warning FailedMount pod/revision-pruner-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered 147m Warning FailedMount pod/yangyang0917-tv2fd-master-1copenshift-qeinternal-debug MountVolume.SetUp failed for volume "kube-api-access-gsmrj" : [object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-kube-controller-manager"/"openshift-service-ca.crt" not registered] --- Additional comment from Jan Safranek on 2021-09-17 08:53:27 UTC --- sorry, I forgot to reset the assignee. --- Additional comment from RamaKasturi on 2021-09-21 13:35:08 UTC --- Hit this issue almost 7 times in the e2e ci nightly jobs. Below is the must-gather link. http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1999325/must-gather-43219-423866308.tar.gz --- Additional comment from Tereza on 2021-10-08 14:51:42 UTC --- Hello team, I am hitting same issues cluster wide during 4.8.12 -> 4.9.0-rc.5 osde2e upgrades in stage environment as part of OSD Gap Analysis of 4.9. For this reason I am adding ServiceDeliveryBlocker label. >>> ... "openshift-kube-scheduler"/"kube-root-ca.crt" not registered "openshift-compliance"/"kube-root-ca.crt" not registered "openshift-kube-apiserver"/"kube-root-ca.crt" not registered ... >>> Findings from my investigation: All started with couple of pods (to prevent confusion, name is empty-node-labels) in CrashLoopBackOff state that I have reason to believe are just the symptom as I don't see "successfully pulled image" in event messages. >>> Normal Pulling 104m (x244 over 22h) kubelet Pulling image "registry.access.redhat.com/ubi8/ubi-minimal" Warning BackOff 4m46s (x6083 over 22h) kubelet Back-off restarting failed container >>> During e2e upgrade I was seeing cluster health checks to fail for one specific pod (to prevent confusion, name is osd-delete-backplane-serviceaccounts). List of events in the namespace openshift-backplane gave me following error: >>> 62m Warning FailedMount pod/osd-delete-backplane-serviceaccounts-27227840--1-8bgmj MountVolume.SetUp failed for volume "kube-api-access-d94r5" : [object "openshift-backplane"/"kube-root-ca.crt" not registered, object "openshift-backplane"/"openshift-service-ca.crt" not registered] >>> At this point, I scanned errors from kubelet and found out that this will probably be cluster wide issue. >>> Oct 08 07:00:05.679120 ip-10-0-134-56 hyperkube[1475]: I1008 07:00:05.679038 1475 kubelet_pods.go:898] "Unable to retrieve pull secret, the image pull may not succeed." pod="openshift-image-registry/image-pruner-27227940--1-wmb5g" secret="" err="object \"openshift-image-registry\"/\"pruner-dockercfg-qc6tv\" not registered" >>> Link that I saw between failing empty-node-labels pods and "not registered" errors is kube-api-access volumes having difficulties with mount setup. In details kubelet on control plane is running loop of: VerifyControllerAttachedVolume started for volume -> MountVolume started for volume -> SetUp succeeded for volume -> Error preparing data for projected volume (certs are not registered) -> UnmountVolume started for volume -> TearDown succeeded for volume -> Volume detached for volume >>> Oct 08 01:02:21.199571 ip-10-0-190-212 hyperkube[1603]: E1008 01:02:21.199473 1603 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72 podName:28f54e40-f566-4ccc-9ba4-dccbdb493da2 nodeName:}" failed. No retries permitted until 2021-10-08 01:02:21.699456816 +0000 UTC m=+14760.769272070 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "kube-api-access-wlm72" (UniqueName: "kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72") pod "aggregator-pod-rhcos4-moderate-master" (UID: "28f54e40-f566-4ccc-9ba4-dccbdb493da2") : [object "openshift-compliance"/"kube-root-ca.crt" not registered, object "openshift-compliance"/"openshift-service-ca.crt" not registered] >>> Last two things, I notice in kubelet logs were `clearQuota called, but quotas disabled` messages regarding the volumes and two failing OCP e2e tests `[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories` for user "system:serviceaccount:osde2e-rveds:default" about forbidden access. I am not sure if this piece of information is helpful for investigation. >>> Oct 08 01:01:34.621169 ip-10-0-185-218 hyperkube[1603]: W1008 01:01:34.621159 1603 empty_dir.go:517] Warning: Failed to clear quota on /var/lib/kubelet/pods/25f1fd79-7851-4183-a237-bfd1a4950c5d/volumes/kubernetes.io~configmap/rhcos4-moderate-master-openscap-container-entrypoint: clearQuota called, but quotas disabled Oct 07 20:57:27.976215 ip-10-0-221-29 hyperkube[1609]: W1007 20:57:27.976195 1609 empty_dir.go:520] Warning: Failed to clear quota on /var/lib/kubelet/pods/e39ba8bd-8ad0-4266-b56f-a4d352b1e857/volumes/kubernetes.io~configmap/config: clearQuota called, but quotas disabled >>> >>> Error from server (Forbidden): services is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "services" in API group "" in the namespace "kube-system" Error from server (Forbidden): infrastructures.config.openshift.io "cluster" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "infrastructures" in API group "config.openshift.io" at the cluster scope Error from server (Forbidden): nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope Error from server (Forbidden): Failed to watch *v1.ClusterOperator: failed to list *v1.ClusterOperator: clusteroperators.config.openshift.io is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusteroperators" in API group "config.openshift.io" at the cluster scope Error from server (Forbidden): Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: clusterversions.config.openshift.io "version" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusterversions" in API group "config.openshift.io" at the cluster scope Error from server (Forbidden): Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "pods" in API group "" at the cluster scope Oct 7 18:56:11.768: FAIL: users.user.openshift.io "e2e-test-rails-postgresql-repo-test-fnvct-user" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "users" in API group "user.openshift.io" at the cluster scope >>> This is as far as I have got. I would be happy if someone could look at this issue. I am attaching full must-gather so you can also investigate. Thank you :) --- Additional comment from Tereza on 2021-10-08 14:56:20 UTC --- Link to full must-gather of 'gap-upgrade' stg cluster https://drive.google.com/file/d/1J4EUyFSEPyKmYYOCIUTNCb-VAEaDvYC6/view?usp=sharing --- Additional comment from Nikos Stamatelopoulos on 2021-12-07 14:08:32 UTC --- Hello team, I double-checked with the assosiate in the attached ticket that these errors FailedMount...."not registered" are not affecting the normal operation of the container and despite their existence in the logs the volumes are mounted to the pod. His issue was incorrect documentation syntax in the code of the cronjob object yaml code. Im not sure of what this message means "not registered" and this needs to be investigated thats why i reopen this bug. --- Additional comment from Tom Sweeney on 2022-01-06 16:20:30 UTC --- Not completed this sprint. --- Additional comment from Alexander Greene on 2022-03-11 19:33:49 UTC --- --- Additional comment from Elana Hashman on 2022-03-17 22:51:43 UTC --- Corresponding code is https://github.com/openshift/kubernetes/blob/4d08582d1fa21e1f5887e73380001ac827371553/pkg/kubelet/util/manager/watch_based_manager.go#L292 I suspect given the timing (4.9+ releases) this is related to the Kubernetes pod lifecycle refactor that landed in Kubernetes 1.22. We recently had one more patch land today to resolve some lingering issues a few hours ago that we plan to backport to 4.9: https://github.com/openshift/kubernetes/pull/1199 Let's wait and see if this fixes the errors we've been seeing in CI. --- Additional comment from Elana Hashman on 2022-03-17 23:09:46 UTC --- Ah wait, this is https://github.com/kubernetes/kubernetes/pull/107821 https://github.com/kubernetes/kubernetes/pull/107831 We have open backports upstream for these. I'll look into getting them merged. --- Additional comment from Elana Hashman on 2022-03-21 16:42:44 UTC --- Bug is fixed in 4.11 via the PRs linked above in https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c22 Upstream backports have now merged for 1.23: https://github.com/kubernetes/kubernetes/pull/108756 1.22: https://github.com/kubernetes/kubernetes/pull/108754 --- Additional comment from Elana Hashman on 2022-03-22 17:58:37 UTC --- Based on the attached patches we believe this is just logspam due to a race condition and doesn't have any negative effects on running workloads, per https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c12 We expect the patch to be picked up in OpenShift rebases that include Kubernetes versions at least v1.24.0-alpha.2, 1.23.6, and 1.22.9. The 1.23.x and 1.22.x patches are scheduled to be released in April. I believe we have not yet begun the 1.24 rebase for 4.11 but I expect the bug will be fixed once the rebase lands, as well as 4.10 and 4.9 once the April patch rebases land. At that time QE can confirm the fix.
@kaushal(In reply to Kaushal from comment #0) > I believe this has been fixed in 4.11 but given the nature of this (it > blocks IBM for months), we would like to request that it be back-ported to > 4.10 This is not yet fixed in 4.11 because the rebase hasn't happened yet. We plan on backporting to all affected versions, both 4.10 and 4.9, when we've confirmed the fix. Can you share a bit more information about the impact of this bug? My understanding from the parent BZ and the patches in Kubernetes is that this is just logspam; a warning is printed suggesting there is a mount issue, but the mounts are successful and their contents are available in pods, hence why I've marked the bug as severity medium.
IBM team did some further testing and have verified that this is only a logging error. The mounts are successful despite the error. Should we just keep this ticket open for the back-port request or close it and track with the original ticket ? CC @kangell CC @ehashman CC @patmarti
@kmudalia I would say it depends if this is already tracked elsewhere.
I think at this time I will close this bug, and we'll have the bot open backport bugs automatically when we are ready to backport the patches. It will be a little bit cleaner/more centralized that way. *** This bug has been marked as a duplicate of bug 1999325 ***
Reopening for backport.
Checked CI logs and don't see these errors for 4.10 https://search.ci.openshift.org/?search=FailedMount+MountVolume.SetUp+failed+for+volume.*kube-api-access.*not+registered&maxAge=168h&context=1&type=junit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.15 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2258
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days