[sig-arch] events should not repeat pathologically We've seen an uptick in these messages: event happened 24 times, something is wrong: ns/openshift-kube-scheduler pod/revision-pruner-8-ci-op-k3ifhkdh-875d2-xjdv2-master-2 node/ci-op-k3ifhkdh-875d2-xjdv2-master-2 - reason/FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered See https://search.ci.openshift.org/?search=FailedMount+MountVolume.SetUp+failed+for+volume.*kube-api-access.*not+registered&maxAge=168h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job From Slack thread discussing this: "The latter (kube-root-ca.crt) is Node bug (kubelet cache probably being slow)"
I am sorry, but here I must exercise the agreement with sig-node - secret / config maps / downward API volumes are their territory. We can help with issues there (and we often do), but here it's kubelet who can't get a config map from the API server in time. Storage is not involved here. In addition, name "kube-root-ca.crt" suggests it might be one of the volumes projected by BoundServiceAccountTokenVolume feature (but please double check).
Observed the similar event in KCM operator. # oc get event | grep FailedMount 130m Warning FailedMount pod/installer-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered 129m Warning FailedMount pod/revision-pruner-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered 147m Warning FailedMount pod/yangyang0917-tv2fd-master-1copenshift-qeinternal-debug MountVolume.SetUp failed for volume "kube-api-access-gsmrj" : [object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-kube-controller-manager"/"openshift-service-ca.crt" not registered]
sorry, I forgot to reset the assignee.
Hit this issue almost 7 times in the e2e ci nightly jobs. Below is the must-gather link. http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1999325/must-gather-43219-423866308.tar.gz
Hello team, I am hitting same issues cluster wide during 4.8.12 -> 4.9.0-rc.5 osde2e upgrades in stage environment as part of OSD Gap Analysis of 4.9. For this reason I am adding ServiceDeliveryBlocker label. >>> ... "openshift-kube-scheduler"/"kube-root-ca.crt" not registered "openshift-compliance"/"kube-root-ca.crt" not registered "openshift-kube-apiserver"/"kube-root-ca.crt" not registered ... >>> Findings from my investigation: All started with couple of pods (to prevent confusion, name is empty-node-labels) in CrashLoopBackOff state that I have reason to believe are just the symptom as I don't see "successfully pulled image" in event messages. >>> Normal Pulling 104m (x244 over 22h) kubelet Pulling image "registry.access.redhat.com/ubi8/ubi-minimal" Warning BackOff 4m46s (x6083 over 22h) kubelet Back-off restarting failed container >>> During e2e upgrade I was seeing cluster health checks to fail for one specific pod (to prevent confusion, name is osd-delete-backplane-serviceaccounts). List of events in the namespace openshift-backplane gave me following error: >>> 62m Warning FailedMount pod/osd-delete-backplane-serviceaccounts-27227840--1-8bgmj MountVolume.SetUp failed for volume "kube-api-access-d94r5" : [object "openshift-backplane"/"kube-root-ca.crt" not registered, object "openshift-backplane"/"openshift-service-ca.crt" not registered] >>> At this point, I scanned errors from kubelet and found out that this will probably be cluster wide issue. >>> Oct 08 07:00:05.679120 ip-10-0-134-56 hyperkube[1475]: I1008 07:00:05.679038 1475 kubelet_pods.go:898] "Unable to retrieve pull secret, the image pull may not succeed." pod="openshift-image-registry/image-pruner-27227940--1-wmb5g" secret="" err="object \"openshift-image-registry\"/\"pruner-dockercfg-qc6tv\" not registered" >>> Link that I saw between failing empty-node-labels pods and "not registered" errors is kube-api-access volumes having difficulties with mount setup. In details kubelet on control plane is running loop of: VerifyControllerAttachedVolume started for volume -> MountVolume started for volume -> SetUp succeeded for volume -> Error preparing data for projected volume (certs are not registered) -> UnmountVolume started for volume -> TearDown succeeded for volume -> Volume detached for volume >>> Oct 08 01:02:21.199571 ip-10-0-190-212 hyperkube[1603]: E1008 01:02:21.199473 1603 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72 podName:28f54e40-f566-4ccc-9ba4-dccbdb493da2 nodeName:}" failed. No retries permitted until 2021-10-08 01:02:21.699456816 +0000 UTC m=+14760.769272070 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "kube-api-access-wlm72" (UniqueName: "kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72") pod "aggregator-pod-rhcos4-moderate-master" (UID: "28f54e40-f566-4ccc-9ba4-dccbdb493da2") : [object "openshift-compliance"/"kube-root-ca.crt" not registered, object "openshift-compliance"/"openshift-service-ca.crt" not registered] >>> Last two things, I notice in kubelet logs were `clearQuota called, but quotas disabled` messages regarding the volumes and two failing OCP e2e tests `[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories` for user "system:serviceaccount:osde2e-rveds:default" about forbidden access. I am not sure if this piece of information is helpful for investigation. >>> Oct 08 01:01:34.621169 ip-10-0-185-218 hyperkube[1603]: W1008 01:01:34.621159 1603 empty_dir.go:517] Warning: Failed to clear quota on /var/lib/kubelet/pods/25f1fd79-7851-4183-a237-bfd1a4950c5d/volumes/kubernetes.io~configmap/rhcos4-moderate-master-openscap-container-entrypoint: clearQuota called, but quotas disabled Oct 07 20:57:27.976215 ip-10-0-221-29 hyperkube[1609]: W1007 20:57:27.976195 1609 empty_dir.go:520] Warning: Failed to clear quota on /var/lib/kubelet/pods/e39ba8bd-8ad0-4266-b56f-a4d352b1e857/volumes/kubernetes.io~configmap/config: clearQuota called, but quotas disabled >>> >>> Error from server (Forbidden): services is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "services" in API group "" in the namespace "kube-system" Error from server (Forbidden): infrastructures.config.openshift.io "cluster" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "infrastructures" in API group "config.openshift.io" at the cluster scope Error from server (Forbidden): nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope Error from server (Forbidden): Failed to watch *v1.ClusterOperator: failed to list *v1.ClusterOperator: clusteroperators.config.openshift.io is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusteroperators" in API group "config.openshift.io" at the cluster scope Error from server (Forbidden): Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: clusterversions.config.openshift.io "version" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusterversions" in API group "config.openshift.io" at the cluster scope Error from server (Forbidden): Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "pods" in API group "" at the cluster scope Oct 7 18:56:11.768: FAIL: users.user.openshift.io "e2e-test-rails-postgresql-repo-test-fnvct-user" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "users" in API group "user.openshift.io" at the cluster scope >>> This is as far as I have got. I would be happy if someone could look at this issue. I am attaching full must-gather so you can also investigate. Thank you :)
Link to full must-gather of 'gap-upgrade' stg cluster https://drive.google.com/file/d/1J4EUyFSEPyKmYYOCIUTNCb-VAEaDvYC6/view?usp=sharing
Hello team, I double-checked with the assosiate in the attached ticket that these errors FailedMount...."not registered" are not affecting the normal operation of the container and despite their existence in the logs the volumes are mounted to the pod. His issue was incorrect documentation syntax in the code of the cronjob object yaml code. Im not sure of what this message means "not registered" and this needs to be investigated thats why i reopen this bug.
Not completed this sprint.
*** Bug 2029836 has been marked as a duplicate of this bug. ***
Corresponding code is https://github.com/openshift/kubernetes/blob/4d08582d1fa21e1f5887e73380001ac827371553/pkg/kubelet/util/manager/watch_based_manager.go#L292 I suspect given the timing (4.9+ releases) this is related to the Kubernetes pod lifecycle refactor that landed in Kubernetes 1.22. We recently had one more patch land today to resolve some lingering issues a few hours ago that we plan to backport to 4.9: https://github.com/openshift/kubernetes/pull/1199 Let's wait and see if this fixes the errors we've been seeing in CI.
Ah wait, this is https://github.com/kubernetes/kubernetes/pull/107821 https://github.com/kubernetes/kubernetes/pull/107831 We have open backports upstream for these. I'll look into getting them merged.
Bug is fixed in 4.11 via the PRs linked above in https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c22 Upstream backports have now merged for 1.23: https://github.com/kubernetes/kubernetes/pull/108756 1.22: https://github.com/kubernetes/kubernetes/pull/108754
Based on the attached patches we believe this is just logspam due to a race condition and doesn't have any negative effects on running workloads, per https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c12 We expect the patch to be picked up in OpenShift rebases that include Kubernetes versions at least v1.24.0-alpha.2, 1.23.6, and 1.22.9. The 1.23.x and 1.22.x patches are scheduled to be released in April. I believe we have not yet begun the 1.24 rebase for 4.11 but I expect the bug will be fixed once the rebase lands, as well as 4.10 and 4.9 once the April patch rebases land. At that time QE can confirm the fix.
I have the same behavior implementing the cron for ldapsync https://docs.openshift.com/container-platform/4.10/authentication/ldap-syncing.html#ldap-auto-syncing_ldap-syncing-groups Server Version: 4.10.5 Kubernetes Version: v1.23.3+e419edf Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "ldap-sync-volume" : object "openshift-authentication"/"ldap-group-syncer" not registered Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "ldap-ca-cert" : object "openshift-authentication"/"ldap-ca" not registered Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "ldap-bind-password" : object "openshift-authentication"/"v4-0-config-user-idp-0-bind-password" not registered Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "kube-api-access-w9262" : [object "openshift-authentication"/"kube-root-ca.crt" not registered, object "openshift-authentication"/"openshift-service-ca.crt" not registered] The cron/pod could mount ldap-sync-volume --> configmap ldap-bind-password --> secret Cannot mount this cm ldap-ca-cert
We are also waiting on one more patch to address the issue: https://github.com/kubernetes/kubernetes/pull/107824
*** Bug 2067464 has been marked as a duplicate of this bug. ***
Pulling in https://bugzilla.redhat.com/show_bug.cgi?id=2067464#c3 > IBM team did some further testing and have verified that this is only a logging error. The mounts are successful despite the error. @aguadarr Are you sure the configmap in question is failing to mount? You've only provided the events and not the kubelet logs/pod info. I'd expect it to have successfully mounted despite the errors. Backport of required patches in https://github.com/openshift/kubernetes/pull/1236 is passing CI; just waiting for everything upstream to merge. The 1.24 release is delayed a couple weeks due to a golang 1.18 bug so we are just waiting on https://github.com/kubernetes/kubernetes/pull/107824 to merge upstream.
(In reply to Elana Hashman from comment #30) > Pulling in https://bugzilla.redhat.com/show_bug.cgi?id=2067464#c3 > > > IBM team did some further testing and have verified that this is only a logging error. The mounts are successful despite the error. > > > @aguadarr Are you sure the configmap in question is failing to > mount? You've only provided the events and not the kubelet logs/pod info. > I'd expect it to have successfully mounted despite the errors. > > > Backport of required patches in > https://github.com/openshift/kubernetes/pull/1236 is passing CI; just > waiting for everything upstream to merge. The 1.24 release is delayed a > couple weeks due to a golang 1.18 bug so we are just waiting on > https://github.com/kubernetes/kubernetes/pull/107824 to merge upstream. 100% Sure, I have the cron working in OCP 4.8 perfectly. When we install the 4.10.5 version, the cron stop working, debugging the cron/pod I double-check the config map with the certificate is not mounted It's weird I expect to mount everything or nothing, but the behavior is mounting 1 secret 1 configmap and not mounting the second configmap
After discussion with @hekumar I confirmed that we shouldn't need the third patch. It fixes a related bug, but not this one: https://github.com/kubernetes/kubernetes/issues/105204#issuecomment-1099296988 I will update PR and initiate backports.
Hi, Elana Hashman, Ryan Phillips I get the following error from search CI: (https://search.ci.openshift.org/?search=MountVolume.SetUp+failed+for+volume.*not+registered&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job) periodic-ci-openshift-release-master-okd-4.11-e2e-aws (all) - 23 runs, 22% failed, 20% of failures match = 4% impact #1519583571889623040 junit 24 hours ago event happened 31 times, something is wrong: ns/e2e-statefulset-2909 pod/ss-1 node/ip-10-0-220-99.us-east-2.compute.internal - reason/FailedMount MountVolume.SetUp failed for volume "kube-api-access-hzlrn" : [object "e2e-statefulset-2909"/"kube-root-ca.crt" not registered, object "e2e-statefulset-2909"/"openshift-service-ca.crt" not registered] And I checked the prowjob.json of job "periodic-ci-openshift-release-master-okd-4.11-e2e-aws": "env": [ { "name": "RELEASE_IMAGE_LATEST", "value": "registry.ci.openshift.org/origin/release:4.11.0-0.okd-2022-04-28-073729" }, { "name": "RELEASE_IMAGE_INITIAL", "value": "registry.ci.openshift.org/origin/release:4.11.0-0.okd-2022-04-28-073729" } Can we confirm if this release_image contains the fix? Anyway, the upstream also been fixed. (Yet I didn't retrieve this error from ocp 4.11 job which is installed with release_image like: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-04-26-181148. ) So I'm not sure if it show that the bug reproduced. Please double confirm. Thank you.
We see this same issue on 4.10.4 and it prevents us from running Operator CI pipeline for operator bundle certification, among other side effects. Which particular 4.10.z release will have the solution and when is this 4.10.z version expected to release? Any update on timeline will help. Thanks!
Bertrand - Can you help with Hersh's question in Comment 40?
*** Bug 2074673 has been marked as a duplicate of this bug. ***
I didn't get similar error from 4.11 job of search CI in the last 7 days: https://search.ci.openshift.org/?search=MountVolume.SetUp+failed+for+volume.*not+registered&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job , set it verified.
We also had this issue on OCP version 4.10.6 in some cronjobs but it finish without errors. I just saw this checking the events of the job: Warning FailedMount pod/POD_NAME MountVolume.SetUp failed for volume "kube-api-access-tx9fx" : [object "NAMESPACE"/"kube-root-ca.crt" not registered, object "mcdbinit"/"openshift-service-ca.crt" not registered] Warning FailedMount pod/POD_NAME MountVolume.SetUp failed for volume "CONFIG_MAP" : object "NAMESPACE"/"db-init-CONFIG_MAP" not registered Can we confirm is this a False-Positive event message? Thanks
*** Bug 2106547 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
(In reply to Pragyan Pathi from comment #41) > Bertrand - Can you help with Hersh's question in Comment 40? OpenShift Container Platform 4.10.15 https://bugzilla.redhat.com/show_bug.cgi?id=2067464#c12