Bug 1999325 - FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered
Summary: FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.11.0
Assignee: Ryan Phillips
QA Contact: MinLi
URL:
Whiteboard:
: 2029836 2074673 2106547 (view as bug list)
Depends On:
Blocks: 2067464 2076669
TreeView+ depends on / blocked
 
Reported: 2021-08-30 21:44 UTC by Stephen Benjamin
Modified: 2022-10-28 08:33 UTC (History)
38 users (show)

Fixed In Version:
Doc Type: Rebase: Bug Fixes Only
Doc Text:
Clone Of:
: 2067464 2075704 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:37:25 UTC
Target Upstream Version:
Embargoed:
tgabriel: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 105204 0 None open sporadic "MountVolume.SetUp failed for volume ... not registered" failures seen 2022-04-04 13:32:56 UTC
Github openshift kubernetes pull 1225 0 None open Bug 1999325: Backport 107821 and 107831 2022-04-14 15:30:57 UTC
Red Hat Knowledge Base (Solution) 6594981 0 None None None 2022-02-08 18:34:50 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:37:45 UTC

Description Stephen Benjamin 2021-08-30 21:44:26 UTC
[sig-arch] events should not repeat pathologically

We've seen an uptick in these messages:

event happened 24 times, something is wrong: ns/openshift-kube-scheduler pod/revision-pruner-8-ci-op-k3ifhkdh-875d2-xjdv2-master-2 node/ci-op-k3ifhkdh-875d2-xjdv2-master-2 - reason/FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered


See https://search.ci.openshift.org/?search=FailedMount+MountVolume.SetUp+failed+for+volume.*kube-api-access.*not+registered&maxAge=168h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

From Slack thread discussing this: "The latter (kube-root-ca.crt) is Node bug (kubelet cache probably being slow)"

Comment 1 Jan Safranek 2021-09-10 12:57:57 UTC
I am sorry, but here I must exercise the agreement with sig-node - secret / config maps / downward API volumes are their territory. We can help with issues there (and we often do), but here it's kubelet who can't get a config map from the API server in time. Storage is not involved here. In addition, name "kube-root-ca.crt" suggests it might be one of the volumes projected by BoundServiceAccountTokenVolume feature (but please double check).

Comment 2 Yang Yang 2021-09-17 08:14:13 UTC
Observed the similar event in KCM operator.

# oc get event | grep FailedMount
130m        Warning   FailedMount                   pod/installer-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal               MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered
129m        Warning   FailedMount                   pod/revision-pruner-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal         MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered
147m        Warning   FailedMount                   pod/yangyang0917-tv2fd-master-1copenshift-qeinternal-debug                        MountVolume.SetUp failed for volume "kube-api-access-gsmrj" : [object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-kube-controller-manager"/"openshift-service-ca.crt" not registered]

Comment 3 Jan Safranek 2021-09-17 08:53:27 UTC
sorry, I forgot to reset the assignee.

Comment 4 RamaKasturi 2021-09-21 13:35:08 UTC
Hit this issue almost 7 times in the e2e ci nightly jobs. Below is the must-gather link.

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1999325/must-gather-43219-423866308.tar.gz

Comment 5 Tereza 2021-10-08 14:51:42 UTC
Hello team,

I am hitting same issues cluster wide during 4.8.12 -> 4.9.0-rc.5 osde2e upgrades in stage environment as part of OSD Gap Analysis of 4.9. For this reason I am adding ServiceDeliveryBlocker label.

>>>
...
  "openshift-kube-scheduler"/"kube-root-ca.crt" not registered
  "openshift-compliance"/"kube-root-ca.crt" not registered
  "openshift-kube-apiserver"/"kube-root-ca.crt" not registered
...
>>>

Findings from my investigation:
All started with couple of pods (to prevent confusion, name is empty-node-labels) in CrashLoopBackOff state that I have reason to believe are just the symptom as I don't see "successfully pulled image" in event messages.

>>>
  Normal   Pulling  104m (x244 over 22h)    kubelet  Pulling image "registry.access.redhat.com/ubi8/ubi-minimal"
  Warning  BackOff  4m46s (x6083 over 22h)  kubelet  Back-off restarting failed container
>>>

During e2e upgrade I was seeing cluster health checks to fail for one specific pod (to prevent confusion, name is osd-delete-backplane-serviceaccounts). List of events in the namespace openshift-backplane gave me following error:

>>>
  62m         Warning   FailedMount        pod/osd-delete-backplane-serviceaccounts-27227840--1-8bgmj   MountVolume.SetUp failed for volume "kube-api-access-d94r5" : [object "openshift-backplane"/"kube-root-ca.crt" not registered, object "openshift-backplane"/"openshift-service-ca.crt" not registered]
>>>

At this point, I scanned errors from kubelet and found out that this will probably be cluster wide issue.

>>>
  Oct 08 07:00:05.679120 ip-10-0-134-56 hyperkube[1475]: I1008 07:00:05.679038    1475 kubelet_pods.go:898] "Unable to retrieve pull secret, the image pull may not succeed." pod="openshift-image-registry/image-pruner-27227940--1-wmb5g" secret="" err="object \"openshift-image-registry\"/\"pruner-dockercfg-qc6tv\" not registered"
>>>

Link that I saw between failing empty-node-labels pods and "not registered" errors is kube-api-access volumes having difficulties with mount setup. In details kubelet on control plane is running loop of:
VerifyControllerAttachedVolume started for volume -> MountVolume started for volume -> SetUp succeeded for volume ->  Error preparing data for projected volume (certs are not registered) -> UnmountVolume started for volume -> TearDown succeeded for volume -> Volume detached for volume

>>>
  Oct 08 01:02:21.199571 ip-10-0-190-212 hyperkube[1603]: E1008 01:02:21.199473    1603 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72 podName:28f54e40-f566-4ccc-9ba4-dccbdb493da2 nodeName:}" failed. No retries permitted until 2021-10-08 01:02:21.699456816 +0000 UTC m=+14760.769272070 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "kube-api-access-wlm72" (UniqueName: "kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72") pod "aggregator-pod-rhcos4-moderate-master" (UID: "28f54e40-f566-4ccc-9ba4-dccbdb493da2") : [object "openshift-compliance"/"kube-root-ca.crt" not registered, object "openshift-compliance"/"openshift-service-ca.crt" not registered]
>>>

Last two things, I notice in kubelet logs were `clearQuota called, but quotas disabled` messages regarding the volumes and two failing OCP e2e tests `[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories` for user "system:serviceaccount:osde2e-rveds:default" about forbidden access. I am not sure if this piece of information is helpful for investigation.

>>>
  Oct 08 01:01:34.621169 ip-10-0-185-218 hyperkube[1603]: W1008 01:01:34.621159    1603 empty_dir.go:517] Warning: Failed to clear quota on /var/lib/kubelet/pods/25f1fd79-7851-4183-a237-bfd1a4950c5d/volumes/kubernetes.io~configmap/rhcos4-moderate-master-openscap-container-entrypoint: clearQuota called, but quotas disabled
  Oct 07 20:57:27.976215 ip-10-0-221-29 hyperkube[1609]: W1007 20:57:27.976195    1609 empty_dir.go:520] Warning: Failed to clear quota on /var/lib/kubelet/pods/e39ba8bd-8ad0-4266-b56f-a4d352b1e857/volumes/kubernetes.io~configmap/config: clearQuota called, but quotas disabled
>>>
>>>
  Error from server (Forbidden): services is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "services" in API group "" in the namespace "kube-system"
  Error from server (Forbidden): infrastructures.config.openshift.io "cluster" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "infrastructures" in API group "config.openshift.io" at the cluster scope
  Error from server (Forbidden): nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope
  Error from server (Forbidden): Failed to watch *v1.ClusterOperator: failed to list *v1.ClusterOperator: clusteroperators.config.openshift.io is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusteroperators" in API group "config.openshift.io" at the cluster scope
  Error from server (Forbidden): Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: clusterversions.config.openshift.io "version" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusterversions" in API group "config.openshift.io" at the cluster scope
  Error from server (Forbidden): Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope
  Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "pods" in API group "" at the cluster scope
Oct  7 18:56:11.768: FAIL: users.user.openshift.io "e2e-test-rails-postgresql-repo-test-fnvct-user" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "users" in API group "user.openshift.io" at the cluster scope
>>>

This is as far as I have got.

I would be happy if someone could look at this issue. I am attaching full must-gather so you can also investigate.

Thank you :)

Comment 6 Tereza 2021-10-08 14:56:20 UTC
Link to full must-gather of 'gap-upgrade' stg cluster https://drive.google.com/file/d/1J4EUyFSEPyKmYYOCIUTNCb-VAEaDvYC6/view?usp=sharing

Comment 12 Nikos Stamatelopoulos 2021-12-07 14:08:32 UTC
Hello team,

I double-checked with the assosiate in the attached ticket that these errors FailedMount...."not registered" are not affecting the normal operation of the container and despite their existence in the logs the volumes are mounted to the pod.
His issue was incorrect documentation syntax in the code of the cronjob object yaml code. 

Im not sure of what this message means "not registered" and this needs to be investigated thats why i reopen this bug.

Comment 14 Tom Sweeney 2022-01-06 16:20:30 UTC
Not completed this sprint.

Comment 20 Alexander Greene 2022-03-11 19:33:49 UTC
*** Bug 2029836 has been marked as a duplicate of this bug. ***

Comment 21 Elana Hashman 2022-03-17 22:51:43 UTC
Corresponding code is https://github.com/openshift/kubernetes/blob/4d08582d1fa21e1f5887e73380001ac827371553/pkg/kubelet/util/manager/watch_based_manager.go#L292

I suspect given the timing (4.9+ releases) this is related to the Kubernetes pod lifecycle refactor that landed in Kubernetes 1.22.

We recently had one more patch land today to resolve some lingering issues a few hours ago that we plan to backport to 4.9: https://github.com/openshift/kubernetes/pull/1199

Let's wait and see if this fixes the errors we've been seeing in CI.

Comment 22 Elana Hashman 2022-03-17 23:09:46 UTC
Ah wait, this is https://github.com/kubernetes/kubernetes/pull/107821 https://github.com/kubernetes/kubernetes/pull/107831

We have open backports upstream for these. I'll look into getting them merged.

Comment 24 Elana Hashman 2022-03-21 16:42:44 UTC
Bug is fixed in 4.11 via the PRs linked above in https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c22

Upstream backports have now merged for

1.23: https://github.com/kubernetes/kubernetes/pull/108756
1.22: https://github.com/kubernetes/kubernetes/pull/108754

Comment 25 Elana Hashman 2022-03-22 17:58:37 UTC
Based on the attached patches we believe this is just logspam due to a race condition and doesn't have any negative effects on running workloads, per https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c12

We expect the patch to be picked up in OpenShift rebases that include Kubernetes versions at least v1.24.0-alpha.2, 1.23.6, and 1.22.9. The 1.23.x and 1.22.x patches are scheduled to be released in April.

I believe we have not yet begun the 1.24 rebase for 4.11 but I expect the bug will be fixed once the rebase lands, as well as 4.10 and 4.9 once the April patch rebases land. At that time QE can confirm the fix.

Comment 27 Alejandro G 2022-03-28 17:40:58 UTC
I have the same behavior implementing the cron for ldapsync

https://docs.openshift.com/container-platform/4.10/authentication/ldap-syncing.html#ldap-auto-syncing_ldap-syncing-groups

Server Version: 4.10.5
Kubernetes Version: v1.23.3+e419edf

Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "ldap-sync-volume" : object "openshift-authentication"/"ldap-group-syncer" not registered
Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "ldap-ca-cert" : object "openshift-authentication"/"ldap-ca" not registered
Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "ldap-bind-password" : object "openshift-authentication"/"v4-0-config-user-idp-0-bind-password" not registered
Warning FailedMount 18m (x2 over 18m) kubelet MountVolume.SetUp failed for volume "kube-api-access-w9262" : [object "openshift-authentication"/"kube-root-ca.crt" not registered, object "openshift-authentication"/"openshift-service-ca.crt" not registered]

The cron/pod could mount 
ldap-sync-volume   --> configmap
ldap-bind-password --> secret

Cannot mount this cm
ldap-ca-cert

Comment 28 Elana Hashman 2022-04-08 21:14:11 UTC
We are also waiting on one more patch to address the issue: https://github.com/kubernetes/kubernetes/pull/107824

Comment 29 Elana Hashman 2022-04-12 21:01:17 UTC
*** Bug 2067464 has been marked as a duplicate of this bug. ***

Comment 30 Elana Hashman 2022-04-12 21:25:18 UTC
Pulling in https://bugzilla.redhat.com/show_bug.cgi?id=2067464#c3

> IBM team did some further testing and have verified that this is only a logging error. The mounts are successful despite the error.


@aguadarr Are you sure the configmap in question is failing to mount? You've only provided the events and not the kubelet logs/pod info. I'd expect it to have successfully mounted despite the errors.


Backport of required patches in https://github.com/openshift/kubernetes/pull/1236 is passing CI; just waiting for everything upstream to merge. The 1.24 release is delayed a couple weeks due to a golang 1.18 bug so we are just waiting on https://github.com/kubernetes/kubernetes/pull/107824 to merge upstream.

Comment 31 Alejandro G 2022-04-12 21:39:11 UTC
(In reply to Elana Hashman from comment #30)
> Pulling in https://bugzilla.redhat.com/show_bug.cgi?id=2067464#c3
> 
> > IBM team did some further testing and have verified that this is only a logging error. The mounts are successful despite the error.
> 
> 
> @aguadarr Are you sure the configmap in question is failing to
> mount? You've only provided the events and not the kubelet logs/pod info.
> I'd expect it to have successfully mounted despite the errors.
> 
> 
> Backport of required patches in
> https://github.com/openshift/kubernetes/pull/1236 is passing CI; just
> waiting for everything upstream to merge. The 1.24 release is delayed a
> couple weeks due to a golang 1.18 bug so we are just waiting on
> https://github.com/kubernetes/kubernetes/pull/107824 to merge upstream.

100% Sure, I have the cron working in OCP 4.8 perfectly.  When we install the 4.10.5 version, the cron stop working, debugging the cron/pod I double-check the config map with the certificate is not mounted

It's weird I expect to mount everything or nothing, but the behavior is mounting 1 secret 1 configmap and not mounting the second configmap

Comment 32 Elana Hashman 2022-04-14 15:29:26 UTC
After discussion with @hekumar I confirmed that we shouldn't need the third patch. It fixes a related bug, but not this one: https://github.com/kubernetes/kubernetes/issues/105204#issuecomment-1099296988

I will update PR and initiate backports.

Comment 37 MinLi 2022-04-29 10:07:24 UTC
Hi, Elana Hashman, Ryan Phillips

I get the following error from search CI: (https://search.ci.openshift.org/?search=MountVolume.SetUp+failed+for+volume.*not+registered&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job)
periodic-ci-openshift-release-master-okd-4.11-e2e-aws (all) - 23 runs, 22% failed, 20% of failures match = 4% impact
#1519583571889623040	junit	24 hours ago	

event happened 31 times, something is wrong: ns/e2e-statefulset-2909 pod/ss-1 node/ip-10-0-220-99.us-east-2.compute.internal - reason/FailedMount MountVolume.SetUp failed for volume "kube-api-access-hzlrn" : [object "e2e-statefulset-2909"/"kube-root-ca.crt" not registered, object "e2e-statefulset-2909"/"openshift-service-ca.crt" not registered]

And I checked the prowjob.json of job "periodic-ci-openshift-release-master-okd-4.11-e2e-aws":
"env": [
						{
							"name": "RELEASE_IMAGE_LATEST",
							"value": "registry.ci.openshift.org/origin/release:4.11.0-0.okd-2022-04-28-073729"
						},
						{
							"name": "RELEASE_IMAGE_INITIAL",
							"value": "registry.ci.openshift.org/origin/release:4.11.0-0.okd-2022-04-28-073729"
						}

Can we confirm if this release_image contains the fix? Anyway, the upstream also been fixed.
(Yet I didn't retrieve this error from ocp 4.11 job which is installed with release_image like: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-04-26-181148. )

So I'm not sure if it show that the bug reproduced. Please double confirm. Thank you.

Comment 40 hersh.pathak 2022-05-03 18:29:51 UTC
We see this same issue on 4.10.4 and it prevents us from running Operator CI pipeline for operator bundle certification, among other side effects. Which particular 4.10.z release will have the solution and when is this 4.10.z version expected to release? Any update on timeline will help. Thanks!

Comment 41 Pragyan Pathi 2022-05-03 23:38:59 UTC
Bertrand - Can you help with Hersh's question in Comment 40?

Comment 42 Sascha Grunert 2022-05-05 07:18:29 UTC
*** Bug 2074673 has been marked as a duplicate of this bug. ***

Comment 49 Alejandro Villegas 2022-06-02 11:34:20 UTC
We also had this issue on OCP version 4.10.6 in some cronjobs but it finish without errors. I just saw this checking the events of the job:


Warning FailedMount pod/POD_NAME MountVolume.SetUp failed for volume "kube-api-access-tx9fx" : [object "NAMESPACE"/"kube-root-ca.crt" not registered, object "mcdbinit"/"openshift-service-ca.crt" not registered] 

Warning FailedMount pod/POD_NAME MountVolume.SetUp failed for volume "CONFIG_MAP" : object "NAMESPACE"/"db-init-CONFIG_MAP" not registered

Can we confirm is this a False-Positive event message?

Thanks

Comment 52 Ryan Phillips 2022-07-28 19:38:19 UTC
*** Bug 2106547 has been marked as a duplicate of this bug. ***

Comment 55 errata-xmlrpc 2022-08-10 10:37:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 57 Bertrand 2022-09-20 07:13:56 UTC
(In reply to Pragyan Pathi from comment #41)
> Bertrand - Can you help with Hersh's question in Comment 40?

OpenShift Container Platform 4.10.15

https://bugzilla.redhat.com/show_bug.cgi?id=2067464#c12


Note You need to log in before you can comment on or make changes to this bug.