2067464 – FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered

Bug 2067464 - FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered

Summary: FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "o...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1999325
Blocks:	2075704
TreeView+	depends on / blocked

Reported:	2022-03-23 21:00 UTC by Kaushal
Modified:	2023-09-15 01:53 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1999325
Environment:
Last Closed:	2022-05-23 13:25:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1241	None	open	[release-4.10] Bug 2067464: Backport 107821 and 107831	2022-05-10 01:00:37 UTC
Red Hat Issue Tracker	CPSYN-63	None	None	None	2022-03-23 21:00:06 UTC
Red Hat Issue Tracker	SUPPORTEX-9104	None	None	None	2022-05-06 18:22:31 UTC
Red Hat Product Errata	RHBA-2022:2258	None	None	None	2022-05-23 13:25:34 UTC

Description Kaushal 2022-03-23 21:00:06 UTC

I believe this has been fixed in 4.11 but given the nature of this (it blocks IBM for months), we would like to request that it be back-ported to 4.10

Here is the Jira ticket which IBM team opened: https://issues.redhat.com/browse/CPSYN-63

+++ This bug was initially created as a clone of Bug #1999325 +++

[sig-arch] events should not repeat pathologically

We've seen an uptick in these messages:

event happened 24 times, something is wrong: ns/openshift-kube-scheduler pod/revision-pruner-8-ci-op-k3ifhkdh-875d2-xjdv2-master-2 node/ci-op-k3ifhkdh-875d2-xjdv2-master-2 - reason/FailedMount MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-scheduler"/"kube-root-ca.crt" not registered


See https://search.ci.openshift.org/?search=FailedMount+MountVolume.SetUp+failed+for+volume.*kube-api-access.*not+registered&maxAge=168h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

From Slack thread discussing this: "The latter (kube-root-ca.crt) is Node bug (kubelet cache probably being slow)"

--- Additional comment from Jan Safranek on 2021-09-10 12:57:57 UTC ---

I am sorry, but here I must exercise the agreement with sig-node - secret / config maps / downward API volumes are their territory. We can help with issues there (and we often do), but here it's kubelet who can't get a config map from the API server in time. Storage is not involved here. In addition, name "kube-root-ca.crt" suggests it might be one of the volumes projected by BoundServiceAccountTokenVolume feature (but please double check).

--- Additional comment from Yang Yang on 2021-09-17 08:14:13 UTC ---

Observed the similar event in KCM operator.

# oc get event | grep FailedMount
130m        Warning   FailedMount                   pod/installer-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal               MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered
129m        Warning   FailedMount                   pod/revision-pruner-9-yangyang0917-tv2fd-master-1.c.openshift-qe.internal         MountVolume.SetUp failed for volume "kube-api-access" : object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered
147m        Warning   FailedMount                   pod/yangyang0917-tv2fd-master-1copenshift-qeinternal-debug                        MountVolume.SetUp failed for volume "kube-api-access-gsmrj" : [object "openshift-kube-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-kube-controller-manager"/"openshift-service-ca.crt" not registered]

--- Additional comment from Jan Safranek on 2021-09-17 08:53:27 UTC ---

sorry, I forgot to reset the assignee.

--- Additional comment from RamaKasturi on 2021-09-21 13:35:08 UTC ---

Hit this issue almost 7 times in the e2e ci nightly jobs. Below is the must-gather link.

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1999325/must-gather-43219-423866308.tar.gz

--- Additional comment from Tereza on 2021-10-08 14:51:42 UTC ---

Hello team,

I am hitting same issues cluster wide during 4.8.12 -> 4.9.0-rc.5 osde2e upgrades in stage environment as part of OSD Gap Analysis of 4.9. For this reason I am adding ServiceDeliveryBlocker label.

>>>
...
  "openshift-kube-scheduler"/"kube-root-ca.crt" not registered
  "openshift-compliance"/"kube-root-ca.crt" not registered
  "openshift-kube-apiserver"/"kube-root-ca.crt" not registered
...
>>>

Findings from my investigation:
All started with couple of pods (to prevent confusion, name is empty-node-labels) in CrashLoopBackOff state that I have reason to believe are just the symptom as I don't see "successfully pulled image" in event messages.

>>>
  Normal   Pulling  104m (x244 over 22h)    kubelet  Pulling image "registry.access.redhat.com/ubi8/ubi-minimal"
  Warning  BackOff  4m46s (x6083 over 22h)  kubelet  Back-off restarting failed container
>>>

During e2e upgrade I was seeing cluster health checks to fail for one specific pod (to prevent confusion, name is osd-delete-backplane-serviceaccounts). List of events in the namespace openshift-backplane gave me following error:

>>>
  62m         Warning   FailedMount        pod/osd-delete-backplane-serviceaccounts-27227840--1-8bgmj   MountVolume.SetUp failed for volume "kube-api-access-d94r5" : [object "openshift-backplane"/"kube-root-ca.crt" not registered, object "openshift-backplane"/"openshift-service-ca.crt" not registered]
>>>

At this point, I scanned errors from kubelet and found out that this will probably be cluster wide issue.

>>>
  Oct 08 07:00:05.679120 ip-10-0-134-56 hyperkube[1475]: I1008 07:00:05.679038    1475 kubelet_pods.go:898] "Unable to retrieve pull secret, the image pull may not succeed." pod="openshift-image-registry/image-pruner-27227940--1-wmb5g" secret="" err="object \"openshift-image-registry\"/\"pruner-dockercfg-qc6tv\" not registered"
>>>

Link that I saw between failing empty-node-labels pods and "not registered" errors is kube-api-access volumes having difficulties with mount setup. In details kubelet on control plane is running loop of:
VerifyControllerAttachedVolume started for volume -> MountVolume started for volume -> SetUp succeeded for volume ->  Error preparing data for projected volume (certs are not registered) -> UnmountVolume started for volume -> TearDown succeeded for volume -> Volume detached for volume

>>>
  Oct 08 01:02:21.199571 ip-10-0-190-212 hyperkube[1603]: E1008 01:02:21.199473    1603 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72 podName:28f54e40-f566-4ccc-9ba4-dccbdb493da2 nodeName:}" failed. No retries permitted until 2021-10-08 01:02:21.699456816 +0000 UTC m=+14760.769272070 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "kube-api-access-wlm72" (UniqueName: "kubernetes.io/projected/28f54e40-f566-4ccc-9ba4-dccbdb493da2-kube-api-access-wlm72") pod "aggregator-pod-rhcos4-moderate-master" (UID: "28f54e40-f566-4ccc-9ba4-dccbdb493da2") : [object "openshift-compliance"/"kube-root-ca.crt" not registered, object "openshift-compliance"/"openshift-service-ca.crt" not registered]
>>>

Last two things, I notice in kubelet logs were `clearQuota called, but quotas disabled` messages regarding the volumes and two failing OCP e2e tests `[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories` for user "system:serviceaccount:osde2e-rveds:default" about forbidden access. I am not sure if this piece of information is helpful for investigation.

>>>
  Oct 08 01:01:34.621169 ip-10-0-185-218 hyperkube[1603]: W1008 01:01:34.621159    1603 empty_dir.go:517] Warning: Failed to clear quota on /var/lib/kubelet/pods/25f1fd79-7851-4183-a237-bfd1a4950c5d/volumes/kubernetes.io~configmap/rhcos4-moderate-master-openscap-container-entrypoint: clearQuota called, but quotas disabled
  Oct 07 20:57:27.976215 ip-10-0-221-29 hyperkube[1609]: W1007 20:57:27.976195    1609 empty_dir.go:520] Warning: Failed to clear quota on /var/lib/kubelet/pods/e39ba8bd-8ad0-4266-b56f-a4d352b1e857/volumes/kubernetes.io~configmap/config: clearQuota called, but quotas disabled
>>>
>>>
  Error from server (Forbidden): services is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "services" in API group "" in the namespace "kube-system"
  Error from server (Forbidden): infrastructures.config.openshift.io "cluster" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "infrastructures" in API group "config.openshift.io" at the cluster scope
  Error from server (Forbidden): nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope
  Error from server (Forbidden): Failed to watch *v1.ClusterOperator: failed to list *v1.ClusterOperator: clusteroperators.config.openshift.io is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusteroperators" in API group "config.openshift.io" at the cluster scope
  Error from server (Forbidden): Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: clusterversions.config.openshift.io "version" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "clusterversions" in API group "config.openshift.io" at the cluster scope
  Error from server (Forbidden): Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "nodes" in API group "" at the cluster scope
  Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot list resource "pods" in API group "" at the cluster scope
Oct  7 18:56:11.768: FAIL: users.user.openshift.io "e2e-test-rails-postgresql-repo-test-fnvct-user" is forbidden: User "system:serviceaccount:osde2e-rveds:default" cannot get resource "users" in API group "user.openshift.io" at the cluster scope
>>>

This is as far as I have got.

I would be happy if someone could look at this issue. I am attaching full must-gather so you can also investigate.

Thank you :)

--- Additional comment from Tereza on 2021-10-08 14:56:20 UTC ---

Link to full must-gather of 'gap-upgrade' stg cluster https://drive.google.com/file/d/1J4EUyFSEPyKmYYOCIUTNCb-VAEaDvYC6/view?usp=sharing

--- Additional comment from Nikos Stamatelopoulos on 2021-12-07 14:08:32 UTC ---

Hello team,

I double-checked with the assosiate in the attached ticket that these errors FailedMount...."not registered" are not affecting the normal operation of the container and despite their existence in the logs the volumes are mounted to the pod.
His issue was incorrect documentation syntax in the code of the cronjob object yaml code. 

Im not sure of what this message means "not registered" and this needs to be investigated thats why i reopen this bug.

--- Additional comment from Tom Sweeney on 2022-01-06 16:20:30 UTC ---

Not completed this sprint.

--- Additional comment from Alexander Greene on 2022-03-11 19:33:49 UTC ---



--- Additional comment from Elana Hashman on 2022-03-17 22:51:43 UTC ---

Corresponding code is https://github.com/openshift/kubernetes/blob/4d08582d1fa21e1f5887e73380001ac827371553/pkg/kubelet/util/manager/watch_based_manager.go#L292

I suspect given the timing (4.9+ releases) this is related to the Kubernetes pod lifecycle refactor that landed in Kubernetes 1.22.

We recently had one more patch land today to resolve some lingering issues a few hours ago that we plan to backport to 4.9: https://github.com/openshift/kubernetes/pull/1199

Let's wait and see if this fixes the errors we've been seeing in CI.

--- Additional comment from Elana Hashman on 2022-03-17 23:09:46 UTC ---

Ah wait, this is https://github.com/kubernetes/kubernetes/pull/107821 https://github.com/kubernetes/kubernetes/pull/107831

We have open backports upstream for these. I'll look into getting them merged.

--- Additional comment from Elana Hashman on 2022-03-21 16:42:44 UTC ---

Bug is fixed in 4.11 via the PRs linked above in https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c22

Upstream backports have now merged for

1.23: https://github.com/kubernetes/kubernetes/pull/108756
1.22: https://github.com/kubernetes/kubernetes/pull/108754

--- Additional comment from Elana Hashman on 2022-03-22 17:58:37 UTC ---

Based on the attached patches we believe this is just logspam due to a race condition and doesn't have any negative effects on running workloads, per https://bugzilla.redhat.com/show_bug.cgi?id=1999325#c12

We expect the patch to be picked up in OpenShift rebases that include Kubernetes versions at least v1.24.0-alpha.2, 1.23.6, and 1.22.9. The 1.23.x and 1.22.x patches are scheduled to be released in April.

I believe we have not yet begun the 1.24 rebase for 4.11 but I expect the bug will be fixed once the rebase lands, as well as 4.10 and 4.9 once the April patch rebases land. At that time QE can confirm the fix.

Comment 1 Elana Hashman 2022-03-25 18:03:39 UTC

@kaushal(In reply to Kaushal from comment #0)
> I believe this has been fixed in 4.11 but given the nature of this (it
> blocks IBM for months), we would like to request that it be back-ported to
> 4.10

This is not yet fixed in 4.11 because the rebase hasn't happened yet.

We plan on backporting to all affected versions, both 4.10 and 4.9, when we've confirmed the fix.

Can you share a bit more information about the impact of this bug? My understanding from the parent BZ and the patches in Kubernetes is that this is just logspam; a warning is printed suggesting there is a mount issue, but the mounts are successful and their contents are available in pods, hence why I've marked the bug as severity medium.

Comment 3 Kaushal 2022-03-31 19:48:07 UTC

IBM team did some further testing and have verified that this is only a logging error. The mounts are successful despite the error. Should we just keep this ticket open for the back-port request  or close it and track with the original ticket ?

CC @kangell 
CC @ehashman
CC @patmarti

Comment 4 Patrick Martin 2022-04-05 15:24:01 UTC

@kmudalia I would say it depends if this is already tracked elsewhere.

Comment 5 Elana Hashman 2022-04-12 21:01:17 UTC

I think at this time I will close this bug, and we'll have the bot open backport bugs automatically when we are ready to backport the patches. It will be a little bit cleaner/more centralized that way.

*** This bug has been marked as a duplicate of bug 1999325 ***

Comment 6 Elana Hashman 2022-04-14 22:17:34 UTC

Reopening for backport.

Comment 9 Sunil Choudhary 2022-05-11 05:50:02 UTC

Checked CI logs and don't see these errors for 4.10 

https://search.ci.openshift.org/?search=FailedMount+MountVolume.SetUp+failed+for+volume.*kube-api-access.*not+registered&maxAge=168h&context=1&type=junit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 12 errata-xmlrpc 2022-05-23 13:25:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2258

Comment 14 Red Hat Bugzilla 2023-09-15 01:53:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.