2255030 – Mount command returning error as MDS is laggy for unauthorized client

Bug 2255030 - Mount command returning error as MDS is laggy for unauthorized client

Summary: Mount command returning error as MDS is laggy for unauthorized client

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	7.1
Assignee:	Neeraj Pratap Singh
QA Contact:	Amarnath
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2267614 2298578 2298579
TreeView+	depends on / blocked

Reported:	2023-12-18 16:34 UTC by Amarnath
Modified:	2024-07-18 07:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-18.2.1-20.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-13 14:23:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8089	0	None	None	None	2023-12-18 16:36:08 UTC
Red Hat Product Errata	RHSA-2024:3925	0	None	None	None	2024-06-13 14:23:57 UTC

Description Amarnath 2023-12-18 16:34:20 UTC

Description of problem:
Kernal Mount command returning error as MDS is laggy for unauthorized client

Steps followed : 
1. Created 2 filesystems(cephfs, cephfs1)
2. Authorized two clients, assigning each to a distinct filesystem (client1 for "cephfs" and client2 for "cephfs1").
3. Attempted to mount "cephfs1" using client1, resulting in the error message: "mount error: no MDS server is up or the cluster is laggy."
4. Conversely, when attempting the mount operation with client2 on "cephfs1," it succeeded without errors.

Observation and Question:

1. The encountered error suggests that there is no active MDS (Metadata Server) or a potential cluster lag when client1 attempts to mount "cephfs1." But filesystem is up and running
2. An expected behavior would be to receive an unauthorized error instead of the "No MDS server is UP" message

Please find the commands executed
[root@ceph-mirror-amk-pwsavd-node7 ~]# cat /etc//ceph/ceph.client.client1.keyring
[client.client1]
	key = AQCGcIBlLD/cNBAA4ESJx6hW82bDOu3thGqI5w==
	caps mds = "allow rw fsname=cephfs"
	caps mon = "allow r fsname=cephfs"
	caps osd = "allow rw tag cephfs data=cephfs"
[root@ceph-mirror-amk-pwsavd-node7 ~]# cat /etc//ceph/ceph.client.client2.keyring
[client.client2]
	key = AQCQcIBlb9wiFxAAv0KUVguMQKY4cSejYsAOuQ==
	caps mds = "allow rw fsname=cephfs1"
	caps mon = "allow r fsname=cephfs1"
	caps osd = "allow rw tag cephfs data=cephfs1"



[root@ceph-mirror-amk-pwsavd-node7 ~]# ceph fs status
cephfs - 8 clients
======
RANK  STATE                      MDS                         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.ceph-mirror-amk-pwsavd-node5.hjouso  Reqs:    0 /s  1448   1302     87     12   
 1    active  cephfs.ceph-mirror-amk-pwsavd-node4.xxssaw  Reqs:    0 /s   331    282     28     12   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata  4060M  47.8G  
cephfs.cephfs.data    data       0   47.8G  
cephfs1 - 1 clients
=======
RANK  STATE                       MDS                         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs1.ceph-mirror-amk-pwsavd-node2.dljcau  Reqs:    0 /s    10     13     12      1   
        POOL           TYPE     USED  AVAIL  
cephfs.cephfs1.meta  metadata  96.0k  47.8G  
cephfs.cephfs1.data    data       0   47.8G  
                STANDBY MDS                  
cephfs1.ceph-mirror-amk-pwsavd-node5.omovtd  
 cephfs.ceph-mirror-amk-pwsavd-node6.xlqsfz  
MDS version: ceph version 18.2.0-128.el9cp (d38df712b9120eae50f448fe0847719d3567c2d1) reef (stable)
[root@ceph-mirror-amk-pwsavd-node7 ~]# mount -t ceph 10.0.211.1,10.0.210.182,10.0.211.126:/ /mnt/test_client1 -o name=client1,secretfile=/etc/ceph/client1.secret,fs=cephfs1
mount error: no mds server is up or the cluster is laggy
[root@ceph-mirror-amk-pwsavd-node7 ~]# mkdir /mnt/test_client2
[root@ceph-mirror-amk-pwsavd-node7 ~]# mount -t ceph 10.0.211.1,10.0.210.182,10.0.211.126:/ /mnt/test_client2 -o name=client2,secretfile=/etc/ceph/client2.secret,fs=cephfs
mount error: no mds server is up or the cluster is laggy
[root@ceph-mirror-amk-pwsavd-node7 ~]# mount -t ceph 10.0.211.1,10.0.210.182,10.0.211.126:/ /mnt/test_client2 -o name=client2,secretfile=/etc/ceph/client2.secret,fs=cephfs1
[root@ceph-mirror-amk-pwsavd-node7 ~]# mount -t ceph 10.0.211.1,10.0.210.182,10.0.211.126:/ /mnt/test_client1 -o name=client1,secretfile=/etc/ceph/client1.secret,fs=cephfs
[root@ceph-mirror-amk-pwsavd-node7 ~]# 
 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Venky Shankar 2023-12-19 04:49:10 UTC

(In reply to Amarnath from comment #0)
> Description of problem:
> Kernal Mount command returning error as MDS is laggy for unauthorized client
> 
> Steps followed : 
> 1. Created 2 filesystems(cephfs, cephfs1)
> 2. Authorized two clients, assigning each to a distinct filesystem (client1
> for "cephfs" and client2 for "cephfs1").
> 3. Attempted to mount "cephfs1" using client1, resulting in the error
> message: "mount error: no MDS server is up or the cluster is laggy."
> 4. Conversely, when attempting the mount operation with client2 on
> "cephfs1," it succeeded without errors.
> 
> Observation and Question:
> 
> 1. The encountered error suggests that there is no active MDS (Metadata
> Server) or a potential cluster lag when client1 attempts to mount "cephfs1."
> But filesystem is up and running
> 2. An expected behavior would be to receive an unauthorized error instead of
> the "No MDS server is UP" message

This error message is thrown by the mount helper when mount returns -EHOSTUNREACH errno and also by the kernel driver in the kernel ring buffer

> [517841.491998] libceph: auth protocol 'cephx' msgr authentication failed: -13
> [517841.492242] ceph: No mds server is up or the cluster is laggy

So, it looks like a generic message is thrown. Since we do get errno -13, I think specific error message can be shown.

Comment 2 Xiubo Li 2023-12-19 06:10:07 UTC

(In reply to Venky Shankar from comment #1)
> (In reply to Amarnath from comment #0)
> > Description of problem:
> > Kernal Mount command returning error as MDS is laggy for unauthorized client
> > 
> > Steps followed : 
> > 1. Created 2 filesystems(cephfs, cephfs1)
> > 2. Authorized two clients, assigning each to a distinct filesystem (client1
> > for "cephfs" and client2 for "cephfs1").
> > 3. Attempted to mount "cephfs1" using client1, resulting in the error
> > message: "mount error: no MDS server is up or the cluster is laggy."
> > 4. Conversely, when attempting the mount operation with client2 on
> > "cephfs1," it succeeded without errors.
> > 
> > Observation and Question:
> > 
> > 1. The encountered error suggests that there is no active MDS (Metadata
> > Server) or a potential cluster lag when client1 attempts to mount "cephfs1."
> > But filesystem is up and running
> > 2. An expected behavior would be to receive an unauthorized error instead of
> > the "No MDS server is UP" message
> 
> This error message is thrown by the mount helper when mount returns
> -EHOSTUNREACH errno and also by the kernel driver in the kernel ring buffer
> 
> > [517841.491998] libceph: auth protocol 'cephx' msgr authentication failed: -13
> > [517841.492242] ceph: No mds server is up or the cluster is laggy
> 
> So, it looks like a generic message is thrown. Since we do get errno -13, I
> think specific error message can be shown.

Since you specified the invalidate the 'fsname' paramters and the ceph mon just returned the fsname list allowed, but failed to match and just return -2(-ENOENT) and just break by leaving the local mdsmap cache to be empty, and then in uplayer just before failing the mounting it will check the local mdsmap cache and found no MDS is up, then switched the errno to -113(-EHOSTUNREACH).

The problem is the above also could happen when the cephx is disabled. So it's hard to distinguish which case it is.

Venky,

Maybe we should just change the debug logs to:

"No mds server is up or the cluster is laggy" ---> "No mds server is up or the cluster is laggy or unauthorized" ?

Thanks
- Xiubo

Comment 3 Venky Shankar 2023-12-19 07:10:45 UTC

(In reply to Xiubo Li from comment #2)
> (In reply to Venky Shankar from comment #1)
> > (In reply to Amarnath from comment #0)
> > > Description of problem:
> > > Kernal Mount command returning error as MDS is laggy for unauthorized client
> > > 
> > > Steps followed : 
> > > 1. Created 2 filesystems(cephfs, cephfs1)
> > > 2. Authorized two clients, assigning each to a distinct filesystem (client1
> > > for "cephfs" and client2 for "cephfs1").
> > > 3. Attempted to mount "cephfs1" using client1, resulting in the error
> > > message: "mount error: no MDS server is up or the cluster is laggy."
> > > 4. Conversely, when attempting the mount operation with client2 on
> > > "cephfs1," it succeeded without errors.
> > > 
> > > Observation and Question:
> > > 
> > > 1. The encountered error suggests that there is no active MDS (Metadata
> > > Server) or a potential cluster lag when client1 attempts to mount "cephfs1."
> > > But filesystem is up and running
> > > 2. An expected behavior would be to receive an unauthorized error instead of
> > > the "No MDS server is UP" message
> > 
> > This error message is thrown by the mount helper when mount returns
> > -EHOSTUNREACH errno and also by the kernel driver in the kernel ring buffer
> > 
> > > [517841.491998] libceph: auth protocol 'cephx' msgr authentication failed: -13
> > > [517841.492242] ceph: No mds server is up or the cluster is laggy
> > 
> > So, it looks like a generic message is thrown. Since we do get errno -13, I
> > think specific error message can be shown.
> 
> Since you specified the invalidate the 'fsname' paramters and the ceph mon
> just returned the fsname list allowed, but failed to match and just return
> -2(-ENOENT) and just break by leaving the local mdsmap cache to be empty,
> and then in uplayer just before failing the mounting it will check the local
> mdsmap cache and found no MDS is up, then switched the errno to
> -113(-EHOSTUNREACH).
> 
> The problem is the above also could happen when the cephx is disabled. So
> it's hard to distinguish which case it is.
> 
> Venky,
> 
> Maybe we should just change the debug logs to:
> 
> "No mds server is up or the cluster is laggy" ---> "No mds server is up or
> the cluster is laggy or unauthorized" ?

Absolutely. There no need to complicate by sending the mdsmap on mismatch just for error string correctness.

Neeraj, please create a tracker.

Comment 4 Greg Farnum 2023-12-20 06:33:27 UTC

How about "No mds server is available — it may be laggy or down, or you may not be authorized"

Comment 5 Xiubo Li 2023-12-21 00:23:19 UTC

(In reply to Greg Farnum from comment #4)
> How about "No mds server is available — it may be laggy or down, or you may
> not be authorized"

Yeah, much better. Thanks!

Comment 10 Amarnath 2024-02-19 08:22:42 UTC

Hi All,

We are seeing updated error message concerning authorizion  


[root@ceph-nfs-fail-pff5tt-node8 ~]# ceph auth get client.client1 -o /etc/ceph/ceph.client.client1.keyring
[root@ceph-nfs-fail-pff5tt-node8 ~]# mount -t ceph 10.0.211.170,10.0.209.108,10.0.211.130:/ /mnt/test_client1 -o name=client1,fs=cephfs
[root@ceph-nfs-fail-pff5tt-node8 ~]# ceph auth get client.client2 -o /etc/ceph/ceph.client.client2.keyring
[root@ceph-nfs-fail-pff5tt-node8 ~]# mount -t ceph 10.0.211.170,10.0.209.108,10.0.211.130:/ /mnt/test_client2 -o name=client2,fs=cephfs
mount error: no mds (Metadata Server) is up. The cluster might be laggy, or you may not be authorized
[root@ceph-nfs-fail-pff5tt-node8 ~]# mount -t ceph 10.0.211.170,10.0.209.108,10.0.211.130:/ /mnt/test_client2 -o name=client2,fs=cephfs1
[root@ceph-nfs-fail-pff5tt-node8 ~]# ceph versions
{
    "mon": {
        "ceph version 18.2.1-20.el9cp (171d20b9d47e6145ad666c10de8e45efe66b8f50) reef (stable)": 3
    },
    "mgr": {
        "ceph version 18.2.1-20.el9cp (171d20b9d47e6145ad666c10de8e45efe66b8f50) reef (stable)": 2
    },
    "osd": {
        "ceph version 18.2.1-20.el9cp (171d20b9d47e6145ad666c10de8e45efe66b8f50) reef (stable)": 12
    },
    "mds": {
        "ceph version 18.2.1-20.el9cp (171d20b9d47e6145ad666c10de8e45efe66b8f50) reef (stable)": 7
    },
    "overall": {
        "ceph version 18.2.1-20.el9cp (171d20b9d47e6145ad666c10de8e45efe66b8f50) reef (stable)": 24
    }
}
[root@ceph-nfs-fail-pff5tt-node8 ~]# 
[root@ceph-nfs-fail-pff5tt-node8 ~]# ceph fs status
cephfs - 1 clients
======
RANK  STATE                     MDS                        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.ceph-nfs-fail-pff5tt-node7.gwjydi  Reqs:    0 /s   177     72     71      1   
 1    active  cephfs.ceph-nfs-fail-pff5tt-node4.azrceg  Reqs:    0 /s   371     21     19      0   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata  1401M  49.5G  
cephfs.cephfs.data    data       0   49.5G  
cephfs1 - 1 clients
=======
RANK  STATE                      MDS                        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs1.ceph-nfs-fail-pff5tt-node5.xrkvwi  Reqs:    0 /s    10     13     12      1   
        POOL           TYPE     USED  AVAIL  
cephfs.cephfs1.meta  metadata  96.0k  49.5G  
cephfs.cephfs1.data    data       0   49.5G  
               STANDBY MDS                 
 cephfs.ceph-nfs-fail-pff5tt-node5.sumcxl  
 cephfs.ceph-nfs-fail-pff5tt-node3.ydhhkv  
 cephfs.ceph-nfs-fail-pff5tt-node6.ozpqjo  
cephfs1.ceph-nfs-fail-pff5tt-node2.jzzydo  
MDS version: ceph version 18.2.1-20.el9cp (171d20b9d47e6145ad666c10de8e45efe66b8f50) reef (stable)
[root@ceph-nfs-fail-pff5tt-node8 ~]# 


Regards,
Amarnath

Comment 14 errata-xmlrpc 2024-06-13 14:23:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.