Bug 2150306

Summary:	[Scale]subvolume get path command failed with ESHUTDOWN : error in stat:
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Amarnath <amk>
Component:	CephFS	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED ERRATA	QA Contact:	Amarnath <amk>
Severity:	low	Docs Contact:	Akash Raj <akraj>
Priority:	medium
Version:	5.3	CC:	akraj, ceph-eng-bugs, cephqe-warriors, gfarnum, hyelloji, ngangadh, sostapov, tserlin, vshankar
Target Milestone:	---	Flags:	gfarnum: needinfo-
Target Release:	6.1z1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-17.2.6-91.el9cp	Doc Type:	Bug Fix
Doc Text:	.Calls to Ceph Manager daemons and volumes no longer return `-ESHUTDOWN` Previously, calls to Ceph Manager daemons and volumes would return `-ESHUTDOWN` when the `ceph-mgr` process was shutting down, which was not necessary. With this fix, Ceph Manager plugin handles the shutdown without returning a special error code and `-ESHUTDOWN` is never returned.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-03 16:45:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2221020

Description Amarnath 2022-12-02 13:25:10 UTC

Description of problem:
As part of scale testing, We followed below steps
1. Create subvolume
2. Get the path of subvolume created
3. Mount the subvolume using above path
4. write 1 gb data.

Above steps have been passed till subvolume 2491 iterations
In 2492 Iteration Get path of subvolume has failed with below error

2022-12-01 23:04:59,778 - INFO - cephci.ceph.ceph.py:1513 - Running command ceph fs subvolume getpath cephfs subvol_max_2942 on 10.1.38.141 timeout 600
2022-12-01 23:05:00,211 - ERROR - cephci.ceph.ceph.py:1548 - Error 108 during cmd, timeout 600
2022-12-01 23:05:00,212 - ERROR - cephci.ceph.ceph.py:1549 - Error ESHUTDOWN: error in stat: /volumes/_nogroup

MDS Logs : 
http://magna002.ceph.redhat.com/ceph-qe-logs/amar/mds_log_scale/


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2022-12-02 13:25:22 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Amarnath 2022-12-02 13:27:41 UTC

[root@f12-h09-000-1029u ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 96
    },
    "mds": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 3
    },
    "rbd-mirror": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 105
    }
}
[root@f12-h09-000-1029u ~]#

Comment 9 Venky Shankar 2023-02-06 09:24:18 UTC

Also share the ceph-mgr logs please.

Comment 12 Scott Ostapovicz 2023-02-07 14:23:19 UTC

Given the low priority of this issue and the fact that we are fixing blockers only in 5.3 z1, I am moving this to 6.1.

Comment 13 Amarnath 2023-02-08 05:41:10 UTC

Hi Venky,

This is done on bare metal servers and awe have reimaged them.
I have to recreate the setup and try this.
Please expect some delay in collect the logs. As we have limited server resources

Regards,
Amarnath

Comment 23 Venky Shankar 2023-07-18 04:07:53 UTC

(In reply to Amarnath from comment #22)
> Hi Venky,
> 
> We are not observing the error code as part of output even after filling the
> cluster to full.
> [root@ceph-amk-61-test-xtknkr-node7 subvol_1]# ceph -s
>   cluster:
>     id:     417b4eba-247b-11ee-bf71-fa163e45e70b
>     health: HEALTH_ERR
>             1 MDSs report slow metadata IOs
>             1 MDSs report slow requests
>             1 full osd(s)
>             Degraded data redundancy: 1983/122460 objects degraded (1.619%),
> 6 pgs degraded, 6 pgs undersized
>             Full OSDs blocking recovery: 6 pgs recovery_toofull
>             5 pool(s) full
>  
>   services:
>     mon: 3 daemons, quorum
> ceph-amk-61-test-xtknkr-node1-installer,ceph-amk-61-test-xtknkr-node3,ceph-
> amk-61-test-xtknkr-node2 (age 3h)
>     mgr: ceph-amk-61-test-xtknkr-node1-installer.cmqizy(active, since 11m)
>     mds: 2/2 daemons up, 3 standby
>     osd: 12 osds: 12 up (since 2h), 12 in (since 3h); 6 remapped pgs
>  
>   data:
>     volumes: 2/2 healthy
>     pools:   5 pools, 193 pgs
>     objects: 40.82k objects, 40 GiB
>     usage:   121 GiB used, 59 GiB / 180 GiB avail
>     pgs:     1983/122460 objects degraded (1.619%)
>              1340/122460 objects misplaced (1.094%)
>              187 active+clean
>              6   active+recovery_toofull+undersized+degraded+remapped
>  
> [root@ceph-amk-61-test-xtknkr-node7 subvol_1]# 
> [root@ceph-amk-61-test-xtknkr-node7 subvol_1]# wget -O linux.tar.gz
> http://download.ceph.com/qa/linux-5.4.tar.gz
> --2023-07-17 08:26:15--  http://download.ceph.com/qa/linux-5.4.tar.gz
> Resolving download.ceph.com (download.ceph.com)...
> 2607:5300:201:2000::3:58a1, 158.69.68.124
> Connecting to download.ceph.com
> (download.ceph.com)|2607:5300:201:2000::3:58a1|:80... failed: No route to
> host.
> Connecting to download.ceph.com (download.ceph.com)|158.69.68.124|:80...
> connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 172616875 (165M) [application/octet-stream]
> Saving to: ‘linux.tar.gz’
> 
> linux.tar.gz                                         0%[                    
> ]       0  --.-KB/s    in 0s      
> 
> 
> Cannot write to ‘linux.tar.gz’ (No space left on device).
> 
> 
> We even tried stopping the mgr services using systemctl. when the service is
> stopped the command is getting stuck and when it started back it is getting
> the path of the subvolume correctly 
> 
> @need info, Anything more needs to tested on this Venky

This should be sufficient.

Comment 27 errata-xmlrpc 2023-08-03 16:45:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4473