2150306 – [Scale]subvolume get path command failed with ESHUTDOWN : error in stat:

Bug 2150306 - [Scale]subvolume get path command failed with ESHUTDOWN : error in stat:

Summary: [Scale]subvolume get path command failed with ESHUTDOWN : error in stat:

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	6.1z1
Assignee:	Venky Shankar
QA Contact:	Amarnath
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2221020
TreeView+	depends on / blocked

Reported:	2022-12-02 13:25 UTC by Amarnath
Modified:	2023-08-03 16:46 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-17.2.6-91.el9cp
Doc Type:	Bug Fix
Doc Text:	.Calls to Ceph Manager daemons and volumes no longer return `-ESHUTDOWN` Previously, calls to Ceph Manager daemons and volumes would return `-ESHUTDOWN` when the `ceph-mgr` process was shutting down, which was not necessary. With this fix, Ceph Manager plugin handles the shutdown without returning a special error code and `-ESHUTDOWN` is never returned.
Clone Of:
Environment:
Last Closed:	2023-08-03 16:45:09 UTC
Embargoed:
Dependent Products:
Flags:	gfarnum: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	58651	None	None	None	2023-02-07 04:50:43 UTC
Red Hat Issue Tracker	RHCEPH-5727	None	None	None	2022-12-02 13:31:20 UTC
Red Hat Product Errata	RHBA-2023:4473	None	None	None	2023-08-03 16:46:06 UTC

Description Amarnath 2022-12-02 13:25:10 UTC

Description of problem:
As part of scale testing, We followed below steps
1. Create subvolume
2. Get the path of subvolume created
3. Mount the subvolume using above path
4. write 1 gb data.

Above steps have been passed till subvolume 2491 iterations
In 2492 Iteration Get path of subvolume has failed with below error

2022-12-01 23:04:59,778 - INFO - cephci.ceph.ceph.py:1513 - Running command ceph fs subvolume getpath cephfs subvol_max_2942 on 10.1.38.141 timeout 600
2022-12-01 23:05:00,211 - ERROR - cephci.ceph.ceph.py:1548 - Error 108 during cmd, timeout 600
2022-12-01 23:05:00,212 - ERROR - cephci.ceph.ceph.py:1549 - Error ESHUTDOWN: error in stat: /volumes/_nogroup

MDS Logs : 
http://magna002.ceph.redhat.com/ceph-qe-logs/amar/mds_log_scale/


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2022-12-02 13:25:22 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Amarnath 2022-12-02 13:27:41 UTC

[root@f12-h09-000-1029u ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 96
    },
    "mds": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 3
    },
    "rbd-mirror": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.10-79.el8cp (04a651bbcd8d087dd0fcc0bc71a5871e77732529) pacific (stable)": 105
    }
}
[root@f12-h09-000-1029u ~]#

Comment 9 Venky Shankar 2023-02-06 09:24:18 UTC

Also share the ceph-mgr logs please.

Comment 12 Scott Ostapovicz 2023-02-07 14:23:19 UTC

Given the low priority of this issue and the fact that we are fixing blockers only in 5.3 z1, I am moving this to 6.1.

Comment 13 Amarnath 2023-02-08 05:41:10 UTC

Hi Venky,

This is done on bare metal servers and awe have reimaged them.
I have to recreate the setup and try this.
Please expect some delay in collect the logs. As we have limited server resources

Regards,
Amarnath

Comment 23 Venky Shankar 2023-07-18 04:07:53 UTC

(In reply to Amarnath from comment #22)
> Hi Venky,
> 
> We are not observing the error code as part of output even after filling the
> cluster to full.
> [root@ceph-amk-61-test-xtknkr-node7 subvol_1]# ceph -s
>   cluster:
>     id:     417b4eba-247b-11ee-bf71-fa163e45e70b
>     health: HEALTH_ERR
>             1 MDSs report slow metadata IOs
>             1 MDSs report slow requests
>             1 full osd(s)
>             Degraded data redundancy: 1983/122460 objects degraded (1.619%),
> 6 pgs degraded, 6 pgs undersized
>             Full OSDs blocking recovery: 6 pgs recovery_toofull
>             5 pool(s) full
>  
>   services:
>     mon: 3 daemons, quorum
> ceph-amk-61-test-xtknkr-node1-installer,ceph-amk-61-test-xtknkr-node3,ceph-
> amk-61-test-xtknkr-node2 (age 3h)
>     mgr: ceph-amk-61-test-xtknkr-node1-installer.cmqizy(active, since 11m)
>     mds: 2/2 daemons up, 3 standby
>     osd: 12 osds: 12 up (since 2h), 12 in (since 3h); 6 remapped pgs
>  
>   data:
>     volumes: 2/2 healthy
>     pools:   5 pools, 193 pgs
>     objects: 40.82k objects, 40 GiB
>     usage:   121 GiB used, 59 GiB / 180 GiB avail
>     pgs:     1983/122460 objects degraded (1.619%)
>              1340/122460 objects misplaced (1.094%)
>              187 active+clean
>              6   active+recovery_toofull+undersized+degraded+remapped
>  
> [root@ceph-amk-61-test-xtknkr-node7 subvol_1]# 
> [root@ceph-amk-61-test-xtknkr-node7 subvol_1]# wget -O linux.tar.gz
> http://download.ceph.com/qa/linux-5.4.tar.gz
> --2023-07-17 08:26:15--  http://download.ceph.com/qa/linux-5.4.tar.gz
> Resolving download.ceph.com (download.ceph.com)...
> 2607:5300:201:2000::3:58a1, 158.69.68.124
> Connecting to download.ceph.com
> (download.ceph.com)|2607:5300:201:2000::3:58a1|:80... failed: No route to
> host.
> Connecting to download.ceph.com (download.ceph.com)|158.69.68.124|:80...
> connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 172616875 (165M) [application/octet-stream]
> Saving to: ‘linux.tar.gz’
> 
> linux.tar.gz                                         0%[                    
> ]       0  --.-KB/s    in 0s      
> 
> 
> Cannot write to ‘linux.tar.gz’ (No space left on device).
> 
> 
> We even tried stopping the mgr services using systemctl. when the service is
> stopped the command is getting stuck and when it started back it is getting
> the path of the subvolume correctly 
> 
> @need info, Anything more needs to tested on this Venky

This should be sufficient.

Comment 27 errata-xmlrpc 2023-08-03 16:45:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4473

Note You need to log in before you can comment on or make changes to this bug.