2190080 – [GSS][Ceph] cephfs clone are stuck in Pending state

Bug 2190080 - [GSS][Ceph] cephfs clone are stuck in Pending state

Summary: [GSS][Ceph] cephfs clone are stuck in Pending state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z3
Assignee:	Xiubo Li
QA Contact:	Hemanth Kumar
Docs Contact:	lysanche
URL:
Whiteboard:
Duplicates (1):	2219761 (view as bug list)
Depends On:	2179083
Blocks:	2203283
TreeView+	depends on / blocked

Reported:	2023-04-27 05:33 UTC by Xiubo Li
Modified:	2024-08-28 10:42 UTC (History)
CC List:	16 users (show)
Fixed In Version:	ceph-16.2.10-166.el8cp
Doc Type:	Bug Fix
Doc Text:	.Users can find the inodes in the metadata pool Previously, when the MDS crashed and the `openfiletable` could not be flushed, replacing the MDS would not load a few already opened `CInodes` into the MDCache. If the clients withdrew some requests after reconnection, the MDS would return `-ESTALE` after failing to find the inode in all active peers. With this fix, load and open the inode from the metadata pool if it is not found in the MDS cache. If it is not found in the metadata pool, it returns `-ESTALE` instead of an infinite loop. It successfully finds the inode in most cases.
Clone Of:	2179083
Environment:
Last Closed:	2023-05-23 00:19:11 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo+ hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-6575	0	None	None	None	2023-04-27 05:34:59 UTC
Red Hat Product Errata	RHBA-2023:3259	0	None	None	None	2023-05-23 00:19:41 UTC

Description Xiubo Li 2023-04-27 05:33:44 UTC

+++ This bug was initially created as a clone of Bug #2179083 +++

Description of problem (please be detailed as possible and provide log
snippests):
Cephfs clones are stuck in pending state.

Version of all relevant components (if applicable):
ceph version 16.2.7-126.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, can't restore snapshot backup

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
-

Steps to Reproduce:
1. Create cephfs volume or a cephfs PVC
2. Put some data in it
3. Create snapshot from the volume
4. Create cephfs clone (using ceph command)


Actual results:
Volume clone is stuck for indefinite time period.

Expected results:
Cloned volume should provision

Additional info:
In next private comment.

--- Additional comment from Sonal on 2023-03-16 15:43:09 UTC ---

Created clone using below cephfs command :


$ ceph fs subvolume snapshot clone ocs-storagecluster-cephfilesystem csi-vol-466b00be-bf73-11ed-b0ae-0a580afe2ac7 csi-snap-1792d65a-c197-11ed-9576-0a580afe2c7d  csi-target-01 --group_name csi

Here is the status (it is in pending state for more than 20 minutes):

$ ceph fs clone status ocs-storagecluster-cephfilesystem csi-target-01
{
  "status": {
    "state": "pending",
    "source": {
      "volume": "ocs-storagecluster-cephfilesystem",
      "subvolume": "csi-vol-466b00be-bf73-11ed-b0ae-0a580afe2ac7",
      "snapshot": "csi-snap-1792d65a-c197-11ed-9576-0a580afe2c7d",
      "group": "csi"
    }
  }
}


It stays in pending state.

Observed same behavior when creating PVC clone.

Captured debug mgr logs while creating cephfs clone named csi-target-03 from the same snapshot, subvolume as above:

Data is in supportshell: cd /cases/03456510

Debug mgr logs: 0010-mg-odf.tgz
ODF must-gather: 0010-mg-odf.tgz

Thanks

Regards,
Sonal Arora

--- Additional comment from Sonal on 2023-03-16 15:47:00 UTC ---

Typo in above comment. ceph-mgr log is at below location in supportshell:

0110-ceph-mgr.a.log.6

--- Additional comment from Venky Shankar on 2023-03-17 04:55:33 UTC ---

I think the clone is not progressing at all. There are lots of "Stale file handle" replies that the cephfs client is seeing again and again.

Going through mds logs to see if I can spot anything.

--- Additional comment from Greg Farnum on 2023-03-18 01:11:40 UTC ---



--- Additional comment from Greg Farnum on 2023-03-18 01:12:23 UTC ---



--- Additional comment from Greg Farnum on 2023-03-18 01:12:34 UTC ---



--- Additional comment from Venky Shankar on 2023-03-20 05:00:50 UTC ---

Hi Sonal,

The MDS logs are pretty limited. Do we know why the relevant weren't captured?

I assume the issue is still on-going with the customer. My best guess atm for this issue is that the MDS restarted (or crashed!) and the failover MDS was unable to load inodes into its cache. The client, however, retried requests post reconnection with the MDS returning -ESTALE failing to find the inode in its cache and active peers. What can be done at this point is to mount the file system and run `find` or `ls -R` to force load inodes in the MDS cache. This would kick the client doing the clone to find the relevant inode and make progress.

Cheers,
Venky

--- Additional comment from Sonal on 2023-03-22 10:24:39 UTC ---

Hi Venky,

ls -lR on mount point didn't help. The clone is still in pending status:

$ ceph fs clone status ocs-storagecluster-cephfilesystem csi-target-03
{
  "status": {
    "state": "pending",
    "source": {
      "volume": "ocs-storagecluster-cephfilesystem",
      "subvolume": "csi-vol-466b00be-bf73-11ed-b0ae-0a580afe2ac7",
      "snapshot": "csi-snap-1792d65a-c197-11ed-9576-0a580afe2c7d",
      "group": "csi"
    }
  }
}

--- Additional comment from Venky Shankar on 2023-03-23 02:27:42 UTC ---

Hi Sonal,

(In reply to Sonal from comment #8)
> Hi Venky,
> 
> ls -lR on mount point didn't help. The clone is still in pending status:
> 
> $ ceph fs clone status ocs-storagecluster-cephfilesystem csi-target-03
> {
>   "status": {
>     "state": "pending",
>     "source": {
>       "volume": "ocs-storagecluster-cephfilesystem",
>       "subvolume": "csi-vol-466b00be-bf73-11ed-b0ae-0a580afe2ac7",
>       "snapshot": "csi-snap-1792d65a-c197-11ed-9576-0a580afe2c7d",
>       "group": "csi"
>     }
>   }
> }

Please capture the latest MG logs and/or sosreports (post running ls -lR/find).

--- Additional comment from Venky Shankar on 2023-03-24 13:12:29 UTC ---

(In reply to Venky Shankar from comment #9)
> Hi Sonal,
> 
> (In reply to Sonal from comment #8)
> > Hi Venky,
> > 
> > ls -lR on mount point didn't help. The clone is still in pending status:
> > 
> > $ ceph fs clone status ocs-storagecluster-cephfilesystem csi-target-03
> > {
> >   "status": {
> >     "state": "pending",
> >     "source": {
> >       "volume": "ocs-storagecluster-cephfilesystem",
> >       "subvolume": "csi-vol-466b00be-bf73-11ed-b0ae-0a580afe2ac7",
> >       "snapshot": "csi-snap-1792d65a-c197-11ed-9576-0a580afe2c7d",
> >       "group": "csi"
> >     }
> >   }
> > }
> 
> Please capture the latest MG logs and/or sosreports (post running ls
> -lR/find).

with debug_mds = 20

--- Additional comment from RHEL Program Management on 2023-03-24 13:12:39 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Venky Shankar on 2023-03-24 13:13:23 UTC ---

and debug_mgr = 20 would help too.

--- Additional comment from Sonal on 2023-03-27 09:17:37 UTC ---

Hi Venky,

Here is the ODF must-gather captured after running 'ls -lR': 0130-mustgather-regitry-redhat-io-odf4.tar.gz

Regards,
Sonal Arora

--- Additional comment from Sonal on 2023-03-27 12:55:15 UTC ---

Sharing mgr logs from the node in case if it helps.


-rw-rw-rw-. 1 yank yank   80577345 Mar 27 10:03 0140-ceph-mgr.a.log.6.gz
-rw-rw-rw-. 1 yank yank   95995118 Mar 27 10:03 0150-ceph-mgr.a.log.4.gz
-rw-rw-rw-. 1 yank yank   96371198 Mar 27 10:03 0160-ceph-mgr.a.log.5.gz
-rw-rw-rw-. 1 yank yank   97358152 Mar 27 10:03 0170-ceph-mgr.a.log.2.gz
-rw-rw-rw-. 1 yank yank   98779837 Mar 27 10:03 0180-ceph-mgr.a.log.3.gz
-rw-rw-rw-. 1 yank yank  102110975 Mar 27 10:03 0190-ceph-mgr.a.log.7.gz
-rw-rw-rw-. 1 yank yank   97821834 Mar 27 10:04 0200-ceph-mgr.a.log.1.gz

--- Additional comment from Sonal on 2023-03-27 16:42:59 UTC ---

Hi Venky,

Did you get a chance to check the logs shared?

Regards,
Sonal Arora

--- Additional comment from Venky Shankar on 2023-03-28 02:32:48 UTC ---

(In reply to Sonal from comment #15)
> Hi Venky,
> 
> Did you get a chance to check the logs shared?

Will have a look today.

--- Additional comment from Venky Shankar on 2023-03-28 09:10:41 UTC ---

Hi Sonal,

(In reply to Sonal from comment #13)
> Hi Venky,
> 
> Here is the ODF must-gather captured after running 'ls -lR':
> 0130-mustgather-regitry-redhat-io-odf4.tar.gz

I checked the MDS logs in this must-gather and none of them are debug logs. The only reference to "stale file handle" is in the mgr log (which are debug logs):

> $ pwd
> /home/remote/vshankar/03456510/0130-mustgather-regitry-redhat-io-odf4.tar.gz

> $ grep -ir 'file handle' . | awk -F':' '{ print $1 }' | sort -nu
> ./registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-2d7a886ff2a0a2613758515181c04d696451b160f4552af773c2d954c69c0d21/namespaces/openshift-storage/pods/rook-ceph-mgr-a-cb66d857f-lrxhm/mgr/mgr/logs/current.log

Was `debug_mds = 20` set before capturing MG logs? Could you revert back with the MG logs after setting:

> debug_client = 20
> debug_mgr = 20
> debug_mds = 20

(or use `ceph config set` interface).

Also, please ensure the customer has run `find .` or `ls -lR` from a cephfs mount point and capture the relevant debug logs when running this). The MDS logs are essential to be able to provide a resolution for the issue.

Cheers,
Venky

--- Additional comment from Sonal on 2023-03-29 13:49:53 UTC ---

Hi Venky,

Here are the debug logs of mgr and mds in latest must-gather: 0210-must-gather.tgz

less namespaces/openshift-storage/pods/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-765d545fl4z8x/mds/mds/logs/current.log
~~~
2023-03-29T12:33:08.798046474Z 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.server reply_client_request -116 ((116) Stale file handle) client_request(client.39931722:23624629 getattr Fa #0x10000cfd1cd 2023-03-29T12:22:46.825328+0000 RETRY=94 caller_uid=0, caller_gid=0{}) v5
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x0
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 20 mds.0.server lat 0.000102
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 mds.0.904 send_message_client client.39931722 10.254.28.83:0/2993335853 client_reply(???:23624629 = -116 (116) Stale file handle) v1
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.cache request_finish request(client.39931722:23624629 nref=4 cr=0x5607e35beb00)
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 15 mds.0.cache request_cleanup request(client.39931722:23624629 nref=4 cr=0x5607e35beb00)
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  4 mds.0.server handle_client_request client_request(client.39931701:23624630 getattr Fa #0x10000cfd1cd 2023-03-29T12:22:43.438788+0000 RETRY=170 caller_uid=0, caller_gid=0{}) v5
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 20 mds.0.904 get_session have 0x5606c2b86500 client.39931701 10.254.28.83:0/1609393384 state open
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 15 mds.0.server  oldest_client_tid=23624627
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.cache request_start request(client.39931701:23624630 nref=2 cr=0x56077aa29b80)
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.server dispatch_client_request client_request(client.39931701:23624630 getattr Fa #0x10000cfd1cd 2023-03-29T12:22:43.438788+0000 RETRY=170 caller_uid=0, caller_gid=0{}) v5
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.cache traverse: opening base ino 0x10000cfd1cd snap head
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 mds.0.server rdlock_path_pin_ref request(client.39931701:23624630 nref=2 cr=0x56077aa29b80) #0x10000cfd1cd
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.cache traverse: opening base ino 0x10000cfd1cd snap head
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 mds.0.server FAIL on CEPHFS_ESTALE but attempting recovery
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  5 mds.0.cache find_ino_peers 0x10000cfd1cd hint -1
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 mds.0.cache _do_find_ino_peer 14063217417 0x10000cfd1cd active 0 all 0 checked
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 mds.0.cache _do_find_ino_peer failed on 0x10000cfd1cd
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700 10 MDSContext::complete: 18C_MDS_TryFindInode
2023-03-29T12:33:08.798046474Z debug 2023-03-29T12:33:07.178+0000 7fd6c4315700  7 mds.0.server reply_client_request -116 ((116) Stale file handle) client_request(client.39931701:23624630 getattr Fa #0x10000cfd1cd 2023-03-29T12:22:43.438788+0000 RETRY=170 caller_uid=0, caller_gid=0{}) v5
~~~

--- Additional comment from Venky Shankar on 2023-03-30 08:56:32 UTC ---

Thank you, Sonal. I'll have a look.

--- Additional comment from Sonal on 2023-03-30 17:24:57 UTC ---

Hi Venky,

Any updates?

Regards,
Sonal Arora

--- Additional comment from Venky Shankar on 2023-03-31 04:50:35 UTC ---

(In reply to Sonal from comment #20)
> Hi Venky,
> 
> Any updates?

I'll post an update soon.

--- Additional comment from Venky Shankar on 2023-03-31 05:18:34 UTC ---

The MDS is unable to find an inode from its peers. Normally, if an inode is not in the MDCache, the MDS would try to find the inode by contacting its peer MDSs. However, this MDS is the only MDS in the cluster (max_mds=1) which is excluded when trying to find an inode, which is expected.

```
2023-03-29T12:33:32.101673839Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700  7 mds.0.cache traverse: opening base ino 0x10000cfd1cd snap head
2023-03-29T12:33:32.101673839Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700 10 mds.0.server rdlock_path_pin_ref request(client.39931710:23624643 nref=2 cr=0x56073ae99600) #0x10000cfd1cd
2023-03-29T12:33:32.101673839Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700  7 mds.0.cache traverse: opening base ino 0x10000cfd1cd snap head
2023-03-29T12:33:32.101673839Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700 10 mds.0.server FAIL on CEPHFS_ESTALE but attempting recovery
2023-03-29T12:33:32.101673839Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700  5 mds.0.cache find_ino_peers 0x10000cfd1cd hint -1
2023-03-29T12:33:32.101673839Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700 10 mds.0.cache _do_find_ino_peer 14063259190 0x10000cfd1cd active 0 all 0 checked
2023-03-29T12:33:32.101684532Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700 10 mds.0.cache _do_find_ino_peer failed on 0x10000cfd1cd
2023-03-29T12:33:32.101684532Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700 10 MDSContext::complete: 18C_MDS_TryFindInode                                                                                                          2023-03-29T12:33:32.101684532Z debug 2023-03-29T12:33:32.100+0000 7fd6c4315700  7 mds.0.server reply_client_request -116 ((116) Stale file handle) client_request(client.39931710:23624643 getattr Fa #0x10000cfd1cd 2023-03-29T12:22:44.081124+0000 RETRY=236 caller_uid=0, caller_gid=0{}) v5
```

The inode is missing from the MDS cache, which means, running `find` or `ls -R` would load the inode in the MDS cache. Another reason could be that the inode is under purging by the MDS. The inode number is 0x10000cfd1cd. Could you please share the output of:

> ceph tell mds.c dump inode 1099525247437

Additionally, could you also run the following from a cephfs mount:

> find <mntpt> -inum 1099525247437

--- Additional comment from Sonal on 2023-04-03 09:24:57 UTC ---

Hi Venky,

Here is the output:

$ ceph tell mds.0 dump inode 1099525247437
2023-04-03T05:03:42.949+0000 7fa0dbfff700  0 client.41477266 ms_handle_reset on v2:10.254.24.68:6800/432486752
2023-04-03T05:03:42.982+0000 7fa0dbfff700  0 client.41477272 ms_handle_reset on v2:10.254.24.68:6800/432486752
dump inode failed, wrong inode number or the inode is not cached

$ ceph tell mds.1 dump inode 1099525247437
2023-04-03T05:03:55.501+0000 7efc6f55bdc0 -1 client.41510867 resolve_mds: gid 1 not in MDS map
2023-04-03T05:03:55.501+0000 7efc6f55bdc0 -1 client.41510867 FSMap: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
Error ENOENT: problem getting command descriptions from mds.1

# find /mnt/tmpmount/ -inum 1099525247437
/mnt/tmpmount/csi/csi-vol-082fbffc-9589-11ec-b4e8-0a580afe2eed/30a8dcb9-fbaf-441b-87fe-ca94399407e0/os1-12/content/98/24

Regards,
Sonal Arora

--- Additional comment from Venky Shankar on 2023-04-03 11:17:13 UTC ---

(In reply to Sonal from comment #23)
> Hi Venky,
> 
> Here is the output:
> 
> $ ceph tell mds.0 dump inode 1099525247437
> 2023-04-03T05:03:42.949+0000 7fa0dbfff700  0 client.41477266 ms_handle_reset
> on v2:10.254.24.68:6800/432486752
> 2023-04-03T05:03:42.982+0000 7fa0dbfff700  0 client.41477272 ms_handle_reset
> on v2:10.254.24.68:6800/432486752
> dump inode failed, wrong inode number or the inode is not cached
> 
> $ ceph tell mds.1 dump inode 1099525247437
> 2023-04-03T05:03:55.501+0000 7efc6f55bdc0 -1 client.41510867 resolve_mds:
> gid 1 not in MDS map
> 2023-04-03T05:03:55.501+0000 7efc6f55bdc0 -1 client.41510867 FSMap:
> ocs-storagecluster-cephfilesystem:1
> {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
> Error ENOENT: problem getting command descriptions from mds.1
> 
> # find /mnt/tmpmount/ -inum 1099525247437
> /mnt/tmpmount/csi/csi-vol-082fbffc-9589-11ec-b4e8-0a580afe2eed/30a8dcb9-fbaf-
> 441b-87fe-ca94399407e0/os1-12/content/98/24

That's weird. Can you failover the MDS for now? That should clean things up.

> ceph mds fail 0

Post this, if the issue is seen again, please collect must-gather logs (with debug logs enabled as earlier).

--- Additional comment from Sonal on 2023-04-03 14:37:55 UTC ---

Hi Venky,

The issue persist after doing failover. ODF must-gather : must-gather2.tgz 

Regards,
Sonal Arora

--- Additional comment from Venky Shankar on 2023-04-04 02:20:39 UTC ---

(In reply to Sonal from comment #25)
> Hi Venky,
> 
> The issue persist after doing failover. ODF must-gather : must-gather2.tgz 

Thanks, Sonal. I'll have a look. The client9s) seem to have a reference to the inode but the MDS does not have it in its cache.

--- Additional comment from Venky Shankar on 2023-04-04 09:43:26 UTC ---

Hi Sonal,

(In reply to Venky Shankar from comment #26)
> (In reply to Sonal from comment #25)
> > Hi Venky,
> > 
> > The issue persist after doing failover. ODF must-gather : must-gather2.tgz 
> 
> Thanks, Sonal. I'll have a look. The client9s) seem to have a reference to
> the inode but the MDS does not have it in its cache.

Unfortunately, the ceph version that the customer is running has a client (userspace - libcephfs) side bug causing the client to get stuck in the ESTALE loop even after running `find` or `ls -R`. The customer would have to upgrade to 16.2.11. In the meantime, you could try failing over the mgr to see if the clone operation makes progress, however, the client could get stuck again at a future point. Therefore, upgrading to 16.2.11 would fix this.

To fail over the mgr use:

> ceph mgr fail <id>

--- Additional comment from Venky Shankar on 2023-04-05 09:09:54 UTC ---

Hi Sonal,

I propose that this BZ be closed with current-release state. Would you concur?

Cheers,
Venky

--- Additional comment from Venky Shankar on 2023-04-11 06:20:36 UTC ---

Sonal, ping?

--- Additional comment from Venky Shankar on 2023-04-12 13:31:14 UTC ---

Moving out to 4.14.

--- Additional comment from Sonal on 2023-04-24 18:12:14 UTC ---

Hi Venky,

I was on long leave. 

Is ceph 16.2.11 GA'ed? I do not see this version from ceph release metrics in KCS https://access.redhat.com/solutions/2045583, also none of the ODF version consume ceph-16.2.11 (KCS https://access.redhat.com/articles/4731161)

Regards,
Sonal Arora

--- Additional comment from Venky Shankar on 2023-04-26 09:51:56 UTC ---

Hi Sonal,

(In reply to Sonal from comment #31)
> Hi Venky,
> 
> I was on long leave. 
> 
> Is ceph 16.2.11 GA'ed? I do not see this version from ceph release metrics
> in KCS https://access.redhat.com/solutions/2045583, also none of the ODF
> version consume ceph-16.2.11 (KCS https://access.redhat.com/articles/4731161)

The fix will be a available in RHCS 5.3z3 (its already available in RHCS6) and the KCS article would be updated to the ODF version that consumes the RHCS release.

--- Additional comment from Venky Shankar on 2023-04-27 05:16:48 UTC ---

Xiubo, please clone this for Ceph component.

Comment 9 Amarnath 2023-05-15 12:45:22 UTC

Hi All,

As part of Verification we have followed below steps

1. Created a subvolume and filled it with date
2. Created snapshot for the same
3. Created 7 clones in which 3 will be going to pending state and 4 in-progess state. we are waiting for all clones to get created and deleted them and recreating them in for loop
   for i in {1..10};do echo $i; for i in {1..7};do echo $i; ceph fs subvolume snapshot clone cephfs subvol_2 snap_1 clone_${i};echo "Creation of clone Done"; done; sh clone_status.sh;for i in {1..7};do echo $i; ceph fs subvolume rm cephfs clone_${i};echo "Deletion of volume done"; done;echo "##########################################"; done
   
4. In parallel failed mds with ceph mds fail 0 in loop and waited for it to become active and waited for 30 seconds this is done for 100 times in loop
   for i in {1..100};do echo $i; ceph mds fail 0; sh test_mds.sh;echo "##########################################"; done

Please find the logs for the same

http://magna002.ceph.redhat.com/ceph-qe-logs/amar/2190080/

 
Verified on 
[root@ceph-amk-upgrade-1-p2wbr0-node7 ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 8
    },
    "mds": {
        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 3
    },
    "overall": {
        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 15
    }
}
[root@ceph-amk-upgrade-1-p2wbr0-node7 ~]#

Comment 14 errata-xmlrpc 2023-05-23 00:19:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3259

Comment 15 Patrick Donnelly 2023-07-06 00:41:39 UTC

*** Bug 2219761 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.