Bug 2249038

Summary:	ceph-fuse mount attempts on client nodes core dumping
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Venky Shankar <vshankar>
Component:	CephFS	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED ERRATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	5.0	CC:	bengland, ceph-eng-bugs, cephqe-warriors, hyelloji, mcurrier, nravinas, rpollack, sostapov, sweil, tserlin, vereddy, vshankar
Target Milestone:	---	Keywords:	Reopened
Target Release:	6.1z7	Flags:	rpollack: needinfo+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-17.2.6-233	Doc Type:	Bug Fix
Doc Text:	.The client process no longer causes loss of access to Ceph File System due to incorrect lock API usage Previously, an incorrect lock API usage caused the client process to crash, causing loss of access to the Ceph File System. With this fix, the correct lock API is being used and the Ceph File System works as expected.	Story Points:	---
Clone Of:	2005442	Environment:
Last Closed:	2024-08-28 17:57:36 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2005442, 2249039, 2347559
Bug Blocks:

Description Venky Shankar 2023-11-10 10:21:49 UTC

+++ This bug was initially created as a clone of Bug #2005442 +++

D
Environment:  3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation

This mount command worked on the admin (bootstrap) host of the cluster:

[root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01
2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0 newargc=15
ceph-fuse[238730]: starting ceph client
ceph-fuse[238730]: starting fuse
[root@ip-172-31-32-53 ceph]#
[root@ip-172-31-32-53 ceph]# df /mnt/cephfs
Filesystem     1K-blocks  Used Available Use% Mounted on
ceph-fuse      946319360     0 946319360   0% /mnt/cephfs

However, on the other two nodes within the Ceph cluster, these two methods to attempt mount caused this core dump. The ceph.conf and the keyrings were copied to /etc/ceph directory.

[root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2 --client-fs=cephfs01
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)
[root@ip-172-31-42-149 share]#
[root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)

Kernel mounts on these two client nodes such as:

# mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o name=admin,fs=cphfs01

--- Additional comment from Patrick Donnelly on 2021-09-17 17:02:30 UTC ---

(In reply to mcurrier from comment #0)
> D
> Environment:  3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation
> 
> This mount command worked on the admin (bootstrap) host of the cluster:
> 
> [root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin
> --client-fs=cephfs01
> 2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0
> newargc=15
> ceph-fuse[238730]: starting ceph client
> ceph-fuse[238730]: starting fuse
> [root@ip-172-31-32-53 ceph]#
> [root@ip-172-31-32-53 ceph]# df /mnt/cephfs
> Filesystem     1K-blocks  Used Available Use% Mounted on
> ceph-fuse      946319360     0 946319360   0% /mnt/cephfs
> 
> However, on the other two nodes within the Ceph cluster, these two methods
> to attempt mount caused this core dump. The ceph.conf and the keyrings were
> copied to /etc/ceph directory.
> 
> [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2
> --client-fs=cephfs01
> ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
> `mutex->__data.__owner == 0' failed.
> Aborted (core dumped)
> [root@ip-172-31-42-149 share]#
> [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin
> --client-fs=cephfs01
> ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
> `mutex->__data.__owner == 0' failed.
> Aborted (core dumped)

Can you verify the ceph-fuse versions were the same on both hosts? Do you have a coredump you can share?

> Kernel mounts on these two client nodes such as:
> 
> # mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o
> name=admin,fs=cphfs01

Do the kernel mounts succeed?

--- Additional comment from  on 2021-09-20 18:29:30 UTC ---

Hello,

Yes, the kernel mounts succeed.

--- Additional comment from Patrick Donnelly on 2021-09-21 17:21:16 UTC ---

> fs=cphfs01

is that a typo?

In any case, please turn up debugging:

> ceph config set client debug_client 20
> ceph config set client debug_ms 1
> ceph config set client debug_monc 10

and retry the ceph-fuse mounts. Please upload the logs.

--- Additional comment from  on 2021-09-22 12:55:34 UTC ---



The version of Ceph-fuse is:

[root@ip-172-31-32-53 ~]# ceph-fuse --version
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

I did install same way on each of the three nodes.  However checking the version on a second two non-bootstrap node shows the same core dump issue:

[root@ip-172-31-42-149 ~]# ceph-fuse --verison
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.


on 3rd node:

[ec2-user@ip-172-31-40-206 ~]$ ceph-fuse --version
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

The fs=cphfs01 is part of the command given.  It is not a typo, it is the filesystem name.

I will attach the core file and logs.

--- Additional comment from  on 2021-09-22 13:01:37 UTC ---



--- Additional comment from  on 2021-09-22 13:03:41 UTC ---



--- Additional comment from  on 2021-09-22 13:05:28 UTC ---



--- Additional comment from  on 2021-09-22 13:06:37 UTC ---



--- Additional comment from  on 2021-09-22 13:07:24 UTC ---



--- Additional comment from  on 2021-09-22 13:08:03 UTC ---



--- Additional comment from  on 2021-09-22 13:08:54 UTC ---



--- Additional comment from Ben England on 2021-09-23 17:36:43 UTC ---

Matt, next time create a tarball with all the logs and post 1 attachment, much easier.   You can also create a SOS report using the command "sos report",  That collects all the RHEL config in one big file automatically. -ben

--- Additional comment from Venky Shankar on 2021-11-23 04:34:38 UTC ---

(In reply to mcurrier from comment #5)
> Created attachment 1825305 [details]
> Bug 2005442 attachments

Hey Matt,

Did you miss updating with the client log (with `debug client = 20`)?

In the meantime, I'll take a look at the core dump.

Cheers,
Venky

--- Additional comment from  on 2021-12-01 18:13:02 UTC ---

Hi Venky,

Sorry for late reply.  I missed this earlier.

I looked through my notes and I see I applied this:
ceph config set client debug_client 20

I hope this helps.
Matt

--- Additional comment from Venky Shankar on 2021-12-03 05:04:53 UTC ---

(In reply to mcurrier from comment #14)
> Hi Venky,
> 
> Sorry for late reply.  I missed this earlier.
> 
> I looked through my notes and I see I applied this:
> ceph config set client debug_client 20

I cannot find the client logs in the attachment - just the core, mgr, audit, volume log.

I couldn't get a clean backtrace from the core.

Could you please check.

> 
> I hope this helps.
> Matt

--- Additional comment from Venky Shankar on 2021-12-03 09:20:37 UTC ---

Looks like a uninitialized mutex is being locked::

```
#0  0x00007f2bbcd7537f in raise () from /lib64/libc.so.6
#1  0x00007f2bbcd5fdb5 in abort () from /lib64/libc.so.6
#2  0x00007f2bbcd5fc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007f2bbcd6da76 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f2bbe323b61 in pthread_mutex_lock () from /lib64/libpthread.so.0
#5  0x000055ef2578bdb7 in ?? ()
#6  0x000055ef2652a880 in ?? ()

```

I couldn't get other stack frames for some reason (I have the required packages installed though).

Matt - client logs would really help (and/or if you can provide the backtrace through gdb too, that would be great).

--- Additional comment from  on 2021-12-03 17:08:26 UTC ---

Hi Venky,

This appears to no longer be an issue in this RHCS V5 cluster.  I can now mount the ceph-fuse mount points on the other two hosts. I think we should close this bugzilla.


[root@ip-172-31-42-149 testruns]# ll /etc/ceph
total 24
-rw-------. 1 root root  63 Sep 16 15:47 ceph.client.admin.keyring
-rw-r--r--. 1 root root 175 Sep 16 15:47 ceph.conf
-rw-r--r--. 1 root root 184 Sep 16 14:30 ceph.client.2.keyring
-rw-r--r--. 1 root root  41 Sep 16 15:59 ceph.client.2.keyring.tmp
-rw-------. 1 root root 110 Dec  2 19:03 podman-auth.json
-rw-r--r--. 1 root root  92 Sep 16 14:52 rbdmap

2021-12-03T17:02:49.004+0000 7fb9671f3200 -1 init, newargv = 0x5633521c6740 newargc=15
ceph-fuse[935873]: starting ceph client
ceph-fuse[935873]: starting fuse
[root@ip-172-31-42-149 testruns]# df
Filesystem          Type            Size  Used Avail Use% Mounted on
devtmpfs            devtmpfs         16G     0   16G   0% /dev
tmpfs               tmpfs            16G   84K   16G   1% /dev/shm
tmpfs               tmpfs            16G  1.6G   14G  11% /run
tmpfs               tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p2      xfs              10G  6.5G  3.6G  65% /
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/9d34db80512c92b9998999f25843420af062d13baa4958c1444283d5b53ae378/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/d224cd3efeb6a738da7d5bc96b70a9fb98aaf6b6e0ef77b467b4e3b07a6b840a/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/c653abe212907b403d124d56d2a1eb6420916cce645b6d1bdf5336bfab7b976d/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/a04b90fec89ba4190749dd06517ce7ff284c2934982c52ec4dd29d68c60da754/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/589f8ca22f4c66b15586afa03ee4989bbd65fa124f650b81c931759b6567319c/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/203bb3a0b0ba30347593a7ccb825461cda3a16d4f42c6065a6866ed7e124e3a9/merged
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/0
172.31.32.53:6789:/ ceph            1.9T  142G  1.7T   8% /mnt/kernel-cephfs
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/1000
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/ddcc4a2c56cb6050fc9252e7c7d2841c9ccd2a6b52df94e3a83f750e1608238c/merged
ceph-fuse           fuse.ceph-fuse  1.9T  142G  1.7T   8% /mnt/cephfs



2021-12-03T17:04:16.871+0000 7f6489dcf200 -1 init, newargv = 0x55f9f78c9930 newargc=15
ceph-fuse[1115149]: starting ceph client
ceph-fuse[1115149]: starting fuse
[root@ip-172-31-40-206 ~]# 
[root@ip-172-31-40-206 ~]# 
[root@ip-172-31-40-206 ~]# df
Filesystem          Type            Size  Used Avail Use% Mounted on
devtmpfs            devtmpfs         16G     0   16G   0% /dev
tmpfs               tmpfs            16G   84K   16G   1% /dev/shm
tmpfs               tmpfs            16G  1.6G   14G  11% /run
tmpfs               tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p2      xfs              10G  7.1G  3.0G  71% /
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/f19012f83cc96aee042f5e535ebb6acfea5abe5e7aac7e6b38af2d22d32aa283/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/32a0fc30124cc92c19e8b8b8991849cd8faf373219d305def02f7d5f22f39bae/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/e6cc3f32f914f06115d5129f78f1fc1f4804504da0c4886a1d37baa945102a3b/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/480cdbc4eac1d9db38e5a697c21a1d6f8fdc9946c2b575896d586c48f7ecb42b/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/e3d11731bde9033eaf00b7933cebfe223639345feb4365ab18e524757686a9d1/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/c112930d4c94ff9a8156172dea2b0b43497856aeadbd0f55dfb70c5abf2bd539/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/aa435dfbefef1fa0bb9a2deb1b126dc8b569c14dc9bc4c4df8aea7e388a3f365/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/550e0dab60d22208def86704ef7ceb01eedf241c43897cf02ff0dba7235be9e1/merged
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/0
172.31.32.53:6789:/ ceph            1.9T  141G  1.7T   8% /mnt/kernel-cephfs
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/1000
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/00b0dc5a7a3b9c316bd63b4733c8f9dc0bd2bd71601ebf294e0c126bbadadc8a/merged
ceph-fuse           fuse.ceph-fuse  1.9T  141G  1.7T   8% /mnt/cephfs
[root@ip-172-31-40-206 ~]#

--- Additional comment from Venky Shankar on 2021-12-03 17:36:31 UTC ---

(In reply to mcurrier from comment #17)
> Hi Venky,
> 
> This appears to no longer be an issue in this RHCS V5 cluster.  I can now
> mount the ceph-fuse mount points on the other two hosts. I think we should
> close this bugzilla.
> 

ACK - please reopen if you hit it again.

--- Additional comment from  on 2023-08-07 08:25:56 UTC ---

I'm observing the same issue described in this bug in my lab.

* Ceph version:

		ceph versions
		{
		    "mon": {
		        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 5
		    },
		    "mgr": {
		        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 2
		    },
		    "osd": {
		        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 3
		    },
		    "mds": {
		        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 2
		    },
		    "rgw": {
		        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 2
		    },
		    "overall": {
		        "ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)": 14
		    }
		}

* The CephFS file system can be mounted correctly using the kernel client:

		df | grep cephfs
		10.0.88.144:6789,10.0.88.193:6789,10.0.90.124:6789,10.0.91.14:6789,10.0.94.189:6789:/   9547776       0   9547776   0% /mnt/cephfs


* However, cephfs-fuse mount fails:

		# ceph-fuse -n client.cephfs --client_fs my-filesystem /mnt/ceph-fuse
		ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
		Aborted (core dumped)


		# ceph-fuse version
		ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
		Aborted (core dumped)

* Backtrace of the core file:

		# coredumpctl info 546820
		           PID: 546820 (ceph-fuse)
		           UID: 0 (root)
		           GID: 0 (root)
		        Signal: 6 (ABRT)
		     Timestamp: Mon 2023-08-07 04:16:05 EDT (2min 13s ago)
		  Command Line: ceph-fuse versions
		    Executable: /usr/bin/ceph-fuse
		 Control Group: /user.slice/user-1000.slice/session-7.scope
		          Unit: session-7.scope
		         Slice: user-1000.slice
		       Session: 7
		     Owner UID: 1000 (quickcluster)
		       Boot ID: 44877874079142e981164132ac16023a
		    Machine ID: 29565e14434c4afb8b0afd9f014d71a5
		      Hostname: rgws-1.nravinargw1.lab.upshift.rdu2.redhat.com
		       Storage: /var/lib/systemd/coredump/core.ceph-fuse.0.44877874079142e981164132ac16023a.546820.1691396165000000.lz4
		       Message: Process 546820 (ceph-fuse) of user 0 dumped core.
		
		                Stack trace of thread 546820:
		                #0  0x00007f66b42a2a4f raise (libc.so.6)
		                #1  0x00007f66b4275db5 abort (libc.so.6)
		                #2  0x00007f66b4275c89 __assert_fail_base.cold.0 (libc.so.6)
		                #3  0x00007f66b429b3a6 __assert_fail (libc.so.6)
		                #4  0x00007f66b583ccb1 __pthread_mutex_lock (libpthread.so.0)
		                #5  0x0000563393237387 _Z15global_pre_initPKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St4lessIS5_ESaISt4pairIKS5_S5_EEERSt6vectorIPKcSaISH_EEj18code_environment_ti (ceph-fuse)
		                #6  0x0000563393239576 _Z11global_initPKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St4lessIS5_ESaISt4pairIKS5_S5_EEERSt6vectorIPKcSaISH_EEj18code_environment_tib (ceph-fuse)
		                #7  0x000056339314a01e main (ceph-fuse)
		                #8  0x00007f66b428eca3 __libc_start_main (libc.so.6)
		                #9  0x000056339315173e _start (ceph-fuse)  


		(gdb) t a a bt
		
		Thread 1 (Thread 0x7f66bfaf6380 (LWP 546820)):
		#0  0x00007f66b42a2a4f in raise () from /lib64/libc.so.6
		#1  0x00007f66b4275db5 in abort () from /lib64/libc.so.6
		#2  0x00007f66b4275c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
		#3  0x00007f66b429b3a6 in __assert_fail () from /lib64/libc.so.6
		#4  0x00007f66b583ccb1 in pthread_mutex_lock () from /lib64/libpthread.so.0
		#5  0x0000563393237387 in global_pre_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int) ()
		#6  0x0000563393239576 in global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool) ()
		#7  0x000056339314a01e in main ()
		

* Please note that this issue is not attached to any customer-facing problem, as this was observed while I was working on my lab. Hence it's a low-priority request. Please, let me know any other information you might need.

Thank you,

Natalia

--- Additional comment from Venky Shankar on 2023-08-14 04:44:03 UTC ---

This looks like its locking an uninitialised mutex - checking.

--- Additional comment from Venky Shankar on 2023-11-09 12:14:18 UTC ---

Change is under review: https://github.com/ceph/ceph/pull/54433

Will be backported to RHCS6 and RHCS7 releases.

--- Additional comment from Venky Shankar on 2023-11-10 10:09:28 UTC ---

(In reply to Venky Shankar from comment #21)
> Change is under review: https://github.com/ceph/ceph/pull/54433
> 
> Will be backported to RHCS6 and RHCS7 releases.

Will be ported to RHCS5 too.

Comment 14 errata-xmlrpc 2024-08-28 17:57:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 security, bug fix, and enhancement updates.), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:5960