Bug 2005442 - Ceph 5.0 ceph-fuse mount attempts on client nodes core dumping
Summary: Ceph 5.0 ceph-fuse mount attempts on client nodes core dumping
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.0
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 5.3z9
Assignee: Venky Shankar
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks: 2249038 2249039 2347559
TreeView+ depends on / blocked
 
Reported: 2021-09-17 16:36 UTC by mcurrier
Modified: 2025-05-05 11:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.The client process no longer causes loss of access to Ceph File System due to incorrect lock API usage Previously, an incorrect lock API usage caused the client process to crash, causing loss of access to the Ceph File System. With this fix, the correct lock API is being used and the Ceph File System works as expected.
Clone Of:
: 2249038 2249039 (view as bug list)
Environment:
Last Closed: 2021-12-03 17:36:31 UTC
Embargoed:
ngangadh: needinfo? (vshankar)
ngangadh: needinfo+
gfarnum: needinfo? (vshankar)
gfarnum: needinfo-


Attachments (Terms of Use)
Bug 2005442 attachments (202.68 KB, text/plain)
2021-09-22 13:01 UTC, mcurrier
no flags Details
core file for Bug 2005442 (1.42 MB, application/x-lz4)
2021-09-22 13:03 UTC, mcurrier
no flags Details
ceph.log for Bug 2005442 (228.47 KB, text/plain)
2021-09-22 13:05 UTC, mcurrier
no flags Details
ceph.audit.log (75.33 KB, text/plain)
2021-09-22 13:06 UTC, mcurrier
no flags Details
ceph-volume.log (81.81 KB, text/plain)
2021-09-22 13:07 UTC, mcurrier
no flags Details
mon ip log (389.48 KB, text/plain)
2021-09-22 13:08 UTC, mcurrier
no flags Details
ceph mgr log (209.63 KB, text/plain)
2021-09-22 13:08 UTC, mcurrier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 63494 0 None None None 2023-11-09 12:14:18 UTC
Red Hat Issue Tracker RHCEPH-1726 0 None None None 2021-09-17 16:37:54 UTC

Description mcurrier 2021-09-17 16:36:41 UTC
D
Environment:  3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation

This mount command worked on the admin (bootstrap) host of the cluster:

[root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01
2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0 newargc=15
ceph-fuse[238730]: starting ceph client
ceph-fuse[238730]: starting fuse
[root@ip-172-31-32-53 ceph]#
[root@ip-172-31-32-53 ceph]# df /mnt/cephfs
Filesystem     1K-blocks  Used Available Use% Mounted on
ceph-fuse      946319360     0 946319360   0% /mnt/cephfs

However, on the other two nodes within the Ceph cluster, these two methods to attempt mount caused this core dump. The ceph.conf and the keyrings were copied to /etc/ceph directory.

[root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2 --client-fs=cephfs01
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)
[root@ip-172-31-42-149 share]#
[root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)

Kernel mounts on these two client nodes such as:

# mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o name=admin,fs=cphfs01

Comment 1 Patrick Donnelly 2021-09-17 17:02:30 UTC
(In reply to mcurrier from comment #0)
> D
> Environment:  3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation
> 
> This mount command worked on the admin (bootstrap) host of the cluster:
> 
> [root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin
> --client-fs=cephfs01
> 2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0
> newargc=15
> ceph-fuse[238730]: starting ceph client
> ceph-fuse[238730]: starting fuse
> [root@ip-172-31-32-53 ceph]#
> [root@ip-172-31-32-53 ceph]# df /mnt/cephfs
> Filesystem     1K-blocks  Used Available Use% Mounted on
> ceph-fuse      946319360     0 946319360   0% /mnt/cephfs
> 
> However, on the other two nodes within the Ceph cluster, these two methods
> to attempt mount caused this core dump. The ceph.conf and the keyrings were
> copied to /etc/ceph directory.
> 
> [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2
> --client-fs=cephfs01
> ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
> `mutex->__data.__owner == 0' failed.
> Aborted (core dumped)
> [root@ip-172-31-42-149 share]#
> [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin
> --client-fs=cephfs01
> ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
> `mutex->__data.__owner == 0' failed.
> Aborted (core dumped)

Can you verify the ceph-fuse versions were the same on both hosts? Do you have a coredump you can share?

> Kernel mounts on these two client nodes such as:
> 
> # mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o
> name=admin,fs=cphfs01

Do the kernel mounts succeed?

Comment 2 mcurrier 2021-09-20 18:29:30 UTC
Hello,

Yes, the kernel mounts succeed.

Comment 3 Patrick Donnelly 2021-09-21 17:21:16 UTC
> fs=cphfs01

is that a typo?

In any case, please turn up debugging:

> ceph config set client debug_client 20
> ceph config set client debug_ms 1
> ceph config set client debug_monc 10

and retry the ceph-fuse mounts. Please upload the logs.

Comment 4 mcurrier 2021-09-22 12:55:34 UTC

The version of Ceph-fuse is:

[root@ip-172-31-32-53 ~]# ceph-fuse --version
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

I did install same way on each of the three nodes.  However checking the version on a second two non-bootstrap node shows the same core dump issue:

[root@ip-172-31-42-149 ~]# ceph-fuse --verison
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.


on 3rd node:

[ec2-user@ip-172-31-40-206 ~]$ ceph-fuse --version
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

The fs=cphfs01 is part of the command given.  It is not a typo, it is the filesystem name.

I will attach the core file and logs.

Comment 5 mcurrier 2021-09-22 13:01:37 UTC
Created attachment 1825305 [details]
Bug 2005442 attachments

Comment 6 mcurrier 2021-09-22 13:03:41 UTC
Created attachment 1825306 [details]
core file for Bug 2005442

Comment 7 mcurrier 2021-09-22 13:05:28 UTC
Created attachment 1825308 [details]
ceph.log for Bug 2005442

Comment 8 mcurrier 2021-09-22 13:06:37 UTC
Created attachment 1825310 [details]
ceph.audit.log

Comment 9 mcurrier 2021-09-22 13:07:24 UTC
Created attachment 1825311 [details]
ceph-volume.log

Comment 10 mcurrier 2021-09-22 13:08:03 UTC
Created attachment 1825312 [details]
mon ip log

Comment 11 mcurrier 2021-09-22 13:08:54 UTC
Created attachment 1825313 [details]
ceph mgr log

Comment 12 Ben England 2021-09-23 17:36:43 UTC
Matt, next time create a tarball with all the logs and post 1 attachment, much easier.   You can also create a SOS report using the command "sos report",  That collects all the RHEL config in one big file automatically. -ben

Comment 13 Venky Shankar 2021-11-23 04:34:38 UTC
(In reply to mcurrier from comment #5)
> Created attachment 1825305 [details]
> Bug 2005442 attachments

Hey Matt,

Did you miss updating with the client log (with `debug client = 20`)?

In the meantime, I'll take a look at the core dump.

Cheers,
Venky

Comment 14 mcurrier 2021-12-01 18:13:02 UTC
Hi Venky,

Sorry for late reply.  I missed this earlier.

I looked through my notes and I see I applied this:
ceph config set client debug_client 20

I hope this helps.
Matt

Comment 15 Venky Shankar 2021-12-03 05:04:53 UTC
(In reply to mcurrier from comment #14)
> Hi Venky,
> 
> Sorry for late reply.  I missed this earlier.
> 
> I looked through my notes and I see I applied this:
> ceph config set client debug_client 20

I cannot find the client logs in the attachment - just the core, mgr, audit, volume log.

I couldn't get a clean backtrace from the core.

Could you please check.

> 
> I hope this helps.
> Matt

Comment 16 Venky Shankar 2021-12-03 09:20:37 UTC
Looks like a uninitialized mutex is being locked::

```
#0  0x00007f2bbcd7537f in raise () from /lib64/libc.so.6
#1  0x00007f2bbcd5fdb5 in abort () from /lib64/libc.so.6
#2  0x00007f2bbcd5fc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007f2bbcd6da76 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f2bbe323b61 in pthread_mutex_lock () from /lib64/libpthread.so.0
#5  0x000055ef2578bdb7 in ?? ()
#6  0x000055ef2652a880 in ?? ()

```

I couldn't get other stack frames for some reason (I have the required packages installed though).

Matt - client logs would really help (and/or if you can provide the backtrace through gdb too, that would be great).

Comment 17 mcurrier 2021-12-03 17:08:26 UTC
Hi Venky,

This appears to no longer be an issue in this RHCS V5 cluster.  I can now mount the ceph-fuse mount points on the other two hosts. I think we should close this bugzilla.


[root@ip-172-31-42-149 testruns]# ll /etc/ceph
total 24
-rw-------. 1 root root  63 Sep 16 15:47 ceph.client.admin.keyring
-rw-r--r--. 1 root root 175 Sep 16 15:47 ceph.conf
-rw-r--r--. 1 root root 184 Sep 16 14:30 ceph.client.2.keyring
-rw-r--r--. 1 root root  41 Sep 16 15:59 ceph.client.2.keyring.tmp
-rw-------. 1 root root 110 Dec  2 19:03 podman-auth.json
-rw-r--r--. 1 root root  92 Sep 16 14:52 rbdmap

2021-12-03T17:02:49.004+0000 7fb9671f3200 -1 init, newargv = 0x5633521c6740 newargc=15
ceph-fuse[935873]: starting ceph client
ceph-fuse[935873]: starting fuse
[root@ip-172-31-42-149 testruns]# df
Filesystem          Type            Size  Used Avail Use% Mounted on
devtmpfs            devtmpfs         16G     0   16G   0% /dev
tmpfs               tmpfs            16G   84K   16G   1% /dev/shm
tmpfs               tmpfs            16G  1.6G   14G  11% /run
tmpfs               tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p2      xfs              10G  6.5G  3.6G  65% /
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/9d34db80512c92b9998999f25843420af062d13baa4958c1444283d5b53ae378/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/d224cd3efeb6a738da7d5bc96b70a9fb98aaf6b6e0ef77b467b4e3b07a6b840a/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/c653abe212907b403d124d56d2a1eb6420916cce645b6d1bdf5336bfab7b976d/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/a04b90fec89ba4190749dd06517ce7ff284c2934982c52ec4dd29d68c60da754/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/589f8ca22f4c66b15586afa03ee4989bbd65fa124f650b81c931759b6567319c/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/203bb3a0b0ba30347593a7ccb825461cda3a16d4f42c6065a6866ed7e124e3a9/merged
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/0
172.31.32.53:6789:/ ceph            1.9T  142G  1.7T   8% /mnt/kernel-cephfs
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/1000
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/ddcc4a2c56cb6050fc9252e7c7d2841c9ccd2a6b52df94e3a83f750e1608238c/merged
ceph-fuse           fuse.ceph-fuse  1.9T  142G  1.7T   8% /mnt/cephfs



2021-12-03T17:04:16.871+0000 7f6489dcf200 -1 init, newargv = 0x55f9f78c9930 newargc=15
ceph-fuse[1115149]: starting ceph client
ceph-fuse[1115149]: starting fuse
[root@ip-172-31-40-206 ~]# 
[root@ip-172-31-40-206 ~]# 
[root@ip-172-31-40-206 ~]# df
Filesystem          Type            Size  Used Avail Use% Mounted on
devtmpfs            devtmpfs         16G     0   16G   0% /dev
tmpfs               tmpfs            16G   84K   16G   1% /dev/shm
tmpfs               tmpfs            16G  1.6G   14G  11% /run
tmpfs               tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p2      xfs              10G  7.1G  3.0G  71% /
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/f19012f83cc96aee042f5e535ebb6acfea5abe5e7aac7e6b38af2d22d32aa283/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/32a0fc30124cc92c19e8b8b8991849cd8faf373219d305def02f7d5f22f39bae/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/e6cc3f32f914f06115d5129f78f1fc1f4804504da0c4886a1d37baa945102a3b/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/480cdbc4eac1d9db38e5a697c21a1d6f8fdc9946c2b575896d586c48f7ecb42b/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/e3d11731bde9033eaf00b7933cebfe223639345feb4365ab18e524757686a9d1/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/c112930d4c94ff9a8156172dea2b0b43497856aeadbd0f55dfb70c5abf2bd539/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/aa435dfbefef1fa0bb9a2deb1b126dc8b569c14dc9bc4c4df8aea7e388a3f365/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/550e0dab60d22208def86704ef7ceb01eedf241c43897cf02ff0dba7235be9e1/merged
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/0
172.31.32.53:6789:/ ceph            1.9T  141G  1.7T   8% /mnt/kernel-cephfs
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/1000
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/00b0dc5a7a3b9c316bd63b4733c8f9dc0bd2bd71601ebf294e0c126bbadadc8a/merged
ceph-fuse           fuse.ceph-fuse  1.9T  141G  1.7T   8% /mnt/cephfs
[root@ip-172-31-40-206 ~]#

Comment 18 Venky Shankar 2021-12-03 17:36:31 UTC
(In reply to mcurrier from comment #17)
> Hi Venky,
> 
> This appears to no longer be an issue in this RHCS V5 cluster.  I can now
> mount the ceph-fuse mount points on the other two hosts. I think we should
> close this bugzilla.
> 

ACK - please reopen if you hit it again.

Comment 20 Venky Shankar 2023-08-14 04:44:03 UTC
This looks like its locking an uninitialised mutex - checking.


Note You need to log in before you can comment on or make changes to this bug.