Bug 2005442 - Ceph 5.0 ceph-fuse mount attempts on client nodes core dumping [NEEDINFO]
Summary: Ceph 5.0 ceph-fuse mount attempts on client nodes core dumping
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.0
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 5.3z1
Assignee: Venky Shankar
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-17 16:36 UTC by mcurrier
Modified: 2023-08-14 04:44 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-03 17:36:31 UTC
Embargoed:
nravinas: needinfo? (vshankar)


Attachments (Terms of Use)
Bug 2005442 attachments (202.68 KB, text/plain)
2021-09-22 13:01 UTC, mcurrier
no flags Details
core file for Bug 2005442 (1.42 MB, application/x-lz4)
2021-09-22 13:03 UTC, mcurrier
no flags Details
ceph.log for Bug 2005442 (228.47 KB, text/plain)
2021-09-22 13:05 UTC, mcurrier
no flags Details
ceph.audit.log (75.33 KB, text/plain)
2021-09-22 13:06 UTC, mcurrier
no flags Details
ceph-volume.log (81.81 KB, text/plain)
2021-09-22 13:07 UTC, mcurrier
no flags Details
mon ip log (389.48 KB, text/plain)
2021-09-22 13:08 UTC, mcurrier
no flags Details
ceph mgr log (209.63 KB, text/plain)
2021-09-22 13:08 UTC, mcurrier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-1726 0 None None None 2021-09-17 16:37:54 UTC

Description mcurrier 2021-09-17 16:36:41 UTC
D
Environment:  3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation

This mount command worked on the admin (bootstrap) host of the cluster:

[root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01
2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0 newargc=15
ceph-fuse[238730]: starting ceph client
ceph-fuse[238730]: starting fuse
[root@ip-172-31-32-53 ceph]#
[root@ip-172-31-32-53 ceph]# df /mnt/cephfs
Filesystem     1K-blocks  Used Available Use% Mounted on
ceph-fuse      946319360     0 946319360   0% /mnt/cephfs

However, on the other two nodes within the Ceph cluster, these two methods to attempt mount caused this core dump. The ceph.conf and the keyrings were copied to /etc/ceph directory.

[root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2 --client-fs=cephfs01
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)
[root@ip-172-31-42-149 share]#
[root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Aborted (core dumped)

Kernel mounts on these two client nodes such as:

# mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o name=admin,fs=cphfs01

Comment 1 Patrick Donnelly 2021-09-17 17:02:30 UTC
(In reply to mcurrier from comment #0)
> D
> Environment:  3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation
> 
> This mount command worked on the admin (bootstrap) host of the cluster:
> 
> [root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin
> --client-fs=cephfs01
> 2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0
> newargc=15
> ceph-fuse[238730]: starting ceph client
> ceph-fuse[238730]: starting fuse
> [root@ip-172-31-32-53 ceph]#
> [root@ip-172-31-32-53 ceph]# df /mnt/cephfs
> Filesystem     1K-blocks  Used Available Use% Mounted on
> ceph-fuse      946319360     0 946319360   0% /mnt/cephfs
> 
> However, on the other two nodes within the Ceph cluster, these two methods
> to attempt mount caused this core dump. The ceph.conf and the keyrings were
> copied to /etc/ceph directory.
> 
> [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2
> --client-fs=cephfs01
> ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
> `mutex->__data.__owner == 0' failed.
> Aborted (core dumped)
> [root@ip-172-31-42-149 share]#
> [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin
> --client-fs=cephfs01
> ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
> `mutex->__data.__owner == 0' failed.
> Aborted (core dumped)

Can you verify the ceph-fuse versions were the same on both hosts? Do you have a coredump you can share?

> Kernel mounts on these two client nodes such as:
> 
> # mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o
> name=admin,fs=cphfs01

Do the kernel mounts succeed?

Comment 2 mcurrier 2021-09-20 18:29:30 UTC
Hello,

Yes, the kernel mounts succeed.

Comment 3 Patrick Donnelly 2021-09-21 17:21:16 UTC
> fs=cphfs01

is that a typo?

In any case, please turn up debugging:

> ceph config set client debug_client 20
> ceph config set client debug_ms 1
> ceph config set client debug_monc 10

and retry the ceph-fuse mounts. Please upload the logs.

Comment 4 mcurrier 2021-09-22 12:55:34 UTC

The version of Ceph-fuse is:

[root@ip-172-31-32-53 ~]# ceph-fuse --version
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

I did install same way on each of the three nodes.  However checking the version on a second two non-bootstrap node shows the same core dump issue:

[root@ip-172-31-42-149 ~]# ceph-fuse --verison
ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.


on 3rd node:

[ec2-user@ip-172-31-40-206 ~]$ ceph-fuse --version
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

The fs=cphfs01 is part of the command given.  It is not a typo, it is the filesystem name.

I will attach the core file and logs.

Comment 5 mcurrier 2021-09-22 13:01:37 UTC
Created attachment 1825305 [details]
Bug 2005442 attachments

Comment 6 mcurrier 2021-09-22 13:03:41 UTC
Created attachment 1825306 [details]
core file for Bug 2005442

Comment 7 mcurrier 2021-09-22 13:05:28 UTC
Created attachment 1825308 [details]
ceph.log for Bug 2005442

Comment 8 mcurrier 2021-09-22 13:06:37 UTC
Created attachment 1825310 [details]
ceph.audit.log

Comment 9 mcurrier 2021-09-22 13:07:24 UTC
Created attachment 1825311 [details]
ceph-volume.log

Comment 10 mcurrier 2021-09-22 13:08:03 UTC
Created attachment 1825312 [details]
mon ip log

Comment 11 mcurrier 2021-09-22 13:08:54 UTC
Created attachment 1825313 [details]
ceph mgr log

Comment 12 Ben England 2021-09-23 17:36:43 UTC
Matt, next time create a tarball with all the logs and post 1 attachment, much easier.   You can also create a SOS report using the command "sos report",  That collects all the RHEL config in one big file automatically. -ben

Comment 13 Venky Shankar 2021-11-23 04:34:38 UTC
(In reply to mcurrier from comment #5)
> Created attachment 1825305 [details]
> Bug 2005442 attachments

Hey Matt,

Did you miss updating with the client log (with `debug client = 20`)?

In the meantime, I'll take a look at the core dump.

Cheers,
Venky

Comment 14 mcurrier 2021-12-01 18:13:02 UTC
Hi Venky,

Sorry for late reply.  I missed this earlier.

I looked through my notes and I see I applied this:
ceph config set client debug_client 20

I hope this helps.
Matt

Comment 15 Venky Shankar 2021-12-03 05:04:53 UTC
(In reply to mcurrier from comment #14)
> Hi Venky,
> 
> Sorry for late reply.  I missed this earlier.
> 
> I looked through my notes and I see I applied this:
> ceph config set client debug_client 20

I cannot find the client logs in the attachment - just the core, mgr, audit, volume log.

I couldn't get a clean backtrace from the core.

Could you please check.

> 
> I hope this helps.
> Matt

Comment 16 Venky Shankar 2021-12-03 09:20:37 UTC
Looks like a uninitialized mutex is being locked::

```
#0  0x00007f2bbcd7537f in raise () from /lib64/libc.so.6
#1  0x00007f2bbcd5fdb5 in abort () from /lib64/libc.so.6
#2  0x00007f2bbcd5fc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007f2bbcd6da76 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f2bbe323b61 in pthread_mutex_lock () from /lib64/libpthread.so.0
#5  0x000055ef2578bdb7 in ?? ()
#6  0x000055ef2652a880 in ?? ()

```

I couldn't get other stack frames for some reason (I have the required packages installed though).

Matt - client logs would really help (and/or if you can provide the backtrace through gdb too, that would be great).

Comment 17 mcurrier 2021-12-03 17:08:26 UTC
Hi Venky,

This appears to no longer be an issue in this RHCS V5 cluster.  I can now mount the ceph-fuse mount points on the other two hosts. I think we should close this bugzilla.


[root@ip-172-31-42-149 testruns]# ll /etc/ceph
total 24
-rw-------. 1 root root  63 Sep 16 15:47 ceph.client.admin.keyring
-rw-r--r--. 1 root root 175 Sep 16 15:47 ceph.conf
-rw-r--r--. 1 root root 184 Sep 16 14:30 ceph.client.2.keyring
-rw-r--r--. 1 root root  41 Sep 16 15:59 ceph.client.2.keyring.tmp
-rw-------. 1 root root 110 Dec  2 19:03 podman-auth.json
-rw-r--r--. 1 root root  92 Sep 16 14:52 rbdmap

2021-12-03T17:02:49.004+0000 7fb9671f3200 -1 init, newargv = 0x5633521c6740 newargc=15
ceph-fuse[935873]: starting ceph client
ceph-fuse[935873]: starting fuse
[root@ip-172-31-42-149 testruns]# df
Filesystem          Type            Size  Used Avail Use% Mounted on
devtmpfs            devtmpfs         16G     0   16G   0% /dev
tmpfs               tmpfs            16G   84K   16G   1% /dev/shm
tmpfs               tmpfs            16G  1.6G   14G  11% /run
tmpfs               tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p2      xfs              10G  6.5G  3.6G  65% /
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/9d34db80512c92b9998999f25843420af062d13baa4958c1444283d5b53ae378/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/d224cd3efeb6a738da7d5bc96b70a9fb98aaf6b6e0ef77b467b4e3b07a6b840a/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/c653abe212907b403d124d56d2a1eb6420916cce645b6d1bdf5336bfab7b976d/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/a04b90fec89ba4190749dd06517ce7ff284c2934982c52ec4dd29d68c60da754/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/589f8ca22f4c66b15586afa03ee4989bbd65fa124f650b81c931759b6567319c/merged
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/203bb3a0b0ba30347593a7ccb825461cda3a16d4f42c6065a6866ed7e124e3a9/merged
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/0
172.31.32.53:6789:/ ceph            1.9T  142G  1.7T   8% /mnt/kernel-cephfs
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/1000
overlay             overlay          10G  6.5G  3.6G  65% /var/lib/containers/storage/overlay/ddcc4a2c56cb6050fc9252e7c7d2841c9ccd2a6b52df94e3a83f750e1608238c/merged
ceph-fuse           fuse.ceph-fuse  1.9T  142G  1.7T   8% /mnt/cephfs



2021-12-03T17:04:16.871+0000 7f6489dcf200 -1 init, newargv = 0x55f9f78c9930 newargc=15
ceph-fuse[1115149]: starting ceph client
ceph-fuse[1115149]: starting fuse
[root@ip-172-31-40-206 ~]# 
[root@ip-172-31-40-206 ~]# 
[root@ip-172-31-40-206 ~]# df
Filesystem          Type            Size  Used Avail Use% Mounted on
devtmpfs            devtmpfs         16G     0   16G   0% /dev
tmpfs               tmpfs            16G   84K   16G   1% /dev/shm
tmpfs               tmpfs            16G  1.6G   14G  11% /run
tmpfs               tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p2      xfs              10G  7.1G  3.0G  71% /
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/f19012f83cc96aee042f5e535ebb6acfea5abe5e7aac7e6b38af2d22d32aa283/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/32a0fc30124cc92c19e8b8b8991849cd8faf373219d305def02f7d5f22f39bae/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/e6cc3f32f914f06115d5129f78f1fc1f4804504da0c4886a1d37baa945102a3b/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/480cdbc4eac1d9db38e5a697c21a1d6f8fdc9946c2b575896d586c48f7ecb42b/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/e3d11731bde9033eaf00b7933cebfe223639345feb4365ab18e524757686a9d1/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/c112930d4c94ff9a8156172dea2b0b43497856aeadbd0f55dfb70c5abf2bd539/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/aa435dfbefef1fa0bb9a2deb1b126dc8b569c14dc9bc4c4df8aea7e388a3f365/merged
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/550e0dab60d22208def86704ef7ceb01eedf241c43897cf02ff0dba7235be9e1/merged
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/0
172.31.32.53:6789:/ ceph            1.9T  141G  1.7T   8% /mnt/kernel-cephfs
tmpfs               tmpfs           3.1G     0  3.1G   0% /run/user/1000
overlay             overlay          10G  7.1G  3.0G  71% /var/lib/containers/storage/overlay/00b0dc5a7a3b9c316bd63b4733c8f9dc0bd2bd71601ebf294e0c126bbadadc8a/merged
ceph-fuse           fuse.ceph-fuse  1.9T  141G  1.7T   8% /mnt/cephfs
[root@ip-172-31-40-206 ~]#

Comment 18 Venky Shankar 2021-12-03 17:36:31 UTC
(In reply to mcurrier from comment #17)
> Hi Venky,
> 
> This appears to no longer be an issue in this RHCS V5 cluster.  I can now
> mount the ceph-fuse mount points on the other two hosts. I think we should
> close this bugzilla.
> 

ACK - please reopen if you hit it again.

Comment 20 Venky Shankar 2023-08-14 04:44:03 UTC
This looks like its locking an uninitialised mutex - checking.


Note You need to log in before you can comment on or make changes to this bug.