D Environment: 3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation This mount command worked on the admin (bootstrap) host of the cluster: [root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01 2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0 newargc=15 ceph-fuse[238730]: starting ceph client ceph-fuse[238730]: starting fuse [root@ip-172-31-32-53 ceph]# [root@ip-172-31-32-53 ceph]# df /mnt/cephfs Filesystem 1K-blocks Used Available Use% Mounted on ceph-fuse 946319360 0 946319360 0% /mnt/cephfs However, on the other two nodes within the Ceph cluster, these two methods to attempt mount caused this core dump. The ceph.conf and the keyrings were copied to /etc/ceph directory. [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2 --client-fs=cephfs01 ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed. Aborted (core dumped) [root@ip-172-31-42-149 share]# [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin --client-fs=cephfs01 ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed. Aborted (core dumped) Kernel mounts on these two client nodes such as: # mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o name=admin,fs=cphfs01
(In reply to mcurrier from comment #0) > D > Environment: 3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation > > This mount command worked on the admin (bootstrap) host of the cluster: > > [root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin > --client-fs=cephfs01 > 2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0 > newargc=15 > ceph-fuse[238730]: starting ceph client > ceph-fuse[238730]: starting fuse > [root@ip-172-31-32-53 ceph]# > [root@ip-172-31-32-53 ceph]# df /mnt/cephfs > Filesystem 1K-blocks Used Available Use% Mounted on > ceph-fuse 946319360 0 946319360 0% /mnt/cephfs > > However, on the other two nodes within the Ceph cluster, these two methods > to attempt mount caused this core dump. The ceph.conf and the keyrings were > copied to /etc/ceph directory. > > [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2 > --client-fs=cephfs01 > ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion > `mutex->__data.__owner == 0' failed. > Aborted (core dumped) > [root@ip-172-31-42-149 share]# > [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin > --client-fs=cephfs01 > ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion > `mutex->__data.__owner == 0' failed. > Aborted (core dumped) Can you verify the ceph-fuse versions were the same on both hosts? Do you have a coredump you can share? > Kernel mounts on these two client nodes such as: > > # mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o > name=admin,fs=cphfs01 Do the kernel mounts succeed?
Hello, Yes, the kernel mounts succeed.
> fs=cphfs01 is that a typo? In any case, please turn up debugging: > ceph config set client debug_client 20 > ceph config set client debug_ms 1 > ceph config set client debug_monc 10 and retry the ceph-fuse mounts. Please upload the logs.
The version of Ceph-fuse is: [root@ip-172-31-32-53 ~]# ceph-fuse --version ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) I did install same way on each of the three nodes. However checking the version on a second two non-bootstrap node shows the same core dump issue: [root@ip-172-31-42-149 ~]# ceph-fuse --verison ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed. on 3rd node: [ec2-user@ip-172-31-40-206 ~]$ ceph-fuse --version ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) The fs=cphfs01 is part of the command given. It is not a typo, it is the filesystem name. I will attach the core file and logs.
Created attachment 1825305 [details] Bug 2005442 attachments
Created attachment 1825306 [details] core file for Bug 2005442
Created attachment 1825308 [details] ceph.log for Bug 2005442
Created attachment 1825310 [details] ceph.audit.log
Created attachment 1825311 [details] ceph-volume.log
Created attachment 1825312 [details] mon ip log
Created attachment 1825313 [details] ceph mgr log
Matt, next time create a tarball with all the logs and post 1 attachment, much easier. You can also create a SOS report using the command "sos report", That collects all the RHEL config in one big file automatically. -ben
(In reply to mcurrier from comment #5) > Created attachment 1825305 [details] > Bug 2005442 attachments Hey Matt, Did you miss updating with the client log (with `debug client = 20`)? In the meantime, I'll take a look at the core dump. Cheers, Venky
Hi Venky, Sorry for late reply. I missed this earlier. I looked through my notes and I see I applied this: ceph config set client debug_client 20 I hope this helps. Matt
(In reply to mcurrier from comment #14) > Hi Venky, > > Sorry for late reply. I missed this earlier. > > I looked through my notes and I see I applied this: > ceph config set client debug_client 20 I cannot find the client logs in the attachment - just the core, mgr, audit, volume log. I couldn't get a clean backtrace from the core. Could you please check. > > I hope this helps. > Matt
Looks like a uninitialized mutex is being locked:: ``` #0 0x00007f2bbcd7537f in raise () from /lib64/libc.so.6 #1 0x00007f2bbcd5fdb5 in abort () from /lib64/libc.so.6 #2 0x00007f2bbcd5fc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x00007f2bbcd6da76 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f2bbe323b61 in pthread_mutex_lock () from /lib64/libpthread.so.0 #5 0x000055ef2578bdb7 in ?? () #6 0x000055ef2652a880 in ?? () ``` I couldn't get other stack frames for some reason (I have the required packages installed though). Matt - client logs would really help (and/or if you can provide the backtrace through gdb too, that would be great).
Hi Venky, This appears to no longer be an issue in this RHCS V5 cluster. I can now mount the ceph-fuse mount points on the other two hosts. I think we should close this bugzilla. [root@ip-172-31-42-149 testruns]# ll /etc/ceph total 24 -rw-------. 1 root root 63 Sep 16 15:47 ceph.client.admin.keyring -rw-r--r--. 1 root root 175 Sep 16 15:47 ceph.conf -rw-r--r--. 1 root root 184 Sep 16 14:30 ceph.client.2.keyring -rw-r--r--. 1 root root 41 Sep 16 15:59 ceph.client.2.keyring.tmp -rw-------. 1 root root 110 Dec 2 19:03 podman-auth.json -rw-r--r--. 1 root root 92 Sep 16 14:52 rbdmap 2021-12-03T17:02:49.004+0000 7fb9671f3200 -1 init, newargv = 0x5633521c6740 newargc=15 ceph-fuse[935873]: starting ceph client ceph-fuse[935873]: starting fuse [root@ip-172-31-42-149 testruns]# df Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 16G 84K 16G 1% /dev/shm tmpfs tmpfs 16G 1.6G 14G 11% /run tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/nvme0n1p2 xfs 10G 6.5G 3.6G 65% / overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/9d34db80512c92b9998999f25843420af062d13baa4958c1444283d5b53ae378/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/d224cd3efeb6a738da7d5bc96b70a9fb98aaf6b6e0ef77b467b4e3b07a6b840a/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/c653abe212907b403d124d56d2a1eb6420916cce645b6d1bdf5336bfab7b976d/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/a04b90fec89ba4190749dd06517ce7ff284c2934982c52ec4dd29d68c60da754/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/589f8ca22f4c66b15586afa03ee4989bbd65fa124f650b81c931759b6567319c/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/203bb3a0b0ba30347593a7ccb825461cda3a16d4f42c6065a6866ed7e124e3a9/merged tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/0 172.31.32.53:6789:/ ceph 1.9T 142G 1.7T 8% /mnt/kernel-cephfs tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/1000 overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/ddcc4a2c56cb6050fc9252e7c7d2841c9ccd2a6b52df94e3a83f750e1608238c/merged ceph-fuse fuse.ceph-fuse 1.9T 142G 1.7T 8% /mnt/cephfs 2021-12-03T17:04:16.871+0000 7f6489dcf200 -1 init, newargv = 0x55f9f78c9930 newargc=15 ceph-fuse[1115149]: starting ceph client ceph-fuse[1115149]: starting fuse [root@ip-172-31-40-206 ~]# [root@ip-172-31-40-206 ~]# [root@ip-172-31-40-206 ~]# df Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 16G 84K 16G 1% /dev/shm tmpfs tmpfs 16G 1.6G 14G 11% /run tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/nvme0n1p2 xfs 10G 7.1G 3.0G 71% / overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/f19012f83cc96aee042f5e535ebb6acfea5abe5e7aac7e6b38af2d22d32aa283/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/32a0fc30124cc92c19e8b8b8991849cd8faf373219d305def02f7d5f22f39bae/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/e6cc3f32f914f06115d5129f78f1fc1f4804504da0c4886a1d37baa945102a3b/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/480cdbc4eac1d9db38e5a697c21a1d6f8fdc9946c2b575896d586c48f7ecb42b/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/e3d11731bde9033eaf00b7933cebfe223639345feb4365ab18e524757686a9d1/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/c112930d4c94ff9a8156172dea2b0b43497856aeadbd0f55dfb70c5abf2bd539/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/aa435dfbefef1fa0bb9a2deb1b126dc8b569c14dc9bc4c4df8aea7e388a3f365/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/550e0dab60d22208def86704ef7ceb01eedf241c43897cf02ff0dba7235be9e1/merged tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/0 172.31.32.53:6789:/ ceph 1.9T 141G 1.7T 8% /mnt/kernel-cephfs tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/1000 overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/00b0dc5a7a3b9c316bd63b4733c8f9dc0bd2bd71601ebf294e0c126bbadadc8a/merged ceph-fuse fuse.ceph-fuse 1.9T 141G 1.7T 8% /mnt/cephfs [root@ip-172-31-40-206 ~]#
(In reply to mcurrier from comment #17) > Hi Venky, > > This appears to no longer be an issue in this RHCS V5 cluster. I can now > mount the ceph-fuse mount points on the other two hosts. I think we should > close this bugzilla. > ACK - please reopen if you hit it again.
This looks like its locking an uninitialised mutex - checking.