Bug 2005442
Summary: | Ceph 5.0 ceph-fuse mount attempts on client nodes core dumping | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | mcurrier | ||||||||||||||||
Component: | CephFS | Assignee: | Venky Shankar <vshankar> | ||||||||||||||||
Status: | ASSIGNED --- | QA Contact: | Hemanth Kumar <hyelloji> | ||||||||||||||||
Severity: | low | Docs Contact: | |||||||||||||||||
Priority: | unspecified | ||||||||||||||||||
Version: | 5.0 | CC: | bengland, ceph-eng-bugs, gfarnum, ngangadh, nravinas, sweil, vereddy, vshankar | ||||||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||||||
Target Release: | 5.3z9 | Flags: | ngangadh:
needinfo?
(vshankar) ngangadh: needinfo+ gfarnum: needinfo? (vshankar) gfarnum: needinfo- |
||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: |
.The client process no longer causes loss of access to Ceph File System due to incorrect lock API usage
Previously, an incorrect lock API usage caused the client process to crash, causing loss of access to the Ceph File System.
With this fix, the correct lock API is being used and the Ceph File System works as expected.
|
Story Points: | --- | ||||||||||||||||
Clone Of: | |||||||||||||||||||
: | 2249038 2249039 (view as bug list) | Environment: | |||||||||||||||||
Last Closed: | 2021-12-03 17:36:31 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 2249038, 2249039, 2347559 | ||||||||||||||||||
Attachments: |
|
Description
mcurrier
2021-09-17 16:36:41 UTC
(In reply to mcurrier from comment #0) > D > Environment: 3 node AWS cluster, 3 m5.xlarge VMs, Ceph 5.0 installation > > This mount command worked on the admin (bootstrap) host of the cluster: > > [root@ip-172-31-32-53 ceph]# ceph-fuse /mnt/cephfs/ -n client.admin > --client-fs=cephfs01 > 2021-09-16T17:57:37.325+0000 7ffbb1858200 -1 init, newargv = 0x55d0f420e9e0 > newargc=15 > ceph-fuse[238730]: starting ceph client > ceph-fuse[238730]: starting fuse > [root@ip-172-31-32-53 ceph]# > [root@ip-172-31-32-53 ceph]# df /mnt/cephfs > Filesystem 1K-blocks Used Available Use% Mounted on > ceph-fuse 946319360 0 946319360 0% /mnt/cephfs > > However, on the other two nodes within the Ceph cluster, these two methods > to attempt mount caused this core dump. The ceph.conf and the keyrings were > copied to /etc/ceph directory. > > [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.2 > --client-fs=cephfs01 > ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion > `mutex->__data.__owner == 0' failed. > Aborted (core dumped) > [root@ip-172-31-42-149 share]# > [root@ip-172-31-42-149 share]# ceph-fuse /mnt/cephfs/ -n client.admin > --client-fs=cephfs01 > ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion > `mutex->__data.__owner == 0' failed. > Aborted (core dumped) Can you verify the ceph-fuse versions were the same on both hosts? Do you have a coredump you can share? > Kernel mounts on these two client nodes such as: > > # mount -t ceph ip-172-31-35-184:6789:/ /mnt/cephfs-kernel -o > name=admin,fs=cphfs01 Do the kernel mounts succeed? Hello, Yes, the kernel mounts succeed. > fs=cphfs01 is that a typo? In any case, please turn up debugging: > ceph config set client debug_client 20 > ceph config set client debug_ms 1 > ceph config set client debug_monc 10 and retry the ceph-fuse mounts. Please upload the logs. The version of Ceph-fuse is: [root@ip-172-31-32-53 ~]# ceph-fuse --version ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) I did install same way on each of the three nodes. However checking the version on a second two non-bootstrap node shows the same core dump issue: [root@ip-172-31-42-149 ~]# ceph-fuse --verison ceph-fuse: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed. on 3rd node: [ec2-user@ip-172-31-40-206 ~]$ ceph-fuse --version ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) The fs=cphfs01 is part of the command given. It is not a typo, it is the filesystem name. I will attach the core file and logs. Created attachment 1825305 [details] Bug 2005442 attachments Created attachment 1825306 [details] core file for Bug 2005442 Created attachment 1825308 [details] ceph.log for Bug 2005442 Created attachment 1825310 [details]
ceph.audit.log
Created attachment 1825311 [details]
ceph-volume.log
Created attachment 1825312 [details]
mon ip log
Created attachment 1825313 [details]
ceph mgr log
Matt, next time create a tarball with all the logs and post 1 attachment, much easier. You can also create a SOS report using the command "sos report", That collects all the RHEL config in one big file automatically. -ben (In reply to mcurrier from comment #5) > Created attachment 1825305 [details] > Bug 2005442 attachments Hey Matt, Did you miss updating with the client log (with `debug client = 20`)? In the meantime, I'll take a look at the core dump. Cheers, Venky Hi Venky, Sorry for late reply. I missed this earlier. I looked through my notes and I see I applied this: ceph config set client debug_client 20 I hope this helps. Matt (In reply to mcurrier from comment #14) > Hi Venky, > > Sorry for late reply. I missed this earlier. > > I looked through my notes and I see I applied this: > ceph config set client debug_client 20 I cannot find the client logs in the attachment - just the core, mgr, audit, volume log. I couldn't get a clean backtrace from the core. Could you please check. > > I hope this helps. > Matt Looks like a uninitialized mutex is being locked:: ``` #0 0x00007f2bbcd7537f in raise () from /lib64/libc.so.6 #1 0x00007f2bbcd5fdb5 in abort () from /lib64/libc.so.6 #2 0x00007f2bbcd5fc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x00007f2bbcd6da76 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f2bbe323b61 in pthread_mutex_lock () from /lib64/libpthread.so.0 #5 0x000055ef2578bdb7 in ?? () #6 0x000055ef2652a880 in ?? () ``` I couldn't get other stack frames for some reason (I have the required packages installed though). Matt - client logs would really help (and/or if you can provide the backtrace through gdb too, that would be great). Hi Venky, This appears to no longer be an issue in this RHCS V5 cluster. I can now mount the ceph-fuse mount points on the other two hosts. I think we should close this bugzilla. [root@ip-172-31-42-149 testruns]# ll /etc/ceph total 24 -rw-------. 1 root root 63 Sep 16 15:47 ceph.client.admin.keyring -rw-r--r--. 1 root root 175 Sep 16 15:47 ceph.conf -rw-r--r--. 1 root root 184 Sep 16 14:30 ceph.client.2.keyring -rw-r--r--. 1 root root 41 Sep 16 15:59 ceph.client.2.keyring.tmp -rw-------. 1 root root 110 Dec 2 19:03 podman-auth.json -rw-r--r--. 1 root root 92 Sep 16 14:52 rbdmap 2021-12-03T17:02:49.004+0000 7fb9671f3200 -1 init, newargv = 0x5633521c6740 newargc=15 ceph-fuse[935873]: starting ceph client ceph-fuse[935873]: starting fuse [root@ip-172-31-42-149 testruns]# df Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 16G 84K 16G 1% /dev/shm tmpfs tmpfs 16G 1.6G 14G 11% /run tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/nvme0n1p2 xfs 10G 6.5G 3.6G 65% / overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/9d34db80512c92b9998999f25843420af062d13baa4958c1444283d5b53ae378/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/d224cd3efeb6a738da7d5bc96b70a9fb98aaf6b6e0ef77b467b4e3b07a6b840a/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/c653abe212907b403d124d56d2a1eb6420916cce645b6d1bdf5336bfab7b976d/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/a04b90fec89ba4190749dd06517ce7ff284c2934982c52ec4dd29d68c60da754/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/589f8ca22f4c66b15586afa03ee4989bbd65fa124f650b81c931759b6567319c/merged overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/203bb3a0b0ba30347593a7ccb825461cda3a16d4f42c6065a6866ed7e124e3a9/merged tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/0 172.31.32.53:6789:/ ceph 1.9T 142G 1.7T 8% /mnt/kernel-cephfs tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/1000 overlay overlay 10G 6.5G 3.6G 65% /var/lib/containers/storage/overlay/ddcc4a2c56cb6050fc9252e7c7d2841c9ccd2a6b52df94e3a83f750e1608238c/merged ceph-fuse fuse.ceph-fuse 1.9T 142G 1.7T 8% /mnt/cephfs 2021-12-03T17:04:16.871+0000 7f6489dcf200 -1 init, newargv = 0x55f9f78c9930 newargc=15 ceph-fuse[1115149]: starting ceph client ceph-fuse[1115149]: starting fuse [root@ip-172-31-40-206 ~]# [root@ip-172-31-40-206 ~]# [root@ip-172-31-40-206 ~]# df Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 16G 84K 16G 1% /dev/shm tmpfs tmpfs 16G 1.6G 14G 11% /run tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/nvme0n1p2 xfs 10G 7.1G 3.0G 71% / overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/f19012f83cc96aee042f5e535ebb6acfea5abe5e7aac7e6b38af2d22d32aa283/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/32a0fc30124cc92c19e8b8b8991849cd8faf373219d305def02f7d5f22f39bae/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/e6cc3f32f914f06115d5129f78f1fc1f4804504da0c4886a1d37baa945102a3b/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/480cdbc4eac1d9db38e5a697c21a1d6f8fdc9946c2b575896d586c48f7ecb42b/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/e3d11731bde9033eaf00b7933cebfe223639345feb4365ab18e524757686a9d1/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/c112930d4c94ff9a8156172dea2b0b43497856aeadbd0f55dfb70c5abf2bd539/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/aa435dfbefef1fa0bb9a2deb1b126dc8b569c14dc9bc4c4df8aea7e388a3f365/merged overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/550e0dab60d22208def86704ef7ceb01eedf241c43897cf02ff0dba7235be9e1/merged tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/0 172.31.32.53:6789:/ ceph 1.9T 141G 1.7T 8% /mnt/kernel-cephfs tmpfs tmpfs 3.1G 0 3.1G 0% /run/user/1000 overlay overlay 10G 7.1G 3.0G 71% /var/lib/containers/storage/overlay/00b0dc5a7a3b9c316bd63b4733c8f9dc0bd2bd71601ebf294e0c126bbadadc8a/merged ceph-fuse fuse.ceph-fuse 1.9T 141G 1.7T 8% /mnt/cephfs [root@ip-172-31-40-206 ~]# (In reply to mcurrier from comment #17) > Hi Venky, > > This appears to no longer be an issue in this RHCS V5 cluster. I can now > mount the ceph-fuse mount points on the other two hosts. I think we should > close this bugzilla. > ACK - please reopen if you hit it again. This looks like its locking an uninitialised mutex - checking. |