Description of problem: When trying to do parallel writes from different VMs on a Mandatory Lock Enabled rbd image, the write operations are hung. Version-Release number of selected component (if applicable): ceph version 0.94.1 librbd1-0.94.1-10.el7cp.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create a rbd image with --image-feature 13 option. rbd -p Tanay-RBD --image Import-clone info rbd image 'Import-clone': size 10261 MB in 2566 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.6765c515f007c format: 2 features: layering, exclusive, object map flags: object map invalid parent: Tanay-RBD/Import1@Import1snap overlap: 9911 MB 2. Attached the Clone as a spare disk to 2 different VM, say VM-1 and VM-2 3. Try to write on the spare disk ( e.g. vdd on VM-1 and VM-2 ) in parallel at the same time. Actual results: When the write started one of the VM acquired the lock ( say VM-1), which is expected. rbd -p Tanay-RBD lock list Import-clone There is 1 exclusive lock on this image. Locker ID Address client.471153 auto 140503524950016 10.12.27.44:0/7020710 But it is unable to complete the write operation, hence the other VM ( i.e. VM-2) is in starvation as it never gets the lock. So now the IO operations is hanged on both the VM's strace dd if=100-1 of=/dev/vdd bs=1M count=100 Expected results: The IO should complete on VM-1, and it should allow VM-2 to take over the lock once VM-1 completes its write. Additional info: Copied the logs of one of the osd, where the client (10.12.27.44:0/7020710) establishes a new session. Information: Image Name: Import-clone Pool Name: Tanay-RBD Monitor IP: 10.12.27.26 OSD Tree: ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 13.19987 root default -2 2.00000 host OSD1 0 0.39999 osd.0 up 1.00000 1.00000 3 0.39999 osd.3 up 1.00000 1.00000 6 0.39999 osd.6 up 1.00000 1.00000 9 0.39999 osd.9 up 1.00000 1.00000 12 0.39999 osd.12 up 1.00000 1.00000 -3 1.99997 host OSD2 1 0.39999 osd.1 up 1.00000 1.00000 4 0.39999 osd.4 up 1.00000 1.00000 7 0.39999 osd.7 up 1.00000 1.00000 10 0.39999 osd.10 up 1.00000 1.00000 13 0.39999 osd.13 up 1.00000 1.00000 -4 2.00000 host OSD3 2 0.39999 osd.2 up 1.00000 1.00000 5 0.39999 osd.5 up 1.00000 1.00000 8 0.39999 osd.8 up 1.00000 1.00000 11 0.39999 osd.11 up 1.00000 1.00000 14 0.39999 osd.14 up 1.00000 1.00000 -5 1.20000 host ADMIN 15 0.39999 osd.15 up 1.00000 1.00000 16 0.39999 osd.16 up 1.00000 1.00000 17 0.39999 osd.17 up 1.00000 1.00000 -6 1.19998 host OSD5 33 0.39999 osd.33 up 1.00000 1.00000 34 0.39999 osd.34 up 1.00000 1.00000 35 0.39999 osd.35 up 1.00000 1.00000 -7 1.19998 host OSD4 30 0.39999 osd.30 up 1.00000 1.00000 31 0.39999 osd.31 up 1.00000 1.00000 32 0.39999 osd.32 up 1.00000 1.00000 -8 1.19998 host OSD6 36 0.39999 osd.36 up 1.00000 1.00000 37 0.39999 osd.37 up 1.00000 1.00000 38 0.39999 osd.38 up 1.00000 1.00000 -9 1.19998 host OSD7 39 0.39999 osd.39 up 1.00000 1.00000 40 0.39999 osd.40 up 1.00000 1.00000 41 0.39999 osd.41 up 1.00000 1.00000 -10 1.19998 host OSD8 42 0.39999 osd.42 up 1.00000 1.00000 43 0.39999 osd.43 up 1.00000 1.00000 44 0.39999 osd.44 up 1.00000 1.00000 Log Snippet: 2015-05-21 01:58:27.117395 7fde1b723700 10 osd.36 10123 new session 0x6ebf080 con=0xa442260 addr=10.12.27.44:0/7020710 2015-05-21 01:58:27.117576 7fde1b723700 10 osd.36 10123 session 0x6ebf080 client.libvirt has caps osdcap[grant(object_prefix rbd_children class-read),grant(pool Tanay-RBD rwx)] 'allow class-read object_prefix rbd_children, allow rwx pool=Tanay-RBD' 2015-05-21 01:58:27.303751 7fde21c88700 10 osd.36 10123 new session 0x6eb9400 con=0xa446f60 addr=10.12.27.26:0/1006311 2015-05-21 01:58:27.303862 7fde21c88700 10 osd.36 10123 session 0x6eb9400 client.admin has caps osdcap[grant(*)] 'allow *' 2015-05-21 01:58:27.305213 7fde21c88700 10 osd.36 10123 class rbd method object_map_load flags=r 2015-05-21 01:58:27.305279 7fde21c88700 15 osd.36 10123 enqueue_op 0xaf99b00 prio 63 cost 18 latency 0.000519 osd_op(client.474066.0:15 rbd_object_map.66dcc2ae8944a.000000000000003f [call rbd.object_map_load] 4.1cdb4fb ack+read+known_if_redirected e10123) v5 2015-05-21 01:58:27.305416 7fde37150700 10 osd.36 10123 dequeue_op 0xaf99b00 prio 63 cost 18 latency 0.000656 osd_op(client.474066.0:15 rbd_object_map.66dcc2ae8944a.000000000000003f [call rbd.object_map_load] 4.1cdb4fb ack+read+known_if_redirected e10123) v5 pg pg[4.4fb( v 10120'3 (0'0,10120'3] local-les=10048 n=2 ec=7234 les/c 10048/10048 10047/10047/9934) [36,5,42] r=0 lpr=10047 crt=0'0 lcod 7281'2 mlcod 7281'2 active+clean]
Created attachment 1028001 [details] OSD logs
Writing from multiple clients isn't supposed to work, but also should not deadlock/livelock. Since this is something that shouldn't be done in the first place, I don't think this should block the release. We should fix it at a lower priority.
Also, the flags show the object map became invalid somehow. That's a separate bug. Do you have the steps it took to become invalid, and can you reproduce it?
Josh, w.r.t comment 3. Mandatory exclusive lock is per-image and used to prevent the overlapping of writing into the same rbd image by using some locking technique. Only one client holds the lock at a time, and only that client is allowed to do any kind of write to the image and other client will request the lock from the current lock holder. So why writing from multiple client on the same image should not work ? Please correct my understanding here, i may be totally wrong here.
(In reply to Josh Durgin from comment #4) > Also, the flags show the object map became invalid somehow. That's a > separate bug. Do you have the steps it took to become invalid, and can you > reproduce it? Josh, I have reproduced and filed a new bug for it. https://bugzilla.redhat.com/show_bug.cgi?id=1224174
Created attachment 1028916 [details] deadlock backtrace
Federico, are you ok re-targeting this bz to 1.3.1?
Ken and Federico, before making the decision to fix in 1.3.1 can we please wait for Josh to reply to Tanay's comment 5?
w.r.t. comment 5, the use case for exclusive locking is not for two VMs sharing the same disk (nor three as another BZ tests). The VMs won't expect the disks to be changing underneath them as caching would prevent one VM from seeing the changes of another. If you needed to share a disk between two VMs for the purposes of using a clustered file system (GFS, OCFS), you would just use a standard (non-exclusive) RBD image. This feature was designed to support the object map feature as well as future features such as RBD mirroring. The only time one should expect the same disk to be opened by two different QEMU instances at the same time is in the live-migration case where QEMU temporarily has the image open on the source and destination nodes at the same time for a short period of time.
What Jason said addresses comment 5. To be more explicit, locking isn't meant to make it safe to use rbd from multiple clients at once. This will still fail with any local fs on top due to caching within the vm's kernel. Reads in one vm would get cached data that may have changed, and there's no way to invalidate the vm's cache if another vm wrote to the same disk. This bug is still valid, but I wouldn't treat it as a blocker for 1.3.0, and would rather address it in z-stream or 1.3.1.
Hi Jason, what timeframe are we looking at for landing this fix?
commit 704c0e09479d435ecbb84693a786c60694732b19 Author: Jason Dillaman <dillaman> Date: Thu May 7 12:51:49 2015 -0400 librbd: retry lock requests periodically until acquired If the exclusive lock owner acks the lock release request but crashes before it actually releases the lock, the requestor will wait forever. Therefore, after a certain timeout, retry the request again until it succeeds. Fixes: #11537 Backport: hammer Signed-off-by: Jason Dillaman <dillaman> (cherry picked from commit 37c74e6e5274208d1b6efaf315afec03ea7eaa82) commit dbaaed9cf99121d0c97c22a695a7a6e872a11f48 Author: Jason Dillaman <dillaman> Date: Thu May 7 12:35:36 2015 -0400 librbd: don't hold owner_lock for write during flush The various IO callback codepaths will attempt to take the lock, which will result in deadlock since the flush cannot complete. Backport: hammer Fixes: #11537 Signed-off-by: Jason Dillaman <dillaman> (cherry picked from commit 2b6d0633d5c89de3a557cdd72621dfc19f0540de) # git tag --contains dbaaed9cf99121d0c97c22a695a7a6e872a11f48 v0.94.4
Thanks!
Jason, I am still able to reproduce this Bug. Ceph Build: [ceph@cephqe9 ~]$ rpm -qa |grep ceph ceph-common-0.94.5-4.el7cp.x86_64 ceph-osd-0.94.5-4.el7cp.x86_64 ceph-0.94.5-4.el7cp.x86_64 ceph-radosgw-0.94.5-4.el7cp.x86_64 ceph-selinux-0.94.5-4.el7cp.x86_64 [ceph@cephqe9 ~]$ rpm -qa |grep rados librados2-0.94.5-4.el7cp.x86_64 ceph-radosgw-0.94.5-4.el7cp.x86_64 python-rados-0.94.5-4.el7cp.x86_64 Followed the exact steps mentioned in Comment #1 IO is getting halted on both the VMs ( i.e. Spare disk on which i am writing ) Every 0.1s: sudo rbd -p Tanay-RBD lock list testingClone_new9191 Fri Jan 29 23:04:56 2016 There is 1 exclusive lock on this image. Locker ID Address client.269331 auto 139931622384896 10.70.44.52:0/2135169 Lock is still being held
The lock holder is stuck waiting for the cache to flush -- but I don't see any other threads blocking that operation from completing. #0 0x00007f445827d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f4455896753 in Wait (mutex=..., this=0x7f4433190440) at common/Cond.h:55 #2 librbd::ImageCtx::flush_cache (this=this@entry=0x7f445eaa2000) at librbd/ImageCtx.cc:675 #3 0x00007f44558b1a77 in librbd::_flush (ictx=0x7f445eaa2000) at librbd/internal.cc:3322 #4 0x00007f44558a5f5a in librbd::ImageWatcher::release_lock (this=this@entry=0x7f445e96ed80) at librbd/ImageWatcher.cc:393 #5 0x00007f44558a60be in librbd::ImageWatcher::notify_release_lock (this=0x7f445e96ed80) at librbd/ImageWatcher.cc:586 #6 0x00007f44558876ca in operator() (a0=<optimized out>, this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767 #7 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at include/Context.h:461 #8 0x00007f445587f689 in Context::complete (this=0x7f4461ebc060, r=<optimized out>) at include/Context.h:65 #9 0x00007f445587f689 in Context::complete (this=0x7f4461ebdb30, r=<optimized out>) at include/Context.h:65 #10 0x00007f44558f99c8 in Finisher::finisher_thread_entry (this=0x7f445e9b77c0) at common/Finisher.cc:59 #11 0x00007f4458279dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f44508f321d in clone () from /lib64/libc.so.6 Can you please re-run the test under the following conditions: 1) set SElinux to permissive mode -- your QEMU processes aren't creating the RBD log 2) add "admin socket = /var/log/libvirt/qemu/rbd-$pid.$cctid.asok" to your ceph.conf
Jason pointed out to me that ceph.conf has the following setting under [global]: log file = /var/log/libvirt/qemu/rbd-$pid.$cctid.log Since this is under [global], this causes all Ceph logs to go to this file, not just RBD logs, but OSD logs as well. And SELinux is denying the OSD from writing its logs there. Please change the following: 1) Move this "log file" setting to a [client] setting, so that only the librbd client will write its logs to /var/log/libvirt/qemu. 2) Please create [osd/mon] sections for the settings that should apply only to Mons or only to OSDs. It is ok to test with SELinux in permissive mode to see if the problem vanishes, but ideally we would make this work in enforcing mode.
Disregard SElinux comment, qemu user did not have permissions to write to /var/lib/libvirt/qemu
Marking this as Failed, as the same steps are causing Bug 1303728. After discussion with Jason, the new crash/hang was different than this one So filed a new defect.
We need clear downstream documentation for setting up librbd with qemu and libvirt, doing logging, etc. Without clear instructions, QE cannot conclusively verify these librbd bugs that require configuring the logs and admin socket. Setting this bug to depend on the docs one (bug 1305421)