Bug 1223652 - Deadlock when trying to write from 2 VMs on Mandatory Exclusive Lock enabled RBD image
Summary: Deadlock when trying to write from 2 VMs on Mandatory Exclusive Lock enabled ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 1.3.0
Hardware: x86_64
OS: Linux
medium
low
Target Milestone: rc
: 1.3.4
Assignee: Jason Dillaman
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On: 1305421
Blocks: 1303728
TreeView+ depends on / blocked
 
Reported: 2015-05-21 06:54 UTC by Tanay Ganguly
Modified: 2018-03-07 23:52 UTC (History)
6 users (show)

Fixed In Version: RHEL: ceph-0.94.5-2.el7cp Ubuntu: ceph_0.94.5-2redhat1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-07 23:52:26 UTC
Embargoed:


Attachments (Terms of Use)
OSD logs (2.91 MB, application/x-gzip)
2015-05-21 07:10 UTC, Tanay Ganguly
no flags Details
deadlock backtrace (2.70 KB, text/plain)
2015-05-22 19:30 UTC, Jason Dillaman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 11537 0 None None None Never

Description Tanay Ganguly 2015-05-21 06:54:28 UTC
Description of problem:
When trying to do parallel writes from different VMs on a Mandatory Lock Enabled rbd image, the write operations are hung.

Version-Release number of selected component (if applicable):
ceph version 0.94.1
librbd1-0.94.1-10.el7cp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a rbd image with --image-feature 13 option.

rbd -p Tanay-RBD --image Import-clone info
rbd image 'Import-clone':
        size 10261 MB in 2566 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.6765c515f007c
        format: 2
        features: layering, exclusive, object map
        flags: object map invalid
        parent: Tanay-RBD/Import1@Import1snap
        overlap: 9911 MB

2. Attached the Clone as a spare disk to 2 different VM, say VM-1 and VM-2
3. Try to write on the spare disk ( e.g. vdd on VM-1 and VM-2 ) in parallel at the same time.

Actual results:
When the write started one of the VM acquired the lock ( say VM-1), which is expected.
rbd -p Tanay-RBD lock list Import-clone

There is 1 exclusive lock on this image.
Locker        ID                   Address
client.471153 auto 140503524950016 10.12.27.44:0/7020710


But it is unable to complete the write operation, hence the other VM ( i.e. VM-2) is in starvation as it never gets the lock.

So now the IO operations is hanged on both the VM's

strace dd if=100-1 of=/dev/vdd bs=1M count=100


Expected results:
The IO should complete on VM-1, and it should allow VM-2 to take over the lock once VM-1 completes its write.
 

Additional info:
Copied the logs of one of the osd, where the client (10.12.27.44:0/7020710) establishes a new session.

Information:

Image Name: Import-clone
Pool Name: Tanay-RBD
Monitor IP: 10.12.27.26
OSD Tree: 
ceph osd tree
ID  WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
 -1 13.19987 root default                                     
 -2  2.00000     host OSD1                                    
  0  0.39999         osd.0       up  1.00000          1.00000 
  3  0.39999         osd.3       up  1.00000          1.00000 
  6  0.39999         osd.6       up  1.00000          1.00000 
  9  0.39999         osd.9       up  1.00000          1.00000 
 12  0.39999         osd.12      up  1.00000          1.00000 
 -3  1.99997     host OSD2                                    
  1  0.39999         osd.1       up  1.00000          1.00000 
  4  0.39999         osd.4       up  1.00000          1.00000 
  7  0.39999         osd.7       up  1.00000          1.00000 
 10  0.39999         osd.10      up  1.00000          1.00000 
 13  0.39999         osd.13      up  1.00000          1.00000 
 -4  2.00000     host OSD3                                    
  2  0.39999         osd.2       up  1.00000          1.00000 
  5  0.39999         osd.5       up  1.00000          1.00000 
  8  0.39999         osd.8       up  1.00000          1.00000 
 11  0.39999         osd.11      up  1.00000          1.00000 
 14  0.39999         osd.14      up  1.00000          1.00000 
 -5  1.20000     host ADMIN                                   
 15  0.39999         osd.15      up  1.00000          1.00000 
 16  0.39999         osd.16      up  1.00000          1.00000 
 17  0.39999         osd.17      up  1.00000          1.00000 
 -6  1.19998     host OSD5                                    
 33  0.39999         osd.33      up  1.00000          1.00000 
 34  0.39999         osd.34      up  1.00000          1.00000 
 35  0.39999         osd.35      up  1.00000          1.00000 
 -7  1.19998     host OSD4                                    
 30  0.39999         osd.30      up  1.00000          1.00000 
 31  0.39999         osd.31      up  1.00000          1.00000 
 32  0.39999         osd.32      up  1.00000          1.00000 
 -8  1.19998     host OSD6                                    
 36  0.39999         osd.36      up  1.00000          1.00000 
 37  0.39999         osd.37      up  1.00000          1.00000 
 38  0.39999         osd.38      up  1.00000          1.00000 
 -9  1.19998     host OSD7                                    
 39  0.39999         osd.39      up  1.00000          1.00000 
 40  0.39999         osd.40      up  1.00000          1.00000 
 41  0.39999         osd.41      up  1.00000          1.00000 
-10  1.19998     host OSD8                                    
 42  0.39999         osd.42      up  1.00000          1.00000 
 43  0.39999         osd.43      up  1.00000          1.00000 
 44  0.39999         osd.44      up  1.00000          1.00000 


Log Snippet:

2015-05-21 01:58:27.117395 7fde1b723700 10 osd.36 10123  new session 0x6ebf080 con=0xa442260 addr=10.12.27.44:0/7020710
2015-05-21 01:58:27.117576 7fde1b723700 10 osd.36 10123  session 0x6ebf080 client.libvirt has caps osdcap[grant(object_prefix rbd_children  class-read),grant(pool Tanay-RBD rwx)] 'allow class-read object_prefix rbd_children, allow rwx pool=Tanay-RBD'
2015-05-21 01:58:27.303751 7fde21c88700 10 osd.36 10123  new session 0x6eb9400 con=0xa446f60 addr=10.12.27.26:0/1006311
2015-05-21 01:58:27.303862 7fde21c88700 10 osd.36 10123  session 0x6eb9400 client.admin has caps osdcap[grant(*)] 'allow *'
2015-05-21 01:58:27.305213 7fde21c88700 10 osd.36 10123 class rbd method object_map_load flags=r
2015-05-21 01:58:27.305279 7fde21c88700 15 osd.36 10123 enqueue_op 0xaf99b00 prio 63 cost 18 latency 0.000519 osd_op(client.474066.0:15 rbd_object_map.66dcc2ae8944a.000000000000003f [call rbd.object_map_load] 4.1cdb4fb ack+read+known_if_redirected e10123) v5
2015-05-21 01:58:27.305416 7fde37150700 10 osd.36 10123 dequeue_op 0xaf99b00 prio 63 cost 18 latency 0.000656 osd_op(client.474066.0:15 rbd_object_map.66dcc2ae8944a.000000000000003f [call rbd.object_map_load] 4.1cdb4fb ack+read+known_if_redirected e10123) v5 pg pg[4.4fb( v 10120'3 (0'0,10120'3] local-les=10048 n=2 ec=7234 les/c 10048/10048 10047/10047/9934) [36,5,42] r=0 lpr=10047 crt=0'0 lcod 7281'2 mlcod 7281'2 active+clean]

Comment 2 Tanay Ganguly 2015-05-21 07:10:28 UTC
Created attachment 1028001 [details]
OSD  logs

Comment 3 Josh Durgin 2015-05-21 23:50:29 UTC
Writing from multiple clients isn't supposed to work, but also should not deadlock/livelock. Since this is something that shouldn't be done in the first place, I don't think this should block the release. We should fix it at a lower priority.

Comment 4 Josh Durgin 2015-05-21 23:51:51 UTC
Also, the flags show the object map became invalid somehow. That's a separate bug. Do you have the steps it took to become invalid, and can you reproduce it?

Comment 5 Tanay Ganguly 2015-05-22 09:41:50 UTC
Josh,
w.r.t comment 3.

Mandatory exclusive lock is per-image and used to prevent the overlapping of writing into the same rbd image by using some locking technique.

Only one client holds the lock at a time, and only that client is allowed to do any kind of write to the image and other client will request the lock from the current lock holder.

So why writing from multiple client on the same image should not work ?

Please correct my understanding here, i may be totally wrong here.

Comment 6 Tanay Ganguly 2015-05-22 10:11:38 UTC
(In reply to Josh Durgin from comment #4)
> Also, the flags show the object map became invalid somehow. That's a
> separate bug. Do you have the steps it took to become invalid, and can you
> reproduce it?

Josh,
I have reproduced and filed a new bug for it.

https://bugzilla.redhat.com/show_bug.cgi?id=1224174

Comment 7 Jason Dillaman 2015-05-22 19:30:20 UTC
Created attachment 1028916 [details]
deadlock backtrace

Comment 8 Ken Dreyer (Red Hat) 2015-05-23 00:16:56 UTC
Federico, are you ok re-targeting this bz to 1.3.1?

Comment 9 Harish NV Rao 2015-05-25 05:04:33 UTC
Ken and Federico, before making the decision to fix in 1.3.1 can we please wait for Josh to reply to Tanay's comment 5?

Comment 10 Jason Dillaman 2015-05-26 13:07:46 UTC
w.r.t. comment 5, the use case for exclusive locking is not for two VMs sharing the same disk (nor three as another BZ tests).  The VMs won't expect the disks to be changing underneath them as caching would prevent one VM from seeing the changes of another.  If you needed to share a disk between two VMs for the purposes of using a clustered file system (GFS, OCFS), you would just use a standard (non-exclusive) RBD image.

This feature was designed to support the object map feature as well as future features such as RBD mirroring.  The only time one should expect the same disk to be opened by two different QEMU instances at the same time is in the live-migration case where QEMU temporarily has the image open on the source and destination nodes at the same time for a short period of time.

Comment 11 Josh Durgin 2015-05-28 19:04:38 UTC
What Jason said addresses comment 5. To be more explicit, locking isn't meant to make it safe to use rbd from multiple clients at once. This will still fail with any local fs on top due to caching within the vm's kernel. Reads in one vm would get cached data that may have changed, and there's no way to invalidate the vm's cache if another vm wrote to the same disk.

This bug is still valid, but I wouldn't treat it as a blocker for 1.3.0, and would rather address it in z-stream or 1.3.1.

Comment 13 Ken Dreyer (Red Hat) 2015-12-10 20:43:39 UTC
Hi Jason, what timeframe are we looking at for landing this fix?

Comment 14 Jason Dillaman 2015-12-10 20:57:19 UTC
commit 704c0e09479d435ecbb84693a786c60694732b19
Author: Jason Dillaman <dillaman>
Date:   Thu May 7 12:51:49 2015 -0400

    librbd: retry lock requests periodically until acquired
    
    If the exclusive lock owner acks the lock release request but crashes
    before it actually releases the lock, the requestor will wait forever.
    Therefore, after a certain timeout, retry the request again until it
    succeeds.
    
    Fixes: #11537
    Backport: hammer
    Signed-off-by: Jason Dillaman <dillaman>
    (cherry picked from commit 37c74e6e5274208d1b6efaf315afec03ea7eaa82)

commit dbaaed9cf99121d0c97c22a695a7a6e872a11f48
Author: Jason Dillaman <dillaman>
Date:   Thu May 7 12:35:36 2015 -0400

    librbd: don't hold owner_lock for write during flush
    
    The various IO callback codepaths will attempt to take
    the lock, which will result in deadlock since the flush
    cannot complete.
    
    Backport: hammer
    Fixes: #11537
    Signed-off-by: Jason Dillaman <dillaman>
    
    (cherry picked from commit 2b6d0633d5c89de3a557cdd72621dfc19f0540de)

# git tag --contains dbaaed9cf99121d0c97c22a695a7a6e872a11f48
v0.94.4

Comment 15 Ken Dreyer (Red Hat) 2015-12-10 21:19:09 UTC
Thanks!

Comment 17 Tanay Ganguly 2016-01-29 12:15:22 UTC
Jason,
I am still able to reproduce this Bug.

Ceph Build:
[ceph@cephqe9 ~]$ rpm -qa |grep ceph
ceph-common-0.94.5-4.el7cp.x86_64
ceph-osd-0.94.5-4.el7cp.x86_64
ceph-0.94.5-4.el7cp.x86_64
ceph-radosgw-0.94.5-4.el7cp.x86_64
ceph-selinux-0.94.5-4.el7cp.x86_64
[ceph@cephqe9 ~]$ rpm -qa |grep rados
librados2-0.94.5-4.el7cp.x86_64
ceph-radosgw-0.94.5-4.el7cp.x86_64
python-rados-0.94.5-4.el7cp.x86_64

Followed the exact steps mentioned in Comment #1

IO is getting halted on both the VMs ( i.e. Spare disk on which i am writing )

Every 0.1s: sudo rbd -p Tanay-RBD lock list testingClone_new9191                                                                                                 Fri Jan 29 23:04:56 2016

There is 1 exclusive lock on this image.
Locker        ID                   Address
client.269331 auto 139931622384896 10.70.44.52:0/2135169

Lock is still being held

Comment 18 Jason Dillaman 2016-01-29 15:35:06 UTC
The lock holder is stuck waiting for the cache to flush -- but I don't see any other threads blocking that operation from completing.

#0  0x00007f445827d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4455896753 in Wait (mutex=..., this=0x7f4433190440) at common/Cond.h:55
#2  librbd::ImageCtx::flush_cache (this=this@entry=0x7f445eaa2000) at librbd/ImageCtx.cc:675
#3  0x00007f44558b1a77 in librbd::_flush (ictx=0x7f445eaa2000) at librbd/internal.cc:3322
#4  0x00007f44558a5f5a in librbd::ImageWatcher::release_lock (this=this@entry=0x7f445e96ed80) at librbd/ImageWatcher.cc:393
#5  0x00007f44558a60be in librbd::ImageWatcher::notify_release_lock (this=0x7f445e96ed80) at librbd/ImageWatcher.cc:586
#6  0x00007f44558876ca in operator() (a0=<optimized out>, this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
#7  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at include/Context.h:461
#8  0x00007f445587f689 in Context::complete (this=0x7f4461ebc060, r=<optimized out>) at include/Context.h:65
#9  0x00007f445587f689 in Context::complete (this=0x7f4461ebdb30, r=<optimized out>) at include/Context.h:65
#10 0x00007f44558f99c8 in Finisher::finisher_thread_entry (this=0x7f445e9b77c0) at common/Finisher.cc:59
#11 0x00007f4458279dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f44508f321d in clone () from /lib64/libc.so.6

Can you please re-run the test under the following conditions:

1) set SElinux to permissive mode -- your QEMU processes aren't creating the RBD log
2) add "admin socket = /var/log/libvirt/qemu/rbd-$pid.$cctid.asok" to your ceph.conf

Comment 19 Ken Dreyer (Red Hat) 2016-01-29 18:47:38 UTC
Jason pointed out to me that ceph.conf has the following setting under [global]:

log file = /var/log/libvirt/qemu/rbd-$pid.$cctid.log

Since this is under [global], this causes all Ceph logs to go to this file, not just RBD logs, but OSD logs as well. And SELinux is denying the OSD from writing its logs there.

Please change the following:

1) Move this "log file" setting to a [client] setting, so that only the librbd client will write its logs to /var/log/libvirt/qemu.

2) Please create [osd/mon] sections for the settings that should apply only to Mons or only to OSDs.

It is ok to test with SELinux in permissive mode to see if the problem vanishes, but ideally we would make this work in enforcing mode.

Comment 20 Jason Dillaman 2016-01-29 19:14:56 UTC
Disregard SElinux comment, qemu user did not have permissions to write to /var/lib/libvirt/qemu

Comment 21 Tanay Ganguly 2016-02-02 06:17:34 UTC
Marking this as Failed, as the same steps are causing Bug 1303728.
After discussion with Jason, the new crash/hang was different than this one 

So filed a new defect.

Comment 23 Ken Dreyer (Red Hat) 2016-02-12 22:05:02 UTC
We need clear downstream documentation for setting up librbd with qemu and libvirt, doing logging, etc. Without clear instructions, QE cannot conclusively verify these librbd bugs that require configuring the logs and admin socket.

Setting this bug to depend on the docs one (bug 1305421)


Note You need to log in before you can comment on or make changes to this bug.