Bug 1224921 - IO operation is getting hanged with Mandatory exclusive lock enabled rbd image
Summary: IO operation is getting hanged with Mandatory exclusive lock enabled rbd image
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 1.3.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 1.3.2
Assignee: Jason Dillaman
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-26 08:37 UTC by Tanay Ganguly
Modified: 2017-07-30 15:26 UTC (History)
7 users (show)

Fixed In Version: RHEL: ceph-0.94.5-2.el7cp Ubuntu: ceph_0.94.5-2redhat1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-29 14:42:12 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 12235 0 None None None Never
Red Hat Product Errata RHBA-2016:0313 0 normal SHIPPED_LIVE Red Hat Ceph Storage 1.3.2 bug fix and enhancement update 2016-02-29 19:37:43 UTC

Description Tanay Ganguly 2015-05-26 08:37:52 UTC
Description of problem:
IO operation is getting hanged while trying to write on same rbd image ( mandatory exclusive feature enabled )  one after the other from 3 different VMs.

Version-Release number of selected component (if applicable):
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
librbd1-0.94.1-10.el7cp.x86_64

How reproducible:
2 out of 2 times

Steps to Reproduce:
1. Create a rbd image,take snap and clone it with --image-features 5 (Exclusive lock enabled)

rbd image 'snap1':
        size 13240 MB in 3310 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.a0f202eb141f2
        format: 2
        features: layering, exclusive
        flags: 
        parent: Tanay-RBD/Anfield@1
        overlap: 13240 MB

2. Attach the same rbd image on 3 different VMs as a spare disk.
3. write a small 10 mb file from all 3 different VMs one after the other.

while true; do dd if=10M of=/dev/vdb bs=1M count=10; echo "Sleeping now"; sleep 45; done

4. When the write complete on 1st VM and starts sleeping( it was taking around 4-5 seconds to complete the write) start the same command on 2nd Vm and so on for 3rd vm.

Actual results:
The lock was getting changed when the VM2 tries to write the file and same for VM3 and VM1

TG1     client.663435 (VM1)
TG2     client.663762 (VM2)
TEST    client.662967 (VM3)

It was working smoothly for 20 odd iterations, but after that the IO didn't continue and the lock got stuck  with VM1 forever, hence the IO is stalled on all the 3 Vms.

And i could see all the clients got blacklisted.

ceph osd blacklist ls
listed 3 entries
10.12.27.45:0/1005465 2015-05-26 04:06:34.530285
10.12.27.45:0/2005465 2015-05-26 04:02:37.331390
10.12.27.45:0/1005376 2015-05-26 04:03:16.993895

Expected results:
The IO should have continued and the lock changing should have happened forever among the 3 VMs.

Additional info:
After a while the the blacklisting is showing no entries.
listed 0 entries

Didn't see any log message on MON and OSD's.

Comment 2 Jason Dillaman 2015-05-26 13:10:56 UTC
Is this different from BZ 1223652 (besides three VMs instead of two)?

Comment 3 Tanay Ganguly 2015-05-27 09:08:30 UTC
Yes,

The writes was not parallel from all the 3 VM, it was one after the other.
There was a sleep for 45 seconds after every write.

It was good for some 20 odd iterations, then it stopped changing the lock owner.

Comment 4 Josh Durgin 2015-05-28 19:05:33 UTC
Since the dd didn't use oflag=direct, the writes did go through the page cache in the vm, and there could have been some parallelism.

This bug may be different in that all three clients became blacklisted. Similar reasoning as for BZ 1223652 makes me inclined to address this in 1.3.1 or z-stream, and not block 1.3.0.

Comment 7 Ken Dreyer (Red Hat) 2015-12-10 20:45:45 UTC
Fixed upstream in v0.94.4

Comment 9 Tanay Ganguly 2016-02-05 10:17:39 UTC
Marking this BUG as Verified.

Ran the same test for 1000+ iterations.
The lock owner continuously changed

Comment 11 errata-xmlrpc 2016-02-29 14:42:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313


Note You need to log in before you can comment on or make changes to this bug.