1325932 – Seeing a Crash while writing and re-sizing on a RBD Image in parallel

Bug 1325932 - Seeing a Crash while writing and re-sizing on a RBD Image in parallel

Summary: Seeing a Crash while writing and re-sizing on a RBD Image in parallel

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	2.0
Assignee:	Jason Dillaman
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-11 12:51 UTC by Tanay Ganguly
Modified:	2017-07-30 15:36 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ceph-10.2.0-1.el7cp
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:35:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Resize Script (523 bytes, text/x-python) 2016-04-11 12:51 UTC, Tanay Ganguly	no flags	Details
BT (5.88 KB, text/plain) 2016-04-11 12:52 UTC, Tanay Ganguly	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	15456	0	None	None	None	2016-04-11 13:14:40 UTC
Red Hat Product Errata	RHBA-2016:1755	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.0 bug fix and enhancement update	2016-08-23 23:23:52 UTC

Description Tanay Ganguly 2016-04-11 12:51:17 UTC

Created attachment 1145988 [details]
Resize Script

Description of problem:
Resize and Writing (using rbd bench) causing the rbd to crash

Version-Release number of selected component (if applicable):
ceph-release-1-1.el7.noarch
ceph-selinux-10.1.0-1.el7cp.x86_64
ceph-osd-10.1.0-1.el7cp.x86_64
libcephfs1-10.1.0-1.el7cp.x86_64
ceph-base-10.1.0-1.el7cp.x86_64
ceph-10.1.0-1.el7cp.x86_64
python-cephfs-10.1.0-1.el7cp.x86_64
ceph-mds-10.1.0-1.el7cp.x86_64
ceph-common-10.1.0-1.el7cp.x86_64
ceph-mon-10.1.0-1.el7cp.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Create an Image
rbd create Tanay-RBD/BIG_Image1 --size 2048000 --image-format 2 --image-feature layering --image-feature exclusive-lock --image-feature object-map --image-feature fast-diff --image-feature deep-flatten
2.Take Snap.
rbd snap create Tanay-RBD/BIG_Image1@snap
3. Protect It 
rbd snap protect Tanay-RBD/BIG_Image1@snap
4. Create Clone 
rbd clone Tanay-RBD/BIG_Image1@snap Tanay-RBD/BUG-CLONE --image-feature layering --image-feature exclusive-lock --image-feature object-map --image-feature fast-diff --image-feature deep-flatten
5. Start rbd bench write.
rbd bench-write -p Tanay-RBD --image BUG-CLONE --io-size 10240 --io-pattern rand
6. While the write starts then run the resize script (PFA)

Actual results:
Initially step 5 acquires the lock, and step 6 waits for the lock to be release, as mentioned below:

2016-04-11 22:37:43.050150 7f6df3fff700 -1 librbd::image_watcher::NotifyLockOwner: 0x7f6dec0020c0 handle_notify: no lock owners detected
2016-04-11 22:37:48.052145 7f6df3fff700 -1 librbd::image_watcher::NotifyLockOwner: 0x7f6dec0020c0 handle_notify: no lock owners detected

But after a while rbd bench write crashed and lock was being released to the resize and resize completed successfully.


Expected results:
There should not be any Crash.

Additional info:
Crash Dump

Comment 2 Tanay Ganguly 2016-04-11 12:52:49 UTC

Created attachment 1145989 [details]
BT

Comment 3 Jason Dillaman 2016-04-11 13:11:08 UTC

This is a crash within 'rbd bench-write' because you shrunk the image below its in-flight write extent.  The write failed (correctly) because it was out-of-bounds but the CLI has an expectation that the write won't fail.  We can remove the assert and just have bench-write exit with a failure code, but in general it doesn't make sense to shrink images with live IO within the deleted region.

Comment 4 Jason Dillaman 2016-04-13 01:18:03 UTC

Fix is to gracefully stop the "rbd bench-write" operation when an IO error is encountered (e.g. the write was out-of-bounds).

Upstream PR: https://github.com/ceph/ceph/pull/8565

Comment 5 Ken Dreyer (Red Hat) 2016-04-26 20:50:44 UTC

The above PR is present in v10.2.0.

Comment 7 Tanay Ganguly 2016-05-31 04:08:09 UTC

Marking it Verified.

ceph version 10.2.1-6.el7cp

Comment 9 errata-xmlrpc 2016-08-23 19:35:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.