Created attachment 1145988 [details]
Description of problem:
Resize and Writing (using rbd bench) causing the rbd to crash
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create an Image
rbd create Tanay-RBD/BIG_Image1 --size 2048000 --image-format 2 --image-feature layering --image-feature exclusive-lock --image-feature object-map --image-feature fast-diff --image-feature deep-flatten
rbd snap create Tanay-RBD/BIG_Image1@snap
3. Protect It
rbd snap protect Tanay-RBD/BIG_Image1@snap
4. Create Clone
rbd clone Tanay-RBD/BIG_Image1@snap Tanay-RBD/BUG-CLONE --image-feature layering --image-feature exclusive-lock --image-feature object-map --image-feature fast-diff --image-feature deep-flatten
5. Start rbd bench write.
rbd bench-write -p Tanay-RBD --image BUG-CLONE --io-size 10240 --io-pattern rand
6. While the write starts then run the resize script (PFA)
Initially step 5 acquires the lock, and step 6 waits for the lock to be release, as mentioned below:
2016-04-11 22:37:43.050150 7f6df3fff700 -1 librbd::image_watcher::NotifyLockOwner: 0x7f6dec0020c0 handle_notify: no lock owners detected
2016-04-11 22:37:48.052145 7f6df3fff700 -1 librbd::image_watcher::NotifyLockOwner: 0x7f6dec0020c0 handle_notify: no lock owners detected
But after a while rbd bench write crashed and lock was being released to the resize and resize completed successfully.
There should not be any Crash.
Created attachment 1145989 [details]
This is a crash within 'rbd bench-write' because you shrunk the image below its in-flight write extent. The write failed (correctly) because it was out-of-bounds but the CLI has an expectation that the write won't fail. We can remove the assert and just have bench-write exit with a failure code, but in general it doesn't make sense to shrink images with live IO within the deleted region.
Fix is to gracefully stop the "rbd bench-write" operation when an IO error is encountered (e.g. the write was out-of-bounds).
Upstream PR: https://github.com/ceph/ceph/pull/8565
The above PR is present in v10.2.0.
Marking it Verified.
ceph version 10.2.1-6.el7cp
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.