Bug 1340998

Summary: Seeing a BT while writing and re-sizing on a RBD Image in parallel, with Journaling Enabled
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: Tanay Ganguly <tganguly>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, ceph-qe-bugs, flucifre, hnallurv, kdreyer, kurs
Target Milestone: rc   
Target Release: 2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.2-1.el7cp Ubuntu: ceph_10.2.2-3redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:40:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
RBD Log
none
Resize Script none

Description Tanay Ganguly 2016-05-31 05:11:28 UTC
Created attachment 1163026 [details]
RBD Log

Description of problem:
While reproducing BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1325932
I am hitting a crash, but this time i have enabled Journaling.

Version-Release number of selected component (if applicable):
ceph version 10.2.1-6.el7cp

How reproducible:
2 times

If its not getting reproduced easily, repeat the same steps
Start the bench-write and run resize in parallel.

Steps to Reproduce:
1. Create and Image, take snap, protect it, and take a clone.
rbd image 'NEW_CLone':
        size 2000 GB in 512000 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.1254862ae8944a
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling
        flags: 
        parent: cephfs_data/NEW@snap1
        overlap: 2000 GB
        journal: 1254862ae8944a
        mirroring state: disabled

2. Run Resize script and bench-write in parallel.
rbd bench-write -p cephfs_data --image NEW_CLone --io-size 1024 --io-pattern rand

Actual results:
Seeing a Crash

Expected results:
There should not be a crash

Additional info:
Logs

-----------------------------------------------------------------

    -4> 2016-05-31 04:42:02.168457 7fb750ff9700 -1 librbd::AioCompletion: 0x7fb73c09f980 fail: (22) Invalid argument
    -3> 2016-05-31 04:42:02.168477 7fb750ff9700 -1 librbd::AioCompletion: completed invalid aio_type: 0
    -2> 2016-05-31 04:42:02.168482 7fb750ff9700 -1 librbd::journal::Replay: AIO modify op failed: (22) Invalid argument
    -1> 2016-05-31 04:42:02.168487 7fb750ff9700 -1 librbd::Journal: failed to commit journal event to disk: (22) Invalid argument
     0> 2016-05-31 04:42:02.169581 7fb750ff9700 -1 *** Caught signal (Aborted) **
 in thread 7fb750ff9700 thread_name:tp_librbd


---------------------------------------------------

Comment 2 Tanay Ganguly 2016-05-31 05:12:14 UTC
Created attachment 1163027 [details]
Resize Script

Comment 3 Jason Dillaman 2016-05-31 11:34:20 UTC
@Tanay: while it shouldn't crash, you really shouldn't be sending IO outside the bounds of the image (e.g. you shrink the image to a point where bench-write is writing outside the image extents).

Comment 4 Harish NV Rao 2016-05-31 12:48:15 UTC
@jason, this is not a graceful exit and gives a bad user experience. requesting this to be fixed in 2.0. resetting target release.

Comment 9 Jason Dillaman 2016-06-12 23:54:01 UTC
Upstream merged Jewel PR: https://github.com/ceph/ceph/pull/9611

Comment 13 Tanay Ganguly 2016-06-28 11:31:56 UTC
Marking it as Verified.

ceph version 10.2.2-5.el7cp

Comment 16 errata-xmlrpc 2016-08-23 19:40:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html