Bug 1344274 - crash while bench-write and disabling Journal in parallel
Summary: crash while bench-write and disabling Journal in parallel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Jason Dillaman
QA Contact: Tanay Ganguly
URL:
Whiteboard:
Depends On:
Blocks: 1343229
TreeView+ depends on / blocked
 
Reported: 2016-06-09 10:01 UTC by Tanay Ganguly
Modified: 2017-07-31 20:59 UTC (History)
5 users (show)

Fixed In Version: ceph-10.2.2-1.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:41:05 UTC
Embargoed:


Attachments (Terms of Use)
Crash Log (86.55 KB, text/plain)
2016-06-09 10:01 UTC, Tanay Ganguly
no flags Details
Log and script (146.95 KB, application/x-gzip)
2016-06-10 05:55 UTC, Tanay Ganguly
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 16235 0 None None None 2016-06-10 17:25:12 UTC
Red Hat Product Errata RHBA-2016:1755 0 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description Tanay Ganguly 2016-06-09 10:01:32 UTC
Created attachment 1166230 [details]
Crash Log

Description of problem:
Continuous bench-write and disabling of Journal from Master Node

Version-Release number of selected component (if applicable):
rbd-mirror-10.2.1-12.el7cp.x86_64

How reproducible:
Once

Steps to Reproduce:
1. Create an Image without enabling Journal
2. Write some data on to it.
3. Enable Journal to resync to Slave Node
4. Start bench-write on the Image, after a while Kill it.
 Repeat step 4, for 3-4 times.
5. While this is in progress disable Journal from Master Node

Actual results:
Seeing an Crash in Master Node

Expected results:
Disable should be graceful

Additional info:
Log attached

-------------------------------------------------------------------------
    -2> 2016-06-09 15:19:48.724907 7fc2c86f7700  1 -- 10.70.44.40:0/334889691 <== osd.3 10.70.44.50:6829/133371 16 ==== osd_op_reply(54 journal.136a2ae8944a [call] v0'0 uv1178 ondisk = 0) v7 ==== 140+0+385 (2263775508 0 3014552106) 0x7fc27c001940 con 0x7fc2b0014d70
    -1> 2016-06-09 15:19:48.725028 7fc2edbd3d80  5 librbd::Operations: 0x7fc2f8775da0 snap_remove: snap_name=.rbd-mirror.3e563921-8f1c-45bd-bcd9-7fb0b4bfdc9a.c1691508-4630-4524-95e4-e9a8b0b79e3a
     0> 2016-06-09 15:19:48.725769 7fc2edbd3d80 -1 *** Caught signal (Aborted) **
 in thread 7fc2edbd3d80 thread_name:rbd

 ceph version 10.2.1-12.el7cp (939056d19a2a523223611ef08194666b41086b03)
 1: (()+0x1feafa) [0x7fc2ede06afa]
 2: (()+0xf100) [0x7fc2da1bf100]
 3: (gsignal()+0x37) [0x7fc2d820c5f7]
 4: (abort()+0x148) [0x7fc2d820dce8]

Comment 2 Jason Dillaman 2016-06-09 12:12:55 UTC
@Tanay: where is the full log? Are several processes sharing the same log file in your setup?  It looks like the crash was in the rbd CLI while updating the features, not the rbd-mirror daemon as implied.

Comment 3 Tanay Ganguly 2016-06-10 05:48:41 UTC
(In reply to Jason Dillaman from comment #2)
> @Tanay: where is the full log? Are several processes sharing the same log
> file in your setup?  It looks like the crash was in the rbd CLI while
> updating the features, not the rbd-mirror daemon as implied.


I shared the full log, no i was not executing anything i waited for bench-write to complete then started disabling. 

Not sure about RBD CLI crash, the log looks like:

    -1> 2016-06-10 10:59:19.884312 7f879a61cd80  5 librbd::Operations: 0x7f87a58d10e0 snap_remove: snap_name=.rbd-mirror.3e563921-8f1c-45bd-bcd9-7fb0b4bfdc9a.edaf3ce8-fbfd-4fb9-9f75-1effbd754200
     0> 2016-06-10 10:59:19.885214 7f879a61cd80 -1 *** Caught signal (Aborted) **
 in thread 7f879a61cd80 thread_name:rbd

Comment 4 Tanay Ganguly 2016-06-10 05:51:23 UTC
@Jason,

I am able to reproduce it again with some simpler steps.

1. Create an Image without Journaling enabled ( PFA, i am using the script to create )
This Script i created to replicate the functionality of RBD_Import, it imports a Block Device.

2. Again write some data using bench-write.
3. Let the write complere, and sync begin at Slave Node ( It was some 35% complete)
4. Disable the Journal.


After Disabling i am seeing the Crash from Master Node.

Comment 5 Tanay Ganguly 2016-06-10 05:55:10 UTC
Created attachment 1166484 [details]
Log and script

This is a Tar File

Comment 6 Harish NV Rao 2016-06-10 06:43:14 UTC
Monti,
This defect is for rbd mirroring. It needs to be fixed for 2.0. I am setting the target release as 2.0 and adding this to 2.0 GA tracker bz.

Comment 8 Harish NV Rao 2016-06-10 14:43:18 UTC
Monti, please change the target release to 2.0. Rules engine is pushing it to 2.1 if I tried changing from 2.1 to 2.0 (comment 7)

Comment 9 Jason Dillaman 2016-06-10 14:50:06 UTC
@Tanay: just want to explicitly confirm that this was a crash in the rbd CLI, not the rbd-mirror daemon (that is what the logs show) and whether or not re-running the rbd CLI command was successful.

Comment 10 Jason Dillaman 2016-06-10 17:26:33 UTC
It appears the rbd CLI will continue to crash until the sync is complete unless you delete the image's rbd-mirror snapshot before disabling journaling.

Comment 11 Jason Dillaman 2016-06-12 19:28:16 UTC
Upstream Jewel PR: https://github.com/ceph/ceph/pull/9654

Comment 12 Christina Meno 2016-06-14 18:14:42 UTC
Harish you need to change the flag ceph-2.Z ? to "" and ceph-2.0 to ?

Comment 14 Tanay Ganguly 2016-06-28 10:48:15 UTC
Marking it as Verified.

ceph version 10.2.2-5.el7cp

Comment 16 errata-xmlrpc 2016-08-23 19:41:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.