Bug 1344274

Summary: crash while bench-write and disabling Journal in parallel
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: Tanay Ganguly <tganguly>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, gmeno, hnallurv, kurs, tganguly
Target Milestone: rc   
Target Release: 2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-10.2.2-1.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:41:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1343229    
Attachments:
Description Flags
Crash Log
none
Log and script none

Description Tanay Ganguly 2016-06-09 10:01:32 UTC
Created attachment 1166230 [details]
Crash Log

Description of problem:
Continuous bench-write and disabling of Journal from Master Node

Version-Release number of selected component (if applicable):
rbd-mirror-10.2.1-12.el7cp.x86_64

How reproducible:
Once

Steps to Reproduce:
1. Create an Image without enabling Journal
2. Write some data on to it.
3. Enable Journal to resync to Slave Node
4. Start bench-write on the Image, after a while Kill it.
 Repeat step 4, for 3-4 times.
5. While this is in progress disable Journal from Master Node

Actual results:
Seeing an Crash in Master Node

Expected results:
Disable should be graceful

Additional info:
Log attached

-------------------------------------------------------------------------
    -2> 2016-06-09 15:19:48.724907 7fc2c86f7700  1 -- 10.70.44.40:0/334889691 <== osd.3 10.70.44.50:6829/133371 16 ==== osd_op_reply(54 journal.136a2ae8944a [call] v0'0 uv1178 ondisk = 0) v7 ==== 140+0+385 (2263775508 0 3014552106) 0x7fc27c001940 con 0x7fc2b0014d70
    -1> 2016-06-09 15:19:48.725028 7fc2edbd3d80  5 librbd::Operations: 0x7fc2f8775da0 snap_remove: snap_name=.rbd-mirror.3e563921-8f1c-45bd-bcd9-7fb0b4bfdc9a.c1691508-4630-4524-95e4-e9a8b0b79e3a
     0> 2016-06-09 15:19:48.725769 7fc2edbd3d80 -1 *** Caught signal (Aborted) **
 in thread 7fc2edbd3d80 thread_name:rbd

 ceph version 10.2.1-12.el7cp (939056d19a2a523223611ef08194666b41086b03)
 1: (()+0x1feafa) [0x7fc2ede06afa]
 2: (()+0xf100) [0x7fc2da1bf100]
 3: (gsignal()+0x37) [0x7fc2d820c5f7]
 4: (abort()+0x148) [0x7fc2d820dce8]

Comment 2 Jason Dillaman 2016-06-09 12:12:55 UTC
@Tanay: where is the full log? Are several processes sharing the same log file in your setup?  It looks like the crash was in the rbd CLI while updating the features, not the rbd-mirror daemon as implied.

Comment 3 Tanay Ganguly 2016-06-10 05:48:41 UTC
(In reply to Jason Dillaman from comment #2)
> @Tanay: where is the full log? Are several processes sharing the same log
> file in your setup?  It looks like the crash was in the rbd CLI while
> updating the features, not the rbd-mirror daemon as implied.


I shared the full log, no i was not executing anything i waited for bench-write to complete then started disabling. 

Not sure about RBD CLI crash, the log looks like:

    -1> 2016-06-10 10:59:19.884312 7f879a61cd80  5 librbd::Operations: 0x7f87a58d10e0 snap_remove: snap_name=.rbd-mirror.3e563921-8f1c-45bd-bcd9-7fb0b4bfdc9a.edaf3ce8-fbfd-4fb9-9f75-1effbd754200
     0> 2016-06-10 10:59:19.885214 7f879a61cd80 -1 *** Caught signal (Aborted) **
 in thread 7f879a61cd80 thread_name:rbd

Comment 4 Tanay Ganguly 2016-06-10 05:51:23 UTC
@Jason,

I am able to reproduce it again with some simpler steps.

1. Create an Image without Journaling enabled ( PFA, i am using the script to create )
This Script i created to replicate the functionality of RBD_Import, it imports a Block Device.

2. Again write some data using bench-write.
3. Let the write complere, and sync begin at Slave Node ( It was some 35% complete)
4. Disable the Journal.


After Disabling i am seeing the Crash from Master Node.

Comment 5 Tanay Ganguly 2016-06-10 05:55:10 UTC
Created attachment 1166484 [details]
Log and script

This is a Tar File

Comment 6 Harish NV Rao 2016-06-10 06:43:14 UTC
Monti,
This defect is for rbd mirroring. It needs to be fixed for 2.0. I am setting the target release as 2.0 and adding this to 2.0 GA tracker bz.

Comment 8 Harish NV Rao 2016-06-10 14:43:18 UTC
Monti, please change the target release to 2.0. Rules engine is pushing it to 2.1 if I tried changing from 2.1 to 2.0 (comment 7)

Comment 9 Jason Dillaman 2016-06-10 14:50:06 UTC
@Tanay: just want to explicitly confirm that this was a crash in the rbd CLI, not the rbd-mirror daemon (that is what the logs show) and whether or not re-running the rbd CLI command was successful.

Comment 10 Jason Dillaman 2016-06-10 17:26:33 UTC
It appears the rbd CLI will continue to crash until the sync is complete unless you delete the image's rbd-mirror snapshot before disabling journaling.

Comment 11 Jason Dillaman 2016-06-12 19:28:16 UTC
Upstream Jewel PR: https://github.com/ceph/ceph/pull/9654

Comment 12 Christina Meno 2016-06-14 18:14:42 UTC
Harish you need to change the flag ceph-2.Z ? to "" and ceph-2.0 to ?

Comment 14 Tanay Ganguly 2016-06-28 10:48:15 UTC
Marking it as Verified.

ceph version 10.2.2-5.el7cp

Comment 16 errata-xmlrpc 2016-08-23 19:41:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html