Bug 1349288 - RBD Mirror Crashed on Slave Node "librbd/AioObjectRequest.cc: 405: FAILED assert(m_ictx->exclusive_lock->is_lock_owner())"
Summary: RBD Mirror Crashed on Slave Node "librbd/AioObjectRequest.cc: 405: FAILED ass...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RBD
Version: 2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Jason Dillaman
QA Contact: Tanay Ganguly
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-23 07:22 UTC by Tanay Ganguly
Modified: 2017-07-30 15:31 UTC (History)
7 users (show)

Fixed In Version: RHEL: ceph-10.2.2-9.el7cp Ubuntu: ceph_10.2.2-7redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:42:23 UTC
Target Upstream Version:


Attachments (Terms of Use)
Test Script (781 bytes, application/x-shellscript)
2016-06-23 07:22 UTC, Tanay Ganguly
no flags Details


Links
System ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 16450 None None None 2016-06-23 14:16:34 UTC
Red Hat Product Errata RHBA-2016:1755 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description Tanay Ganguly 2016-06-23 07:22:07 UTC
Created attachment 1171319 [details]
Test Script

Description of problem:

Overnight script was running which was creating Image and Cloning it, which in-turn sync with Slave Node. It was properly syncing until today morning when i noticed that Mon Node on the Master Node failed, because it ran out of Space ( Because of the Extra logging which is enabled )
Now when i deleted some log files and restarted the MON process on Master Node, the re-sync again started.

But after a while on Slave Node the rbd-mirror daemon got crashed.

Version-Release number of selected component (if applicable):
ceph version 10.2.2-5.el7cp

How reproducible:
Once

Steps to Reproduce:
1.Executed the attached script.
   The Script does:
   Create Image with Journal enabled, take Snap, Protect Snap, Clone it with Journal enabled, unprotect snap,Flatten it, bench write on Clone, bench write on parent image.
2. In Parallel another bigimage1 also started its sync (Its 100GB in Size)

Actual results:
Morning i noticed the Master Node Mon process died ( because it ran out of space)

2016-06-21 04:53:45.349466 7fab60f44700  0 mon.cephqe5@0(leader).data_health(5) update_stats avail 4% total 51175 MB, used 49066 MB, avail 2108 MB
2016-06-21 04:53:45.349524 7fab60f44700 -1 mon.cephqe5@0(leader).data_health(5) reached critical levels of available space on local monitor storage -- shutdown!
2016-06-21 04:53:45.349532 7fab60f44700  0 ** Shutdown via Data Health Service **
2016-06-21 04:53:45.349667 7fab5eb3c700 -1 mon.cephqe5@0(leader) e1 *** Got Signal Interrupt ***

On the Master Node.

Morning after cleaning up the space i started the mon again.
The Resync of the existing image started.

Few minutes later i saw that the rbd-mirror daemon process got killed On Slave Node.

Expected results:
Should not crash

Additional info:



-----------------------------------------------------------------------------

-flight updates
    -6> 2016-06-23 06:21:52.788274 7fddec3e1700  1 -- 10.8.128.3:0/1446315387 <== osd.0 10.8.128.23:6801/108449 374221 ==== osd_op_reply(2321648 rbd_mirroring [call] v705'153625 uv153625 ack = 0) v7 ==== 133+0+0 (1062358106 0 0) 0x7fddc40200c0 con 0x7fde04d59fd0
    -5> 2016-06-23 06:21:52.788324 7fddec3e1700  1 -- 10.8.128.3:0/1446315387 <== osd.0 10.8.128.23:6801/108449 374222 ==== osd_op_reply(2321648 rbd_mirroring [call] v705'153625 uv153625 ondisk = 0) v7 ==== 133+0+0 (748351248 0 0) 0x7fddc40200c0 con 0x7fde04d59fd0
    -4> 2016-06-23 06:21:52.788334 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] handle_mirror_status_update: r=0
    -3> 2016-06-23 06:21:52.788342 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] reschedule_update_status_task:
    -2> 2016-06-23 06:21:52.788343 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] start_mirror_image_status_update:
    -1> 2016-06-23 06:21:52.788347 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] finish_mirror_image_status_update: waiting on 1 in-flight updates
     0> 2016-06-23 06:21:52.789303 7fdd459e7700 -1 librbd/AioObjectRequest.cc: In function 'void librbd::AbstractAioObjectWrite::send_pre()' thread 7fdd459e7700 time 2016-06-23 06:21:52.707119
librbd/AioObjectRequest.cc: 405: FAILED assert(m_ictx->exclusive_lock->is_lock_owner())

 ceph version 10.2.2-5.el7cp (49c93ac9dceaa9318129ece8b0f45a3968d9c981)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fddfafc2275]
 2: (librbd::AbstractAioObjectWrite::send_pre()+0x47c) [0x7fddfaed8d1c]
 3: (librbd::AbstractAioObjectWrite::send()+0x95) [0x7fddfaed8e55]
 4: (librbd::AbstractAioImageWrite::send_object_requests(std::vector<ObjectExtent, std::allocator<ObjectExtent> > const&, SnapContext const&, std::list<librbd::AioObjectRequest*, std::allocator<librbd::AioObjectRequest*> >*)+0x3dd) [0x7fddfaecde7d]
 5: (librbd::AbstractAioImageWrite::send_request()+0x36c) [0x7fddfaecec3c]
 6: (librbd::AioImageRequest<librbd::ImageCtx>::send()+0xc4) [0x7fddfaed08c4]
 7: (librbd::AioImageRequest<librbd::ImageCtx>::aio_write(librbd::ImageCtx*, librbd::AioCompletion*, unsigned long, unsigned long, char const*, int)+0x84) [0x7fddfaed0bd4]
 8: (librbd::journal::Replay<librbd::ImageCtx>::handle_event(librbd::journal::AioWriteEvent const&, Context*, Context*)+0xbc) [0x7fddfae9d60c]
 9: (librbd::journal::Replay<librbd::ImageCtx>::process(ceph::buffer::list::iterator*, Context*, Context*)+0x452) [0x7fddfaea2082]
 10: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::process_entry()+0xfd) [0x7fddfadcb9bd]
 11: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::handle_replay_ready()+0x24d) [0x7fddfadcc74d]
 12: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::handle_process_entry_ready(int)+0x79) [0x7fddfadc84b9]
 13: (Context::complete(int)+0x9) [0x7fddfadb46c9]
 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7fddfafb355e]
 15: (ThreadPool::WorkThread::entry()+0x10) [0x7fddfafb4430]
 16: (()+0x7dc5) [0x7fddf04addc5]
 17: (clone()+0x6d) [0x7fddef393ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Comment 3 Jason Dillaman 2016-06-23 13:27:42 UTC
Overloaded cluster watch failure:

librbd::ImageWatcher: 0x7fdd780acf60 image watch failed: 140589178452096, (107) Transport endpoint is not connected

Comment 6 Christina Meno 2016-06-24 14:07:54 UTC
slave not crashing while cluster is outside bandwidth parameters is "nice to have" moving to 2.1

Comment 9 Jason Dillaman 2016-06-27 18:20:33 UTC
Upstream master PR: https://github.com/ceph/ceph/pull/9905 and

Comment 13 Tanay Ganguly 2016-07-28 08:55:17 UTC
Marking it as Verified, as i am unable to reproduce it.

Comment 15 errata-xmlrpc 2016-08-23 19:42:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.