Created attachment 1171319 [details] Test Script Description of problem: Overnight script was running which was creating Image and Cloning it, which in-turn sync with Slave Node. It was properly syncing until today morning when i noticed that Mon Node on the Master Node failed, because it ran out of Space ( Because of the Extra logging which is enabled ) Now when i deleted some log files and restarted the MON process on Master Node, the re-sync again started. But after a while on Slave Node the rbd-mirror daemon got crashed. Version-Release number of selected component (if applicable): ceph version 10.2.2-5.el7cp How reproducible: Once Steps to Reproduce: 1.Executed the attached script. The Script does: Create Image with Journal enabled, take Snap, Protect Snap, Clone it with Journal enabled, unprotect snap,Flatten it, bench write on Clone, bench write on parent image. 2. In Parallel another bigimage1 also started its sync (Its 100GB in Size) Actual results: Morning i noticed the Master Node Mon process died ( because it ran out of space) 2016-06-21 04:53:45.349466 7fab60f44700 0 mon.cephqe5@0(leader).data_health(5) update_stats avail 4% total 51175 MB, used 49066 MB, avail 2108 MB 2016-06-21 04:53:45.349524 7fab60f44700 -1 mon.cephqe5@0(leader).data_health(5) reached critical levels of available space on local monitor storage -- shutdown! 2016-06-21 04:53:45.349532 7fab60f44700 0 ** Shutdown via Data Health Service ** 2016-06-21 04:53:45.349667 7fab5eb3c700 -1 mon.cephqe5@0(leader) e1 *** Got Signal Interrupt *** On the Master Node. Morning after cleaning up the space i started the mon again. The Resync of the existing image started. Few minutes later i saw that the rbd-mirror daemon process got killed On Slave Node. Expected results: Should not crash Additional info: ----------------------------------------------------------------------------- -flight updates -6> 2016-06-23 06:21:52.788274 7fddec3e1700 1 -- 10.8.128.3:0/1446315387 <== osd.0 10.8.128.23:6801/108449 374221 ==== osd_op_reply(2321648 rbd_mirroring [call] v705'153625 uv153625 ack = 0) v7 ==== 133+0+0 (1062358106 0 0) 0x7fddc40200c0 con 0x7fde04d59fd0 -5> 2016-06-23 06:21:52.788324 7fddec3e1700 1 -- 10.8.128.3:0/1446315387 <== osd.0 10.8.128.23:6801/108449 374222 ==== osd_op_reply(2321648 rbd_mirroring [call] v705'153625 uv153625 ondisk = 0) v7 ==== 133+0+0 (748351248 0 0) 0x7fddc40200c0 con 0x7fde04d59fd0 -4> 2016-06-23 06:21:52.788334 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] handle_mirror_status_update: r=0 -3> 2016-06-23 06:21:52.788342 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] reschedule_update_status_task: -2> 2016-06-23 06:21:52.788343 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] start_mirror_image_status_update: -1> 2016-06-23 06:21:52.788347 7fdddd7fa700 20 rbd::mirror::ImageReplayer: 0x7fdd8400fef0 [4/e81dc523-eee3-4f46-997a-7359212eda9b] finish_mirror_image_status_update: waiting on 1 in-flight updates 0> 2016-06-23 06:21:52.789303 7fdd459e7700 -1 librbd/AioObjectRequest.cc: In function 'void librbd::AbstractAioObjectWrite::send_pre()' thread 7fdd459e7700 time 2016-06-23 06:21:52.707119 librbd/AioObjectRequest.cc: 405: FAILED assert(m_ictx->exclusive_lock->is_lock_owner()) ceph version 10.2.2-5.el7cp (49c93ac9dceaa9318129ece8b0f45a3968d9c981) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fddfafc2275] 2: (librbd::AbstractAioObjectWrite::send_pre()+0x47c) [0x7fddfaed8d1c] 3: (librbd::AbstractAioObjectWrite::send()+0x95) [0x7fddfaed8e55] 4: (librbd::AbstractAioImageWrite::send_object_requests(std::vector<ObjectExtent, std::allocator<ObjectExtent> > const&, SnapContext const&, std::list<librbd::AioObjectRequest*, std::allocator<librbd::AioObjectRequest*> >*)+0x3dd) [0x7fddfaecde7d] 5: (librbd::AbstractAioImageWrite::send_request()+0x36c) [0x7fddfaecec3c] 6: (librbd::AioImageRequest<librbd::ImageCtx>::send()+0xc4) [0x7fddfaed08c4] 7: (librbd::AioImageRequest<librbd::ImageCtx>::aio_write(librbd::ImageCtx*, librbd::AioCompletion*, unsigned long, unsigned long, char const*, int)+0x84) [0x7fddfaed0bd4] 8: (librbd::journal::Replay<librbd::ImageCtx>::handle_event(librbd::journal::AioWriteEvent const&, Context*, Context*)+0xbc) [0x7fddfae9d60c] 9: (librbd::journal::Replay<librbd::ImageCtx>::process(ceph::buffer::list::iterator*, Context*, Context*)+0x452) [0x7fddfaea2082] 10: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::process_entry()+0xfd) [0x7fddfadcb9bd] 11: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::handle_replay_ready()+0x24d) [0x7fddfadcc74d] 12: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::handle_process_entry_ready(int)+0x79) [0x7fddfadc84b9] 13: (Context::complete(int)+0x9) [0x7fddfadb46c9] 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7fddfafb355e] 15: (ThreadPool::WorkThread::entry()+0x10) [0x7fddfafb4430] 16: (()+0x7dc5) [0x7fddf04addc5] 17: (clone()+0x6d) [0x7fddef393ced] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Overloaded cluster watch failure: librbd::ImageWatcher: 0x7fdd780acf60 image watch failed: 140589178452096, (107) Transport endpoint is not connected
slave not crashing while cluster is outside bandwidth parameters is "nice to have" moving to 2.1
Upstream master PR: https://github.com/ceph/ceph/pull/9905 and
Marking it as Verified, as i am unable to reproduce it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html