Description of problem: Observed below assert in OSD when performing IO on Erasure Coded CephFS data pool. IO: Create file workload using Crefi and smallfiles IO tools. 2018-10-10 11:04:17.526418 7f2e04cef700 -1 /builddir/build/BUILD/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7f2e04cef700 time 2018-10-10 11:04:17.521648 /builddir/build/BUILD/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 9419: FAILED assert(repop_queue.front() == repop) ceph version 12.2.8-14.el7cp (8f51764103fb904d4b9772579a437a193a69cc28) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x559a545ba040] 2: (PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)+0x524) [0x559a541930c4] 3: (PrimaryLogPG::repop_all_committed(PrimaryLogPG::RepGather*)+0xc5) [0x559a54193ba5] 4: (Context::complete(int)+0x9) [0x559a54053ae9] 5: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply const&, ZTracer::Trace const&)+0x29b) [0x559a54328ceb] 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2df) [0x559a5432adbf] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x559a54229a90] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x559a541947ec] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x559a54015479] 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x559a5429c607] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x559a540443ae] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x559a545bfb59] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559a545c1af0] 14: (()+0x7dd5) [0x7f2e1fdc3dd5] 15: (clone()+0x6d) [0x7f2e1eeb3ead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Version-Release number of selected component (if applicable): ceph version 12.2.8-14.el7cp (8f51764103fb904d4b9772579a437a193a69cc28) luminous (stable) How reproducible: 1/1 On particular OSD node all 3 OSD's hit the same issue. Steps to Reproduce: 1. Configure Erasure Coded CephFS data pool 2. Run Crefi and smallfiles IO with cluster in degraded state. Actual results: Expected results: Additional info:
Created attachment 1492528 [details] OSD logs
Created attachment 1492529 [details] osd.15 logs which also hit same assert
These are a couple of upstream tracker issues which seem to be similar to this https://tracker.ceph.com/issues/22570 and https://tracker.ceph.com/issues/21143.