Bug 1637948 - OSD FAILED assert(repop_queue.front() == repop) in EC CephFS data pool
Summary: OSD FAILED assert(repop_queue.front() == repop) in EC CephFS data pool
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 3.2
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: 4.*
Assignee: Neha Ojha
QA Contact: Manohar Murthy
URL:
Whiteboard:
Depends On:
Blocks: 1629656
TreeView+ depends on / blocked
 
Reported: 2018-10-10 11:38 UTC by Ramakrishnan Periyasamy
Modified: 2020-05-06 21:28 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Performing I/O in CephFS erasure-coded pools can cause a failure on assertion This issue is being investigated as a possible latent bug in the messenger layer which could be causing out of order operations on the OSD. The issue causes the following error: ---- FAILED assert(repop_queue.front() == repop) ---- There is no workaround at this time. CephFS with erasure-coded pools are a Technology Preview. For more information see link:cephfs-guide#creating-ceph-file-systems-with-erasure-coding-fs[Creating Ceph File Systems with erasure coding] in the link:cephfs-guide[Ceph File System Guide]
Clone Of:
Environment:
Last Closed: 2020-05-06 21:28:35 UTC
Embargoed:


Attachments (Terms of Use)
OSD logs (5.42 MB, text/plain)
2018-10-10 11:43 UTC, Ramakrishnan Periyasamy
no flags Details
osd.15 logs which also hit same assert (4.82 MB, text/plain)
2018-10-10 11:45 UTC, Ramakrishnan Periyasamy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 21143 0 None None None 2018-10-18 15:22:11 UTC

Description Ramakrishnan Periyasamy 2018-10-10 11:38:57 UTC
Description of problem:

Observed below assert in OSD when performing IO on Erasure Coded CephFS data pool.

IO: Create file workload using Crefi and smallfiles IO tools.

2018-10-10 11:04:17.526418 7f2e04cef700 -1 /builddir/build/BUILD/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7f2e04cef700 time 2018-10-10 11:04:17.521648
/builddir/build/BUILD/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 9419: FAILED assert(repop_queue.front() == repop)

 ceph version 12.2.8-14.el7cp (8f51764103fb904d4b9772579a437a193a69cc28) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x559a545ba040]
 2: (PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)+0x524) [0x559a541930c4]
 3: (PrimaryLogPG::repop_all_committed(PrimaryLogPG::RepGather*)+0xc5) [0x559a54193ba5]
 4: (Context::complete(int)+0x9) [0x559a54053ae9]
 5: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply const&, ZTracer::Trace const&)+0x29b) [0x559a54328ceb]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2df) [0x559a5432adbf]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x559a54229a90]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x559a541947ec]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x559a54015479]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x559a5429c607]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x559a540443ae]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x559a545bfb59]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559a545c1af0]
 14: (()+0x7dd5) [0x7f2e1fdc3dd5]
 15: (clone()+0x6d) [0x7f2e1eeb3ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



Version-Release number of selected component (if applicable):
ceph version 12.2.8-14.el7cp (8f51764103fb904d4b9772579a437a193a69cc28) luminous (stable)

How reproducible:
1/1 On particular OSD node all 3 OSD's hit the same issue.

Steps to Reproduce:
1. Configure Erasure Coded CephFS data pool
2. Run Crefi and smallfiles IO with cluster in degraded state.

Actual results:


Expected results:


Additional info:

Comment 3 Ramakrishnan Periyasamy 2018-10-10 11:43:52 UTC
Created attachment 1492528 [details]
OSD logs

Comment 4 Ramakrishnan Periyasamy 2018-10-10 11:45:04 UTC
Created attachment 1492529 [details]
osd.15 logs which also hit same assert

Comment 5 Neha Ojha 2018-10-12 23:06:04 UTC
These are a couple of upstream tracker issues which seem to be similar to this https://tracker.ceph.com/issues/22570 and https://tracker.ceph.com/issues/21143.


Note You need to log in before you can comment on or make changes to this bug.