Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1637948

Summary:

OSD FAILED assert(repop_queue.front() == repop) in EC CephFS data pool

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Ramakrishnan Periyasamy <rperiyas>

Component:

RADOS

Assignee:

Neha Ojha <nojha>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Manohar Murthy <mmurthy>

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.2

CC:

ceph-eng-bugs, dzafman, hgurav, hnallurv, jbrier, jdurgin, kchai, nojha, pasik, racpatel, vumrao

Target Milestone:

Target Release:

4.*

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

.Performing I/O in CephFS erasure-coded pools can cause a failure on assertion This issue is being investigated as a possible latent bug in the messenger layer which could be causing out of order operations on the OSD. The issue causes the following error: ---- FAILED assert(repop_queue.front() == repop) ---- There is no workaround at this time. CephFS with erasure-coded pools are a Technology Preview. For more information see link:cephfs-guide#creating-ceph-file-systems-with-erasure-coding-fs[Creating Ceph File Systems with erasure coding] in the link:cephfs-guide[Ceph File System Guide]

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-05-06 21:28:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1629656

Attachments:

Description	Flags
OSD logs	none
osd.15 logs which also hit same assert	none

Description Ramakrishnan Periyasamy 2018-10-10 11:38:57 UTC

Description of problem:

Observed below assert in OSD when performing IO on Erasure Coded CephFS data pool.

IO: Create file workload using Crefi and smallfiles IO tools.

2018-10-10 11:04:17.526418 7f2e04cef700 -1 /builddir/build/BUILD/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7f2e04cef700 time 2018-10-10 11:04:17.521648
/builddir/build/BUILD/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 9419: FAILED assert(repop_queue.front() == repop)

 ceph version 12.2.8-14.el7cp (8f51764103fb904d4b9772579a437a193a69cc28) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x559a545ba040]
 2: (PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)+0x524) [0x559a541930c4]
 3: (PrimaryLogPG::repop_all_committed(PrimaryLogPG::RepGather*)+0xc5) [0x559a54193ba5]
 4: (Context::complete(int)+0x9) [0x559a54053ae9]
 5: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply const&, ZTracer::Trace const&)+0x29b) [0x559a54328ceb]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2df) [0x559a5432adbf]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x559a54229a90]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x559a541947ec]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x559a54015479]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x559a5429c607]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x559a540443ae]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x559a545bfb59]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559a545c1af0]
 14: (()+0x7dd5) [0x7f2e1fdc3dd5]
 15: (clone()+0x6d) [0x7f2e1eeb3ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



Version-Release number of selected component (if applicable):
ceph version 12.2.8-14.el7cp (8f51764103fb904d4b9772579a437a193a69cc28) luminous (stable)

How reproducible:
1/1 On particular OSD node all 3 OSD's hit the same issue.

Steps to Reproduce:
1. Configure Erasure Coded CephFS data pool
2. Run Crefi and smallfiles IO with cluster in degraded state.

Actual results:


Expected results:


Additional info:

Comment 3 Ramakrishnan Periyasamy 2018-10-10 11:43:52 UTC

Created attachment 1492528 [details]
OSD logs

Comment 4 Ramakrishnan Periyasamy 2018-10-10 11:45:04 UTC

Created attachment 1492529 [details]
osd.15 logs which also hit same assert

Comment 5 Neha Ojha 2018-10-12 23:06:04 UTC

These are a couple of upstream tracker issues which seem to be similar to this https://tracker.ceph.com/issues/22570 and https://tracker.ceph.com/issues/21143.