Bug 1455925 - Osd crashing during recovery
Summary: Osd crashing during recovery
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 2.3
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-26 12:53 UTC by Parikshith
Modified: 2017-07-30 15:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-26 14:45:06 UTC
Embargoed:


Attachments (Terms of Use)
Crashed osd log (6.25 MB, text/plain)
2017-05-26 12:53 UTC, Parikshith
no flags Details

Description Parikshith 2017-05-26 12:53:40 UTC
Created attachment 1282574 [details]
Crashed osd log

Description of problem: One of the osds crashed during recovery IOs


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Created a Replicated pool (rep 3)
2. Ran some IOs using rados bench on the pool.
3. Brought down one of the osd (osd.0) to induce recovery in the cluster.
4. During Recovery one of the osds(osd.5) went down with an assert.

Assert:

 -35> 2017-05-26 12:48:14.090931 7f4af04ab700  1 -- 10.70.39.4:0/349858 --> 10.70.39.2:6807/160560 -- osd_ping(ping e240 stamp 2017-05-26 12:48:14.090789) v2 -- ?+0 0x7f4b25d61600 con 0x7f4b1f71d180
   -34> 2017-05-26 12:48:14.090950 7f4af04ab700  1 -- 10.70.39.4:0/349858 --> 10.70.39.2:6810/160560 -- osd_ping(ping e240 stamp 2017-05-26 12:48:14.090789) v2 -- ?+0 0x7f4b25d60000 con 0x7f4b1f71c580
   -33> 2017-05-26 12:48:14.090970 7f4af04ab700  1 -- 10.70.39.4:0/349858 --> 10.70.39.3:6805/81414 -- osd_ping(ping e240 stamp 2017-05-26 12:48:14.090789) v2 -- ?+0 0x7f4b25d61c00 con 0x7f4b1f71a480
   -32> 2017-05-26 12:48:14.090979 7f4af04ab700  1 -- 10.70.39.4:0/349858 --> 10.70.39.3:6808/81414 -- osd_ping(ping e240 stamp 2017-05-26 12:48:14.090789) v2 -- ?+0 0x7f4b25d62200 con 0x7f4b1fe43780
   -31> 2017-05-26 12:48:14.090999 7f4af04ab700  1 -- 10.70.39.4:0/349858 --> 10.70.39.4:6807/349853 -- osd_ping(ping e240 stamp 2017-05-26 12:48:14.090789) v2 -- ?+0 0x7f4b25d63400 con 0x7f4b1fe41f80
   -30> 2017-05-26 12:48:14.091008 7f4af04ab700  1 -- 10.70.39.4:0/349858 --> 10.70.39.4:6810/349853 -- osd_ping(ping e240 stamp 2017-05-26 12:48:14.090789) v2 -- ?+0 0x7f4b25d63800 con 0x7f4b1fe41e00
   -29> 2017-05-26 12:48:14.091154 7f4ae9a79700  1 -- 10.70.39.4:0/349858 <== osd.1 10.70.39.4:6808/349861 454 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b1fdfac00 con 0x7f4b1f970000
   -28> 2017-05-26 12:48:14.091203 7f4ae9a79700  1 -- 10.70.39.4:0/349858 <== osd.1 10.70.39.4:6808/349861 455 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b1fdfac00 con 0x7f4b1f970000
   -27> 2017-05-26 12:48:14.091234 7f4aeb594700  1 -- 10.70.39.4:0/349858 <== osd.1 10.70.39.4:6811/349861 454 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b1fdfae00 con 0x7f4b207bac00
   -26> 2017-05-26 12:48:14.091262 7f4aeb594700  1 -- 10.70.39.4:0/349858 <== osd.1 10.70.39.4:6811/349861 455 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b1fdfae00 con 0x7f4b207bac00
   -25> 2017-05-26 12:48:14.091404 7f4aea483700  1 -- 10.70.39.4:0/349858 <== osd.6 10.70.39.2:6810/160560 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25ddc800 con 0x7f4b1f71c580
   -24> 2017-05-26 12:48:14.091442 7f4aeaf8e700  1 -- 10.70.39.4:0/349858 <== osd.3 10.70.39.2:6806/160555 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25ddb800 con 0x7f4b1fa01b00
   -23> 2017-05-26 12:48:14.091488 7f4aea483700  1 -- 10.70.39.4:0/349858 <== osd.6 10.70.39.2:6810/160560 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25ddc800 con 0x7f4b1f71c580
   -22> 2017-05-26 12:48:14.091480 7f4aea786700  1 -- 10.70.39.4:0/349858 <== osd.8 10.70.39.4:6810/349853 454 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b1fdf9400 con 0x7f4b1fe41e00
   -21> 2017-05-26 12:48:14.091505 7f4ae9473700  1 -- 10.70.39.4:0/349858 <== osd.3 10.70.39.2:6809/160555 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b1fdf9200 con 0x7f4b1fa02e80
   -20> 2017-05-26 12:48:14.091501 7f4ae9978700  1 -- 10.70.39.4:0/349858 <== osd.6 10.70.39.2:6807/160560 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25ddc000 con 0x7f4b1f71d180
   -19> 2017-05-26 12:48:14.091486 7f4ae8867700  1 -- 10.70.39.4:0/349858 <== osd.4 10.70.39.3:6807/81415 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25ddbc00 con 0x7f4b1f71a180
   -18> 2017-05-26 12:48:14.091485 7f4aeaf8e700  1 -- 10.70.39.4:0/349858 <== osd.3 10.70.39.2:6806/160555 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25ddb800 con 0x7f4b1fa01b00
   -17> 2017-05-26 12:48:14.091486 7f4ae9877700  1 -- 10.70.39.4:0/349858 <== osd.7 10.70.39.3:6805/81414 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25dda400 con 0x7f4b1f71a480
   -16> 2017-05-26 12:48:14.091548 7f4aea786700  1 -- 10.70.39.4:0/349858 <== osd.8 10.70.39.4:6810/349853 455 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b1fdf9400 con 0x7f4b1fe41e00
   -15> 2017-05-26 12:48:14.091555 7f4ae9978700  1 -- 10.70.39.4:0/349858 <== osd.6 10.70.39.2:6807/160560 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25ddc000 con 0x7f4b1f71d180
   -14> 2017-05-26 12:48:14.091557 7f4aea685700  1 -- 10.70.39.4:0/349858 <== osd.7 10.70.39.3:6808/81414 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25dda200 con 0x7f4b1fe43780
   -13> 2017-05-26 12:48:14.091577 7f4ae8867700  1 -- 10.70.39.4:0/349858 <== osd.4 10.70.39.3:6807/81415 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25ddbc00 con 0x7f4b1f71a180
   -12> 2017-05-26 12:48:14.091581 7f4aeab8a700  1 -- 10.70.39.4:0/349858 <== osd.4 10.70.39.3:6804/81415 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25ddda00 con 0x7f4b1f71c280
   -11> 2017-05-26 12:48:14.091591 7f4aea988700  1 -- 10.70.39.4:0/349858 <== osd.2 10.70.39.3:6810/81413 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b25ddb600 con 0x7f4b207ae900
   -10> 2017-05-26 12:48:14.091602 7f4ae9877700  1 -- 10.70.39.4:0/349858 <== osd.7 10.70.39.3:6805/81414 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25dda400 con 0x7f4b1f71a480
    -9> 2017-05-26 12:48:14.091606 7f4ae9574700  1 -- 10.70.39.4:0/349858 <== osd.8 10.70.39.4:6807/349853 454 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b1fdfb800 con 0x7f4b1fe41f80
    -8> 2017-05-26 12:48:14.091615 7f4aea988700  1 -- 10.70.39.4:0/349858 <== osd.2 10.70.39.3:6810/81413 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25ddb600 con 0x7f4b207ae900
    -7> 2017-05-26 12:48:14.091620 7f4ae9473700  1 -- 10.70.39.4:0/349858 <== osd.3 10.70.39.2:6809/160555 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b1fdf9200 con 0x7f4b1fa02e80
    -6> 2017-05-26 12:48:14.091628 7f4aea685700  1 -- 10.70.39.4:0/349858 <== osd.7 10.70.39.3:6808/81414 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25dda200 con 0x7f4b1fe43780
    -5> 2017-05-26 12:48:14.091624 7f4aeab8a700  1 -- 10.70.39.4:0/349858 <== osd.4 10.70.39.3:6804/81415 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b25ddda00 con 0x7f4b1f71c280
    -4> 2017-05-26 12:48:14.091649 7f4ae9574700  1 -- 10.70.39.4:0/349858 <== osd.8 10.70.39.4:6807/349853 455 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b1fdfb800 con 0x7f4b1fe41f80
    -3> 2017-05-26 12:48:14.091680 7f4aeb695700  1 -- 10.70.39.4:0/349858 <== osd.2 10.70.39.3:6811/81413 468 ==== osd_ping(ping_reply e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (3107826512 0 0) 0x7f4b1fdf9c00 con 0x7f4b1fa01c80
    -2> 2017-05-26 12:48:14.091765 7f4aeb695700  1 -- 10.70.39.4:0/349858 <== osd.2 10.70.39.3:6811/81413 469 ==== osd_ping(you_died e244 stamp 2017-05-26 12:48:14.090789) v2 ==== 47+0+0 (4125756681 0 0) 0x7f4b1fdf9c00 con 0x7f4b1fa01c80
    -1> 2017-05-26 12:48:16.085088 7f4af44b3700 -1 filestore(/var/lib/ceph/osd/ceph-5) FileStore::read(2.37_head/#2:eca4013c:::benchmark_data_aircobra.lab.eng.blr.redhat.c_136743_object180:head#) pread error: (5) Input/output error
     0> 2017-05-26 12:48:16.088324 7f4af44b3700 -1 os/filestore/FileStore.cc: In function 'virtual int FileStore::read(const coll_t&, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f4af44b3700 time 2017-05-26 12:48:16.085107
os/filestore/FileStore.cc: 3016: FAILED assert(0 == "eio on pread")

 ceph version 10.2.7-21.el7cp (ebe0fca146985f59e6ab136a860d1f063a26c700)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f4b14b528e5]
 2: (FileStore::read(coll_t const&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xd91) [0x7f4b148218c1]
 3: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*, bool)+0x26b) [0x7f4b146af90b]
 4: (ReplicatedBackend::handle_pull(pg_shard_t, PullOp&, PushOp*)+0xd4) [0x7f4b146b0c84]
 5: (ReplicatedBackend::do_pull(std::shared_ptr<OpRequest>)+0x1cc) [0x7f4b146b2c0c]
 6: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x363) [0x7f4b146b93b3]
 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x7f4b1460e7c0]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f4b144c41ad]
 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f4b144c43fd]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x77b) [0x7f4b144c7dbb]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f4b14b42887]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f4b14b447f0]
 13: (()+0x7dc5) [0x7f4b12a6edc5]
 14: (clone()+0x6d) [0x7f4b110fa73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I have attached complete log of the affected osd(osd.5)

Comment 2 Josh Durgin 2017-05-26 14:45:06 UTC
EIO means from the filesystem means it's a hardware problem - usually a dying disk.


Note You need to log in before you can comment on or make changes to this bug.