Created attachment 1029363 [details] OSD log and Mon logs Description of problem: OSDs are getting crashed when i was writing heavy IO operations from rbd clients. Version-Release number of selected component (if applicable): ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) How reproducible: Not Sure, will try to reproduce again. Steps to Reproduce: 1. Create 4 rbd image and attach them to VM's, with object map feature enabled. 2. Boot all the VM's and start OS installation process in all of them at once. 3. Create 10 rbd image. 4. When the installation kicked in, i started rbd write-bench on all of my 10 OSD hosts using the same rbd images created in step 3. rbd bench-write --io-pattern rand Tanay-RBD/WBClone_new5 Actual results: OSDs are getting crashed. -2 2.00000 host OSD1 0 0.39999 osd.0 down 0 1.00000 3 0.39999 osd.3 down 0 1.00000 6 0.39999 osd.6 down 0 1.00000 9 0.39999 osd.9 down 0 1.00000 12 0.39999 osd.12 down 0 1.00000 bt: 0> 2015-05-23 13:28:21.836189 7f906020e700 -1 *** Caught signal (Aborted) ** in thread 7f906020e700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0x9f6562] 2: (()+0xf130) [0x7f906d6c0130] 3: (gsignal()+0x37) [0x7f906c0da5d7] 4: (abort()+0x148) [0x7f906c0dbcc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f906c9de9b5] 6: (()+0x5e926) [0x7f906c9dc926] 7: (()+0x5e953) [0x7f906c9dc953] 8: (()+0x5eb73) [0x7f906c9dcb73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xaf640a] 10: (SyncEntryTimeout::finish(int)+0xbd) [0x8d83bd] 11: (Context::complete(int)+0x9) [0x67c759] 12: (SafeTimer::timer_thread()+0x104) [0xadf804] 13: (SafeTimerThread::entry()+0xd) [0xae07bd] 14: (()+0x7df5) [0x7f906d6b8df5] 15: (clone()+0x6d) [0x7f906c19b1ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Expected results: There should not be any crash. Additional info: I started all the write-bench thread and the installation process, today morning i am seeing OSDs being crahsed. [root@OSD1 logs]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 15.59984 root default -2 2.00000 host OSD1 0 0.39999 osd.0 down 0 1.00000 3 0.39999 osd.3 down 0 1.00000 6 0.39999 osd.6 down 0 1.00000 9 0.39999 osd.9 down 0 1.00000 12 0.39999 osd.12 down 0 1.00000 -3 1.99997 host OSD2 1 0.39999 osd.1 up 1.00000 1.00000 4 0.39999 osd.4 up 1.00000 1.00000 7 0.39999 osd.7 up 1.00000 1.00000 10 0.39999 osd.10 up 1.00000 1.00000 13 0.39999 osd.13 up 1.00000 1.00000 -4 2.00000 host OSD3 2 0.39999 osd.2 up 1.00000 1.00000 5 0.39999 osd.5 up 1.00000 1.00000 8 0.39999 osd.8 up 1.00000 1.00000 11 0.39999 osd.11 up 1.00000 1.00000 14 0.39999 osd.14 up 1.00000 1.00000 -5 1.20000 host ADMIN 15 0.39999 osd.15 up 1.00000 1.00000 16 0.39999 osd.16 up 1.00000 1.00000 17 0.39999 osd.17 up 1.00000 1.00000 -6 1.19998 host OSD5 33 0.39999 osd.33 up 1.00000 1.00000 34 0.39999 osd.34 up 1.00000 1.00000 35 0.39999 osd.35 up 1.00000 1.00000 -7 1.19998 host OSD4 30 0.39999 osd.30 up 1.00000 1.00000 31 0.39999 osd.31 up 1.00000 1.00000 32 0.39999 osd.32 up 1.00000 1.00000 -8 1.19998 host OSD6 36 0.39999 osd.36 up 1.00000 1.00000 37 0.39999 osd.37 up 1.00000 1.00000 38 0.39999 osd.38 up 1.00000 1.00000 -9 1.19998 host OSD7 39 0.39999 osd.39 up 1.00000 1.00000 40 0.39999 osd.40 up 1.00000 1.00000 41 0.39999 osd.41 up 1.00000 1.00000 -10 1.19998 host OSD8 42 0.39999 osd.42 up 1.00000 1.00000 43 0.39999 osd.43 up 1.00000 1.00000 44 0.39999 osd.44 up 1.00000 1.00000 -11 1.19998 host OSD9 18 0.39999 osd.18 up 1.00000 1.00000 19 0.39999 osd.19 up 1.00000 1.00000 20 0.39999 osd.20 up 1.00000 1.00000 -12 1.19998 host OSD10 21 0.39999 osd.21 up 1.00000 1.00000 22 0.39999 osd.22 up 1.00000 1.00000 23 0.39999 osd.23 up 1.00000 1.00000 ceph -s cluster 1ceee8cc-75a3-4b10-a24f-6be6bbaa6a99 health HEALTH_WARN 25 requests are blocked > 32 sec monmap e1: 1 mons at {MON1=10.12.27.26:6789/0} election epoch 1, quorum 0 MON1 osdmap e10915: 39 osds: 34 up, 34 in pgmap v745916: 4608 pgs, 2 pools, 115 GB data, 31971 objects 2391 GB used, 11450 GB / 13842 GB avail 4608 active+clean client io 191 kB/s rd, 121 kB/s wr, 107 op/s Attaching the logs of OSD's, MON and the ceph log for further investigation.
The sub-component should be RADOS. It looks like you hit the filestore sync timeout (defaults to 10 minutes) which indicates you have severely slow I/O on the OSD.
Apparently, you are exceednig the filestore_commit_timeout threshold, which defaults to 600s -- which means that a syncfs(2) took longer than 10 minutes. I think the disk on which this is happening is probably just bad, or the disk/node is hilariously overloaded. I guess this is still the atom systems?
Sam, is osd crash in this case an expected behavior? Shouldn't it be handled gracefully?
It crashes because usually a syncfs(2) taking 10 minutes indicates that the storage died, and the osd should commit suicide. It's a config option (filestore_commit_timeout).
Crashing with ceph_abort() seems extreme to me, because it's hard for a non-developer user to know what's happened. All that users will see is an obfuscated stack trace with __ceph_assert_fail() somewhere in there. It would be nice if the the OSDs logged what's happened, and the log message could even point at the filestore_commit_timeout setting, and then shut down gracefully. John, in terms of documentation, I guess we need a knowledge-base article with this information? Something like "if users see a crash in SyncEntryTimeout::finish, it means that the storage hardware has died and the OSD refused to continue to run. This value is configurable with the filestore_commit_timeout setting". Does that sound right to you?
We probably should add some logging to indicate what is going on, but a graceful shutdown is impossible since the disk is not able to commit.
http://tracker.ceph.com/issues/11823
This hasn't been done upstream yet. You probably want to push it for now.
Thanks Sam, re-targeting to 1.3.2.
Clearing need_info thing.