| Summary: | osd crash with Segmentation fault in handle_osd_ping | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
| Component: | RADOS | Assignee: | David Zafman <dzafman> |
| Status: | CLOSED WORKSFORME | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 1.3.2 | CC: | bhubbard, ceph-eng-bugs, dzafman, kchai, kdreyer, vumrao |
| Target Milestone: | rc | ||
| Target Release: | 1.3.4 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-05-25 18:09:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
3: (OSD::handle_osd_ping(MOSDPing*)+0x75a) [0x659e7a] That's here: 3652 } 3653 3654 Message *r = new MOSDPing(monc->get_fsid(), 0x0000000000659e73 <+1875>: mov 0x160(%rbx),%r12 3655 curmap->get_epoch(), 0x0000000000659e5b <+1851>: mov 0x10(%rsp),%rax 0x0000000000659e7a <+1882>: mov 0x10(%rax),%eax <----HERE 0x0000000000659e85 <+1893>: mov %eax,0x8(%rsp) 3656 MOSDPing::PING_REPLY, 3657 m->stamp); If the "epoch" field in an OSDMap class is stored at offset 0x10 (I don't have time to check this and will be out for most of the next week and a half) then it looks like it is curmap or its "epoch" field which is causing the crash. HTH. (In reply to Brad Hubbard from comment #6) > 3: (OSD::handle_osd_ping(MOSDPing*)+0x75a) [0x659e7a] > > That's here: > > 3652 } > 3653 > 3654 Message *r = new MOSDPing(monc->get_fsid(), > 0x0000000000659e73 <+1875>: mov 0x160(%rbx),%r12 > > 3655 curmap->get_epoch(), > 0x0000000000659e5b <+1851>: mov 0x10(%rsp),%rax > 0x0000000000659e7a <+1882>: mov 0x10(%rax),%eax <----HERE > 0x0000000000659e85 <+1893>: mov %eax,0x8(%rsp) > > 3656 MOSDPing::PING_REPLY, > 3657 m->stamp); > > If the "epoch" field in an OSDMap class is stored at offset 0x10 (I don't > have time to check this and will be out for most of the next week and a > half) then it looks like it is curmap or its "epoch" field which is causing > the crash. > > HTH. Thanks Brad. I tired with you KCS : https://access.redhat.com/articles/1614603 $ objdump -rdS ./ceph-osd > /root/dump.txt $ grep -m1 -B5 handle_osd_ping dump.txt 659715: eb ac jmp 6596c3 <_ZN10OSDService24send_message_osd_clusterEiP7Messagej+0x2f3> 659717: 90 nop 659718: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 65971f: 00 0000000000659720 <_ZN3OSD15handle_osd_pingEP8MOSDPing>: $ c++filt _ZN3OSD15handle_osd_pingEP8MOSDPing OSD::handle_osd_ping(MOSDPing*) echo 'obase=16;ibase=16;659720+75A'|bc 659E7A # grep 659e7a -A5 -B35 dump.txt 659de5: 00 659de6: e8 45 68 fb ff callq 610630 <_ZNSt8ios_baseD2Ev@plt> 659deb: 48 8b 44 24 10 mov 0x10(%rsp),%rax 659df0: 31 d2 xor %edx,%edx 659df2: 48 89 df mov %rbx,%rdi 659df5: 8b 40 10 mov 0x10(%rax),%eax 659df8: 8d 70 01 lea 0x1(%rax),%esi 659dfb: e8 c0 e8 fe ff callq 6486c0 <_ZN3OSD16osdmap_subscribeEmb> 659e00: e9 fb fb ff ff jmpq 659a00 <_ZN3OSD15handle_osd_pingEP8MOSDPing+0x2e0> 659e05: 0f 1f 00 nopl (%rax) 659e08: 48 39 ce cmp %rcx,%rsi 659e0b: 0f 84 37 03 00 00 je 65a148 <_ZN3OSD15handle_osd_pingEP8MOSDPing+0xa28> 659e11: 39 51 20 cmp %edx,0x20(%rcx) 659e14: 0f 8f 2e 03 00 00 jg 65a148 <_ZN3OSD15handle_osd_pingEP8MOSDPing+0xa28> 659e1a: 48 8d 44 24 6c lea 0x6c(%rsp),%rax 659e1f: 48 89 ef mov %rbp,%rdi 659e22: 48 89 c6 mov %rax,%rsi 659e25: 48 89 44 24 20 mov %rax,0x20(%rsp) 659e2a: e8 61 0f 06 00 callq 6bad90 <_ZNSt3mapIiiSt4lessIiESaISt4pairIKiiEEEixERS3_> 659e2f: 8b 00 mov (%rax),%eax 659e31: 85 c0 test %eax,%eax 659e33: 0f 85 37 07 00 00 jne 65a570 <_ZN3OSD15handle_osd_pingEP8MOSDPing+0xe50> 659e39: 48 8b 74 24 20 mov 0x20(%rsp),%rsi 659e3e: 48 89 ef mov %rbp,%rdi 659e41: e8 da 5e 04 00 callq 69fd20 <_ZNSt8_Rb_treeIiSt4pairIKiiESt10_Select1stIS2_ESt4lessIiESaIS2_EE5eraseERS1_> 659e46: 48 8b 43 08 mov 0x8(%rbx),%rax 659e4a: 48 8b 78 58 mov 0x58(%rax),%rdi 659e4e: e8 3d 32 3e 00 callq a3d090 <_ZN4ceph12HeartbeatMap10is_healthyEv> 659e53: 84 c0 test %al,%al 659e55: 0f 84 05 02 00 00 je 65a060 <_ZN3OSD15handle_osd_pingEP8MOSDPing+0x940> 659e5b: 48 8b 44 24 10 mov 0x10(%rsp),%rax 659e60: 41 8b 8f dc 01 00 00 mov 0x1dc(%r15),%ecx 659e67: bf e8 01 00 00 mov $0x1e8,%edi 659e6c: 41 8b 97 e0 01 00 00 mov 0x1e0(%r15),%edx 659e73: 4c 8b a3 60 01 00 00 mov 0x160(%rbx),%r12 659e7a: 8b 40 10 mov 0x10(%rax),%eax <============== 659e7d: 89 4c 24 28 mov %ecx,0x28(%rsp) 659e81: 89 54 24 20 mov %edx,0x20(%rsp) 659e85: 89 44 24 08 mov %eax,0x8(%rsp) 659e89: e8 62 6a fb ff callq 6108f0 <_Znwm@plt> 659e8e: b9 01 00 00 00 mov $0x1,%ecx > If the "epoch" field in an OSDMap class is stored at offset 0x10 then it looks like it is curmap or its "epoch" field which is causing the crash. For this I looked into the code as I was not able to check "OSDMap class is stored at offset 0x10 " Then I looked into the code : curmap->get_epoch(), In this code if I check as per my understanding we wont crash inside "get_epoch()" : epoch_t get_epoch() const { return epoch; } But this could : OSDMapRef curmap = service.get_osdmap(); <================== OSD.h 391 OSDMapRef get_osdmap() { OSDMapRef get_osdmap() { Mutex::Locker l(publish_lock); return osdmap; } PG.h 227 OSDMapRef get_osdmap() const { OSDMapRef get_osdmap() const { assert(is_locked()); assert(osdmap_ref); return osdmap_ref; } (In reply to Vikhyat Umrao from comment #8) I find myself with a small window of time today (I'm on PTO)... > 659e7a: 8b 40 10 mov 0x10(%rax),%eax <============== > 659e7d: 89 4c 24 28 mov %ecx,0x28(%rsp) > 659e81: 89 54 24 20 mov %edx,0x20(%rsp) > 659e85: 89 44 24 08 mov %eax,0x8(%rsp) > 659e89: e8 62 6a fb ff callq 6108f0 <_Znwm@plt> > 659e8e: b9 01 00 00 00 mov $0x1,%ecx Nice work. I just did a "disass /m 0x659e7a" after downloading all matching binaries to cwd, upacking them, and then entering the following in gdb. (gdb) set debug-file-directory usr/lib/debug (gdb) set use-deprecated-index-sections on (gdb) set solib-search-path usr/lib64 (gdb) set sysroot . (gdb) file ./usr/bin/ceph-osd Reading symbols from ./usr/bin/ceph-osd...Reading symbols from /home/brad/working/scratch/bz1328377/usr/lib/debug/usr/bin/ceph-osd.debug...done. done. > > > If the "epoch" field in an OSDMap class is stored at offset 0x10 then it looks like it is curmap or its "epoch" field which is causing the crash. > > For this I looked into the code as I was not able to check "OSDMap class is > stored at offset 0x10 " The main reason I'm looking at this this morning is it came to me last night that this is easy (once you know how) and I should have been able to do it yesterday. (gdb) p ((OSDMap*)0x0)->epoch Cannot access memory at address 0x10 So the offset of the "epoch" field in an OSDMap class *is* 0x10 and this reinforces my theory. > > Then I looked into the code : > > curmap->get_epoch(), > > In this code if I check as per my understanding we wont crash inside > "get_epoch()" : > > epoch_t get_epoch() const { return epoch; } Well, here's what I believe is happening. Because get_epoch is such a simple member function the compiler can optimise it away and essentially replace it with a single instruction; mov 0x10(%rax),%eax So this *is* the implementation of get_epoch and since this is a segfault the instruction pointer should be on the exact instruction that caused the segfault, not a return address. Since this is the first use of curmap after it is returned by get_osdmap it could be the entire OSDMap instance is corrupt but, since this appears to be protected by a lock, I'd suggest it is more likely that just "epoch" is corrupt. Either way I'd suggest a code review to see if there can be a race on curmap or "epoch" or if some other situation could lead to this. The logs should give hints to the code paths involved but it may be difficult to spot. The only other suggestions I have are capture a core file which may pinpoint a pair of racing threads, or run the ceph-osd under valgrind which may not be practical/possible. I'm out of time again I'm afraid. HTH. (In reply to Brad Hubbard from comment #9) > Since this is the first use of curmap after it is returned by get_osdmap it > could be the entire OSDMap instance is corrupt but, since this appears to be > protected by a lock, I'd suggest it is more likely that just "epoch" is > corrupt. I retract this statement :) I'm now leaning towards "curmap" (an OSDMapref) being NULL or invalid. Here's why... I found this, it's ancient and fixed long ago, but is amazingly similar. http://www.spinics.net/lists/ceph-devel/msg07677.html The fix in that case was the following. $ git show 76efd9772c60b93bbf632e3ecc3b9117dc081427 commit 76efd9772c60b93bbf632e3ecc3b9117dc081427 Author: Samuel Just <sam.just> Date: Wed Jul 18 09:26:11 2012 -0700 OSD: publish_map in init to initialize OSDService map Other areas rely on OSDService::get_map() to function, possibly before activate_map is first called. In particular, with handle_osd_ping, not initializing the map member results in: ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d) 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a] 2: (()+0xfcb0) [0x7fcd8243dcb0] 3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd] 4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b] 5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b] 6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914] 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd] 8: (()+0x7e9a) [0x7fcd82435e9a] 9: (clone()+0x6d) [0x7fcd809ea4bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Signed-off-by: Samuel Just <sam.just> diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 4acd82c..17e14f4 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -819,6 +819,7 @@ int OSD::init() return -EINVAL; } osdmap = get_map(superblock.current_epoch); + service.publish_map(osdmap); bind_epoch = osdmap->get_epoch(); So I looked at the current incarnation of this function... 1789 int OSD::init() 1790 { 1791 CompatSet initial, diff; 1792 Mutex::Locker lock(osd_lock); ----8<---- 1886 osdmap = get_map(superblock.current_epoch); <----NOTE 1887 check_osdmap_features(store); 1888 1889 create_recoverystate_perf(); 1890 1891 { 1892 epoch_t bind_epoch = osdmap->get_epoch(); 1893 service.set_epochs(NULL, NULL, &bind_epoch); 1894 } 1895 1896 // load up pgs (as they previously existed) 1897 load_pgs(); 1898 1899 dout(2) << "superblock: i am osd." << superblock.whoami << dendl; 1900 1901 create_logger(); 1902 1903 // i'm ready! 1904 client_messenger->add_dispatcher_head(this); 1905 cluster_messenger->add_dispatcher_head(this); 1906 1907 hbclient_messenger->add_dispatcher_head(&heartbeat_dispatcher); 1908 hb_front_server_messenger->add_dispatcher_head(&heartbeat_dispatcher); 1909 hb_back_server_messenger->add_dispatcher_head(&heartbeat_dispatcher); 1910 1911 objecter_messenger->add_dispatcher_head(service.objecter); 1912 1913 monc->set_want_keys(CEPH_ENTITY_TYPE_MON | CEPH_ENTITY_TYPE_OSD); 1914 r = monc->init(); 1915 if (r < 0) 1916 goto out; 1917 1918 // tell monc about log_client so it will know about mon session resets 1919 monc->set_log_client(&log_client); 1920 update_log_config(); 1921 1922 osd_tp.start(); 1923 osd_op_tp.start(); 1924 recovery_tp.start(); 1925 disk_tp.start(); 1926 command_tp.start(); 1927 1928 set_disk_tp_priority(); 1929 1930 // start the heartbeat 1931 heartbeat_thread.create(); <----NOTE 1932 1933 // tick 1934 tick_timer.add_event_after(cct->_conf->osd_heartbeat_interval, new C_Tick(this)); 1935 1936 service.init(); 1937 service.publish_map(osdmap); <----NOTE 1938 service.publish_superblock(superblock); Note that we see "superblock: i am osd.43" in the above logs and I guess we must have started the heartbeat_thread because we are in "OSD::heartbeat_dispatch" but have we published the map yet? There is only around one hundredth of a second between the "i am osd" output and the crash... We can see that a lot of code has crept in between when we get the map and when we actually publish it. I wonder if this has opened up an opportunity to race? The osd_lock is locked during this whole period but perhaps this is not enough? Perhaps if a ping comes in within that window we see this issue? One possible thing to try would be to move the block 1936-1938 to sometime before line 1931 (heartbeat_thread.create()). IOW, are we calling service.get_osdmap() before we have called service.publish_map(osdmap) ? Vikhyat, It might be worth seeing if Sam would have time to look at my theory and see whether the suggested reordering might resolve the race condition? Unfortunately I'm not back until next Tuesday but, if it can wait until then, I'd be happy to create a test package with the modifications suggested above. I too was thinking along those lines. To verify I added a sleep right before the first map is published on osd.0 then started it and a bunch more OSDs. It did not reproduce the problem. (In reply to Brad Hubbard from comment #13) > (In reply to Brad Hubbard from comment #9) > > > Since this is the first use of curmap after it is returned by get_osdmap it > > could be the entire OSDMap instance is corrupt but, since this appears to be > > protected by a lock, I'd suggest it is more likely that just "epoch" is > > corrupt. > > I retract this statement :) > > I'm now leaning towards "curmap" (an OSDMapref) being NULL or invalid. > Here's why... > > I found this, it's ancient and fixed long ago, but is amazingly similar. > > http://www.spinics.net/lists/ceph-devel/msg07677.html > > The fix in that case was the following. > > $ git show 76efd9772c60b93bbf632e3ecc3b9117dc081427 > commit 76efd9772c60b93bbf632e3ecc3b9117dc081427 > Author: Samuel Just <sam.just> > Date: Wed Jul 18 09:26:11 2012 -0700 > > OSD: publish_map in init to initialize OSDService map > > Other areas rely on OSDService::get_map() to function, possibly before > activate_map is first called. In particular, with handle_osd_ping, > not initializing the map member results in: > > ceph version 0.48argonaut-413-g90ddc5a > (commit:90ddc5ae51627e7656459085d7e15105c8b8316d) > 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a] > 2: (()+0xfcb0) [0x7fcd8243dcb0] > 3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd] > 4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b] > 5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b] > 6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914] > 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd] > 8: (()+0x7e9a) [0x7fcd82435e9a] > 9: (clone()+0x6d) [0x7fcd809ea4bd] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > Signed-off-by: Samuel Just <sam.just> > > diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc > index 4acd82c..17e14f4 100644 > --- a/src/osd/OSD.cc > +++ b/src/osd/OSD.cc > @@ -819,6 +819,7 @@ int OSD::init() > return -EINVAL; > } > osdmap = get_map(superblock.current_epoch); > + service.publish_map(osdmap); > > bind_epoch = osdmap->get_epoch(); > > > So I looked at the current incarnation of this function... > > 1789 int OSD::init() > 1790 { > 1791 CompatSet initial, diff; > 1792 Mutex::Locker lock(osd_lock); > ----8<---- > 1886 osdmap = get_map(superblock.current_epoch); <----NOTE > 1887 check_osdmap_features(store); > 1888 > 1889 create_recoverystate_perf(); > 1890 > 1891 { > 1892 epoch_t bind_epoch = osdmap->get_epoch(); > 1893 service.set_epochs(NULL, NULL, &bind_epoch); > 1894 } > 1895 > 1896 // load up pgs (as they previously existed) > 1897 load_pgs(); > 1898 > 1899 dout(2) << "superblock: i am osd." << superblock.whoami << dendl; > 1900 > 1901 create_logger(); > 1902 > 1903 // i'm ready! > 1904 client_messenger->add_dispatcher_head(this); > 1905 cluster_messenger->add_dispatcher_head(this); > 1906 > 1907 hbclient_messenger->add_dispatcher_head(&heartbeat_dispatcher); > 1908 hb_front_server_messenger->add_dispatcher_head(&heartbeat_dispatcher); > 1909 hb_back_server_messenger->add_dispatcher_head(&heartbeat_dispatcher); > 1910 > 1911 objecter_messenger->add_dispatcher_head(service.objecter); > 1912 > 1913 monc->set_want_keys(CEPH_ENTITY_TYPE_MON | CEPH_ENTITY_TYPE_OSD); > 1914 r = monc->init(); > 1915 if (r < 0) > 1916 goto out; > 1917 > 1918 // tell monc about log_client so it will know about mon session resets > 1919 monc->set_log_client(&log_client); > 1920 update_log_config(); > 1921 > 1922 osd_tp.start(); > 1923 osd_op_tp.start(); > 1924 recovery_tp.start(); > 1925 disk_tp.start(); > 1926 command_tp.start(); > 1927 > 1928 set_disk_tp_priority(); > 1929 > 1930 // start the heartbeat > 1931 heartbeat_thread.create(); <----NOTE > 1932 > 1933 // tick > 1934 tick_timer.add_event_after(cct->_conf->osd_heartbeat_interval, new > C_Tick(this)); > 1935 > 1936 service.init(); > 1937 service.publish_map(osdmap); <----NOTE > 1938 service.publish_superblock(superblock); > > Note that we see "superblock: i am osd.43" in the above logs and I guess we > must have started the heartbeat_thread because we are in > "OSD::heartbeat_dispatch" but have we published the map yet? There is only > around one hundredth of a second between the "i am osd" output and the > crash... > > We can see that a lot of code has crept in between when we get the map and > when we actually publish it. I wonder if this has opened up an opportunity > to race? The osd_lock is locked during this whole period but perhaps this is > not enough? Perhaps if a ping comes in within that window we see this issue? > One possible thing to try would be to move the block 1936-1938 to sometime > before line 1931 (heartbeat_thread.create()). > > IOW, are we calling service.get_osdmap() before we have called > service.publish_map(osdmap) ? > > Vikhyat, It might be worth seeing if Sam would have time to look at my > theory and see whether the suggested reordering might resolve the race > condition? Unfortunately I'm not back until next Tuesday but, if it can wait > until then, I'd be happy to create a test package with the modifications > suggested above. (In reply to David Zafman from comment #14) > > I too was thinking along those lines. To verify I added a sleep right > before the first map is published on osd.0 then started it and a bunch more > OSDs. It did not reproduce the problem. Hmmm, cool. Nice idea. We must be missing something along those lines though I think... |
Description of problem: osd crash with Segmentation fault in handle_osd_ping -1> 2016-04-18 12:04:55.912562 7f9e871a2700 1 -- 192.168.0.46:6823/6476 <== osd.21 192.168.0.43:0/6251 1 ==== osd_ping(ping e3580 stamp 2016-04-18 12:04:48.875455) v2 ==== 47+0+0 (1708338583 0 0) 0x669d1e00 con 0x663d31e0 0> 2016-04-18 12:04:55.913613 7f9e871a2700 -1 *** Caught signal (Segmentation fault) ** in thread 7f9e871a2700 ceph version 0.94.5-9.el7cp (deef183a81111fa5e128ec88c90a32c9587c615d) 1: /usr/bin/ceph-osd() [0xa05e32] 2: (()+0xf100) [0x7f9ea295c100] 3: (OSD::handle_osd_ping(MOSDPing*)+0x75a) [0x659e7a] 4: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x65b0cb] 5: (DispatchQueue::entry()+0x62a) [0xbc2aba] 6: (DispatchQueue::DispatchThread::entry()+0xd) [0xae572d] 7: (()+0x7dc5) [0x7f9ea2954dc5] 8: (clone()+0x6d) [0x7f9ea143528d] Version-Release number of selected component (if applicable): Red Hat Ceph Storage 1.3.2 ceph version 0.94.5-9.el7cp