Description of problem: Customer has ceph-xxx-16.2.7-112.el8cp.x86_6 Ceph 5.1 installed They are using the Ceph iscsi gateway to serve out LUNS for their RHV hosts. TO get this working they have a non RHV test host mapping LUNS Running a fio test lands up losing connections and the tcmu-runner processes segfault Version-Release number of selected component (if applicable): ceph-xxx-16.2.7-112.el8cp.x86_6 Ceph 5.1 installed How reproducible: Has happened multiple times Steps to Reproduce: 1.Discover and map ISCSI luns for the test client from the iscsi gateway 2.Run a fio test to generate load 3.The tcmu-runner processes segfault Actual results: We lose connectivity when the processes segfault Expected results: If this is a timing issue then we may expect connection issues due to timeouts but we should not get segfaults Additional info: I have been struggling to get a useful backtrace from the core files. There is no denuginfo for tcmu-runner and I built from source using rpmbuild and still not finding a build with symbols to use. I tried on an actual Ceph node and on the system where I built the package The logs seen so far What we have so far 2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] handle_unmap_internal:414 rbd/rhv-1.rhv-lun-1: Parameter list 0, start lba: 16353280, end lba: 16386047, nlbas: 32768 2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:332 rbd/rhv-1.rhv-lun-1: OPTIMAL UNMAP GRANULARITY: 32768, UNMAP GRANULARITY ALIGNMENT mask: 0, lbas: 32768 2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:356 rbd/rhv-1.rhv-lun-1: The first split: start lba: 16353280, end lba: 16386047, lbas: 32768 2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:361 rbd/rhv-1.rhv-lun-1: The last split: start lba: 16353280, end lba: 16386047, lbas: 32768 2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:362 rbd/rhv-1.rhv-lun-1: There are totally 1 splits 2023-03-15 14:47:11.095 7:cmdproc-uio0 [DEBUG] handle_writesame_check:671 rbd/rhv-1.rhv-lun-1: Start lba: 8194096, number of lba: 20480, last lba: 8214575 2023-03-15 14:47:11.096 7:cmdproc-uio0 [DEBUG] tcmu_rbd_aio_writesame:1600 rbd/rhv-1.rhv-lun-1: Start write same off:4195377152, len:10485760 2023-03-15 14:47:55.861 7:cmdproc-uio0 [DEBUG] tcmu_rbd_has_lock:710 rbd/rhv-1.rhv-lun-1: Is owner 2023-03-15 14:47:55.865 7:cmdproc-uio0 [DEBUG] tcmu_rbd_has_lock:710 rbd/rhv-1.rhv-lun-1: Is owner 2023-03-15 14:50:21.755 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:21.755 7:io_context_pool [ERROR] __tcmu_notify_conn_lost:214 rbd/rhv-1.rhv-lun-1: Handler connection lost (lock state 5) 2023-03-15 14:50:21.757 7:ework-thread [DEBUG] tgt_port_grp_recovery_work_fn:224: Disabling iscsi/iqn.2001-07.com.ceph:rhv-lun/tpgt_1. 2023-03-15 14:50:21.782 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:21.782 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:21.782 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:21.866 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:22.055 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:22.056 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:22.056 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:22.056 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:22.566 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:23.190 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd. 2023-03-15 14:50:23.191 7:ework-thread [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2001-07.com.ceph:rhv-lun/tpgt_1. 2023-03-15 14:50:23.191 7:ework-thread [DEBUG] tcmu_dev_flush_ring:385 rbd/rhv-1.rhv-lun-1: waiting for ring to clear 2023-03-15 14:50:23.191 7:ework-thread [DEBUG] tcmu_dev_flush_ring:388 rbd/rhv-1.rhv-lun-1: ring clear 2023-03-15 14:50:23.191 7:ework-thread [DEBUG] __tcmu_reopen_dev:77 rbd/rhv-1.rhv-lun-1: Waiting for outstanding commands to complete 2023-03-15 14:50:23.191 7:ework-thread [DEBUG] __tcmu_reopen_dev:85 rbd/rhv-1.rhv-lun-1: Closing device. 2023-03-15 14:50:23.203 7:ework-thread [INFO] tcmu_rbd_close:1240 rbd/rhv-1.rhv-lun-1: appended blocklist entry: {192.168.56.49:0/4051956005} 2023-03-15 14:50:23.203 7:ework-thread [DEBUG] __tcmu_reopen_dev:95 rbd/rhv-1.rhv-lun-1: Opening device. Attempt 0 2023-03-15 14:50:23.203 7:ework-thread [DEBUG] tcmu_rbd_open:1088 rbd/rhv-1.rhv-lun-1: tcmu_rbd_open config rbd/rhv-1/rhv-lun-1;osd_op_timeout=30;id=iscsi.iscsi.ceph-iscsi-1.tuzhcv block size 512 num lbas 4294967296. The efforts to get a proper stack dump warning: Could not load shared library symbols for /lib64/liblttng-ust-tracepoint.so.0. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/tcmu-runner'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f858c2b60f4 in librados::IoCtx::aio_write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::AioCompletion*, ceph::buffer::list const&, unsigned long, unsigned long) [clone .cold.1354] () from /lib64/librados.so.2 [Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))] Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8.x86_64 and then on an actual ceph node its diffferent warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.2" is not at the expected address (wrong library or version mismatch?) [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/tcmu-runner'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f3fa0bbe0f4 in ceph::async::detail::CompletionImpl<boost::asio::io_context::executor_type, librados::v14_2_0::RadosClient::mon_command_async(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, Context*)::{lambda(boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, ceph::buffer::v15_2_0::list&&)#1}, void, boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::~CompletionImpl() () from /lib64/librados.so.2 [Current thread is 1 (Thread 0x7f3f8b8f7700 (LWP 885))] Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8cp.x86_64 I will make the core files available on my people page in the next update Regards Laurence Oberman
Hello Ilya Yep, I asked for the messages logs yesterday and they only just provided them. I will update the bug as soon as I extract what you are looking for. I am uploading the cores to my people page
My people page has the current core files https://people.redhat.com/loberman/customer/.bz2178838/ I will now find the segfault messages
Hello Illya Here we go messages-20230219:Feb 16 05:46:31 ceph-iscsi-2 kernel: ework-thread[2702117]: segfault at b0 ip 00007f709f30c0f4 sp 00007f70778239b8 error 4 in librados.so.2.0.0[7f709f25a000+1a4000] messages-20230219:Feb 18 01:12:47 ceph-iscsi-2 kernel: ework-thread[3374310]: segfault at b0 ip 00007fb1c6a1c0f4 sp 00007fb1a9f479b8 error 4 in librados.so.2.0.0[7fb1c696a000+1a4000] messages-20230226:Feb 21 04:46:50 ceph-iscsi-2 kernel: ework-thread[3960676]: segfault at b0 ip 00007f3e15a3d0f4 sp 00007f3dedf549b8 error 4 in librados.so.2.0.0[7f3e1598b000+1a4000] messages-20230305:Mar 3 11:10:49 ceph-iscsi-2 kernel: ework-thread[1440959]: segfault at b0 ip 00007fb19dd200f4 sp 00007fb18ba5c9b8 error 4 in librados.so.2.0.0[7fb19dc6e000+1a4000] Best Regards Laurence
(In reply to loberman from comment #4) > Hello Illya > > Here we go > > messages-20230219:Feb 16 05:46:31 ceph-iscsi-2 kernel: > ework-thread[2702117]: segfault at b0 ip 00007f709f30c0f4 sp > 00007f70778239b8 error 4 in librados.so.2.0.0[7f709f25a000+1a4000] > messages-20230219:Feb 18 01:12:47 ceph-iscsi-2 kernel: Hi, just for completeness, for the host ceph-iscsi-1, the segfault line is: Mar 15 16:18:01 ceph-iscsi-1 kernel: ework-thread[938865]: segfault at b0 ip 00007f858c2b60f4 sp 00007f8571fe49b8 error 4 in librados.so.2.0.0[7f858c204000+1a4000] This should correspond with the coredump collected for this host yesterday. We also have the debug 4 tcmu-runner.log which I attach to this BZ, please note that there is a difference in (probably) the timezone configuration between the host and the container, as the container tcmu-runner.log shows the time with one hour less.
Always at the same offset +1a4000 Of course the start user mode address will change 7f709f25a000
Thanks Ilya, I am still messing with trying to get a useful back-trace Brad I did rebuild tcmu-runner and see if I could fudge a way to get all symbols but its C++ not C so that in itself was a challenge. Like where to add the g++ -g option to get the binary to include symbols etc. and include the library symbols too The offset is always +1a4000 Feb 16 05:46:31 ceph-iscsi-2 kernel: ework-thread[2702117]: segfault at b0 ip 00007f709f30c0f4 sp 00007f70778239b8 error 4 in librados.so.2.0.0[7f709f25a000+1a4000] Feb 18 01:12:47 ceph-iscsi-2 kernel: ework-thread[3374310]: segfault at b0 ip 00007fb1c6a1c0f4 sp 00007fb1a9f479b8 error 4 in librados.so.2.0.0[7fb1c696a000+1a4000] Feb 21 04:46:50 ceph-iscsi-2 kernel: ework-thread[3960676]: segfault at b0 ip 00007f3e15a3d0f4 sp 00007f3dedf549b8 error 4 in librados.so.2.0.0[7f3e1598b000+1a4000] Mar 3 11:10:49 ceph-iscsi-2 kernel: ework-thread[1440959]: segfault at b0 ip 00007fb19dd200f4 sp 00007fb18ba5c9b8 error 4 in librados.so.2.0.0[7fb19dc6e000+1a4000] Mar 15 16:18:01 ceph-iscsi-1 kernel: ework-thread[938865]: segfault at b0 ip 00007f858c2b60f4 sp 00007f8571fe49b8 error 4 in librados.so.2.0.0[7f858c204000+1a4000] static void *tcmur_work_fn(void *data) { struct private *p = data; tcmu_set_thread_name("ework-thread", NULL); p->work_fn(p->data); **** Kicks off work here pthread_mutex_lock(&p->work->lock); if (--p->work->refcnt == 0) pthread_cond_signal(&p->work->cond); pthread_mutex_unlock(&p->work->lock); free(p); return NULL; } Looking at what part of this calls functions from librados If I look at librados DebugInfo for the symbols and try see if I can track down either librados::IoCtx::aio_write( or ceph::async::detail::CompletionImpl<boost::asio::io_context::executor_type 0000000000069860 T rados_aio_write 0000000000049618 t rados_aio_write.cold.1587 0000000000069100 T rados_aio_write_full 000000000004952c t rados_aio_write_full.cold.1581 0000000000067210 T rados_aio_write_op_operate 00000000000490fb t rados_aio_write_op_operate.cold.1559 0000000000061a20 T rados_aio_writesame 0000000000048480 t rados_aio_writesame.cold.1490 And 000000000004718a t _ZN8librados5IoCtx9aio_writeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_13AioCompletionERKN4ceph6buffer4listEmm.cold.1354 The efforts to get a proper stack dump warning: Could not load shared library symbols for /lib64/liblttng-ust-tracepoint.so.0. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/tcmu-runner'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f858c2b60f4 in librados::IoCtx::aio_write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::AioCompletion*, ceph::buffer::list const&, unsigned long, unsigned long) [clone .cold.1354] () from /lib64/librados.so.2 [Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))] Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8.x86_64 ***** Not available and then on an actual ceph node its diffferent warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.2" is not at the expected address (wrong library or version mismatch?) [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/tcmu-runner'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f3fa0bbe0f4 in ceph::async::detail::CompletionImpl<boost::asio::io_context::executor_type, librados::v14_2_0::RadosClient::mon_command_async(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, Context*)::{lambda(boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, ceph::buffer::v15_2_0::list&&)#1}, void, boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::~CompletionImpl() () from /lib64/librados.so.2 [Current thread is 1 (Thread 0x7f3f8b8f7700 (LWP 885))] Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8cp.x86_64
Hello Xiubo Can I get a update on what the workaround would be here if we decide not to fix this. Customer crashes reliably during testing. We have asked the customer to reconfigure in a supported configuration such that the LUN is onlcy actively served from one gateway at the time. We are hoping that will prevent this from happening and will keep you updated. Thank you Laurence
@bhubbard++ Excellent work Sir Many Thanks for all your efforts Sincerely Laurence
@raimund How I built it this morning to include DebugInfo I had to add a lot of dependencies I used the matching src.rpm package from access.redhat.com into rpmbuild Installed it on my RHEL8 test system Ran rpmbuild -bp SPECS/tcmu-runner.spec to make sure I had all the dependencies Then tested the build with rpmbuild -bb SPECS/tcmu-runner.spec Afterwards cd /home/loberman/rpmbuild/BUILD/tcmu-runner-1.5.4 in CMakeLists.txt added/modifiued these lines set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Werror -Wall -Wdeclaration-after-statement -std=c99 -DCMAKE_BUILD_TYPE=Debug ") set(CMAKE_BUILD_TYPE Debug) Ran make It builds fine but as Brad and I already showed does not have symbols[loberman@kernel-build-rhel8 tcmu-runner-1.5.4]$ make [ 2%] Building C object CMakeFiles/handler_rbd.dir/rbd.c.o [ 4%] Linking C shared library handler_rbd.so [ 4%] Built target handler_rbd [ 6%] Generating tcmuhandler-generated.c, tcmuhandler-generated.h [ 9%] Building C object CMakeFiles/tcmu.dir/strlcpy.c.o [ 11%] Building C object CMakeFiles/tcmu.dir/configfs.c.o [ 13%] Building C object CMakeFiles/tcmu.dir/api.c.o [ 16%] Building C object CMakeFiles/tcmu.dir/libtcmu.c.o [ 18%] Building C object CMakeFiles/tcmu.dir/libtcmu-register.c.o [ 20%] Building C object CMakeFiles/tcmu.dir/tcmuhandler-generated.c.o [ 23%] Building C object CMakeFiles/tcmu.dir/libtcmu_log.c.o [ 25%] Building C object CMakeFiles/tcmu.dir/libtcmu_config.c.o [ 27%] Building C object CMakeFiles/tcmu.dir/libtcmu_time.c.o [ 30%] Linking C shared library libtcmu.so [ 30%] Built target tcmu [ 32%] Building C object CMakeFiles/consumer.dir/scsi.c.o [ 34%] Building C object CMakeFiles/consumer.dir/consumer.c.o [ 37%] Linking C executable consumer [ 37%] Built target consumer [ 39%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_work.c.o [ 41%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_cmd_handler.c.o [ 44%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_aio.c.o [ 46%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_device.c.o [ 48%] Building C object CMakeFiles/tcmu-runner.dir/target.c.o [ 51%] Building C object CMakeFiles/tcmu-runner.dir/alua.c.o [ 53%] Building C object CMakeFiles/tcmu-runner.dir/scsi.c.o [ 55%] Building C object CMakeFiles/tcmu-runner.dir/main.c.o [ 58%] Building C object CMakeFiles/tcmu-runner.dir/tcmuhandler-generated.c.o [ 60%] Linking C executable tcmu-runner [ 62%] Built target tcmu-runner [ 65%] Building C object CMakeFiles/tcmu_static.dir/strlcpy.c.o [ 67%] Building C object CMakeFiles/tcmu_static.dir/configfs.c.o [ 69%] Building C object CMakeFiles/tcmu_static.dir/api.c.o [ 72%] Building C object CMakeFiles/tcmu_static.dir/libtcmu.c.o [ 74%] Building C object CMakeFiles/tcmu_static.dir/libtcmu-register.c.o [ 76%] Building C object CMakeFiles/tcmu_static.dir/tcmuhandler-generated.c.o [ 79%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_log.c.o [ 81%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_config.c.o [ 83%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_time.c.o [ 86%] Linking C static library libtcmu_static.a [ 88%] Built target tcmu_static [ 90%] Building C object CMakeFiles/tcmu-synthesizer.dir/scsi.c.o [ 93%] Building C object CMakeFiles/tcmu-synthesizer.dir/tcmu-synthesizer.c.o [ 95%] Linking C executable tcmu-synthesizer [ 95%] Built target tcmu-synthesizer [ 97%] Building C object CMakeFiles/handler_file.dir/file_example.c.o [100%] Linking C shared library handler_file.so [100%] Built target handler_file And now I have the debuginfo included [loberman@kernel-build-rhel8 tcmu-runner-1.5.4]$ file tcmu-runner tcmu-runner: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=cfc2af2a3a90dd47807b0616359d3e777d34e46f, with debug_info, not stripped, too many notes (256) Now i get what looks like a sane stack dump [root@kernel-build-rhel8 ~]# gdb /tmp/tcmu-runner core.tcmu-runner.0.868f4a2ddc7943acbd90ce05e41421e1.915693.1678893481000000 Reading symbols from /tmp/tcmu-runner...done. warning: Corrupted shared library list: 0x0 != 0xffa0e90000000468 warning: .dynamic section for "/lib64/libtcmalloc.so.4" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/lib64/libunwind.so.8" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/lib64/librbd.so.1" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.2" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/lib64/libibverbs.so.1" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/lib64/librdmacm.so.1" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/lib64/libnl-route-3.so.200" is not at the expected address (wrong library or version mismatch?) warning: Could not load shared library symbols for /lib64/liblttng-ust-tracepoint.so.0. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libuuid.so.1 Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/1a/e6dc88d59b2619224ff5bc3050227b53139f72.debug Core was generated by `/usr/bin/tcmu-runner'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f858c2b60f4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose (this=0xffffffffffffffa0) at /usr/include/c++/8/bits/basic_string.h:189 warning: Source file is more recent than executable. 189 #endif [Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))] [Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))] (gdb) bt #0 0x00007f858c2b60f4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() (this=0xffffffffffffffa0) at /usr/include/c++/8/bits/basic_string.h:189 #1 0x00007f858c2b60f4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (this=0xffffffffffffffa0, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/basic_string.h:657 #2 0x00007f858c2b60f4 in object_t::~object_t() (this=0xffffffffffffffa0, __in_chrg=<optimized out>) at /usr/src/debug/ceph-12.2.7-9.el8.x86_64/src/include/object.h:32 #3 0x00007f858c2b60f4 in librados::IoCtx::aio_write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::AioCompletion*, ceph::buffer::list const&, unsigned long, unsigned long) (this=<optimized out>, oid=..., c=<optimized out>, bl=..., len=140211321291616, off=<optimized out>) at /usr/src/debug/ceph-12.2.7-9.el8.x86_64/src/librados/librados.cc:1908
Hello We had another customer upgrade and now seeing similar segfaults in librados. Case # is 03470814 Is it possible for them to run the older container for the iscsi gateway to avoid this for now in a mixed cluster. Regards Laurence
Hi Mike So as we discussed, I suspect NWU went from 4.x to 5.x , the sos shows the prior version at ceph-common-14.2.11-147.el8cp.x86_64 Meaning we will need a hotfix container image as its a regression that took down production when they upgraded. Thanks so much Laurence
Hello From the SOS they are on 16.2.10-138.el8cp.noarch That came from the SOS of the iscsigw machine Regards Laurence
Operational Business Justification for the Hotfix. Customer upgraded from Ceph 4 to 16.2.10-138.el8cp and started experiencing segfaults in TCMU This takes their whole production cluster down This has already been triaged in this BZ and a fix in the code has been provided Detailed description of the issue including BZ and case number. BZ is this one, case is 03470814 Thu Mar 23 2023 Thomas Serlin <tserlin> 1.5.4-5 - tcmur_device: fix racy between reopening devices and reporting events (rhbz#2178838) What is the current impact to customer? Customer in production and cannot serve customers and is a University Its a regression as they were working fine before upgrading and they cannot downgrade Does an upstream fix exist? If yes, what release? Yes, see above Is there a root cause for the issue from development documented in the BZ? Yes, in the published build
Hello Bipin, email sent. Can I make the hotfix available to the customer now. Regards Laurence
Many Thanks for all the work and help here by all. When its ready I will ask the Ceph support folks how to make the container available for the customer. This is not something I have had to do before and its only recently that I have started helping here in Ceph. Regards Laurence
Hello I had asked for a hotfix for and it was also denied by Bipin, based on the fix being included in a soon to be released set of fixes. The problem we have in my case is the customer keeps crashing since upgrading so it's a clear regression. If this takes weeks before the customer can get it they are going to freak out. So for me I think getting a fix even if it's early and not gone through all the QA is better than waiting. I don't know what BMW is getting but the fix here is very small and confined to just TCMU. Regards Laurence
Hello We have two cases. The NWU case (case I wanted a accelerated fix for)is production workload for a University and the one most urgently needing a test package. They have production running for students. The case this BZ was originally logged for is the one reproducing the issue with a fio test. Getting a test package for NWU will be great. Regards Laurence
Hello Bipin The customer confirmed they can wait for the official release. They did also confirm this is a production ceph configuration. Thanks Laurence
@Xiubli, @tserlin, Please let us know on the fixed in build version for ceph and tcmu.
We have run the below tests to verify the BZ and no issue were seen during the test. - ISCSI tests using two gateway nodes with max luns configurations and configuring multipath and run concurrent FIOs against random luns - ISCSI tests using four gateway nodes with max luns configurations and configuring multipath and run concurrent FIOs against random luns - Failover conditions verifictain when multipath is configured for 2 or more gateways with max luns configuration - Perform random IOs using FIO perform write, write restart and parallel reads across random luns - while IOs in progress perform multipath failover Below ceph version and tcmu: [root@ceph-tcmu-issue-gyvrcs-node5 /]# rpm -qa | grep tcmu* tcmu-runner-1.5.4-7.el8cp.x86_64 libtcmu-1.5.4-7.el8cp.x86_64 [root@ceph-tcmu-issue-gyvrcs-node1-installer cephuser]# ceph version ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable) [root@ceph-tcmu-issue-gyvrcs-node1-installer cephuser]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:1732
See KCS #7008129, (https://access.redhat.com/solutions/7008129)
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days