2178838 – CEPH 5.1 iscsi gateway tcmu-runner tasks are segfaulting and client connections are lost (ceph-base-16.2.7-112.el8cp.x86_64)

Bug 2178838 - CEPH 5.1 iscsi gateway tcmu-runner tasks are segfaulting and client connections are lost (ceph-base-16.2.7-112.el8cp.x86_64)

Summary: CEPH 5.1 iscsi gateway tcmu-runner tasks are segfaulting and client connectio...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	iSCSI
Sub Component:
Version:	5.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	5.3z2
Assignee:	Xiubo Li
QA Contact:	Preethi
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2185621
TreeView+	depends on / blocked

Reported:	2023-03-15 20:54 UTC by loberman
Modified:	2023-09-19 04:34 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.`use-after-free` bug crash no longer occurs Previously, a racy condition between reopening devices and reporting events to the Ceph server caused _use-after-free_ bug crash. With this fix, to ensure that all the in-flight IOs are finished before or after flushing the `event_work`, a new timedout IO callback is implemented. This fires a new `event_work` which might access the device while closing the device and there are no more crashes.
Clone Of:
Environment:
Last Closed:	2023-04-11 20:07:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	open-iscsi tcmu-runner pull 695	None	open	tcmur_device: fix racy between reopening devices and reporting events	2023-03-20 03:00:45 UTC
Red Hat Issue Tracker	RHCEPH-6263	None	None	None	2023-03-15 20:56:51 UTC
Red Hat Knowledge Base (Solution)	7008129	None	None	None	2023-04-17 16:24:43 UTC
Red Hat Product Errata	RHBA-2023:1732	None	None	None	2023-04-11 20:08:18 UTC

Description loberman 2023-03-15 20:54:30 UTC

Description of problem:
Customer has ceph-xxx-16.2.7-112.el8cp.x86_6    Ceph 5.1 installed
They are using the Ceph iscsi gateway to serve out LUNS for their RHV hosts.
TO get this working they have a non RHV test host mapping LUNS

Running a fio test lands up losing connections and the tcmu-runner processes segfault

Version-Release number of selected component (if applicable):
ceph-xxx-16.2.7-112.el8cp.x86_6    Ceph 5.1 installed

How reproducible:
Has happened multiple times

Steps to Reproduce:
1.Discover and map ISCSI luns for the test client from the iscsi gateway 
2.Run a fio test to generate load
3.The tcmu-runner processes segfault

Actual results:
We lose connectivity when the processes segfault

Expected results:
If this is a timing issue then we may expect connection issues due to timeouts but we should not get segfaults

Additional info:
I have been struggling to get a useful backtrace from the core files. There is no denuginfo for tcmu-runner and I built from source using rpmbuild and still not finding a build with symbols to use.

I tried on an actual Ceph node and on the system where I built the package

The logs seen so far

What we have so far 

2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] handle_unmap_internal:414 rbd/rhv-1.rhv-lun-1: Parameter list 0, start lba: 16353280, end lba: 16386047, nlbas: 32768
2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:332 rbd/rhv-1.rhv-lun-1: OPTIMAL UNMAP GRANULARITY: 32768, UNMAP GRANULARITY ALIGNMENT mask: 0, lbas: 32768
2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:356 rbd/rhv-1.rhv-lun-1: The first split: start lba: 16353280, end lba: 16386047, lbas: 32768
2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:361 rbd/rhv-1.rhv-lun-1: The last split: start lba: 16353280, end lba: 16386047, lbas: 32768
2023-03-15 14:47:11.081 7:cmdproc-uio0 [DEBUG] align_and_split_unmap:362 rbd/rhv-1.rhv-lun-1: There are totally 1 splits
2023-03-15 14:47:11.095 7:cmdproc-uio0 [DEBUG] handle_writesame_check:671 rbd/rhv-1.rhv-lun-1: Start lba: 8194096, number of lba: 20480, last lba: 8214575
2023-03-15 14:47:11.096 7:cmdproc-uio0 [DEBUG] tcmu_rbd_aio_writesame:1600 rbd/rhv-1.rhv-lun-1: Start write same off:4195377152, len:10485760
2023-03-15 14:47:55.861 7:cmdproc-uio0 [DEBUG] tcmu_rbd_has_lock:710 rbd/rhv-1.rhv-lun-1: Is owner
2023-03-15 14:47:55.865 7:cmdproc-uio0 [DEBUG] tcmu_rbd_has_lock:710 rbd/rhv-1.rhv-lun-1: Is owner
2023-03-15 14:50:21.755 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:21.755 7:io_context_pool [ERROR] __tcmu_notify_conn_lost:214 rbd/rhv-1.rhv-lun-1: Handler connection lost (lock state 5)
2023-03-15 14:50:21.757 7:ework-thread [DEBUG] tgt_port_grp_recovery_work_fn:224: Disabling iscsi/iqn.2001-07.com.ceph:rhv-lun/tpgt_1.
2023-03-15 14:50:21.782 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:21.782 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:21.782 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:21.866 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:22.055 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:22.056 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:22.056 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:22.056 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:22.566 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:23.189 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:23.190 7:io_context_pool [ERROR] tcmu_rbd_handle_timedout_cmd:1267 rbd/rhv-1.rhv-lun-1: Timing out cmd.
2023-03-15 14:50:23.191 7:ework-thread [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2001-07.com.ceph:rhv-lun/tpgt_1.
2023-03-15 14:50:23.191 7:ework-thread [DEBUG] tcmu_dev_flush_ring:385 rbd/rhv-1.rhv-lun-1: waiting for ring to clear
2023-03-15 14:50:23.191 7:ework-thread [DEBUG] tcmu_dev_flush_ring:388 rbd/rhv-1.rhv-lun-1: ring clear
2023-03-15 14:50:23.191 7:ework-thread [DEBUG] __tcmu_reopen_dev:77 rbd/rhv-1.rhv-lun-1: Waiting for outstanding commands to complete
2023-03-15 14:50:23.191 7:ework-thread [DEBUG] __tcmu_reopen_dev:85 rbd/rhv-1.rhv-lun-1: Closing device.
2023-03-15 14:50:23.203 7:ework-thread [INFO] tcmu_rbd_close:1240 rbd/rhv-1.rhv-lun-1: appended blocklist entry: {192.168.56.49:0/4051956005}
2023-03-15 14:50:23.203 7:ework-thread [DEBUG] __tcmu_reopen_dev:95 rbd/rhv-1.rhv-lun-1: Opening device. Attempt 0
2023-03-15 14:50:23.203 7:ework-thread [DEBUG] tcmu_rbd_open:1088 rbd/rhv-1.rhv-lun-1: tcmu_rbd_open config rbd/rhv-1/rhv-lun-1;osd_op_timeout=30;id=iscsi.iscsi.ceph-iscsi-1.tuzhcv block size 512 num lbas 4294967296.


The efforts to get a proper stack dump

warning: Could not load shared library symbols for /lib64/liblttng-ust-tracepoint.so.0.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/tcmu-runner'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f858c2b60f4 in librados::IoCtx::aio_write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::AioCompletion*, ceph::buffer::list const&, unsigned long, unsigned long) [clone .cold.1354] () from /lib64/librados.so.2
[Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))]
Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8.x86_64


and then on an actual ceph node its diffferent

warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.2" is not at the expected address (wrong library or version mismatch?)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/tcmu-runner'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f3fa0bbe0f4 in ceph::async::detail::CompletionImpl<boost::asio::io_context::executor_type, librados::v14_2_0::RadosClient::mon_command_async(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, Context*)::{lambda(boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, ceph::buffer::v15_2_0::list&&)#1}, void, boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::~CompletionImpl() () from /lib64/librados.so.2
[Current thread is 1 (Thread 0x7f3f8b8f7700 (LWP 885))]
Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8cp.x86_64

I will make the core files available on my people page in the next update

Regards
Laurence Oberman

Comment 2 loberman 2023-03-16 13:34:08 UTC

Hello Ilya

Yep, I asked for the messages logs yesterday and they only just provided them.
I will update the bug as soon as I extract what you are looking for.

I am uploading the cores to my people page

Comment 3 loberman 2023-03-16 13:46:21 UTC

My people page has the current core files

https://people.redhat.com/loberman/customer/.bz2178838/

I will now find the segfault messages

Comment 4 loberman 2023-03-16 14:00:43 UTC

Hello Illya

Here we go

messages-20230219:Feb 16 05:46:31 ceph-iscsi-2 kernel: ework-thread[2702117]: segfault at b0 ip 00007f709f30c0f4 sp 00007f70778239b8 error 4 in librados.so.2.0.0[7f709f25a000+1a4000]
messages-20230219:Feb 18 01:12:47 ceph-iscsi-2 kernel: ework-thread[3374310]: segfault at b0 ip 00007fb1c6a1c0f4 sp 00007fb1a9f479b8 error 4 in librados.so.2.0.0[7fb1c696a000+1a4000]
messages-20230226:Feb 21 04:46:50 ceph-iscsi-2 kernel: ework-thread[3960676]: segfault at b0 ip 00007f3e15a3d0f4 sp 00007f3dedf549b8 error 4 in librados.so.2.0.0[7f3e1598b000+1a4000]
messages-20230305:Mar  3 11:10:49 ceph-iscsi-2 kernel: ework-thread[1440959]: segfault at b0 ip 00007fb19dd200f4 sp 00007fb18ba5c9b8 error 4 in librados.so.2.0.0[7fb19dc6e000+1a4000]

Best Regards
Laurence

Comment 5 Raimund Sacherer 2023-03-16 14:49:14 UTC

(In reply to loberman from comment #4)
> Hello Illya
> 
> Here we go
> 
> messages-20230219:Feb 16 05:46:31 ceph-iscsi-2 kernel:
> ework-thread[2702117]: segfault at b0 ip 00007f709f30c0f4 sp
> 00007f70778239b8 error 4 in librados.so.2.0.0[7f709f25a000+1a4000]
> messages-20230219:Feb 18 01:12:47 ceph-iscsi-2 kernel:

Hi, just for completeness, for the host ceph-iscsi-1, the segfault line is:
Mar 15 16:18:01 ceph-iscsi-1 kernel: ework-thread[938865]: segfault at b0 ip 00007f858c2b60f4 sp 00007f8571fe49b8 error 4 in librados.so.2.0.0[7f858c204000+1a4000]

This should correspond with the coredump collected for this host yesterday. 

We also have the debug 4 tcmu-runner.log which I attach to this BZ, please note that there is a difference in (probably) the timezone configuration between the host and the container, as the container tcmu-runner.log shows the time with one hour less.

Comment 7 loberman 2023-03-16 15:37:48 UTC

Always at the same offset

+1a4000

Of course the start user mode address will change
7f709f25a000

Comment 9 loberman 2023-03-16 18:54:26 UTC

Thanks Ilya, 

I am still messing with trying to get a useful back-trace

Brad I did rebuild tcmu-runner and see if I could fudge a way to get all symbols but its C++ not C so that in itself was a challenge.
Like where to add the g++ -g option to get the binary to include symbols etc. and include the library symbols too

The offset is always +1a4000

Feb 16 05:46:31 ceph-iscsi-2 kernel: ework-thread[2702117]: segfault at b0 ip 00007f709f30c0f4 sp 00007f70778239b8 error 4 in librados.so.2.0.0[7f709f25a000+1a4000]
Feb 18 01:12:47 ceph-iscsi-2 kernel: ework-thread[3374310]: segfault at b0 ip 00007fb1c6a1c0f4 sp 00007fb1a9f479b8 error 4 in librados.so.2.0.0[7fb1c696a000+1a4000]
Feb 21 04:46:50 ceph-iscsi-2 kernel: ework-thread[3960676]: segfault at b0 ip 00007f3e15a3d0f4 sp 00007f3dedf549b8 error 4 in librados.so.2.0.0[7f3e1598b000+1a4000]
Mar  3 11:10:49 ceph-iscsi-2 kernel: ework-thread[1440959]: segfault at b0 ip 00007fb19dd200f4 sp 00007fb18ba5c9b8 error 4 in librados.so.2.0.0[7fb19dc6e000+1a4000]
Mar 15 16:18:01 ceph-iscsi-1 kernel: ework-thread[938865]:  segfault at b0 ip 00007f858c2b60f4 sp 00007f8571fe49b8 error 4 in librados.so.2.0.0[7f858c204000+1a4000]

static void *tcmur_work_fn(void *data)
{
        struct private *p = data;

        tcmu_set_thread_name("ework-thread", NULL);

        p->work_fn(p->data);  **** Kicks off work here

        pthread_mutex_lock(&p->work->lock);
        if (--p->work->refcnt == 0)
                pthread_cond_signal(&p->work->cond);
        pthread_mutex_unlock(&p->work->lock);

        free(p);
        return NULL;
}

Looking at what part of this calls functions from librados

If I look at librados DebugInfo for the symbols and try see if I can track down either
 librados::IoCtx::aio_write(
or
 ceph::async::detail::CompletionImpl<boost::asio::io_context::executor_type


0000000000069860 T rados_aio_write
0000000000049618 t rados_aio_write.cold.1587
0000000000069100 T rados_aio_write_full
000000000004952c t rados_aio_write_full.cold.1581
0000000000067210 T rados_aio_write_op_operate
00000000000490fb t rados_aio_write_op_operate.cold.1559
0000000000061a20 T rados_aio_writesame
0000000000048480 t rados_aio_writesame.cold.1490

And
000000000004718a t _ZN8librados5IoCtx9aio_writeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_13AioCompletionERKN4ceph6buffer4listEmm.cold.1354


The efforts to get a proper stack dump

warning: Could not load shared library symbols for /lib64/liblttng-ust-tracepoint.so.0.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/tcmu-runner'.
Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x00007f858c2b60f4 in librados::IoCtx::aio_write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::AioCompletion*, ceph::buffer::list const&, unsigned long, unsigned long) [clone .cold.1354] () from /lib64/librados.so.2
[Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))]
Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8.x86_64  ***** Not available


and then on an actual ceph node its diffferent

warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.2" is not at the expected address (wrong library or version mismatch?)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/tcmu-runner'.
Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x00007f3fa0bbe0f4 in ceph::async::detail::CompletionImpl<boost::asio::io_context::executor_type, librados::v14_2_0::RadosClient::mon_command_async(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, Context*)::{lambda(boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, ceph::buffer::v15_2_0::list&&)#1}, void, boost::system::error_code, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::~CompletionImpl() () from /lib64/librados.so.2
[Current thread is 1 (Thread 0x7f3f8b8f7700 (LWP 885))]
Missing separate debuginfos, use: yum debuginfo-install tcmu-runner-1.5.4-4.el8cp.x86_64

Comment 15 loberman 2023-03-17 12:21:00 UTC

Hello Xiubo

Can I get a update on what the workaround would be here  if we decide not to fix this.

Customer crashes reliably during testing. 
We have asked the customer to reconfigure in a supported configuration such that the LUN is onlcy actively
served from one gateway at the time.
We are hoping that will prevent this from happening and will keep you updated.

Thank you
Laurence

Comment 16 loberman 2023-03-17 12:26:24 UTC

@bhubbard++ Excellent work Sir
Many Thanks for all your efforts

Sincerely
Laurence

Comment 19 loberman 2023-03-17 14:39:13 UTC

@raimund
How I built it this morning to include DebugInfo
I had to add a lot of dependencies

I used the matching src.rpm package from access.redhat.com into rpmbuild

Installed it on my RHEL8 test system 

Ran rpmbuild -bp SPECS/tcmu-runner.spec to make sure I had all the dependencies
Then tested the build with rpmbuild -bb SPECS/tcmu-runner.spec

Afterwards 
cd /home/loberman/rpmbuild/BUILD/tcmu-runner-1.5.4

in CMakeLists.txt
 added/modifiued these lines

set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Werror -Wall -Wdeclaration-after-statement -std=c99 -DCMAKE_BUILD_TYPE=Debug ")
set(CMAKE_BUILD_TYPE Debug)

Ran make

It builds fine but as Brad and I already showed does not have symbols[loberman@kernel-build-rhel8 tcmu-runner-1.5.4]$ make


[  2%] Building C object CMakeFiles/handler_rbd.dir/rbd.c.o
[  4%] Linking C shared library handler_rbd.so
[  4%] Built target handler_rbd
[  6%] Generating tcmuhandler-generated.c, tcmuhandler-generated.h
[  9%] Building C object CMakeFiles/tcmu.dir/strlcpy.c.o
[ 11%] Building C object CMakeFiles/tcmu.dir/configfs.c.o
[ 13%] Building C object CMakeFiles/tcmu.dir/api.c.o
[ 16%] Building C object CMakeFiles/tcmu.dir/libtcmu.c.o
[ 18%] Building C object CMakeFiles/tcmu.dir/libtcmu-register.c.o
[ 20%] Building C object CMakeFiles/tcmu.dir/tcmuhandler-generated.c.o
[ 23%] Building C object CMakeFiles/tcmu.dir/libtcmu_log.c.o
[ 25%] Building C object CMakeFiles/tcmu.dir/libtcmu_config.c.o
[ 27%] Building C object CMakeFiles/tcmu.dir/libtcmu_time.c.o
[ 30%] Linking C shared library libtcmu.so
[ 30%] Built target tcmu
[ 32%] Building C object CMakeFiles/consumer.dir/scsi.c.o
[ 34%] Building C object CMakeFiles/consumer.dir/consumer.c.o
[ 37%] Linking C executable consumer
[ 37%] Built target consumer
[ 39%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_work.c.o
[ 41%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_cmd_handler.c.o
[ 44%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_aio.c.o
[ 46%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_device.c.o
[ 48%] Building C object CMakeFiles/tcmu-runner.dir/target.c.o
[ 51%] Building C object CMakeFiles/tcmu-runner.dir/alua.c.o
[ 53%] Building C object CMakeFiles/tcmu-runner.dir/scsi.c.o
[ 55%] Building C object CMakeFiles/tcmu-runner.dir/main.c.o
[ 58%] Building C object CMakeFiles/tcmu-runner.dir/tcmuhandler-generated.c.o
[ 60%] Linking C executable tcmu-runner
[ 62%] Built target tcmu-runner
[ 65%] Building C object CMakeFiles/tcmu_static.dir/strlcpy.c.o
[ 67%] Building C object CMakeFiles/tcmu_static.dir/configfs.c.o
[ 69%] Building C object CMakeFiles/tcmu_static.dir/api.c.o
[ 72%] Building C object CMakeFiles/tcmu_static.dir/libtcmu.c.o
[ 74%] Building C object CMakeFiles/tcmu_static.dir/libtcmu-register.c.o
[ 76%] Building C object CMakeFiles/tcmu_static.dir/tcmuhandler-generated.c.o
[ 79%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_log.c.o
[ 81%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_config.c.o
[ 83%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_time.c.o
[ 86%] Linking C static library libtcmu_static.a
[ 88%] Built target tcmu_static
[ 90%] Building C object CMakeFiles/tcmu-synthesizer.dir/scsi.c.o
[ 93%] Building C object CMakeFiles/tcmu-synthesizer.dir/tcmu-synthesizer.c.o
[ 95%] Linking C executable tcmu-synthesizer
[ 95%] Built target tcmu-synthesizer
[ 97%] Building C object CMakeFiles/handler_file.dir/file_example.c.o
[100%] Linking C shared library handler_file.so
[100%] Built target handler_file

And now I have the debuginfo included

[loberman@kernel-build-rhel8 tcmu-runner-1.5.4]$ file tcmu-runner
tcmu-runner: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=cfc2af2a3a90dd47807b0616359d3e777d34e46f, with debug_info, not stripped, too many notes (256)

Now i get what looks like a sane stack dump

[root@kernel-build-rhel8 ~]#  gdb /tmp/tcmu-runner core.tcmu-runner.0.868f4a2ddc7943acbd90ce05e41421e1.915693.1678893481000000
Reading symbols from /tmp/tcmu-runner...done.

warning: Corrupted shared library list: 0x0 != 0xffa0e90000000468

warning: .dynamic section for "/lib64/libtcmalloc.so.4" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libunwind.so.8" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/librbd.so.1" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.2" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libibverbs.so.1" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/librdmacm.so.1" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libnl-route-3.so.200" is not at the expected address (wrong library or version mismatch?)

warning: Could not load shared library symbols for /lib64/liblttng-ust-tracepoint.so.0.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libuuid.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/1a/e6dc88d59b2619224ff5bc3050227b53139f72.debug
Core was generated by `/usr/bin/tcmu-runner'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f858c2b60f4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose (this=0xffffffffffffffa0) at /usr/include/c++/8/bits/basic_string.h:189

warning: Source file is more recent than executable.
189	#endif
[Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))]

[Current thread is 1 (Thread 0x7f8571fe5700 (LWP 1953))]
(gdb) bt
#0  0x00007f858c2b60f4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() (this=0xffffffffffffffa0)
    at /usr/include/c++/8/bits/basic_string.h:189
#1  0x00007f858c2b60f4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()
    (this=0xffffffffffffffa0, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/basic_string.h:657
#2  0x00007f858c2b60f4 in object_t::~object_t() (this=0xffffffffffffffa0, __in_chrg=<optimized out>)
    at /usr/src/debug/ceph-12.2.7-9.el8.x86_64/src/include/object.h:32
#3  0x00007f858c2b60f4 in librados::IoCtx::aio_write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::AioCompletion*, ceph::buffer::list const&, unsigned long, unsigned long)
    (this=<optimized out>, oid=..., c=<optimized out>, bl=..., len=140211321291616, off=<optimized out>)
    at /usr/src/debug/ceph-12.2.7-9.el8.x86_64/src/librados/librados.cc:1908

Comment 49 loberman 2023-03-28 12:46:23 UTC

Hello

We had another customer upgrade and now seeing similar segfaults in librados.
Case # is 03470814
Is it possible for them to run the older container for the iscsi gateway to avoid this for now in a mixed cluster.

Regards
Laurence

Comment 51 loberman 2023-03-28 15:07:04 UTC

Hi Mike

So as we discussed, I suspect NWU went from 4.x to 5.x , the sos shows the prior version at ceph-common-14.2.11-147.el8cp.x86_64
Meaning we will need a hotfix container image as its a regression that took down production when they upgraded.

Thanks so much
Laurence

Comment 54 loberman 2023-03-28 16:32:28 UTC

Hello

From the SOS they are on 16.2.10-138.el8cp.noarch
That came from the SOS of the iscsigw machine

Regards
Laurence

Comment 55 loberman 2023-03-28 16:46:15 UTC

Operational Business Justification for the Hotfix.

Customer upgraded from Ceph 4 to 16.2.10-138.el8cp and started experiencing segfaults in TCMU
This takes their whole production cluster down
This has already been triaged in this BZ and a fix in the code has been provided

Detailed description of the issue including BZ and case number.

BZ is this one, case is 03470814

 Thu Mar 23 2023 Thomas Serlin <tserlin> 1.5.4-5
- tcmur_device: fix racy between reopening devices and reporting events (rhbz#2178838)

What is the current impact to customer?
Customer in production and cannot serve customers and is a University
Its a regression as they were working fine before upgrading and they cannot downgrade

Does an upstream fix exist? If yes, what release?
Yes, see above

Is there a root cause for the issue from development documented in the BZ?
Yes, in the published build

Comment 60 loberman 2023-03-29 12:17:48 UTC

Hello Bipin, email sent.

Can I make the hotfix available to the customer now.

Regards
Laurence

Comment 62 loberman 2023-03-29 14:31:25 UTC

Many Thanks for all the work and help here by all.

When its ready I will ask the Ceph support folks how to make the container available for the customer. 
This is not something I have had to do before and its only recently that I have started helping here in Ceph.

Regards
Laurence

Comment 67 loberman 2023-03-29 20:35:39 UTC

Hello

I had asked for a hotfix for and it was also denied by Bipin, based on the fix being included in a soon to be released set of fixes.

The problem we have in my case is the customer keeps crashing since upgrading so it's a clear regression. 
If this takes weeks before the customer can get it they are going to freak out.

So for me I think getting a fix even if it's early and not gone through all the QA is better than waiting. 
I don't know what BMW is getting but the fix here is very small and confined to just TCMU.

Regards
Laurence

Comment 70 loberman 2023-03-30 11:15:15 UTC

Hello
We have two cases.
The NWU case (case I wanted a accelerated fix for)is production workload for a University and the one most urgently needing a test package.
They have production running for students.

The case this BZ was originally logged for is the one
reproducing the issue with a fio test.

Getting a test package for NWU will be great.

Regards
Laurence

Comment 73 loberman 2023-03-31 10:10:11 UTC

Hello Bipin

The customer confirmed they can wait for the official release.
They did also confirm this is a production ceph configuration.

Thanks
Laurence

Comment 75 Preethi 2023-04-03 06:25:38 UTC

@Xiubli, @tserlin, Please let us know on the fixed in build version for ceph and tcmu.

Comment 79 Preethi 2023-04-04 17:03:30 UTC

We have run the below tests to verify the BZ and no issue were seen during the test.

- ISCSI tests using two gateway nodes with max luns configurations and configuring multipath and run concurrent FIOs against random luns
- ISCSI tests using four gateway nodes with max luns configurations and configuring multipath and run concurrent FIOs against random luns
- Failover conditions verifictain when multipath is configured for 2 or more gateways with max luns configuration
- Perform random IOs using FIO perform write, write restart and parallel reads across random luns - while IOs in progress perform multipath failover

Below ceph version and tcmu:
[root@ceph-tcmu-issue-gyvrcs-node5 /]# rpm -qa | grep tcmu*
tcmu-runner-1.5.4-7.el8cp.x86_64
libtcmu-1.5.4-7.el8cp.x86_64

[root@ceph-tcmu-issue-gyvrcs-node1-installer cephuser]# ceph version
ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)
[root@ceph-tcmu-issue-gyvrcs-node1-installer cephuser]#

Comment 83 errata-xmlrpc 2023-04-11 20:07:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1732

Comment 84 Manny 2023-04-17 16:26:02 UTC

See KCS #7008129, (https://access.redhat.com/solutions/7008129)

Comment 85 Red Hat Bugzilla 2023-09-19 04:34:31 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.