Bug 2190288

Summary:	CEPH version 16.2.10-160.el8.x86_6: librados segfault in ework-thread librados.so.2.0.
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	loberman <loberman>
Component:	iSCSI	Assignee:	Xiubo Li <xiubli>
Status:	CLOSED ERRATA	QA Contact:	Preethi <pnataraj>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.3	CC:	ceph-eng-bugs, cephqe-warriors, idryomov, mcaldeir, rsachere, tserlin, vereddy, vumrao, xiubli
Target Milestone:	---
Target Release:	5.3z3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	tcmu-runner-1.5.4-8.el8cp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-23 00:19:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description loberman 2023-04-27 18:54:57 UTC

Description of problem:
Customer upgraded to latest Ceph5.3z to resolve the initial segfault they had in tcmu-runner and is till seeing issues running Ceph and seeing segfaults

The ework-thread segfaults during load tests

Apr 27 08:15:01 esg4wng02202 kernel: [138236.446505] ework-thread[1960297]: segfault at b0 ip 00007fe9401b9194 sp 00007fda61ce49b8 error 4 in librados.so.2.0.0[7fe940107000+1a5000]
Apr 27 08:15:01 esg4wng02202 kernel: [138236.446785] Code: 41 5c e9 3f 30 f8 ff 0f 1f 80 00 00 00 00 5b 5d 41 5c c3 f3 0f 1e fa 48 89 c3 e9 1f a1 f8 ff 90 66 0f 1f 44 00 00 f3 0f 1e fa <83> bf b0 00 00 00 02 75 13 48 81 c7 a8 07 00 00 e9 c7 2f f8 ff 0f

Version-Release number of selected component (if applicable):


How reproducible:
Happens a few hours after load test is started with Oracle RAC
There are no blocklist issues.

Steps to Reproduce:
1.Start the Load test
2.Segfaults happen and then sometimes deadlocks
3.

Actual results:
Fails the test with the ework-thread segfaulting that will then cause deadlocks in the kernel 

Expected results:
Does not segfault

Additional info:
Once again I have struggled to open the core file
I rebuilt Ceph from scratch and installed the debuginfos (whihc now are available) but I had issues

Core file is here
https://people.redhat.com/loberman/customer/.core/

All the package son my debug host are installed so not sure what I am doing wrong there.

[loberman@segstorage3 ~]$ sudo rpm -qa | grep -i tcm
libtcmu-1.5.4-7.el8.x86_64
tcmu-runner-debuginfo-1.5.4-7.el8.x86_64
tcmu-runner-debugsource-1.5.4-7.el8.x86_64
libtcmu-debuginfo-1.5.4-7.el8.x86_64
tcmu-runner-1.5.4-7.el8.x86_64

ceph-common-debuginfo-16.2.10-160.el8.x86_64
ceph-radosgw-debuginfo-16.2.10-160.el8.x86_64
librgw2-debuginfo-16.2.10-160.el8.x86_64
kernel-debuginfo-4.18.0-425.13.1.el8_7.x86_64
ceph-immutable-object-cache-debuginfo-16.2.10-160.el8.x86_64
libcephsqlite-debuginfo-16.2.10-160.el8.x86_64
python3-rbd-debuginfo-16.2.10-160.el8.x86_64
tcmu-runner-debuginfo-1.5.4-7.el8.x86_64
ceph-fuse-debuginfo-16.2.10-160.el8.x86_64
ceph-mon-debuginfo-16.2.10-160.el8.x86_64
libcephfs2-debuginfo-16.2.10-160.el8.x86_64
libradosstriper1-debuginfo-16.2.10-160.el8.x86_64
python3-rados-debuginfo-16.2.10-160.el8.x86_64
rbd-mirror-debuginfo-16.2.10-160.el8.x86_64
elfutils-debuginfod-client-0.187-4.el8.x86_64
cephfs-mirror-debuginfo-16.2.10-160.el8.x86_64
ceph-mgr-debuginfo-16.2.10-160.el8.x86_64
ceph-test-debuginfo-16.2.10-160.el8.x86_64
librados-devel-debuginfo-16.2.10-160.el8.x86_64
python3-cephfs-debuginfo-16.2.10-160.el8.x86_64
rbd-fuse-debuginfo-16.2.10-160.el8.x86_64
libtcmu-debuginfo-1.5.4-7.el8.x86_64
kernel-debuginfo-common-x86_64-4.18.0-425.13.1.el8_7.x86_64
ceph-mds-debuginfo-16.2.10-160.el8.x86_64
librados2-debuginfo-16.2.10-160.el8.x86_64
python3-rgw-debuginfo-16.2.10-160.el8.x86_64
elfutils-debuginfod-client-devel-0.187-4.el8.x86_64
ceph-base-debuginfo-16.2.10-160.el8.x86_64
ceph-osd-debuginfo-16.2.10-160.el8.x86_64
librbd1-debuginfo-16.2.10-160.el8.x86_64
rbd-nbd-debuginfo-16.2.10-160.el8.x86_64




Full log
---------
Apr 27 08:14:59 esg4wng02202 tcmu-runner: tcmu_rbd_close:1240 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN03: appended blocklist entry: {192.168.5.35:0/4245534485}
Apr 27 08:14:59 esg4wng02202 tcmu-runner: tcmu_rbd_rm_stale_entries_from_blocklist:382 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN03: removing blocklist entry: {192.168.5.35:0/2254410244}
Apr 27 08:15:00 esg4wng02202 tcmu-runner: tcmu_rbd_rm_stale_entries_from_blocklist:382 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN03: removing blocklist entry: {192.168.5.35:0/4245534485}
Apr 27 08:15:00 esg4wng02202 tcmu-runner: tcmu_notify_lock_lost:273 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN01: Async lock drop. Old state 5
Apr 27 08:15:00 esg4wng02202 tcmu-runner: alua_implicit_transition:592 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN01: Starting write lock acquisition operation.
Apr 27 08:15:01 esg4wng02202 tcmu-runner: tcmu_rbd_open:1209 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN03: address: {192.168.5.35:0/2863440141}
Apr 27 08:15:01 esg4wng02202 tcmu-runner: tcmu_rbd_close:1240 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN01: appended blocklist entry: {192.168.5.35:0/1426901708}
Apr 27 08:15:01 esg4wng02202 tcmu-runner: tcmu_rbd_open:1209 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN02: address: {192.168.5.35:0/3447570735}
Apr 27 08:15:01 esg4wng02202 tcmu-runner: tcmu_acquire_dev_lock:488 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN03: Read lock acquisition successful
Apr 27 08:15:01 esg4wng02202 tcmu-runner: tcmu_acquire_dev_lock:488 rbd/iscsirepldata01-esg4wng02201-esg4wng02202.grp-cds4wng1001-1002-VOTE1_LUN02: Read lock acquisition successful


Apr 27 08:15:01 esg4wng02202 kernel: [138236.446505] ework-thread[1960297]: segfault at b0 ip 00007fe9401b9194 sp 00007fda61ce49b8 error 4 in librados.so.2.0.0[7fe940107000+1a5000]
Apr 27 08:15:01 esg4wng02202 kernel: [138236.446785] Code: 41 5c e9 3f 30 f8 ff 0f 1f 80 00 00 00 00 5b 5d 41 5c c3 f3 0f 1e fa 48 89 c3 e9 1f a1 f8 ff 90 66 0f 1f 44 00 00 f3 0f 1e fa <83> bf b0 00 00 00 02 75 13 48 81 c7 a8 07 00 00 e9 c7 2f f8 ff 0f

Apr 27 08:16:01 esg4wng02202 kernel: [138296.879907] ABORT_TASK: Found referenced iSCSI task_tag: 70
Apr 27 08:16:08 esg4wng02202 kernel: [138303.778468] ABORT_TASK: Found referenced iSCSI task_tag: 42
Apr 27 08:16:14 esg4wng02202 kernel: [138309.872171] ABORT_TASK: Found referenced iSCSI task_tag: 38
Apr 27 08:16:16 esg4wng02202 kernel: [138311.970428] ABORT_TASK: Found referenced iSCSI task_tag: 55
Apr 27 08:16:33 esg4wng02202 kernel: [138328.977940] iSCSI Login timeout on Network Portal 192.168.55.243:3260
Apr 27 08:16:41 esg4wng02202 kernel: [138336.657922] iSCSI Login timeout on Network Portal 192.168.56.243:3260
Apr 27 08:16:41 esg4wng02202 kernel: [138336.658289] tx_data returned -32, expecting 48.
Apr 27 08:16:41 esg4wng02202 kernel: [138336.658641] iSCSI Login negotiation failed.
Apr 27 08:16:57 esg4wng02202 kernel: [138352.017883] iSCSI Login timeout on Network Portal 192.168.56.243:3260
[root@esg4wng02202 ~]#"


These are all still the older version 16.2.0-117.el8cp in the SOSreport

cephadm-16.2.0-117.el8cp.noarch                             Tue Oct 25 10:03:40 2022
ceph-common-16.2.0-117.el8cp.x86_64                         Tue Oct 25 10:03:44 2022
libcephfs2-16.2.0-117.el8cp.x86_64                          Fri Oct 21 09:58:17 2022
python3-cephfs-16.2.0-117.el8cp.x86_64                      Fri Oct 21 09:58:17 2022
python3-ceph-argparse-16.2.0-117.el8cp.x86_64               Fri Oct 21 09:58:17 2022
python3-ceph-common-16.2.0-117.el8cp.x86_64                 Fri Oct 21 09:58:17 2022


But after upgrade

[loberman@lobefedora sosreport-esg4wng02201-03477154-2023-04-27-ewrzglz]$ cat sos_commands/ceph/ceph_versions 
{
    "mon": {
        "ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 96
    },
    "mds": {},
    "tcmu-runner": {
        "ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 32
    },
    "overall": {
        "ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 133
    }
}

Likely everything is upgraded but this is not a good thing for support folks. 
The way the upgrade happens is not via rpm package updates it seems.

Comment 1 loberman 2023-04-27 19:10:33 UTC

Raimund Noticed this. was at the top of my screen and I missed it.
Makes sense why we cannot open the core file.
We will check with the customer


Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/tcmu-runner...Reading symbols from /usr/lib/debug/usr/bin/tcmu-runner-1.5.4-7.el8.x86_64.debug...done.
done.
BFD: warning: /home/loberman/core.tcmu-runner.0.7b56bb3deb88430fa96e13c4fc7ad1ed.1152774.1682583301000000 is truncated: expected core file size >= 4912242688, found: 2147483648

Comment 4 loberman 2023-04-30 14:16:46 UTC

We are still waiting on a valid core from the customer.
The first one was truncated.
Regards
Laurence

Comment 13 loberman 2023-05-05 14:27:36 UTC

I am not sure if I got the backtrace right yet using my test node to debug.
But to me looks like the cluster pointer is NULL

gdb) up
#1  0x00007f7d3bcece7e in rados_service_update_status () from /lib64/librados.so.2
(gdb) up
#2  0x00007f7d3cb3546e in tcmu_rbd_service_status_update (dev=0x41a7600, has_lock=false)
    at /usr/src/debug/tcmu-runner-1.5.4-7.el8.x86_64/rbd.c:137
137		ret = rados_service_update_status(state->cluster, status_buf);

(gdb) p state->cluster
$1 = (rados_t) 0x0


static int tcmu_rbd_service_status_update(struct tcmu_device *dev,
                                          bool has_lock)
{
        struct tcmu_rbd_state *state = tcmur_dev_get_private(dev);
        struct tcmur_device *rdev = tcmu_dev_get_private(dev);
        char *status_buf = NULL;
        int ret;

        ret = asprintf(&status_buf,
                       "%s%c%s%c%s%c%"PRIu64"%c%s%c%"PRIu64"%c%s%c%"PRIu64"%c",
                       "lock_owner", '\0', has_lock ? "true" : "false", '\0',
                       "lock_lost_cnt", '\0', rdev->lock_lost_cnt, '\0',
                       "conn_lost_cnt", '\0', rdev->conn_lost_cnt, '\0',
                       "cmd_timed_out_cnt", '\0', rdev->cmd_timed_out_cnt, '\0');
        if (ret < 0) {
                tcmu_dev_err(dev, "Could not allocate status buf. Service will not be updated.\n");
                return ret;
        }

        ret = rados_service_update_status(state->cluster, status_buf);
        if (ret < 0) {
                tcmu_dev_err(dev, "Could not update service status. (Err %d)\n",
                             ret);
        }

        free(status_buf);
        return ret;
}

Comment 14 loberman 2023-05-05 14:31:29 UTC

tcmur_device: skip reporting events if the device is closed 

After the rbd device being blocklisted it will lost the connection and at the same time will timedout all the IOs. While the tcmu-runner will report a 'lock_lost_cnt' event to ceph cluster. And if during this the device is reopened it will cause the use-after-free bug.

Notes
When reopening the device it will hold the 'rdev->rdev_lock' lock and then clear the 'TCMUR_DEV_FLAG_IS_OPEN' flag. And we can check this flag and just skip reporting events if the device is closed.

So null means device closed ?

need to understand why we were blocklisted though, That should only happen if we have access to the same LUN from two gateways at the same time.

@Raimund, lets get the logs also available so we can check the blocklisting

Comment 15 loberman 2023-05-05 14:35:50 UTC

Chatted with Raimund

Comment 18 Manny 2023-05-05 15:08:20 UTC

Hello @

Comment 19 Manny 2023-05-05 15:11:51 UTC

Hello @rsachere and @loberman ,

FYI, I'd be cautious to use the Block List as anything of value. There was a site down and a Block List entry was thought to be part of the cause and it turned out to be nothing to do about nothing. Raimund, ping HKlein if you want to know more.

Best regards,
Manny

Comment 22 Xiubo Li 2023-05-06 01:22:01 UTC

(In reply to loberman from comment #14)
> tcmur_device: skip reporting events if the device is closed 
> 
> After the rbd device being blocklisted it will lost the connection and at
> the same time will timedout all the IOs. While the tcmu-runner will report a
> 'lock_lost_cnt' event to ceph cluster. And if during this the device is
> reopened it will cause the use-after-free bug.
> 
> Notes
> When reopening the device it will hold the 'rdev->rdev_lock' lock and then
> clear the 'TCMUR_DEV_FLAG_IS_OPEN' flag. And we can check this flag and just
> skip reporting events if the device is closed.
> 
> So null means device closed ?
> 

No, it's the entity of network connection with Rados will be freed after the device is closed.

> need to understand why we were blocklisted though, That should only happen
> if we have access to the same LUN from two gateways at the same time.
> 
> @Raimund, lets get the logs also available so we can check the blocklisting

Yeah, for tcmu-runner there will always be only one active LUN, so when another gateway is trying to access the ceph Rados that gateway will try to acquire the rbd exclusive lock from Rados and then Rados will blocklist current the lock owner. Please make sure the multiple path config is correct in initiator side, currently the ceph-iscsi couldn't do real AA, cu should make sure the path choice policy is ALUA.

Thanks.

Comment 40 Preethi 2023-05-09 03:49:58 UTC

@xiubo Li, Request you to provide the complete test steps to verify the BZ.

Comment 46 Preethi 2023-05-15 16:50:55 UTC

Hi Xiubo Li,

Following are the test steps performed for verifyigng the BZ - No issue was seen during the test.
1, Setup a ceph-iscsi cluster, which should include 2 gateways, and please enable the debug logs in tcmu-runner services in these two gateways
2, In the initiator side just use fio or other test tools write raw IOs to these two '/dev/sd{x}' scsi devices directly at the same time.
3, Run the test in Step2 for minutes or longer time to see whether could you see the crash
4, If you hit any crash again please provide the debug logs.

NOTE - 1) IOs against same image of different gateways path i.e /dev/sda and /dev/sdb were performed for few hours - Issue is not seen
- 2) Ran FIO against random luns for 5-6 hours and no issue were seen
FYI, Debug log was enabled and there was no seg fault/crash or any error noticed in the tcmu logs.

ceph version and tcmu snippet

[root@ceph-amk-upgrade-53-ga-s8g9v2-node1-installer cephuser]# rpm -qa | grep tcmu*
[root@ceph-amk-upgrade-53-ga-s8g9v2-node1-installer cephuser]# ceph version
ceph version 16.2.10-171.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)

[root@ceph-amk-upgrade-53-ga-s8g9v2-node5 /]# rpm -qa | grep tcmu*
tcmu-runner-1.5.4-8.el8cp.x86_64
libtcmu-1.5.4-8.el8cp.x86_64

FIO workload used -  
fio --name=test-1 --numjobs=1 --rw=write/randwrite --bs=4/8/32k --iodepth=8 --fsync=32 --time_based --group_reporting --ioengine=libaio --filename=/dev/sda

Comment 48 Xiubo Li 2023-05-16 00:36:57 UTC

(In reply to Preethi from comment #46)
> Hi Xiubo Li,
> 
> Following are the test steps performed for verifyigng the BZ - No issue was
> seen during the test.
> 1, Setup a ceph-iscsi cluster, which should include 2 gateways, and please
> enable the debug logs in tcmu-runner services in these two gateways
> 2, In the initiator side just use fio or other test tools write raw IOs to
> these two '/dev/sd{x}' scsi devices directly at the same time.
> 3, Run the test in Step2 for minutes or longer time to see whether could you
> see the crash
> 4, If you hit any crash again please provide the debug logs.
> 
> NOTE - 1) IOs against same image of different gateways path i.e /dev/sda and
> /dev/sdb were performed for few hours - Issue is not seen
> - 2) Ran FIO against random luns for 5-6 hours and no issue were seen
> FYI, Debug log was enabled and there was no seg fault/crash or any error
> noticed in the tcmu logs.
> 
> ceph version and tcmu snippet
> 
> [root@ceph-amk-upgrade-53-ga-s8g9v2-node1-installer cephuser]# rpm -qa |
> grep tcmu*
> [root@ceph-amk-upgrade-53-ga-s8g9v2-node1-installer cephuser]# ceph version
> ceph version 16.2.10-171.el8cp (00a157ecd158911ece116ae43095de793ed9f389)
> pacific (stable)
> 
> [root@ceph-amk-upgrade-53-ga-s8g9v2-node5 /]# rpm -qa | grep tcmu*
> tcmu-runner-1.5.4-8.el8cp.x86_64
> libtcmu-1.5.4-8.el8cp.x86_64
> 
> FIO workload used -  
> fio --name=test-1 --numjobs=1 --rw=write/randwrite --bs=4/8/32k --iodepth=8
> --fsync=32 --time_based --group_reporting --ioengine=libaio
> --filename=/dev/sda

Nice work Preeithi. 

Thanks very much for your tests.
- Xiubo

Comment 50 errata-xmlrpc 2023-05-23 00:19:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3259