Bug 1400424 - tgtd segfault
Summary: tgtd segfault
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: scsi-target-utils
Version: 6.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: Andy Grover
QA Contact: Martin Hoyer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-01 08:09 UTC by nikhil kshirsagar
Modified: 2020-01-17 16:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-13 17:59:33 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description nikhil kshirsagar 2016-12-01 08:09:46 UTC
Description of problem:


Nov 30 08:16:30 i-8D377A5F kernel: tgtd[2168]: segfault at 1af9088 ip 00000000004067bb sp 00007fffc2283f10 error 6 in tgtd (deleted)[400000+3e000]
Nov 30 08:16:31 i-8D377A5F abrt[28913]: File '/usr/sbin/tgtd' seems to be deleted
Nov 30 08:16:31 i-8D377A5F abrtd: Directory 'ccpp-2016-11-30-08:16:31-2168' creation detected
Nov 30 08:16:31 i-8D377A5F abrt[28913]: Saved core dump of pid 2168 (/usr/sbin/tgtd) to /var/spool/abrt/ccpp-2016-11-30-08:16:31-2168 (168697856 bytes)


Version-Release number of selected component (if applicable):
scsi-target-utils-1.0.24-10.el6.x86_64


How reproducible:
Customer just noticed tgtd's status is "tgtd dead but subsys locked" when client failed to login to the scsi server




Additional info:
[tgt@dhcp-192-23 ccpp-2016-11-30-08:16:31-2168]$ gdb coredump /usr/bin/tgtd
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-75.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
"/home/tgt/01749253/var/spool/abrt/ccpp-2016-11-30-08:16:31-2168/coredump" is a core file.
Please specify an executable to debug.
/usr/bin/tgtd: No such file or directory.
(gdb) quit
[tgt@dhcp-192-23 ccpp-2016-11-30-08:16:31-2168]$ 
[tgt@dhcp-192-23 ccpp-2016-11-30-08:16:31-2168]$ gdb /usr/sbin/tgtd coredump 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-75.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/tgtd...Reading symbols from /usr/lib/debug/usr/sbin/tgtd.debug...done.
done.
[New Thread 2168]
[New Thread 2382]
[New Thread 2373]
[New Thread 2384]
[New Thread 2376]
[New Thread 2377]
[New Thread 2372]
[New Thread 2386]
[New Thread 2374]
[New Thread 2378]
[New Thread 2381]
[New Thread 2385]
[New Thread 2379]
[New Thread 2383]
[New Thread 2380]
[New Thread 2375]
[New Thread 2387]
Missing separate debuginfo for 
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/81/a81be2e44c93640adedb62adc93a47f4a09dd1
Reading symbols from /lib64/libaio.so.1.0.1...Reading symbols from /usr/lib/debug/lib64/libaio.so.1.0.1.debug...done.
done.
Loaded symbols for /lib64/libaio.so.1.0.1
Reading symbols from /usr/lib64/libibverbs.so.1.0.0...Reading symbols from /usr/lib/debug/usr/lib64/libibverbs.so.1.0.0.debug...done.
done.
Loaded symbols for /usr/lib64/libibverbs.so.1.0.0
Reading symbols from /usr/lib64/librdmacm.so.1.0.0...Reading symbols from /usr/lib/debug/usr/lib64/librdmacm.so.1.0.0.debug...done.
done.
Loaded symbols for /usr/lib64/librdmacm.so.1.0.0
Reading symbols from /lib64/libpthread-2.12.so...Reading symbols from /usr/lib/debug/lib64/libpthread-2.12.so.debug...done.
[Thread debugging using libthread_db enabled]
done.
Loaded symbols for /lib64/libpthread-2.12.so
Reading symbols from /lib64/libc-2.12.so...Reading symbols from /usr/lib/debug/lib64/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib64/libc-2.12.so
Reading symbols from /lib64/libdl-2.12.so...Reading symbols from /usr/lib/debug/lib64/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib64/libdl-2.12.so
Reading symbols from /lib64/ld-2.12.so...Reading symbols from /usr/lib/debug/lib64/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib64/ld-2.12.so
Core was generated by `tgtd'.
Program terminated with signal 11, Segmentation fault.
#0  __list_del (conn=0x1ab8328) at ./list.h:82
82		next->prev = prev;
(gdb) where
#0  __list_del (conn=0x1ab8328) at ./list.h:82
#1  list_del (conn=0x1ab8328) at ./list.h:88
#2  iscsi_task_tx_start (conn=0x1ab8328) at iscsi/iscsid.c:1916
#3  iscsi_tx_handler (conn=0x1ab8328) at iscsi/iscsid.c:2191
#4  0x000000000040e3f8 in iscsi_tcp_event_handler (fd=<value optimized out>, events=<value optimized out>, data=0x1ab8328) at iscsi/iscsi_tcp.c:164
#5  0x000000000041a0f9 in event_loop () at tgtd.c:411
#6  0x000000000041a7aa in main (argc=<value optimized out>, argv=<value optimized out>) at tgtd.c:583
(gdb) 
#0  __list_del (conn=0x1ab8328) at ./list.h:82
#1  list_del (conn=0x1ab8328) at ./list.h:88
#2  iscsi_task_tx_start (conn=0x1ab8328) at iscsi/iscsid.c:1916
#3  iscsi_tx_handler (conn=0x1ab8328) at iscsi/iscsid.c:2191
#4  0x000000000040e3f8 in iscsi_tcp_event_handler (fd=<value optimized out>, events=<value optimized out>, data=0x1ab8328) at iscsi/iscsi_tcp.c:164
#5  0x000000000041a0f9 in event_loop () at tgtd.c:411
#6  0x000000000041a7aa in main (argc=<value optimized out>, argv=<value optimized out>) at tgtd.c:583
(gdb) f 3
#3  iscsi_tx_handler (conn=0x1ab8328) at iscsi/iscsid.c:2191
2191			ret = iscsi_task_tx_start(conn);
(gdb) l
2186			ddigest = p[ISCSI_PARAM_DATADGST_EN].val & DIGEST_CRC32C;
2187		} else
2188			hdigest = ddigest = 0;
2189	
2190		if (conn->state == STATE_SCSI && !conn->tx_task) {
2191			ret = iscsi_task_tx_start(conn);
2192			if (ret)
2193				goto out;
2194		}
2195	

(gdb) p conn
$1 = (struct iscsi_connection *) 0x1ab8328
(gdb) p *conn
$2 = {state = 12, closed = 0, rx_iostate = 1, tx_iostate = 13, refcount = 1, clist = {next = 0x1ab7950, prev = 0x1ab7950}, session = 0x1ab7910, tid = 1, session_param = {{state = 0, val = 8192}, {state = 2, val = 8192}, {state = 2, 
      val = 1}, {state = 2, val = 1}, {state = 2, val = 1}, {state = 2, val = 1}, {state = 2, val = 1}, {state = 2, val = 65536}, {state = 2, val = 262144}, {state = 2, val = 1}, {state = 2, val = 1}, {state = 2, val = 0}, {state = 0, 
      val = 0}, {state = 0, val = 0}, {state = 2, val = 2}, {state = 2, val = 20}, {state = 0, val = 2048}, {state = 0, val = 2048}, {state = 2, val = 1}, {state = 0, val = 0}, {state = 0, val = 262144}, {state = 0, val = 262144}, {
      state = 0, val = 0}}, initiator = 0x1ab78c0 "iqn.1991-05.com.microsoft:langchaobackup.intranet.gzaic.gov.cn", isid = "@\000\001\067\000", tsih = 143, cid = 1, session_type = 0, auth_method = 0, stat_sn = 3, exp_stat_sn = 2, 
  cmd_sn = 0, exp_cmd_sn = 4523092, max_cmd_sn = 4523093, req = {bhs = {opcode = 1 '\001', flags = 193 '\301', rsvd2 = "\000", hlength = 0 '\000', dlength = "\000\000", lun = "\000\001\000\000\000\000\000", itt = 1409565952, ttt = 1024, 
      statsn = 1409565952, exp_statsn = 33554432, max_statsn = 136, other = "\000\006d\361sp\000\000\002\000\000"}, ahs = 0x1ab38a0, ahssize = 0, data = 0x1afa000, datasize = 0}, req_buffer = 0x1ab38a0, rsp = {bhs = {opcode = 0 '\000', 
      flags = 0 '\000', rsvd2 = "\000", hlength = 0 '\000', dlength = "\000\000", lun = "\000\000\000\000\000\000\000", itt = 0, ttt = 0, statsn = 0, exp_statsn = 0, max_statsn = 0, other = '\000' <repeats 11 times>}, ahs = 0x0, 
    ahssize = 0, data = 0x0, datasize = 0}, rsp_buffer = 0x1ab58b0, rsp_buffer_size = 8192, rx_buffer = 0x1ab8448 "\001\301", tx_buffer = 0x1ab84a0 "", rx_size = 48, tx_size = 48, ttt = 0, text_datasize = 0, text_rsp_buffer = 0x0, 
  rx_task = 0x0, tx_task = 0x0, tx_clist = {next = 0x1aaeba8, prev = 0x1ab88f0}, task_list = {next = 0x1ab8548, prev = 0x1ab8548}, rx_digest = "\000\000\000", tx_digest = "\000\000\000", auth_state = 0, auth = {chap = {digest_alg = 0, 
      id = 0, challenge_size = 0, challenge = 0x0}}, tp = 0x63e7e0, stats = {txdata_octets = 264140, rxdata_octets = 612, noptx_pdus = 0, scsicmd_pdus = 1, tmfcmd_pdus = 0, login_pdus = 0, text_pdus = 0, dataout_pdus = 0, 
    logout_pdus = 0, snack_pdus = 0, noprx_pdus = 0, scsirsp_pdus = 1, tmfrsp_pdus = 0, textrsp_pdus = 0, datain_pdus = 32, logoutrsp_pdus = 0, r2t_pdus = 0, async_pdus = 0, rjt_pdus = 0, digest_err = 0, timeout_err = 0, 
    custom_length = 0, custom = 0x1ab8328}}
(gdb) f 2
#2  iscsi_task_tx_start (conn=0x1ab8328) at iscsi/iscsid.c:1916
1916		list_del(&task->c_list);
(gdb) l
1911		dprintf("found a task %" PRIx64 " %u %u %u\n", task->tag,
1912			ntohl(((struct iscsi_cmd *) (&task->req))->data_length),
1913			task->offset,
1914			task->r2t_count);
1915	
1916		list_del(&task->c_list);
1917	
1918		switch (task->req.opcode & ISCSI_OPCODE_MASK) {
1919		case ISCSI_OP_SCSI_CMD:
1920			err = iscsi_scsi_cmd_tx_start(task);
(gdb) p task
$3 = (struct iscsi_task *) 0x1aaeb28
(gdb) p *task
$4 = {req = {opcode = 46 '.', flags = 105 'i', rsvd2 = "nt", hlength = 114 'r', dlength = "ane", lun = "t.gzaic.", itt = 779513703, ttt = 28259, statsn = 0, exp_statsn = 0, max_statsn = 417, 
    other = "\000\000\000\000P\360\252\001\000\000\000"}, rsp = {opcode = 216 '\330', flags = 254 '\376', rsvd2 = "8", <incomplete sequence \351>, hlength = 58 ':', dlength = "\000\000", lun = "\000\000\000\000\000\000\000", itt = 0, 
    ttt = 0, statsn = 0, exp_statsn = 0, max_statsn = 27978912, other = "\000\000\000\000\300i\252\001\000\000\000"}, tag = 39969447615725632, conn = 0x1aaeb90, c_hlist = {next = 0x1aaeb90, prev = 0x1}, c_list = {next = 0x1af9080, 
    prev = 0x1ab8538}, c_siblings = {next = 0x1aaebb8, prev = 0x1aaebb8}, flags = 4523092, result = 8192, len = 2, offset = 8192, r2t_count = 2, unsol_count = 1, exp_r2tsn = 2, ahs = 0x200000001, data = 0x200000001, scmd = {
    c_target = 0x200000001, c_hlist = {next = 0x200000001, prev = 0x200010000}, qlist = {next = 0x200040000, prev = 0x200000001}, dev_id = 8589934593, dev = 0x0, state = 0, data_dir = DATA_NONE, in_sdb = {resid = 2, length = 2, 
      buffer = 20}, out_sdb = {resid = 2048, length = 0, buffer = 8589936640}, cmd_itn_id = 1, offset = 0, tl = 262144, scb = 0x40000 <Address 0x40000 out of bounds>, scb_len = 0, lun = "\000\000\000\000\200\060\253\001", attribute = 0, 
    tag = 336, result = 80, mreq = 0x1aaeaf0, 
    sense_buffer = "-05.com.microsoft:langchaobackup.intranet.gzaic.gov.cn\000\000\360\001\000\000\000\000\000\000`", '\000' <repeats 15 times>, "]\032>X\000\000\000\000\000\355\252\001\000\000\000\000\000\355\252\001\000\000\000\000pH\252\001", '\000' <repeats 28 times>, "0\355\252\001\000\000\000\000\060\355\252\001\000\000\000\000\200\060\253\001\000\000\000\000a\000\000\000\000\000\000\000P\177\253\001\000\000\000\000P\360\252\001", '\000' <repeats 12 times>, "h\355\252\001\000\000\000\000h\355\252\001", '\000' <repeats 20 times>, "!\000\000\000\000\000\000\000\350\376\070\351:\000\000\000\020o\252\001\000\000\000\000\260\002\000", sense_len = 0, bs_list = {next = 0x2b0, prev = 0xc101}, 
    it_nexus = 0x100, itn_lu_info = 0x400d6d74400}, extdata = 0x1aaeb28}
(gdb) bt
#0  __list_del (conn=0x1ab8328) at ./list.h:82
#1  list_del (conn=0x1ab8328) at ./list.h:88
#2  iscsi_task_tx_start (conn=0x1ab8328) at iscsi/iscsid.c:1916
#3  iscsi_tx_handler (conn=0x1ab8328) at iscsi/iscsid.c:2191
#4  0x000000000040e3f8 in iscsi_tcp_event_handler (fd=<value optimized out>, events=<value optimized out>, data=0x1ab8328) at iscsi/iscsi_tcp.c:164
#5  0x000000000041a0f9 in event_loop () at tgtd.c:411
#6  0x000000000041a7aa in main (argc=<value optimized out>, argv=<value optimized out>) at tgtd.c:583
(gdb) f 1
#1  list_del (conn=0x1ab8328) at ./list.h:88
88		__list_del(entry->prev, entry->next);
(gdb) l
83		prev->next = next;
84	}
85	
86	static inline void list_del(struct list_head *entry)
87	{
88		__list_del(entry->prev, entry->next);
89		entry->next = entry->prev = NULL;
90	}
91	
92	static inline void list_del_init(struct list_head *entry)
(gdb) p *entry
$5 = {next = 0x1af9080, prev = 0x1ab8538}

(gdb) p *0x1af9080
Cannot access memory at address 0x1af9080 <------ invalid address
(gdb) p *0x1ab8538
$6 = 27978664
(gdb) 

The code of  frame 2 looks to be where the issue lies,


1900	static int iscsi_task_tx_start(struct iscsi_connection *conn)
1901	{
1902		struct iscsi_task *task;
1903		int is_rsp, err = 0;
1904	
(gdb) 
1905		if (list_empty(&conn->tx_clist))
1906			goto nodata;
1907	
1908		conn_write_pdu(conn);
1909	
1910		task = list_first_entry(&conn->tx_clist, struct iscsi_task, c_list);            <---- what is this c_list we pass as last argument? Is it a global defined somewhere else?
1911		dprintf("found a task %" PRIx64 " %u %u %u\n", task->tag,
1912			ntohl(((struct iscsi_cmd *) (&task->req))->data_length),
1913			task->offset,
1914			task->r2t_count);
(gdb) 
1915	
1916		list_del(&task->c_list); <--- task->c_list->next is invalid 
1917	

80	static inline void __list_del(struct list_head * prev, struct list_head * next)
81	{
82		next->prev = prev; <--- crashes here
83		prev->next = next;
84	}



looking at list.h list_first_entry is a macro expanding to list_entry, which again expands to container_of , defined here like so,


#define container_of(ptr, type, member) ({                      \
        const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
        (type *)( (char *)__mptr - offsetof(type,member) );})


#define list_first_entry(ptr, type, member) \
        list_entry((ptr)->next, type, member)


#define list_entry(ptr, type, member) \
        container_of(ptr, type, member)

typeof is explained here

https://gcc.gnu.org/onlinedocs/gcc/Typeof.html

I wonder if this is all fine, and we simply have a corruption elsewhere in the code, which might be causing the invalid memory in task->c_list ?

Could we ask them to run the same tgtd command they run, through valgrind to spot potential corruptions ? Or does someone have a better idea to debug this macro,


#define container_of(ptr, type, member) ({                      \
        const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
        (type *)( (char *)__mptr - offsetof(type,member) );})

I doubt there would be a bug in this macro. So it seems more likely that a corruption is causing the task->c_list->next to be invalid memory.

Customer is not ready to run valgrind tests on his reproducer, so I guess the best option is for Engineering to run valgrind on this version of tgtd and check if there is a cause that will explain a corruption.

Comment 3 Chris Williams 2017-06-13 17:59:33 UTC
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.
 
The official life cycle policy can be reviewed here:
 
http://redhat.com/rhel/lifecycle
 
This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:
 
https://access.redhat.com


Note You need to log in before you can comment on or make changes to this bug.