677703 – [RHEL5.5] Panic in iscsi_sw_tcp_data_ready()

Bug 677703 - [RHEL5.5] Panic in iscsi_sw_tcp_data_ready()

Summary: [RHEL5.5] Panic in iscsi_sw_tcp_data_ready()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Storage QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	703056
TreeView+	depends on / blocked

Reported:	2011-02-15 16:05 UTC by Gary Smith
Modified:	2018-11-14 16:55 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Running a reboot test on an iSCSI root host resulted in kernel panic. When the iscsi_tcp module is destroying a connection it grabs the sk_callback_lock and clears the sk_user_data/conn pointer to signal that the callback functions should not execute the operation. However, some functions were not grabbing the lock, causing a NULL pointer kernel panic when iscsi_sw_tcp_conn_restore_callbacks was called and, consequently, one of the callbacks was called. With this update, the underlying source code has been modified to address this issue, and kernel panic no longer occurs.
Clone Of:
Environment:
Last Closed:	2011-07-21 09:27:08 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
access sk user data under callback lock (3.24 KB, patch) 2011-03-10 22:57 UTC, Mike Christie	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1065	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update	2011-07-21 09:21:37 UTC

Description Gary Smith 2011-02-15 16:05:08 UTC

Description of problem:

The following panic is seen in a reboot test on an iSCSI root host:

Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
[<ffffffff880f6e06>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c
PGD 23d345067 PUD 23f5e1067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /class/iscsi_transport/tcp/caps
CPU 0
Modules linked in: be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg shpchp i2c_i801 i2c_core pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage mptsas mptscsih scsi_transport_sas mptbase igb 8021q dca iscsi_ibft iscsi_tcp libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3777, comm: S08ip6tables Not tainted 2.6.18-194.11.1.el5 #1
RIP: 0010:[<ffffffff880f6e06>] [<ffffffff880f6e06>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c
RSP: 0018:ffffffff80446c10 EFLAGS: 00010282
RAX: 0000000000000001 RBX: ffff81013ea1d880 RCX: 0000000000000001
RDX: 00000000fffc9b50 RSI: 0000000000000000 RDI: ffff81013ea1d9c0
RBP: ffff81013ea1d880 R08: 0000000000000003 R09: 00000000fffc9b50
R10: 000000000002fa57 R11: 00000000000005a8 R12: 0000000000000000
R13: 0000000000000020 R14: 0000000000000000 R15: ffff81013ee85034
FS: 00002b3674f6cdd0(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000023e6dd000 CR4: 00000000000006e0
Process S08ip6tables (pid: 3777, threadinfo ffff81023cbc4000, task ffff81023f65e100)
Stack: 0000000000000082 0000000000000001 ffff81013ea1d880 ffff81013e8afe80
0000000000000001 ffff81013ea1d880 ffff81013e8afe80 ffffffff8001bcc5
ffff81013ee8e6d8 000000003ea1d880 ffff81013e8afed8 ffff81013ea1d880
Call Trace:
<IRQ> [<ffffffff8001bcc5>] tcp_rcv_established+0x63a/0x925
[<ffffffff8003b494>] tcp_v4_do_rcv+0x2a/0x2fa
[<ffffffff800a09e1>] autoremove_wake_function+0x9/0x2e
[<ffffffff800272ae>] tcp_v4_rcv+0x9e0/0xa49
[<ffffffff80034766>] ip_local_deliver+0x19d/0x263
[<ffffffff800358d4>] ip_rcv+0x539/0x57c
[<ffffffff80020a3b>] netif_receive_skb+0x470/0x49f
[<ffffffff8022eeb8>] napi_gro_flush+0x1e/0x39
[<ffffffff8813033c>] :igb:igb_poll+0x6ae/0x9b6
[<ffffffff8000c88c>] net_rx_action+0xac/0x1e0
[<ffffffff800123b4>] __do_softirq+0x89/0x133
[<ffffffff8005e2fc>] call_softirq+0x1c/0x28
[<ffffffff8006cb8e>] do_softirq+0x2c/0x85
[<ffffffff8006ca16>] do_IRQ+0xec/0xf5
[<ffffffff8005d615>] ret_from_intr+0x0/0xa
<EOI> [<ffffffff8001fbcb>] copy_process+0xbeb/0x15eb
[<ffffffff8001fd43>] copy_process+0xd63/0x15eb
[<ffffffff800311bb>] do_fork+0x69/0x1c1
[<ffffffff800988b7>] recalc_sigpending+0xe/0x25
[<ffffffff8005d116>] system_call+0x7e/0x83
[<ffffffff8005d427>] ptregscall_common+0x67/0xac


Code: 49 8b 6c 24 08 e8 5a dd f6 f7 48 89 e6 48 c7 c2 82 6b 0f 88
RIP [<ffffffff880f6e06>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c
RSP <ffffffff80446c10>
CR2: 0000000000000008


How reproducable:
ca. 1/50 reboots

Steps to reproduce:
Install system on iSCSI root, start reboot test


Additional info:

Quick analysis from Martin Wilck @ FJ:

103 static void iscsi_sw_tcp_data_ready(struct sock *sk, int flag)
104 {
105 struct iscsi_conn *conn = sk->sk_user_data;
106 struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
107 read_descriptor_t rd_desc;
108
109 read_lock(&sk->sk_callback_lock);

The error occurs on line 106 because conn = sk->sk_user_data is NULL.
Glancing at the code, this looks like a race condition problem, as sk_user_data is protected by the sk_callback_lock, but that lock is taken in this function only after dereferencing sk_user_data.

Martin suggests that what would be needed is to take the lock immediately, and have some fallback code in the case (conn == NULL). Note that similar code appears in other functions in iscsi_tcp.c, that code might need fixing, too. 

RHEL6 and the latest upstream kernel also seem to have the same code.

Comment 1 RHEL Program Management 2011-02-15 18:39:40 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Mike Christie 2011-03-10 22:57:16 UTC

Created attachment 483588 [details]
access sk user data under callback lock

Hey Gary,

The attached patch should fix the problem. I cannot replicate the problem here so I need FJ to test.

I made a i686 and x86_64 kernels here
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3171660
with the patch merged. Could you give to your contacts for verification?

Thanks

Comment 9 Martin Wilck 2011-03-15 12:18:23 UTC

Jusat encountered this problem again on RHEL5.6 after ~1400 reboots.

Comment 10 Martin Wilck 2011-03-15 12:28:09 UTC

Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: 
 [<ffffffff880f7df6>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: /class/iscsi_transport/tcp/caps
CPU 0 
Modules linked in: be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i c
xgb3 dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp
 parport sg shpchp tpm_tis tpm i7core_edac tpm_bios i2c_i801 pcspkr edac_mc i2c_core dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snap
shot dm_zero dm_mirror dm_log dm_mod usb_storage lpfc scsi_transport_fc ahci libata ixgbe 8021q dca iscsi_ibft iscsi_tcp libiscsi_tcp libiscs
i2 scsi_transport_iscsi2 scsi_transport_iscsi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-243.el5.bz647824.2.1fts #1
RIP: 0010:[<ffffffff880f7df6>]  [<ffffffff880f7df6>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c
RSP: 0018:ffffffff804a5c20  EFLAGS: 00010282
RAX: 0000000000000001 RBX: ffff81047ee89980 RCX: 00000000fffc7e11
RDX: 00000000fffc7e11 RSI: 0000000000000000 RDI: ffff81047ee89ac0
RBP: ffff81047ee89980 R08: 00000000c149640f R09: 00000000fffc7e11
R10: 000000000002fa57 R11: 00000000000004b0 R12: 0000000000000000
R13: 0000000000000020 R14: ffff81010fe6e0a2 R15: ffff81047f3fee18
FS:  0000000000000000(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff80456000, task ffffffff80311b60)
Stack:  ffff81087e82e300 ffff81047fc20ac0 000000004a4a2e30 ffff81047ee89980
 ffff81047ee89980 ffff81047ee89980 ffff81047f3fedc0 ffffffff8001bf16
 0000000000000000 ffff81047ee89980 ffff81087e82e300 ffff81047f3fedc0
Call Trace:
 <IRQ>  [<ffffffff8001bf16>] tcp_rcv_established+0x5dc/0x8b9
 [<ffffffff8003b87f>] tcp_v4_do_rcv+0x2a/0x2fa
 [<ffffffff80027466>] tcp_v4_rcv+0x9f9/0xa4d
 [<ffffffff80027466>] tcp_v4_rcv+0x9f9/0xa4d
 [<ffffffff80034acb>] ip_local_deliver+0x19d/0x263
 [<ffffffff80034acb>] ip_local_deliver+0x19d/0x263
 [<ffffffff80035c27>] ip_rcv+0x539/0x57c
 [<ffffffff80020ba2>] netif_receive_skb+0x470/0x49f
 [<ffffffff8812efed>] :ixgbe:ixgbe_clean_rx_irq+0x537/0x7a8
 [<ffffffff8813528f>] :ixgbe:ixgbe_clean_rxtx_many+0x10c/0x223
 [<ffffffff8000ca35>] net_rx_action+0xac/0x1b3
 [<ffffffff8001253c>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff801a13ab>] acpi_processor_idle_simple+0x1af/0x31c
 [<ffffffff801a1368>] acpi_processor_idle_simple+0x16c/0x31c
 [<ffffffff801a11fc>] acpi_processor_idle_simple+0x0/0x31c
 [<ffffffff800494f8>] cpu_idle+0x95/0xb8
 [<ffffffff80461807>] start_kernel+0x220/0x225
 [<ffffffff8046122f>] _sinittext+0x22f/0x236


Code: 49 8b 6c 24 08 e8 2a cd f6 f7 48 89 e6 48 c7 c2 72 7b 0f 88 
RIP  [<ffffffff880f7df6>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c
 RSP <ffffffff804a5c20>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Fatal exception

Comment 11 Martin Wilck 2011-03-15 13:54:37 UTC

Weird ... I can't read comment #7 here although I received an email with its contents. However I can confirm that attachment #483588 [details] looks promising.

Comment 12 Gary Smith 2011-03-15 14:37:42 UTC

> Weird ... I can't read comment #7 here although I received an email with its
> contents. However I can confirm that attachment #483588 [details] looks promising.

Sorry Martin, my bad, I made comment#7 private out of habit because it was directed to me internally and contained an internal link. Packages now uploaded to support case.

Comment 13 Martin Wilck 2011-04-18 08:22:58 UTC

Verified - Gary's test kernel fixes the problem.

Comment 19 Jarod Wilson 2011-05-13 22:19:19 UTC

Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 22 Martin Prpič 2011-07-12 11:54:33 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Running a reboot test on an iSCSI root host resulted in kernel panic. When the iscsi_tcp module is destroying a connection it grabs the sk_callback_lock and clears the sk_user_data/conn pointer to signal that the callback functions should not execute the operation. However, some functions were not grabbing the lock, causing a NULL pointer kernel panic when iscsi_sw_tcp_conn_restore_callbacks was called and, consequently, one of the callbacks was called. With this update, the underlying source code has been modified to address this issue, and kernel panic no longer occurs.

Comment 23 Martin Wilck 2011-07-20 14:00:23 UTC

We saw a very similar panic situation on RHEL6.1 lately (see bug #718786). The RHEL6.1 kernel (and upstream) don't seem to contain anything similar to the patch applied here ... Is it possible that the same fix is needed in 6.1 (and upstream), too?

Comment 24 Mike Christie 2011-07-21 03:59:57 UTC

I just sent the fix upstream a couple weeks ago so I do not think it is in Linus's tree. It is in the scsi maintainer's tree. I also just sent it for 6.2.

Comment 25 errata-xmlrpc 2011-07-21 09:27:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Note You need to log in before you can comment on or make changes to this bug.