Description of problem: The following panic is seen in a reboot test on an iSCSI root host: Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: [<ffffffff880f6e06>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c PGD 23d345067 PUD 23f5e1067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/iscsi_transport/tcp/caps CPU 0 Modules linked in: be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg shpchp i2c_i801 i2c_core pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage mptsas mptscsih scsi_transport_sas mptbase igb 8021q dca iscsi_ibft iscsi_tcp libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3777, comm: S08ip6tables Not tainted 2.6.18-194.11.1.el5 #1 RIP: 0010:[<ffffffff880f6e06>] [<ffffffff880f6e06>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c RSP: 0018:ffffffff80446c10 EFLAGS: 00010282 RAX: 0000000000000001 RBX: ffff81013ea1d880 RCX: 0000000000000001 RDX: 00000000fffc9b50 RSI: 0000000000000000 RDI: ffff81013ea1d9c0 RBP: ffff81013ea1d880 R08: 0000000000000003 R09: 00000000fffc9b50 R10: 000000000002fa57 R11: 00000000000005a8 R12: 0000000000000000 R13: 0000000000000020 R14: 0000000000000000 R15: ffff81013ee85034 FS: 00002b3674f6cdd0(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000023e6dd000 CR4: 00000000000006e0 Process S08ip6tables (pid: 3777, threadinfo ffff81023cbc4000, task ffff81023f65e100) Stack: 0000000000000082 0000000000000001 ffff81013ea1d880 ffff81013e8afe80 0000000000000001 ffff81013ea1d880 ffff81013e8afe80 ffffffff8001bcc5 ffff81013ee8e6d8 000000003ea1d880 ffff81013e8afed8 ffff81013ea1d880 Call Trace: <IRQ> [<ffffffff8001bcc5>] tcp_rcv_established+0x63a/0x925 [<ffffffff8003b494>] tcp_v4_do_rcv+0x2a/0x2fa [<ffffffff800a09e1>] autoremove_wake_function+0x9/0x2e [<ffffffff800272ae>] tcp_v4_rcv+0x9e0/0xa49 [<ffffffff80034766>] ip_local_deliver+0x19d/0x263 [<ffffffff800358d4>] ip_rcv+0x539/0x57c [<ffffffff80020a3b>] netif_receive_skb+0x470/0x49f [<ffffffff8022eeb8>] napi_gro_flush+0x1e/0x39 [<ffffffff8813033c>] :igb:igb_poll+0x6ae/0x9b6 [<ffffffff8000c88c>] net_rx_action+0xac/0x1e0 [<ffffffff800123b4>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb8e>] do_softirq+0x2c/0x85 [<ffffffff8006ca16>] do_IRQ+0xec/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8001fbcb>] copy_process+0xbeb/0x15eb [<ffffffff8001fd43>] copy_process+0xd63/0x15eb [<ffffffff800311bb>] do_fork+0x69/0x1c1 [<ffffffff800988b7>] recalc_sigpending+0xe/0x25 [<ffffffff8005d116>] system_call+0x7e/0x83 [<ffffffff8005d427>] ptregscall_common+0x67/0xac Code: 49 8b 6c 24 08 e8 5a dd f6 f7 48 89 e6 48 c7 c2 82 6b 0f 88 RIP [<ffffffff880f6e06>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c RSP <ffffffff80446c10> CR2: 0000000000000008 How reproducable: ca. 1/50 reboots Steps to reproduce: Install system on iSCSI root, start reboot test Additional info: Quick analysis from Martin Wilck @ FJ: 103 static void iscsi_sw_tcp_data_ready(struct sock *sk, int flag) 104 { 105 struct iscsi_conn *conn = sk->sk_user_data; 106 struct iscsi_tcp_conn *tcp_conn = conn->dd_data; 107 read_descriptor_t rd_desc; 108 109 read_lock(&sk->sk_callback_lock); The error occurs on line 106 because conn = sk->sk_user_data is NULL. Glancing at the code, this looks like a race condition problem, as sk_user_data is protected by the sk_callback_lock, but that lock is taken in this function only after dereferencing sk_user_data. Martin suggests that what would be needed is to take the lock immediately, and have some fallback code in the case (conn == NULL). Note that similar code appears in other functions in iscsi_tcp.c, that code might need fixing, too. RHEL6 and the latest upstream kernel also seem to have the same code.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 483588 [details] access sk user data under callback lock Hey Gary, The attached patch should fix the problem. I cannot replicate the problem here so I need FJ to test. I made a i686 and x86_64 kernels here http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3171660 with the patch merged. Could you give to your contacts for verification? Thanks
Jusat encountered this problem again on RHEL5.6 after ~1400 reboots.
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: [<ffffffff880f7df6>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c PGD 0 Oops: 0000 [1] SMP last sysfs file: /class/iscsi_transport/tcp/caps CPU 0 Modules linked in: be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i c xgb3 dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg shpchp tpm_tis tpm i7core_edac tpm_bios i2c_i801 pcspkr edac_mc i2c_core dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snap shot dm_zero dm_mirror dm_log dm_mod usb_storage lpfc scsi_transport_fc ahci libata ixgbe 8021q dca iscsi_ibft iscsi_tcp libiscsi_tcp libiscs i2 scsi_transport_iscsi2 scsi_transport_iscsi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-243.el5.bz647824.2.1fts #1 RIP: 0010:[<ffffffff880f7df6>] [<ffffffff880f7df6>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c RSP: 0018:ffffffff804a5c20 EFLAGS: 00010282 RAX: 0000000000000001 RBX: ffff81047ee89980 RCX: 00000000fffc7e11 RDX: 00000000fffc7e11 RSI: 0000000000000000 RDI: ffff81047ee89ac0 RBP: ffff81047ee89980 R08: 00000000c149640f R09: 00000000fffc7e11 R10: 000000000002fa57 R11: 00000000000004b0 R12: 0000000000000000 R13: 0000000000000020 R14: ffff81010fe6e0a2 R15: ffff81047f3fee18 FS: 0000000000000000(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff80456000, task ffffffff80311b60) Stack: ffff81087e82e300 ffff81047fc20ac0 000000004a4a2e30 ffff81047ee89980 ffff81047ee89980 ffff81047ee89980 ffff81047f3fedc0 ffffffff8001bf16 0000000000000000 ffff81047ee89980 ffff81087e82e300 ffff81047f3fedc0 Call Trace: <IRQ> [<ffffffff8001bf16>] tcp_rcv_established+0x5dc/0x8b9 [<ffffffff8003b87f>] tcp_v4_do_rcv+0x2a/0x2fa [<ffffffff80027466>] tcp_v4_rcv+0x9f9/0xa4d [<ffffffff80027466>] tcp_v4_rcv+0x9f9/0xa4d [<ffffffff80034acb>] ip_local_deliver+0x19d/0x263 [<ffffffff80034acb>] ip_local_deliver+0x19d/0x263 [<ffffffff80035c27>] ip_rcv+0x539/0x57c [<ffffffff80020ba2>] netif_receive_skb+0x470/0x49f [<ffffffff8812efed>] :ixgbe:ixgbe_clean_rx_irq+0x537/0x7a8 [<ffffffff8813528f>] :ixgbe:ixgbe_clean_rxtx_many+0x10c/0x223 [<ffffffff8000ca35>] net_rx_action+0xac/0x1b3 [<ffffffff8001253c>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff801a13ab>] acpi_processor_idle_simple+0x1af/0x31c [<ffffffff801a1368>] acpi_processor_idle_simple+0x16c/0x31c [<ffffffff801a11fc>] acpi_processor_idle_simple+0x0/0x31c [<ffffffff800494f8>] cpu_idle+0x95/0xb8 [<ffffffff80461807>] start_kernel+0x220/0x225 [<ffffffff8046122f>] _sinittext+0x22f/0x236 Code: 49 8b 6c 24 08 e8 2a cd f6 f7 48 89 e6 48 c7 c2 72 7b 0f 88 RIP [<ffffffff880f7df6>] :iscsi_tcp:iscsi_sw_tcp_data_ready+0x19/0x5c RSP <ffffffff804a5c20> CR2: 0000000000000008 <0>Kernel panic - not syncing: Fatal exception
Weird ... I can't read comment #7 here although I received an email with its contents. However I can confirm that attachment #483588 [details] looks promising.
> Weird ... I can't read comment #7 here although I received an email with its > contents. However I can confirm that attachment #483588 [details] looks promising. Sorry Martin, my bad, I made comment#7 private out of habit because it was directed to me internally and contained an internal link. Packages now uploaded to support case.
Verified - Gary's test kernel fixes the problem.
Patch(es) available in kernel-2.6.18-261.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Running a reboot test on an iSCSI root host resulted in kernel panic. When the iscsi_tcp module is destroying a connection it grabs the sk_callback_lock and clears the sk_user_data/conn pointer to signal that the callback functions should not execute the operation. However, some functions were not grabbing the lock, causing a NULL pointer kernel panic when iscsi_sw_tcp_conn_restore_callbacks was called and, consequently, one of the callbacks was called. With this update, the underlying source code has been modified to address this issue, and kernel panic no longer occurs.
We saw a very similar panic situation on RHEL6.1 lately (see bug #718786). The RHEL6.1 kernel (and upstream) don't seem to contain anything similar to the patch applied here ... Is it possible that the same fix is needed in 6.1 (and upstream), too?
I just sent the fix upstream a couple weeks ago so I do not think it is in Linus's tree. It is in the scsi maintainer's tree. I also just sent it for 6.2.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html