Bug 1497152

Summary: systool causes panic on 2.6.32-696.6.3.el6.x86_64 using be2iscsi
Product: Red Hat Enterprise Linux 6 Reporter: nikhil kshirsagar <nkshirsa>
Component: kernelAssignee: Maurizio Lombardi <mlombard>
kernel sub component: Storage Drivers QA Contact: Martin Hoyer <mhoyer>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dhoward, djeffery, jmagrini, mhoyer, mlombard, mthacker, toneata
Version: 6.7Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-722.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1507512 (view as bug list) Environment:
Last Closed: 2018-06-19 04:59:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1507512    

Description nikhil kshirsagar 2017-09-29 10:29:41 UTC
Description of problem:

systool or other diagnosys tools that uses systool like emcgrab cause kernel panic.

Version-Release number of selected component (if applicable):
2.6.32-696.6.3.el6.x86_64 kernel



Additional info:

https://access.redhat.com/solutions/2832161
and the customers reported behaviour is the same. 
The only difference is that their machines use Emulex FCOE, which are deprecated in rhel 7 seems like. Here are the details of our vmcore analysis, and also their lspci outputs showing the deprecated boards mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1312460

----------------------------------------------------------------------------------------------------------------

   KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-696.10.3.el6.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/235823940/crash/vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Thu Sep 28 03:45:40 2017
      UPTIME: 00:06:53
LOAD AVERAGE: 4.83, 3.85, 1.97
       TASKS: 2837
    NODENAME: ultc1r01.osasunet
     RELEASE: 2.6.32-696.10.3.el6.x86_64
     VERSION: #1 SMP Thu Sep 21 12:12:50 EDT 2017
     MACHINE: x86_64  (2600 Mhz)
      MEMORY: 256 GB
       PANIC: "general protection fault: 0000 [#1] SMP "
         PID: 42300
     COMMAND: "systool"
        TASK: ffff88232d0c8040  [THREAD_INFO: ffff88232d0b8000]
         CPU: 39
       STATE: TASK_RUNNING (PANIC)

 bt
PID: 42300  TASK: ffff88232d0c8040  CPU: 39  COMMAND: "systool"
 #0 [ffff88232d0bba90] machine_kexec at ffffffff8103fdbb
 #1 [ffff88232d0bbaf0] crash_kexec at ffffffff810d1f02
 #2 [ffff88232d0bbbc0] oops_end at ffffffff8154f060
 #3 [ffff88232d0bbbf0] die at ffffffff8101101b
 #4 [ffff88232d0bbc20] do_general_protection at ffffffff8154eb42
 #5 [ffff88232d0bbc50] general_protection at ffffffff8154e2b5
    [exception RIP: strnlen+9]
    RIP: ffffffff812a3379  RSP: ffff88232d0bbd08  RFLAGS: 00010286
    RAX: ffffffff817c6126  RBX: ffff88333d812000  RCX: 0000000000000002
    RDX: 7665642073696854  RSI: ffffffffffffffff  RDI: 7665642073696854
    RBP: ffff88232d0bbd08   R8: 0000000000000073   R9: ffff8808332d43b0
    R10: 0000000000000000  R11: 0000000000000246  R12: ffff88333d811013
    R13: 7665642073696854  R14: 00000000ffffffff  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff88232d0bbd10] string at ffffffff812a47a0
 #7 [ffff88232d0bbd50] vsnprintf at ffffffff812a6258
 #8 [ffff88232d0bbdf0] snprintf at ffffffff812a66c4
 #9 [ffff88232d0bbe50] beiscsi_adap_family_disp at ffffffffa05b1a79 [be2iscsi]
#10 [ffff88232d0bbe60] dev_attr_show at ffffffff8137f907
#11 [ffff88232d0bbe90] sysfs_read_file at ffffffff81217551
#12 [ffff88232d0bbef0] vfs_read at ffffffff8119a4f5
#13 [ffff88232d0bbf30] sys_read at ffffffff8119a841
#14 [ffff88232d0bbf80] system_call_fastpath at ffffffff8100b0d2
    RIP: 000000321a4db770  RSP: 00007ffeac4b4940  RFLAGS: 00010216
    RAX: 0000000000000000  RBX: ffffffff8100b0d2  RCX: 0000000000020000
    RDX: 0000000000001000  RSI: 0000000002200330  RDI: 0000000000000004
    RBP: 0000000002200330   R8: 0000000000000008   R9: 0000000000260000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000004
    R13: 0000000000001000  R14: 00000000022001d0  R15: 0000000000001000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

Below is the HBA driver version ----

cat sos_commands/kernel/modinfo_tpm_tpm_tis_libata_efivars_tcp_cubic_kernel_printk_kgdb_spurious_pstore_dynamic_debug_pcie_aspm_pci_hotplug_pciehp_acpiphp_intel_idle_acpi_pci_slot_processor_thermal_acpi_memhotplug_battery_keyboard_vt_8250_kgdboc_kgdbts_scsi_mod_pcmcia_core_pcmci | grep -i lpfc
filename:       /lib/modules/2.6.32-696.6.3.el6.x86_64/kernel/drivers/scsi/lpfc/lpfc.ko

I can also see below call traces.

Modules linked in: oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U) mptctl mptbase autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand powernow_k8 freq_table mperf 8021q garp stp llc bonding dm_service_time dm_round_robin dm_multipath microcode e1000e ptp pps_core hpilo hpwdt be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi serio_raw fam15h_power k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 power_meter acpi_ipmi ipmi_si ipmi_msghandler sg be2net shpchp ext4 jbd2 mbcache sd_mod hpsa lpfc scsi_transport_fc scsi_tgt crc_t10dif ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 42300, comm: systool Tainted: P           -- ------------    2.6.32-696.10.3.el6.x86_64 #1 HP ProLiant BL685c G7
RIP: 0010:[<ffffffff812a3379>]  [<ffffffff812a3379>] strnlen+0x9/0x40
RSP: 0018:ffff88232d0bbd08  EFLAGS: 00010286
RAX: ffffffff817c6126 RBX: ffff88333d812000 RCX: 0000000000000002
RDX: 7665642073696854 RSI: ffffffffffffffff RDI: 7665642073696854
RBP: ffff88232d0bbd08 R08: 0000000000000073 R09: ffff8808332d43b0
R10: 0000000000000000 R11: 0000000000000246 R12: ffff88333d811013
R13: 7665642073696854 R14: 00000000ffffffff R15: 0000000000000000
FS:  00007f3123212700(0000) GS:ffff88305c6c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002201188 CR3: 000000333bf69000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process systool (pid: 42300, threadinfo ffff88232d0b8000, task ffff88232d0c8040)
Stack:
 ffff88232d0bbd48 ffffffff812a47a0 ffffffff81217040 ffff88333d811013
<d> ffffffffa05b8ab1 ffffffffa05b8aaf ffff88232d0bbdf8 ffff88333d812000
<d> ffff88232d0bbde8 ffffffff812a6258 0000000000000004 0000000affffffff
Call Trace:
 [<ffffffff812a47a0>] string+0x40/0x100
 [<ffffffff81217040>] ? sysfs_open_file+0x0/0x290
 [<ffffffff812a6258>] vsnprintf+0x218/0x5e0
 [<ffffffff812a66c4>] snprintf+0x34/0x40
 [<ffffffffa05b1a79>] beiscsi_adap_family_disp+0x49/0xc0 [be2iscsi]
 [<ffffffff8137f907>] dev_attr_show+0x27/0x50
 [<ffffffff8113bc2e>] ? __get_free_pages+0xe/0x50
 [<ffffffff81217551>] sysfs_read_file+0x111/0x200
 [<ffffffff8119a4f5>] vfs_read+0xb5/0x1a0
 [<ffffffff8119b2a6>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a841>] sys_read+0x51/0xb0
 [<ffffffff810ee47e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 
RIP  [<ffffffff812a3379>] strnlen+0x9/0x40
 RSP <ffff88232d0bbd08>

It seems there is some issue with below 3 modules.

NAME        TAINTS
oracleoks   P(U)
oracleadvm  P(U)
oracleacfs  P(U)

I found below details.

https://access.redhat.com/solutions/2537611

https://bugzilla.redhat.com/show_bug.cgi?id=1312460

Comment 3 David Jeffery 2017-09-29 15:22:14 UTC
The crash is from the be2iscsi driver using a bad extern declaration in be_mgmt.c for the beiscsi_obsolete_adapter_msg string.  The string exists in be_main.c as "char const beiscsi_obsolete_adapter_msg[]" but be_mgmt.c redefined it as "extern char const *beiscsi_obsolete_adapter_msg".

While you can use arrays as pointers, altering the variable declaration like this to switch between an array and a pointer breaks how gcc handles the variable.  When the code in be_mgmt.c tries to use beiscsi_obsolete_adapter_msg as a pointer, it instead passes the beginning of the string itself.  The kernel then crashes when strnlen tries to use the copied part of a string as a pointer.


The extern declaration in be_mgmt.c should be the same as how the variable is declared in be_main.c.  A simple patch like this should fix the issue:


--- linux-2.6.32-696.15.1.el6.x86_64/drivers/scsi/be2iscsi/be_mgmt.c 2017-09-26 06:52:08.000000000 -0400
+++ linux-2.6.32-696.15.1.el6.x86_64/drivers/scsi/be2iscsi/be_mgmt.c 2017-09-29 11:15:43.256589278 -0400
@@ -24,7 +24,7 @@
 #include "be_iscsi.h"
 #include "be_main.h"
 
-extern char const *beiscsi_obsolete_adapter_msg;
+extern char const beiscsi_obsolete_adapter_msg[];
 
 /* UE Status Low CSR */
 static const char * const desc_ue_status_low[] = {

Comment 7 nikhil kshirsagar 2017-10-04 03:36:10 UTC
This bug was introduced in kernel-2.6.32-642.el6 (GA 6.7)

The exact commit was f628daf643edeadf2957921f77992df0f5aa61c3


commit f628daf643edeadf2957921f77992df0f5aa61c3
Author: Maurizio Lombardi <mlombard>
Date:   Wed Mar 16 10:01:49 2016 -0400

    [scsi] be2iscsi: Add warning message for unsupported adapter
    
    Message-id: <56E92F0D.5060802>
    Patchwork-id: 138554
    O-Subject: Re: [RHEL6.8 e-stor PATCH V2 09/10] be2iscsi : Add warning message for unsupported adapter
    Bugzilla: 1253016
    RH-Acked-by: Rob Evers <revers>
    RH-Acked-by: Tomas Henzl <thenzl>
    
    RHEL-only
    
    Add a warning message to indicate obsolete
    BE2 Adapter Family devices
    
    Partially base on this patch:
    http://www.spinics.net/lists/linux-scsi/msg87182.html
    
    Signed-off-by: Maurizio Lombardi <mlombard>
    Signed-off-by: Aristeu Rozanski <aris>
----------------------------------

commit f628daf643edeadf2957921f77992df0f5aa61c3
Author: Maurizio Lombardi <mlombard>
Date:   Wed Mar 16 10:01:49 2016 -0400

    [scsi] be2iscsi: Add warning message for unsupported adapter
    
    Message-id: <56E92F0D.5060802>
    Patchwork-id: 138554
    O-Subject: Re: [RHEL6.8 e-stor PATCH V2 09/10] be2iscsi : Add warning message for unsupported adapter
    Bugzilla: 1253016
    RH-Acked-by: Rob Evers <revers>
    RH-Acked-by: Tomas Henzl <thenzl>
    
    RHEL-only
    
    Add a warning message to indicate obsolete
    BE2 Adapter Family devices
    
    Partially base on this patch:
    http://www.spinics.net/lists/linux-scsi/msg87182.html
    
    Signed-off-by: Maurizio Lombardi <mlombard>
    Signed-off-by: Aristeu Rozanski <aris>

diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c
index 9045129..d40b553 100644
--- a/drivers/scsi/be2iscsi/be_main.c
+++ b/drivers/scsi/be2iscsi/be_main.c
@@ -208,6 +208,14 @@ static char const *cqe_desc[] = {
        "CXN_KILLED_IMM_DATA_RCVD"
 };
 
+char const beiscsi_obsolete_adapter_msg[] = {
+       "This device is not recommended for new deployments. It continues to be\n"
+       "supported in this RHEL release, but it is likely to be removed in the\n"
+       "next major release. Driver updates and fixes for this device will be\n"
+       "limited to critical issues.  Please contact your device's hardware\n"
+       "vendor for additional information.\n"
+};
+
 static int beiscsi_slave_configure(struct scsi_device *sdev)
 {
        blk_queue_max_segment_size(sdev->request_queue, 65536);
@@ -5714,6 +5722,7 @@ static int __devinit beiscsi_dev_probe(struct pci_dev *pcidev,
        case OC_DEVICE_ID2:
                phba->generation = BE_GEN2;
                phba->iotask_fn = beiscsi_iotask;
+               dev_warn(&pcidev->dev, beiscsi_obsolete_adapter_msg);
                break;
        case BE_DEVICE_ID2:
        case OC_DEVICE_ID3:
diff --git a/drivers/scsi/be2iscsi/be_mgmt.c b/drivers/scsi/be2iscsi/be_mgmt.c
index 2d128f8..dbb679a 100644
--- a/drivers/scsi/be2iscsi/be_mgmt.c
+++ b/drivers/scsi/be2iscsi/be_mgmt.c
@@ -24,6 +24,8 @@
 #include "be_iscsi.h"
 #include "be_main.h"
 
+extern char const *beiscsi_obsolete_adapter_msg; <--------
+
 /* UE Status Low CSR */
 static const char * const desc_ue_status_low[] = {
        "CEV",
@@ -1536,7 +1538,8 @@ beiscsi_adap_family_disp(struct device *dev, struct device_attribute *attr,
        case BE_DEVICE_ID1:
        case OC_DEVICE_ID1:
        case OC_DEVICE_ID2:
-               return snprintf(buf, PAGE_SIZE, "BE2 Adapter Family\n");
+               return snprintf(buf, PAGE_SIZE, "%s\n%s", "BE2 Adapter Family",
+                               beiscsi_obsolete_adapter_msg);
                break;
        case BE_DEVICE_ID2:
        case OC_DEVICE_ID3:

Can this be fixed in main, as well as a hotfix for the customer (who is using kernel-2.6.32-696.10.3.el6 )

We already provided a test kernel with the fix, and it works fine.

Comment 13 Phillip Lougher 2017-10-13 20:30:09 UTC
Patch(es) committed on kernel repository and kernel is undergoing testing

Comment 16 Phillip Lougher 2017-10-18 19:38:50 UTC
Patch(es) available on kernel-2.6.32-722.el6

Comment 21 Martin Hoyer 2018-04-25 14:12:33 UTC
Tested on system with OCe10102-IM adapter using kernel-2.6.32-749.el6.x86_64

Comment 23 errata-xmlrpc 2018-06-19 04:59:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1854