Bug 637194
| Summary: | [Qlogic 5.6 bug] qlcnic: fix kernel NULL pointer dereference __qlcnic_shutdown+0xe/0x8a | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | PaulB <pbunyan> | ||||||
| Component: | kernel | Assignee: | Chad Dupuis (Cavium) <cdupuis> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 5.6 | CC: | amit.salecha, andriusb, atodorov, bdonahue, bpicco, cward, dmach, GR-Linux-NIC-Dev, jburke, jwilson, peterm, rajesh.borundia | ||||||
| Target Milestone: | rc | Keywords: | OtherQA | ||||||
| Target Release: | 5.6 | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-01-13 21:22:59 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
PaulB
2010-09-24 14:57:41 UTC
(In reply to comment #0) > Description of problem: > System PANICS during install RHEL5.6-Server-20100921.n_nfs-x86_64 > > Version-Release number of selected component (if applicable): > RHEL5.6-Server-20100921.n_nfs-x86_64 > > Actual results: > Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: > [<ffffffff88293569>] :qlcnic:__qlcnic_shutdown+0xe/0x8a > PGD 237e58067 PUD 237e57067 PMD 0 > Oops: 0000 [1] SMP > last sysfs file: > /devices/pci0000:00/0000:00:1d.0/usb1/1-0:1.0/bAlternateSetting > CPU 0 > Modules linked in: sha256 aes_generic dm_crypt dm_emc dm_round_robin > dm_multipath scsi_dh dm_snapshot dm_mirror dm_zero xfs lock_nolock gfs2 ext3 > jbd ext4 crc16 jbd2 msdos dm_raid45 dm_message dm_mem_cache dm_region_hash > dm_log dm_mod raid456 xor raid10 raid1 raid0 qla2xxx ata_piix libata cciss > qla4xxx scsi_transport_fc netxen_nic qlcnic ehci_hcd uhci_hcd iscsi_ibft > iscsi_tcp libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi > sr_mod sd_mod scsi_mod ide_cd cdrom ipv6 xfrm_nalgo crypto_api squashfs pcspkr > edd loop nfs nfs_acl fscache lockd sunrpc vfat fat cramfs > Pid: 1, comm: init Not tainted 2.6.18-222.el5 #1 > RIP: 0010:[<ffffffff88293569>] [<ffffffff88293569>] > :qlcnic:__qlcnic_shutdown+0xe/0x8a > RSP: 0018:ffff810237f97e08 EFLAGS: 00010282 > RAX: ffffffff882935e5 RBX: 0000000000000000 RCX: ffffffff8020d600 > RDX: ffff8101379c4800 RSI: 0000000000000246 RDI: ffff8101379c4800 > RBP: 0000000028121969 R08: ffff810237d5e810 R09: 0000000000000004 > R10: ffff810237f97c78 R11: ffffffff882935e5 R12: ffff8101379c4800 > R13: 0000000000000008 R14: 0000000000000004 R15: 0000000000000000 > FS: 00000000158c8850(0063) GS:ffffffff80423000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000000040 CR3: 0000000237e54000 CR4: 00000000000006e0 > Process init (pid: 1, threadinfo ffff810237f96000, task ffff8101044ef7a0) > Stack: ffff8101379c4800 0000000028121969 00000000fee1dead ffffffff882935ee > ffff810237bb6870 ffffffff801ceb56 0000000000000000 ffffffff8009cc14 > 0000000001234567 ffffffff8009cd9e ffffffffffffffff ffff810237f97ee8 > Call Trace: > [<ffffffff882935ee>] :qlcnic:qlcnic_shutdown+0x9/0x18 > [<ffffffff801ceb56>] device_shutdown+0x56/0x88 > [<ffffffff8009cc14>] kernel_restart+0x9/0x46 > [<ffffffff8009cd9e>] sys_reboot+0x146/0x1c7 > [<ffffffff8003af15>] hrtimer_try_to_cancel+0x4a/0x53 > [<ffffffff8005a44a>] hrtimer_cancel+0xc/0x16 > [<ffffffff80063ce5>] do_nanosleep+0x47/0x70 > [<ffffffff8005a337>] hrtimer_nanosleep+0x58/0x118 > [<ffffffff800a4530>] hrtimer_wakeup+0x0/0x22 > [<ffffffff8001dde9>] sigprocmask+0xb7/0xdb > [<ffffffff80054cae>] sys_nanosleep+0x4c/0x62 > [<ffffffff8005d116>] system_call+0x7e/0x83 > > > Code: 48 8b 6b 40 48 89 ef e8 0f 10 fa f7 48 89 df e8 98 ff ff ff > RIP [<ffffffff88293569>] :qlcnic:__qlcnic_shutdown+0xe/0x8a > RSP <ffff810237f97e08> > CR2: 0000000000000040 > <0>Kernel panic - not syncing: Fatal exception > > > Expected results: > System should successfully install. > > Additional info: > see next comment > > -pbunyan A couple of questions: 1. Did this occur during a network install where the QLogic 82XX card was being used as the install device? 2. Where in the install does this occur? The stack trace here seems to indicate that we're cleaning up so I assume this occurs when the install is trying to reboot after the installation completed? Thanks. A couple of questions: 1. Did this occur during a network install where the QLogic 82XX card was being used as the install device? Answer - From inventory all I can tell is the default install interface is using the netxen_nic driver. This looks to be an issue with the qlcnic driver. 2. Where in the install does this occur? Answer - Looking at the log it happens at the end of the installation while rebooting. ----------------<snip>------------------ sending kill signals...done disabling swap... /dev/mapper/VolGroup00-LogVol01 unmounting filesystems... /mnt/runtime done disabling /dev/loop0 /proc/bus/usb done /proc done /dev/pts done /sys done /tmp/ramfs done /mnt/source done /selinux done /mnt/sysimage/boot done /mnt/sysimage/sys done /mnt/sysimage/proc/bus/usb done /mnt/sysimage/proc done /mnt/sysimage/selinux done /mnt/sysimage/dev done /mnt/sysimage done rebooting system Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: . . . ----------------</snip>------------------ Created attachment 449830 [details]
null pointer in shutdown
Attaching patch based on dump. Private data is unavailable and dereferencing it causing null pointer exception.
Also gives us exact steps to reproduce the problem i.e O.S and installation procedure.
Is there a way to add this patch to a test build of the RHEL 5 ISO? It would seem the only way to verify this patch would be to retest the installation. Any update on this ? Chad, are you using this bugzilla to post the fix? (In reply to comment #7) > Chad, are you using this bugzilla to post the fix? Yes, bz562723 was originally for posting the patches necessary to add qlcnic to RHEL 5.6 so it would make more sense to track this point fix with this bz. Created attachment 460834 [details] qlcnic: Fix missing error codes We were able to reproduce the issue and believe this fix upstream: http://kerneltrap.org/mailarchive/linux-netdev/2010/8/27/6283992 fixes this issue. Essentially there were some errors that we were not returning the correct error value from probe which makes the PCI layer falsely assume that probe succeeded including falsely populating the PCI driver data. We have tested this on a local setup by error injection. After failing probe and returning a positive error value when we reboot the system we see the shutdown panic. If the error value is negative then shutdown works fine. (In reply to comment #14) > Created attachment 460834 [details] > qlcnic: Fix missing error codes > > We were able to reproduce the issue and believe this fix upstream: > > http://kerneltrap.org/mailarchive/linux-netdev/2010/8/27/6283992 > > fixes this issue. Essentially there were some errors that we were not > returning the correct error value from probe which makes the PCI layer falsely > assume that probe succeeded including falsely populating the PCI driver data. > We have tested this on a local setup by error injection. After failing probe > and returning a positive error value when we reboot the system we see the > shutdown panic. If the error value is negative then shutdown works fine. Oh. Thanks you saved me expending worthless energy. This patch is in V2 of Sucheta's patch for bz#562921. It's not in R5.6 qlcnic driver. So it explains why R6.1 didn't have an issue last night. I was about to pursue but you saved me the effort. thanx, bob This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-233.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Reminder! There should be a fix present for this BZ in snapshot 3 -- unless otherwise noted in a previous comment. Please test and update this BZ with test results as soon as possible. Hello, this is still present with snap #4 on hp-ml370g6-01.rhts.eng.bos.redhat.com. Moving back to assigned. sending kill signals...done disabling swap... /dev/mapper/VolGroup00-LogVol01 unmounting filesystems... /mnt/runtime done disabling /dev/loop0 /proc/bus/usb done /proc done /dev/pts done /sys done /tmp/ramfs done /selinux done /mnt/sysimage/boot done /mnt/sysimage/sys done /mnt/sysimage/proc/bus/usb done /mnt/sysimage/proc done /mnt/sysimage/selinux done /mnt/sysimage/dev done /mnt/sysimage done rebooting system Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: [<ffffffff88293569>] :qlcnic:__qlcnic_shutdown+0xe/0x8a PGD 237e58067 PUD 237e5c067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:1d.0/usb1/1-0:1.0/bAlternateSetting CPU 0 Modules linked in: sha256 aes_generic dm_crypt dm_emc dm_round_robin dm_multipath scsi_dh dm_snapshot dm_mirror dm_zero xfs lock_nolock gfs2 ext3 jbd ext4 crc16 jbd2 msdos dm_raid45 dm_message dm_mem_cache dm_region_hash dm_log dm_mod raid456 xor raid10 raid1 raid0 qla2xxx ata_piix libata cciss qla4xxx scsi_transport_fc netxen_nic qlcnic ehci_hcd uhci_hcd iscsi_ibft iscsi_tcp libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi sr_mod sd_mod scsi_mod ide_cd cdrom ipv6 xfrm_nalgo crypto_api squashfs pcspkr edd loop nfs nfs_acl fscache lockd sunrpc vfat fat cramfs Pid: 1, comm: init Not tainted 2.6.18-225.el5 #1 RIP: 0010:[<ffffffff88293569>] [<ffffffff88293569>] :qlcnic:__qlcnic_shutdown+0xe/0x8a RSP: 0018:ffff810237f97e08 EFLAGS: 00010282 RAX: ffffffff882935e5 RBX: 0000000000000000 RCX: ffffffff8020dcfc RDX: ffff81013793e800 RSI: 0000000000000246 RDI: ffff81013793e800 RBP: 0000000028121969 R08: ffff8101375b7810 R09: 0000000000000004 R10: ffff810237f97c78 R11: ffffffff882935e5 R12: ffff81013793e800 R13: 0000000000000008 R14: 0000000000000004 R15: 0000000000000000 FS: 0000000012271850(0063) GS:ffffffff80424000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000040 CR3: 0000000237e54000 CR4: 00000000000006e0 Process init (pid: 1, threadinfo ffff810237f96000, task ffff8101044ef7a0) Stack: ffff81013793e800 0000000028121969 00000000fee1dead ffffffff882935ee ffff810237baf870 ffffffff801cf253 0000000000000000 ffffffff8009cf42 0000000001234567 ffffffff8009d0cc ffffffffffffffff ffff810237f97ee8 Call Trace: [<ffffffff882935ee>] :qlcnic:qlcnic_shutdown+0x9/0x18 [<ffffffff801cf253>] device_shutdown+0x56/0x88 [<ffffffff8009cf42>] kernel_restart+0x9/0x46 [<ffffffff8009d0cc>] sys_reboot+0x146/0x1c7 [<ffffffff8003af19>] hrtimer_try_to_cancel+0x4a/0x53 [<ffffffff8005a453>] hrtimer_cancel+0xc/0x16 [<ffffffff80063ce5>] do_nanosleep+0x47/0x70 [<ffffffff8005a340>] hrtimer_nanosleep+0x58/0x118 [<ffffffff800a484f>] hrtimer_wakeup+0x0/0x22 [<ffffffff8001dde0>] sigprocmask+0xb7/0xdb [<ffffffff80054cf5>] sys_nanosleep+0x4c/0x62 [<ffffffff8005d116>] system_call+0x7e/0x83 Code: 48 8b 6b 40 48 89 ef e8 0b 17 fa f7 48 89 df e8 98 ff ff ff RIP [<ffffffff88293569>] :qlcnic:__qlcnic_shutdown+0xe/0x8a RSP <ffff810237f97e08> CR2: 0000000000000040 <0>Kernel panic - not syncing: Fatal exception What is the kernel version of snap #4? It's in the traceback: 2.6.18-225.el5 Accoring to Comment #20, this fix is available in kernel-2.6.18-233.el5. Shouldn't you be trying a newer kernel version? (In reply to comment #20) > in kernel-2.6.18-233.el5 > You can download this test kernel (or newer) from > http://people.redhat.com/jwilson/el5 > > Detailed testing feedback is always welcomed. Jarod, how come that this kernel is not pulled into snap #4 and devel whiteboard says "Snapshot 3". Can you please make sure that the package will appear in the next snapshot and update whiteboard accordingly. In the mean time I'll revert the status back to ON_QA. Sorry, not my department. I build the kernel, tag it in brew, put a note in bugzilla when the patches have been committed, when they're available for download, and add the build to the errata. Getting it into a compose is rel-eng's domain. This fix will be available in Snapshot 5. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |