Bug 104520
Summary: | SMP Kernel hang on shutdown with Intel SRCZCR Raid Controller | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Jason Sauve <jasonsauve77> |
Component: | kernel | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.0 | CC: | alietss, andreas.aretz, bbrock, bruce.grove, cgadd, chun.ming.li, coughlan, danielk, dledford, jneedle, keldon, ltroan, petrides, rf, rknepper, tao, t.koenig, wusel+rhbug |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-12-03 01:41:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 106472 | ||
Attachments: |
Description
Jason Sauve
2003-09-16 17:36:03 UTC
In addition to this, when you execute "cat /proc/scsi/gdth/0" it also hangs. have you tried this on the kernel available via RHN ? The kernel I downloaded (2.4.21-1.1931.2.423.entsmp) came from up2date (rhn). I dont know whether it matters or not, but the system is running Dual Xeon 2.4GHz CPU's with hyperthreading. Just for curiousity I disabled hyperthreading in the BIOS (Motherboard is Intel 7501WV2) and the problem still occurs. Just installed kernel-smp-2.4.21-2.EL from RHN. Problem is still unresolved. Unable to issue shutdown/halt/reboot successfully I found more info. The problem I'm encountering with RHEL 3.0 AS beta1,beta2 is the same problem that was found with Redhat 7.2/7.3/8.0 as is well documented on Intel's website (see "Red Hat Linux* 8.0 segmentation fault with an Intel® RAID controller installed" within the document ftp://download.intel.com/support/motherboards/server/srczcr/tested_hwos.pdf) Exerpt: ------------------- When using the normal installation of Red Hat Linux* 8.0 with the 2.4.18-14 kernel and an Intel RAID controller installed, the following issue is seen: 4. A shutdown command results in a segmentation fault. 5. It is not possible to use some tools such as storcon. 6. Accessing the proc file system (via cat /proc/scsi/gdth/#, where â#â stands for the controller number, also results in a segmentation fault. This issue occurs only when using Red Hat kernel version 2.4.18-14 installed with SMP support, and it is not server board or RAID controller specific. ------------------- Using Taroon's Beta 1 kernel, you get segmentation faults. With the Beta 2 kernel you get a hang with it saying "Starting timer : 0 0". These problems are only reproducable on the SMP version of Taroon's kernel, and by default it uses ext3 (w/journaling) as it draws similarities to RHBA- 2002-292. The Intel SE7501WV2 board has Dual Intel 1GB NIC's. Hope this helps shed some light on the problem. Ok guys, I've made progress on this bug. I just downloaded kernel source 2.4.22 from www.kernel.org and compiled using the redhat config file kernel- 2.4.21-i686-smp.config. After compiling successfully with SMP, booting the system, and issuing a shutdown or reboot command all is well. The system shuts down as expected, and as well, performing 'cat /proc/scsi/gdth/0' works (output is attached). It would seem that the gdth driver in the 2.4.1 kernel is flawed. Maybe it has something to do with "gdth register failure path" that is in the Changelog section "Summary of changes from v2.4.22-pre2 to v2.4.22-pre3" It would seem that booting with the 2.4.22 kernel breaks XFree86 and hyperthreading amongst who knows what else. But I'd bet that AS 3.0 Beta wasn't made to work with that version kernel anyhow. I would really appreciate it if someone could give an answer as to "if and when" a fix will be made accessible in RPM format from RHN. I don't want an unresolved bug to deter my purchasing a 5 year support license for AS 3.0 when it is released. We're very eager to install the final *working* product on our servers and are anxiously awaiting its release. I don't have hardware here to reproduce. Can you get me the output of alt-scroll_lock, shift-scroll_lock, and ctrl-scroll_lock after the Starting timer: 1 1 message is displayed? Reassigning to me since there is a good chance this might be iorl patch related. Created attachment 94894 [details]
Output of kernel sysrq
*** Bug 106328 has been marked as a duplicate of this bug. *** How can I obtain taroon-rc to test whether or not this may have already been fixed? I've been dead in the water with no word for a couple weeks now on this one. FROM ISSUE TRACKER Event posted 10-20-2003 12:11pm by chunli with duration of 0.00 Test with RHEL3 RC3 kernel 2.4.21-4 and got same failure. OS hangs when try to access the RAID controller. Can you boot this machine with the option nmi_watchdog=1 on the kernel command line then attempt to reboot. When it locks up it will eventually print out an oops report that should give a traceback to see where we are spinning and waiting (it sounds like somewhere we are trying to take a double lock, but I'm not totally convinced of that, this should let me know). Created attachment 95602 [details]
Kernel boot log with nmi_watchdog=1
Doug, Booting with nmi_watchdog=1 resulted in a failure, so I've attached the kernel boot log (as you will see it fails on CPU0 when testing). When tried to access the SRCZCR controller /proc/scsi/gdth/0 it simply hung again but the kernel didn't OOPS. The kernel only OOPS'd on the original taroon-beta2 kernel. All latter ones it just hung in a dead-lock type state with the message "Starting timer: 0 0". One other thing I've noticed is that when running 'top' after a while it reports CPU timing problems, maybe this is the root cause of the SCSI problems. In the kernel boot log you will notice there are 4 CPUs. There are actually two CPUs but hyperthreading is enabled. Disabling hyperthreading in the BIOS still does not resolve the problem. Let me know how I can be of further assistance. If you'd like to contact me directly by phone please let me know as it's not a problem since I know you guys dont have the same hardware in house. Good news! After giving up on the possibility of nmi_watchdog=1 working, (after waiting about 5 minutes after initiating a shutdown), I left the machine alone for a couple hours, when I came back to it, to my surprise the watchdog information was there! I re-ran it again to get the results a second time and timed how long it took, approx 50 minutes for watchdog trace to show up on the console. Attached is your ouput! Created attachment 95607 [details]
kernel nmi_watchdog=1 log
Yeah, I saw a patch get posted internally to fix the watchdog timeout problem (I think if you use nmi_watchdog=2 that it might make the timeout happen in 30 seconds or so like it's supposed to). Our next kernel release won't have that problem. Can you try booting the machine without the serial console enabled and see if it still won't reboot? (The netdump log is, umm, weird...I'm gonna have to disassemble the module to see why it's showing up the way it is) Created attachment 95614 [details]
oops from kernel-2.4.21-1.1931.2.399.entsmp
Doing a normal boot and shutdown without the serial console attached produces
the same bug as I originally detected without the aid of a serial console. I
tried it again just to be sure.
I decided to boot with the original taroon-beta2 kernel which OOPSes
immediately on shutdown (since this is the only known kernel to my knowledge
that will OOPS without the aid of nmi_watchdog kernel param, all the later
kernels just hang on shutdown/reboot or accessing /proc/scsi/gdth/0).
Please let me know if this output is of better assistance to you. If you want,
I can provide ctrl+alt+shift scroll lock output from
kernel-2.4.21-1.1931.2.399.entsmp as well.
PS: I am guessing that a prior thread comment was a post from Chun Li at Intel
confirming that he was able to reproduce the error on the same or similar
hardware?
Created attachment 95615 [details]
Proposed fix
I think this will actually solve the problem entirely. If you could apply this
to kernel sources, test it, and let me know the results I would appreciate it.
I'm estatic! I applied the patch to the kernel source (kernel-2.4.21-3.EL.smp) and the problem is fixed! I will leave you to set the ticket to resolved and to add any additional comments. At the same time I'm actually quite surprised, this is a bug that seems to keep re-surfacing with each release of Redhat all the way back to 7.3 (Bugzilla #66867, #66867). Now all I would like to know is when it will be made available in RPM format for the RHEL AS/ES 3.0, we're ready to make a purchase. Is it possible that a fix can be provided for installation media? I don't know if this is something that would cause any amount of data-loss if I was to install with the non- patched kernel and then apply the updated kernel RPM afterwards. I would appreciate it if someone could email me back directly regarding my questions. Thanks a bunch! Great work. PS: The "Starting timer: 0 0" message still appears when booting, shutting down and when issuing a /proc/scsi/gdth/0 command. Not sure what this is and whether or not it should be showing up. Doesn't seem to impact anything other than the console display. FYI: The other time where this resurfaced was Bugzilla Bug #72207 It will be available in RPM format with our first kernel update (it has definitely made the cut off deadline for that). As far as what to do between now and when that comes out, you could install the system and rebuild the kernel RPM with this patch added and use that until the next kernel is released. For the most part all the flush routine does is make sure there is no latent write data in the controller's RAM cache before powering down. However, since the cache will get written out eventually even without the flush, you are still pretty safe. The machine hangs on shutdown, no more writes will go to the device, then all you have to do is let it sit for a few moments (enough time to make sure the controller isn't writing to the disks any longer), then hit the reset button and the data will have been written out and things should be fine. The starting timer message is just informational and will be printed out any time scsi_get_host_dev() is called (the gdth flush routine calls this function). FWIW, the bugs that have happened have actually been different bugs, they've just all been different bugs in scsi_get_host_dev and scsi_free_host_dev which are both rarely used functions (only a couple drivers actually use them and the core scsi layer doesn't use them at all) so bugs sometimes accidentally creep in. *** Bug 109639 has been marked as a duplicate of this bug. *** *** Bug 109652 has been marked as a duplicate of this bug. *** This problem is seen on the Sun Microsystems V60/65x's I've noticed the same behaviour (hang on shutdown, crash on accessing /proc/scsi/gdth/0) with an GDT 6523RS. When will this patch be included in an errata kernel? Quoting RHBA-2003:308: |Fixes |amd64 has siginificant bug in 32 bit emulation I'm running this (2.4.21-4.0.1.ELsmp), which is, according to "up2date -u" it's the latest. System info: CPU0: Intel Pentium III (Coppermine) stepping 06 CPU1: Intel Pentium III (Coppermine) stepping 06 Total of 2 processors activated (3991.14 BogoMIPS). scsi0 : GDT6523RS Problem is still there, system becomes unusable on "cat /proc/scsi/gdth/0". We have the same problem. Could you please provide appropriate Information for recompiling the AS30 Kernel. The AS30 support (we are paying for) is not aware of sending a "todo-list" how a AS30 Kernel is to be rebuild. Pls check Service Request Detail Service Request Number 266421 or have a short contact with Steffen Mann The 4.0.1.EL kernel is an errata kernel. There is a very significant difference between an errata kernel and an update kernel. We release errata kernels whenever we have to for security reasons, and we only release update kernels on an occasional basis. The errata kernel in question (4.0.1.EL) does not have the fix for this included (errata kernels get the security updates only, not a bunch of other stuff, so that we can get them through QA quicker). The next update kernel (which is getting ready to go into our QA phase) has the fix applied. It should be out soon. In the meantime I'll try and build some replacement scsi_mod.o modules and attach them to this bug report to take care of the problem. These modules will be built against a 4.0.1.EL kernel, so that's what you'll need to have installed to use them. *** Bug 111201 has been marked as a duplicate of this bug. *** *** Bug 111887 has been marked as a duplicate of this bug. *** Please attach todo- list for AS 3.0 recompile of the kernel. Refering to Comment #33: Doug, what is the "security update" of 4.0.1.EL? The errata states the previous kernel "limited the amount of virtual address space [...] to an unnecessary degree. [...] These updated kernel packages significantly raise this address space limit." The fixing of the do_brk() bug was not made a reason on releasing that errata kernel, initially ... So, following your argumentation, 4.0.1.EL should either not have been issued then (x86-64-users would just have to wait for their fix as do GDTH-users for theirs) or at least the x86-64 address space issue must not have been included in it. I don't want to start a discussion here (did that already with RH in IT, Issue #29695); I just want to point out that, at least from my point of view, your policy on issuing or not issuing an errata kernel seems to biased on market share or something, but not on solving production issues in general. From my point of view, that's a wrong approach. Having to wait for the next quarterly update to happen in order to get this fixed -- or run unspupported kernels until then -- for me does not match RHEL's claim of being an "Enterprise" level OS. That kind of problems, until now, I only expected from volunteer-driven Projects. YMMV, -kai (forced to run UP or custom kernels) Doug, Any word on the replacement scsi_mod.o module? Doug, how long do we have to wait for a kernel update fixing the GDTH problem ? This problem should be well known by RedHat , because it has been detected in some legacy dsitributions of RedHat too. I think one could make a mistake, but he has to learn from it, so that this mistake shouldn't be done again :-)) !! As long as this problem is not fixed , we can't sell any server togehter with RHEL3 and have to inform our customers not to use RHEL3 !!! A patch-RPM, if it would be availble soon (!!) installing the correct scsi_mod.o module would be an acceptable solution until the update kernel is available !! Rgds Andreas Created attachment 96682 [details]
Compressed cpio archive of all the i686 modules needed to solve the problem
The attached cpio archive contains files that should solve your problems. Here
are the directions for installing the files:
1. Save the file into /tmp
2. As root do:
a. cd /tmp
b. zcat modules.cpio.gz-i686 | cpio -ivd
c. cd /lib/modules
d. for i in 2.4.21-4.0.1*; do cp
/tmp/lib/modules/${i}/kernel/drivers/scsi/* ${i}/kernel/drivers/scsi; done
e. cd /boot
f. for i in initrd-2.4.21-4.0.1*; do VERSION=`basename $i .img`; mkinitrd
-v -f $i $VERSION; done
That's it. Reboot into your normal 4.0.1 kernel and it should be working
properly now.
Note: I don't have this hardware in my home lab (which is where I am since I'm
already on Christmas vacation) so this is untested. I manually verified that
the symbol versions didn't change, but I haven't actually booted these modules
up. Saving a copy of your original initrd images under a different name and
adding a new line to your /etc/grub.conf file that boots the same kernel but
with the saved initrd images would be wise until these modules have been
verified. Modules for the i686 UP, i686 SMP, i686 hugemem, and i386 BOOT
kernels are included in this package. I didn't bother with Athlon modules
since I haven't seen anyone request a fix for this on an Athlon machine, nor on
any machines other than x86 (such as x86_64, ia64, or s390).
Hello all, and merry Christmas, hey Doug I tested your modules and they work great, at least for me, for my hardware details see bug #111201, I can reboot and shutdown our servers ok, Doug this is the second time you save my a... on four years of RedHat use, in the past was with Intel 440GX boards, Thank's and keep playing with scsii hardware. Ahh, just one doubt, when the new update kernel see the light, if I do a rpm update this is going to overwrite those files??? Bye all and Happy New Year The next kernel already has the same patch in it that I used here. It shouldn't have any problems at all. We tried to install the cpio archive without any success. That means the box is still hanging on reboot. used hardware: dell sc1600 + intel srcu32 adapter Doug, the kernel version for which you compiled the scsi i686 modules is : 2.4.21-4.0.1* ( 2.4.21-4.0.1ELsmp for example ). The kernel version you will get after installation of RHEL 3 AS is 2.4.21-4.* ( 2.4.21-4.ELsmp for example ) !!!! Due to this issue you can't use your precompiled modules for the generic RHEL 3 kernel versions !!! Would you please compile the suitable ones and attach them to this bugzilla report ? Thanks and best regards Andreas When I tried to run the "fix" commands, I got an error on step "F": #for i in initrd-2.4.21-4.0.1*; do VERSION=`basename $i .img`; mkinitrd -v -f $i $VERSION; done /lib/modules/initrd-2.4.21-4.0.1.EL is not a directory /lib/modules/initrd-2.4.21-4.0.1.ELsmp is not a directory I manually ran mkinitrd like: #mkinitrd -v -f initrd-2.4.21-4.0.1.ELsmp.img 2.4.21-4.0.1.ELsmp and it worked ok. I'll be rebooting the machine in a few hours, so I'll know then if the fix works for me. Finally got around to a reboot last night, no problems what so ever. For the record, my hardware is a 7501WV2, 3ghz Xeons, 6 gigs ram, SRCZCR raid with 6 120g drives. To Andreas Aretz: The current released Red Hat kernel version is 2.4.21-4.0.1.EL. You'll need to update your system to the latest official Red Hat kernel. Once a kernel has been retired by a new kernel release we don't keep making changes to it or for it. Doug, got it ! But : Kernel update 2.4.21-4.0.1.EL is no longer available . The current release is 2.4.21-4.0.2.EL . Would you please compile the suitable patch for this release ? Thanks and regards Andreas It must be a very bad joke -- this bug is still not fixed in the SECOND errata kernel of RHEL v3 (RHSA-2003:416, kernel-smp-2.4.21-4.0.2.EL), according to the ChangeLog. Thanks for pissing me off again, Red Hat. It was clearly stated above that it would be in the next UPDATE kernel, not in the errata kernel. The posted cpio archive does fix the problem. Created attachment 96856 [details]
Modules for the 2.4.21-4.0.2.EL kernel
This module archive is the same as the last one except compiled against the
2.4.21-4.0.2.EL kernel sources. It should solve the problem on the current
kernel. And, to answer the issue raised, the 4.0.2 kernel is a security errata
not an update kernel. The security errata kernels are fast tracked through the
system in order to get them in the hands of users quicker. In order to keep
the time needed to QA security errata kernels down, we are *not* allowed to
shove a bunch of non-security related patches into those kernels. The next
update kernel, which contains this patch, will be released as soon as it passes
our rather rigorous update QA process. You can also tell the difference
between security updates and regular updates by looking at the kernel version
number. In this case, the initial kernel version was 2.4.21-4.EL. The next
update kernel will be at least 2.4.21-5.EL. Any point releases, such as 4.0.1
and 4.0.2, are built from the 4.EL base source code + just the minimal updates
for the errata. If the kernel source has been updated with a full gamut of
patches, then it will have the major Red Hat release number incremented.
Knowing that information, you can then gather that the first Red Hat kernel to
contain this patch will have a minimum number of 2.4.21-5.EL.
Doug, thank you for your work on this matter; I do appreciate this, don't get me wrong. But I have to correct you in terms of the state of 2.4.21-4.0.1.EL -- than one came with RHBA-2003:308 and did only "accidently" fix a security issue (do_brk()), the purpose of RHBA-2003:308 was to fix x86_64-issues. Therefore my company as a paying customer, and as it seems others as well, expect at least the next issued RHSA-kernel to include any service-disrupting bug-fixes -- that's standard for RHEL 2.1 (see RHSA-2003:408) and should still be standard for RHEL 3.x. The policy for RHEL 3.x, that is: to fix bugs only four times a year with Quarterly Updates unless it's security releated, has to be considered a major step backwards, and since Red Hat's providing support only for Red Hat-backed kernels, it's actually voiding any reason to go the RHEL route. What if QU1 fixes GDT and breaks QLC, will I have to wait until QU2 to get a Red Hat supported kernel which supports systems with GDT as well systems with QLC? That's my concern, given that a fix for this GDT issue is known for 3 months now but STILL not available via RHN. Since rebooting SMP systems with any GDT-SCSI-HBA to an errata (read: one with security fixes) kernel is not possible (someone has to press RESET locally somehow), not releasing the fix via RHN (i. e. via an errata kernel) becomes a pain in the ass for at least some people. Any chance of release of the patch, so admins can update their own kernels from kernel-source rpms when errata fixes come out? The patch is already in this bug report. Just check the attached file list. As far as errata kernels go, barring some emergency security update in the very near future, the next kernel update will already have the fix in it. I just finished loading EL 3.0 update 1 using kernel 2.4.21-9.ELsmp. I'm seeing this type of a problem on my dell Pe 6400. So with this update it doesn't appear fixed. or atleast all the way. Any help? To Daren Grant (grant.csc.mil): The patch to solve the particular problem in this bug report is in fact in the 2.4.21-9.EL kernel and I'm positive it solves the problem it attempted to solve. If you are still having problems, then you need to open a different bug report with a description of your hardware and exactly what the problem is, including any error messages from the kernel, etc. *** Bug 105032 has been marked as a duplicate of this bug. *** An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-017.html |