From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031007 Description of problem: System will hang at the unmount-step during a shutdown or reboot. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Boot Fedora Core one. 2. Select shutdown or reboot from graphical login screen. Actual Results: System hangs at "unmounting partitions" and never shuts down (or at least 10 minutes). Expected Results: shutdown or reboot. Additional info: Using suplied SMP kernel on a hyper-threading intel processor. Parallel ATA drives and CD-rom drives. Only kernel option is ide-scsi. Additional weirdness is intermintant failure to remove firewall rule for nntp during shutdown and inability to stop console mouse-services.
If you turn on sysrq, and do 'sysrq-t', or 'sysrq-p', what does it look like is happening?
I have activated sysrq, but now I can not reproduce the problem (many boots, did some file manip and partition manip while running).
It happened again with sysrq activated, but the thing was so-hung that sysrq did not do anything.
If it's that hung, it implies a kernel issue.
I'm having exactly the same problem on a HyperThreading P4 using the SMP kernel. Haven't tried extensively, but it doesn't look like it's happening when I use the non-SMP kernel. With me, it generally occurs only after the system has been running for more than a short while -- if I shutdown immediately after rebooting the system from the hang, it shuts down cleanly every time. The issue occurs whether I'm logged in at console or via SSH.
I'm getting the same problem when using the SMP kernel on my system. Hangs on unmount on reboot after the system has been up for a while.
I'm getting the same on a hyperthreading P4 using the SMP kernel: My partitions are RAID 1. When the mirrors are active and synced I always get a hang but if I reboot/poweroff during the resync of a partition it goes well! It seem that disk activity (at least md based) prevents the hang.
Don't want to jump the gun, but since upgrading to the new kernel, kernel-smp-2.4.22-1.2129.nptl, I haven't run into this problem -- at least not yet (a dozen cycles or so since upgrade). I also appended the apm=power-off kernel boot-time argument, but doubt that made a difference since unmounting filesystems is a couple of steps ahead of that process in shutdown scripts. I do get a complaint that halting APM has FAILED during shutdown, but that seems safe to ignore since the daemon isn't actually started. Be interested to hear if new kernel fixes the unmounting issue for others.
The problem appears with kernel-smp-2.4.22-1.2129.nptl too :( I did a reboot after the system was running idle for a time (~ 40 minutes) and then it hangs on umount. sysrq-t showed umount.
Sorry. Did jump the gun. I have since encountered the same problem with kernel-smp-2.4.22-1.2129.nptl as well. I'm not sure why it's been less frequent since the kernel switch -- could be just luck -- but the system was moved into a more active role with more disk usage around the same time ... so maybe that plays a role.
I'm having identical problem on my P4(HT) with smp kernel. I just discovered that disabling yum service solves this problem for me. I've seen yum started as an service hangs sometimes and then system stopped at the unmount-step.
I have a dual PIII-Xeon running Fedora Core 1 with kernel-smp-2.4.22-1.2135.nptl and periodically encounter the same problem. I am running several servers on this machine: smb(samba), httpd, mysqld, slapd, and a rmserver (helix). Does anyone know if disabling the autofs service will help? I just disabled it tonight after lockup during shutdown so I will report back if I continue to have the same problem. I've never had the yum service enabled, and I have still encountered this problem so I'm doubtful that the yum service is by itself the source of the problem.
Same here on several Dual P4 machines running FC1 with SMP kernels 2.4.22-1.2135.nptl and 2.4.22-1.2140.nptl in a small LAN System shutdown freezing (at about 70% rate) on one of following three points: - Stopping automount - Unmounting file systems - Sending all processes TERM signal (even here, although only once) I guess that the suggestion of brtoone (comment #12) was right: I changed 'autofs' to 'am-utils' (i.e. automount to amd) and it helped instantly! I'm now after a dozen or so reboots and no problem has been encountered. So I would say this is a autofs-SMP-kernel issue. Workaround: switch to 'amd' automounter.
Hmmm. I'm not sure autofs is the whole problem -- I don't run an automount daemon on the P4 HT system that exhibits this problem with the SMP kernel. Still, there're certainly indications that networked filesystem mounts aren't playing nice with the SMP kernel. See: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497
Whats the common factor here? Some of you don't seem to be using autofs, so I'm not convinced thats the problem. Do you all have NFS/SMB mounts? Are things still broken with the latest errata kernel ?
For some time I'm unable to reproduce the problem anymore. I don't have NFS/SMB mounts, allthough I use ACPI enabled kernel.
We also see this on a Dual Xeon system using nfs with autofs. Kernel is 2.4.22-1.2149.nptlsmp Usually hangs having written that it stopped automount ok
For some time after turning off autofs, I was unable to reproduce the problem. Then I reenable autofs thinking that it might be necessary for my USB flash card reader. Then I attempted to shutdown forgetting to first stop autofs. And then wham-o my system hung on the unmounting filesystems step of the shutdown. It could have been just dumb luck, or something related to the card reader, but it does strike me as suspicious that the first time autofs had been re-enabled, the system locked up during shutdown. I will report back as if it happens to me when autofs is not running. I was running kernel 2.4.22-1.2154.nptlsmp kernel when the lockup happened.
I've the same problem, umount hangs (all ide+sata disks). It seems to only appear in halt. I don't have automounts or nfs mounts. My filesystems (ext3 and reiserfs) are on top of LVM on software-RAID1 and 5. My system is a P4 with HT and ICH5 (some disks on Promis 20267). It appeared AFAIR with every Fedora-kernel version since 2115 (last hang with 2140, didn't shutdown 2149, since it takes a fair amount of time to rebuild the RAID arrays)
I did the work and wrote the output of sysrq-t/p down it hangs after: umount -v -f /var /usr/src/tuxbox ... /dev/raid5vg/varlv umounted /var is ext3 /usr/src/tuxbox is reiserfs sysrq-p Pid/TGid: 41/41, comm: kjournald EIP: 0060:[<c01211dc>] CPU: 0 EIP is at .text.lock.sched [kernel] 0xcd (2.4.22-1.2140nptlsmp) EFLAGS: 00000286 Not tainted EAX: df070000 EBX: 00000000 ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: df070000 EBP: df071e3c DS: 0068 ES: 0068 FS: 0000 GS: 0000 CR0: 8005003b CR1: ? CR2: 09a444f4 CR3: 07d84000 CR4: 000006d0 Call Trace: 015110e __wait_on_buffer [kernel] 0x6e df071e40 e0b82db3 journal_commit_transaction [jbd] 0x2c3 df071e68 c0107b3f __switch_to [kernel] 0x16f df071f1c c010f13c schedule [kernel] 0x3fc df071f3c e0b8628a kjournald [jbd] 0x16a df071fb8 e0b86100 commit_timeout [jbd] 0x0 df071fd4 e0b86120 kjournald [jbd] 0x0 df071fe4 c01074bd kernel_thread_helper [kernel] 0x5 df071ff0 sysrq-t umount R C03EBF80 9631 9029 (NOTLB) Call Tace: c010be21 do_IRQ [kernel] 0xd1 cb7b9ed4 c0152382 invalidate_inode_buffers [kernel] 0xd2 cb7b9f10 c01699af invalidate_list [kernel] 0x3f cb7b9f2c e0d83df0 reiserfs_fs_type [reiserfs] 0x0 cb7b9f48 c0169a6d invalidate_inodes [kernel] 0x4d cb7b9f4c e0d83da0 reiserfs_sops [reiserfs] 0x0 cb7b9f6c c0156e32 kill_super [kernel] 0xe2 cb7b9f70 c016caff sys_umount [kernel] 0x3f cb7b9f8c c0109b27 system_call [kernel] 0x33 cb7b9fc0
Created attachment 97148 [details] Serial console trace of umount hang Here is example of the hang with an ext3 fs with NFS not running or started. Attached is the entire serial console trace. umount R C0370D38 2397 2033 (NOTLB) Call Trace: [<c010be4b>] do_IRQ [kernel] 0xfb (0xd0855eb0) [<c010e9a8>] call_do_IRQ [kernel] 0x5 (0xd0855ed4) [<c0151da2>] invalidate_bdev [kernel] 0xb2 (0xd0855f00) [<c0151eee>] __invalidate_buffers [kernel] 0x2e (0xd0855f34) [<f8854134>] ext3_put_super [ext3] 0xf4 (0xd0855f48) [<f885aca0>] ext3_sops [ext3] 0x0 (0xd0855f68) [<f885acf0>] ext3_fs_type [ext3] 0x0 (0xd0855f6c) [<c0156ea6>] kill_super [kernel] 0x156 (0xd0855f70) [<c016caff>] sys_umount [kernel] 0x3f (0xd0855f8c) [<c013c07b>] sys_munmap [kernel] 0x4b (0xd0855fa4) [<c0109b27>] system_call [kernel] 0x33 (0xd0855fc0
Contrary to what I wrote in comment #13, all Fedora SMP kernels (upto 2149, both i686 and athlon, hang the system, and not only when shutting down. I was suffering repeated crashes of a dual-Xeon 3.06 machine running some CPU intensive (plus some network I/O, but not very much) tasks. Also, using 'autofs' or 'amd' automounters does not change much. All those symptoms disappeared immediately when I installed the latest RedHat 9 SMP kernel (kernel-smp-2.4.20-28.9). I'm only not 100% sure if such a change would not break something in the system (like e.g. the lack of "ntpl") - any comments would be welcome. I am a bit disappointed. It seems that Fedora Core 1 linux is just NOT WORKING on SMP machines. As for a full, not-beta distribution, it does not look very attractive. And I cannot see much attention to the problem from the FC team. We even do not know whether this is a generic problem with the 2.4.22 kernel, "ntpl" or some other modifications or add-ons introduced in FC. regards, Michal.
Nice to see I'm not the only one seeing this. Dual 1GHz Pentium-IIIs on an Abit VP6. 30GB boot/scratch drive on hda; 120GB drives on hde, hdg, hdi, hdk (hdg is actually 180GB because it was cheapest). hde-hdk have parallel partitioning schemes, with multiple RAID devices spanning the disks. Just converted the large RAID devices from JFS to ReiserFS over the weekend in the hopes the problem was JFS-related. autofs is off.
Bad news. I'm getting repeated system hangs on "stopping automount" (autofs) on a SINGLE CPU PIV machine (HT disabled) running FC1 updated to 2.4.22-1.2149.nptl (non-SMP) kernel. Well, I'm close to (sadly) conclude that FEDORA just sucks. regards, Michal.
Same problem here; Fedora Core 1 on PIV with HT enabled and 2149 SMP kernel (I noticed this problem also on 2115 SMP). Disabling autofs did not solve this. I have no network mounts of any type (no SMB, NFS, AFS, etc.). I _do_ have NTFS mounts (my WinXP disks) and use reiserfs and ext3 for my Linux partitions. The problem is not the crash itself; I can powerdown the computer with the power switch ;) The problem is more that reiserfs (but also ext3 sometimes) tends to screw up the changes in files on recovery. Alsa for instance saves sound card volumes to a config file. After a crash I regularly find the contents of a pid or log file in it. :( (ok, losing a volume setting or log file is not that serious but I really don't want to lose an important paper or source file) I will try to see if not having the NTFS mounts has any effect and report back later.
his happened to me also. 'Took a while to figure it out. I'm running a dual PIII 450 on an old 440GX motherboard, 2940U2W Adaptec SCSI card and FC1-testing, fully updated. Although it doesn't make any difference if you are running "testing" or "base" right off of the CDs. It only occurs when running the nptlsmp kernels and NFS, not the plain uni-processor (nptl) kernels - NFS or not. If you're not running smp kernels ignore this. It's not the same problem. What I found was that, if there is just one system/directory listed in the /etc/exports file that does not have its internet address listed in the /etc/host file, the system will hang. I was lazy and copied the files from other machines on the network without paying attention to the inconsistancies. The tip-off to the problem shows up in the /var/log/messages file "exportfs: foo has non-inet addr". And that's all that shows up in the logs. I'm convinced that this also causes the mysterious hangs at undetermined intervals when just running without re-booting. In happened to me twice - but I re-boot frequently on my "testing" machine. Ultimately, if you let this go on after several reboots you'll corrupt the root file system. I initially attributed the problem to an old SCSI drive that I was using on my "testing" machine - but it still occurred after installing a new drive. Interestingly, the last time I got to the point of having to do a forced fsck I rebooted and appended autofs=off to the command line and booted up cleanly. Moreover, if you turn off/stop autofs in "System-Services". the hangs do not occur,
Well, it is not quite as Bob writes in #26. All the machines on which I got into troubles with SMP FC1 kernels have /etc/exports files with machine names that are ALL LISTED in /etc/hosts. And still, the system hanged often. Also, when I changed "automount" to "amd", although it seemed to had helped at the first look, finally it also started to crash. Another comment: just turning OFF automounting system in a local network using NFS is NOT A SOLUTION.
I'm having the same problem, on a P4 2.6 512 cache, 800 mhz fsb, Asus P4P800 motherboard, running the latest FC1 kernel (2.4.22-1.2149.nptlsmp) most of the time It just sits at "unmounting filesystems", I tried disabling autofs and replacing it with am-utils and it helped a bit (before doing this, 2 out of 4 shutdowns the system would hang on unmounting filesystems), I tried 6 reboots with only 1 hang, so its better than nothing but its not really a fix, I'll try to compile a vanilla kernel and see if I have any more hangs, and I dont know if it will cause any problems with FC1 (since this new kernel wont be using the NPTL). btw, I have no shares of any kind, and I only mount a fat32 partition (and I changed the entry for it in fstab to noauto just incase that fat32 partition is causing the problem), and right now I disabled am-utils and nfslock to see if the problem is just NFS-oriented, I'll write back if I findout anything.
Bug just cropped up tonight running on a quad Xeon Dell Powervault. SMB shares and NFS shares, box uses Dell PERC 4 RAID, Boot is RAID 1 and there are 2 RAID 5 volumes.> 500 Gigs each. Initial load from CD was updated with yum to latest 2149smp kernel, then on reboot, hung for about 15 minutes then finally shutdown. No NTFS file systems mounted, ext3 only.After system came up, everything appears normal. Checking now for possible corrupt files. Will post back if anything discovered.
I have done a dozen or so reboots/halts after my first post (comment #25) and have had no crashes. On all occasions I manually umounted my NTFS partitions first. I noticed that fam each time was watching a directory on one of the NTFS partitions. Before I was unable to umount that partition I had to kill fam first. Normally this should be taken care of by the shutdown script and it shouldn't be a problem. But since no one seems to have a clue yet why these hangs occur I might mention it anyways ;) I also upgraded to fedora rawhide over the weekend. I will try some reboots with new (2.6.1) kernel and old (2.4.22) kernel to check if it is kernel related or is related to certain version/software package combinations. (so far the 2.6.1 kernel did not let me down; no crashes! (at least not from the kernel))
I have an interesting new twist to this. I am running 2149smp stock, no autofs, etc with LVM on aic7899 scsi 160 Single Proc Hyperthreaded Dell 2650. The machine hangs at Unmounting Filesystems. dd'd the drive with dd bs=10M if=/dev/sda of=/dev/sdb, then took the duplicated drive and put it in identical 2650 (different box). No hangs on that box after several days and many tries, but always hangs on the original disk (which is an identical install obviously) Trying to get more information but was wondering if someone else could try dd'ing a drive in a similar way and see what happens
Created attachment 97461 [details] sysrq output I think I am seeing the same issue. Attached is the sysrq output (host was hung, responds to pings and sysrq only)
Hi i use 2.4.22-1.2149smp on a i848P+ICH5 Chipset with P4 (HT enabled) This happens to me 70% of time when I stay uptime 20-45 minutes using Mozilla or after heavy disk work (updatedb). It doen not seem to happen with UP Kernel. At the moment I am trying to compile a vanilla SMP 2.4.24 bz taken from kernel.org and I'll let you know the outcomes. - Regards - Paolo
Hi: after disabling autofs, nfslock, automount (and all NFS-related daemons) and disabling automounting of my fat32 partition, all was working fine for a week until last night, after 16 hours of uptime, I tried to shutdown and it hang up on unmounting file systems again, so I just got a vanilla 2.4.24 kernel from kernel.org and I'm compiling it now, I'll run it for a few days and post back if I get any news. Zaid
Created attachment 97583 [details] Another sysrq output I'm also hitting this bug. System: Dell PowerEdge 1750 (megaraid), Single Xeon CPU w/HT, LVM ext3 filesystems only, no network filesystems. SMP kernel hangs, UP kernel does not hang. Attached is the output from SysRq-P and SysRq-T (slightly corrputed due to bad serial console setup)
The latest FC1 kernel update (2166) claims to "fix NPTL SMP hang". Worth trying if it is "our" problem fix. I'll do it tomorrow and let you know. Michal.
I am glad to see that not only me having this problem. My P4(HT) sometimes hangs with the FC1 SMP kernel. I am not running any of automounters, and it happens all smp kernels prior to 2.4.22-2166. For me, it happens when the system is running (i.e., not during its shutdown), especially when I run many CPU-heavy computational jobs. I haven't experienced the hangups with 2.4.22-2166, but this is probably because I haven't tested it with heavy jobs. Just upgrade to 2.4.22-2174, will come back and report the problem if I had another hang. In addition, the version 2.4.22-2166 of smp kernel at ATRPMS (http://atrpms.physik.fu-berlin.de/) is claimed to fix the SMP hang problem. I haven't tested that. Probably Redhat guys go to have a look the source? My experience with FC1 is better with RH9. But this hang problem bits me. Also, the Intel 82801EB AC97 on-board sound card doesn't work. It seems many posts on the web reporting the problem but sadly no a clear solution proposed (except recommend to use ALSA). Cheers, JY
Bad news :( I've given a try 2.4.22-2174 SMP kernel on dual Xeon 3.06 machine running in LAN, with 'autofs' ON. At first it seemed to be fine but about 4th consecutive reboot it hanged again on "Stopping automount". Same on 6th reboot. I gave up going back to RH9 2.4.20-28.9 smp kernel. I sadly conclude that all those FC1 2.4.22-XXXX NPTL kernels just SUCK. The "NPTL SMP hang" fix proudly announced by Fedora team at the release of 2166 kernels must have been something else :( regards, Michal.
Created attachment 97934 [details] SysRq from a hung 2.4.22-1.2174.nptlsmp Attached is another SysRq output, this time for 2.4.22-1.2174.nptlsmp. Also included are the boot messages for the 4 processor Pentium III. I previously posted a similar report for an earlier kernel (2166) on bug #109497 I get the system to hang with a script, see later attachment, that repeatedly does a mount/umount. On this run the system hung after 247 mount/umounts. This is the grub entry used to boot the kernel: title Fedora Core (2.4.22-1.2174.nptlsmp) root (hd0,0) kernel /vmlinuz-2.4.22-1.2174.nptlsmp ro root=LABEL=/ console=tty0 console=ttyS0,9600n81 panic=60 nmi_watchdog=1 initrd /initrd-2.4.22-1.2174.nptlsmp.img
Created attachment 97935 [details] mount/umount script to hang system Attached is a script to trigger the SMP hang. It creates a small filesystem in /tmp and repeatedly does a loop back mount/umount until you abort (^C) it or the system hangs.
Got it! I played around with selfcompiled kernels based on the kernel source provided by Fedora (kernel-source-2.4.22-1.2174.nptl) using the following commands: cd /usr/src/linux-2.4.22-1.2174.nptl make clean; make mrproper cp /boot/config-2.4.22-1.2174.nptlsmp .config make xconfig make dep; make bzImage; make modules; make modules_install cp System.map /boot/System.map-2.4.22-1.2174.nptlcustom cp .config /boot/config-2.4.22-1.2174.nptlcustom cp arch/i386/boot/bzImage /boot/vmlinuz-2.4.22-1.2174.nptlcustom cd /boot mkinitrd -f initrd-2.4.22-1.2174.nptlcustom.img 2.4.22-1.2174.nptlcustom Of course one has to add the new kernel into grub.conf Simply saving the config in xconfig without any changes generates a crashing kernel a expected (I used Norman Gaywoods script to trigger it). Removing 'low latency scheduling' in 'Processor type and features' leads to a stable kernel! I run the script for hours (millions of mount/umounts).
Valentin, You have provided another data point, but it's not a solution to the problem. Sure by removing 'low latency scheduling' the bug may not get triggered, but the bug is still there. Also, the script to trigger the problem with mount/umount is still only a trigger to the problem. It does not point to the problem directly. The script probably does not even work for some people. I saw on one of the Fedora mailing lists that Dave Jones has only been able to trigger the problem once. Obviously the problem is a difficult one and I suspect that we will be using one of the many work-arounds, including yours, until FC2 comes out.
Hi, I can confirm the usability of Valentin's fix to SMP kernels. I never used Norman's script to hang a system. All my FC1 SMP machines, both dual Xeon and dual AthlonMP, just used to hang during normal shutdown, with the "success coefficient" of about 70%. Also, I was suffering from system hangs when running even moderately NFS-intensive jobs (I checked both autofs and amd automounters). Now with Valentin's fix, I am after a whole night of rebooting the machine every 5 minutes and it survived w/o any problem. I also gave a try Norman's script - again no problems during 100000 cycles. BTW, anybody out there knowing what the 'low latency scheduling' does? There is no help available for that item in 'xconfig'. regards, Michal.
This bug has been killing me on various SMP servers; they always hang on shutdown and randomly hang every couple of days. I have a backtrace from a machine hung at shutdown; but it is substantially similar to other backtraces posted here. (It is written on paper; I will transcribe it here if it will help.) Since I have many machines to maintain, I hacked the kernel spec file a bit to turn off CONFIG_LOLAT and rebuilt the kernel packages. My packages are available upon request. You can also easily build your own; install the kernel SRPM and apply the fillowing patch to the .spec file: --- kernel-2.4.spec 2004-02-18 12:51:46.000000000 -0600 +++ kernel-2.4.spec-uh 2004-04-06 13:43:15.000000000 -0500 @@ -899,6 +899,9 @@ cp -fv $RPM_SOURCE_DIR/kernel-%{kversion}-athlon*.config configs cp -fv $RPM_SOURCE_DIR/kernel-%{kversion}-x86_64*.config configs +# XXX Local change: turn off low latency scheduling +perl -spi -e 's/^(CONFIG_LOLAT).*/# $1 is not set/' configs/* + # make sure the kernel has the sublevel we know it has... perl -p -i -e "s/^SUBLEVEL.*/SUBLEVEL = %{sublevel}/" Makefile Then edit the %define release line to add some text identifying your custom packages; I changed "nptl" to "uh_nptl". Then edit the %define build* options near the beginning of the file to match what you want to build and do rpmbuild -ba kernel-2.4.spec --target=i686 (or athlon if that's what you're running). Wait a good long while and you should get some RPMS. Packages built this way work for me; I can stick them in my local yum repo and send them out to all of my servers. They reboot fine and survive the previously posted tests. Has anyone experimented with leaving CONFIG_LOLAT on but also turning on CONFIG_LOLAT_SYSCTL and then controlling it that way? It should enable an option somewhere in /proc to turn it off, but I can't figure out where from reading the patch (linux-2.4.20-akpm-lowlatency.patch in the kernel SRPM).
FWIW, I've turned off the lowlat patch in CVS. I've got a bunch of other stuff pending merging, but I'll push out an update kernel in the next week or so. Thanks for being patient and hunting this down.
I just noticed on my uni-processor kernel that umount hung a session. Top reported that umount was using 99% of the CPU. I went into another session and repeated the same umount command, which succeeded, and also freed the hung session. It appears the problem may not be SMP specific, just more serious on SMP kernels.
This problem is resolved in a more recent version of the low-latency patch from Andrew Morton (for 2.4.25 for example). See this web page: http://www.zip.com.au/~akpm/linux/schedlat.html Essentially, for this particular issue in invalidate_bdev(), the code has been modified such that it only attempts to call schedule() 10 times within invalidate_bdev() before reverting to the original code. I guess this doesn't solve an underlying issue, you won't have your hangning umount anymore.
Should be fixed in the 2188 and later kernels. Please re-open if it reccurs
I had faced similar issue, then I have read this blog, Scratch Drive Full, they have explained the steps very efficiently. In my onion, you must read this blog at least once. Visit- https://www.adobesupportphonenumber.com/blog/fix-photoshop-scratch-disk-full-errors/