Bug 109962 - Shutdown hangs on unmounting partitions step
Shutdown hangs on unmounting partitions step
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
1
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-11-13 09:39 EST by fifo942zipzip
Modified: 2015-01-04 17:03 EST (History)
25 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-05-03 18:26:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Serial console trace of umount hang (12.87 KB, text/plain)
2004-01-21 11:44 EST, Steve Dickson
no flags Details
sysrq output (43.93 KB, text/plain)
2004-02-04 04:50 EST, Michael Young
no flags Details
Another sysrq output (4.66 KB, text/plain)
2004-02-11 09:37 EST, Stu Tomlinson
no flags Details
SysRq from a hung 2.4.22-1.2174.nptlsmp (52.94 KB, text/plain)
2004-02-22 19:34 EST, Norman Gaywood
no flags Details
mount/umount script to hang system (448 bytes, text/plain)
2004-02-22 19:39 EST, Norman Gaywood
no flags Details

  None (edit)
Description fifo942zipzip 2003-11-13 09:39:51 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031007

Description of problem:
System will hang at the unmount-step during a shutdown or reboot.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Boot Fedora Core one.
2. Select shutdown or reboot from graphical login screen.

    

Actual Results:  System hangs at "unmounting partitions" and never
shuts down (or at least 10 minutes).

Expected Results:  shutdown or reboot.

Additional info:

Using suplied SMP kernel on a hyper-threading intel processor. 
Parallel ATA drives and CD-rom drives.  Only kernel option is ide-scsi.
Additional weirdness is intermintant failure to remove firewall rule
for  nntp during shutdown and inability to stop console mouse-services.
Comment 1 Bill Nottingham 2003-11-13 11:34:51 EST
If you turn on sysrq, and do 'sysrq-t', or 'sysrq-p', what does it
look like is happening?
Comment 2 fifo942zipzip 2003-11-14 10:25:39 EST
I have activated sysrq, but now I can not reproduce the problem (many
boots, did some file manip and partition manip while running).
Comment 3 fifo942zipzip 2003-11-16 19:05:54 EST
It happened again with sysrq activated, but the thing was so-hung that
sysrq did not do anything.
Comment 4 Bill Nottingham 2003-11-17 14:27:02 EST
If it's that hung, it implies a kernel issue.
Comment 5 MGeiger 2003-11-20 04:54:45 EST
I'm having exactly the same problem on a HyperThreading P4 using the
SMP kernel. Haven't tried extensively, but it doesn't look like it's
happening when I use the non-SMP kernel. 
With me, it generally occurs only after the system has been running
for more than a short while -- if I shutdown immediately after
rebooting the system from the hang, it shuts down cleanly every time.
The issue occurs whether I'm logged in at console or via SSH.
Comment 6 Matthew 2003-11-21 14:30:30 EST
I'm getting the same problem when using the SMP kernel on my system. 
Hangs on unmount on reboot after the system has been up for a while.
Comment 7 Valentin Guggiana 2003-12-07 09:44:55 EST
I'm getting the same on a hyperthreading P4 using the SMP kernel:
My partitions are RAID 1. When the mirrors are active and synced
I always get a hang but if I reboot/poweroff during the resync of
a partition it goes well! It seem that disk activity (at least md
based) prevents the hang.
Comment 8 MGeiger 2003-12-08 05:33:50 EST
Don't want to jump the gun, but since upgrading to the new kernel,
kernel-smp-2.4.22-1.2129.nptl, I haven't run into this problem -- at
least not yet (a dozen cycles or so since upgrade).
I also appended the apm=power-off kernel boot-time argument, but doubt
that made a difference since unmounting filesystems is a couple of
steps ahead of that process in shutdown scripts. I do get a complaint
that halting APM has FAILED during shutdown, but that seems safe to
ignore since the daemon isn't actually started.
Be interested to hear if new kernel fixes the unmounting issue for others.
Comment 9 Valentin Guggiana 2003-12-08 09:08:05 EST
The problem appears with kernel-smp-2.4.22-1.2129.nptl too :(
I did a reboot after the system was running idle for a time
(~ 40 minutes) and then it hangs on umount. sysrq-t showed umount.
Comment 10 MGeiger 2003-12-13 02:03:34 EST
Sorry. Did jump the gun. I have since encountered the same problem
with kernel-smp-2.4.22-1.2129.nptl as well. I'm not sure why it's been
less frequent since the kernel switch -- could be just luck -- but the
system was moved into a more active role with more disk usage around
the same time ... so maybe that plays a role.
Comment 11 Marek Kassur 2003-12-20 13:59:54 EST
I'm having identical problem on my P4(HT) with smp kernel. I just
discovered that disabling yum service solves this problem for me. I've
seen yum started as an service hangs sometimes and then system stopped
at the unmount-step.
Comment 12 Need Real Name 2003-12-27 05:32:01 EST
I have a dual PIII-Xeon running Fedora Core 1 with
kernel-smp-2.4.22-1.2135.nptl and periodically encounter the same
problem. I am running several servers on this machine: smb(samba),
httpd, mysqld, slapd, and a rmserver (helix). Does anyone know if
disabling the autofs service will help? I just disabled it tonight
after lockup during shutdown so I will report back if I continue to
have the same problem. I've never had the yum service enabled, and I
have still encountered this problem so I'm doubtful that the yum
service is by itself the source of the problem.
Comment 13 Michal Szymanski 2004-01-09 20:52:16 EST
Same here on several Dual P4 machines running FC1 with SMP kernels
2.4.22-1.2135.nptl and 2.4.22-1.2140.nptl in a small LAN System
shutdown freezing (at about 70% rate) on one of following three points:
- Stopping automount
- Unmounting file systems
- Sending all processes TERM signal  (even here, although only once)

I guess that the suggestion of brtoone@ucdavis.edu (comment #12) was
right: I changed 'autofs' to 'am-utils' (i.e. automount to amd) and it
helped instantly! I'm now after a dozen or so reboots and no problem
has been encountered.

So I would say this is a autofs-SMP-kernel issue.
Workaround: switch to 'amd' automounter.
Comment 14 MGeiger 2004-01-11 03:19:28 EST
Hmmm. I'm not sure autofs is the whole problem -- I don't run an
automount daemon on the P4 HT system that exhibits this problem with
the  SMP kernel.

Still, there're certainly indications that networked filesystem mounts
aren't playing nice with the SMP kernel. See: 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497
Comment 15 Dave Jones 2004-01-13 22:59:27 EST
Whats the common factor here? Some of you don't seem to be using
autofs, so I'm not convinced thats the problem. Do you all have
NFS/SMB mounts?

Are things still broken with the latest errata kernel ?
Comment 16 Marek Kassur 2004-01-14 15:16:31 EST
For some time I'm unable to reproduce the problem anymore.
I don't have NFS/SMB mounts, allthough I use ACPI enabled kernel.
Comment 17 Roderick Johnstone 2004-01-15 09:01:38 EST
We also see this on a Dual Xeon system using nfs with autofs.

Kernel is 2.4.22-1.2149.nptlsmp

Usually hangs having written that it stopped automount ok
Comment 18 Need Real Name 2004-01-17 01:07:50 EST
For some time after turning off autofs, I was unable to reproduce the
problem. Then I reenable autofs thinking that it might be necessary
for my USB flash card reader. Then I attempted to shutdown forgetting
to first stop autofs. And then wham-o my system hung on the unmounting
filesystems step of the shutdown. It could have been just dumb luck,
or something related to the card reader, but it does strike me as
suspicious that the first time autofs had been re-enabled, the system
locked up during shutdown. I will report back as if it happens to me
when autofs is not running. I was running kernel 2.4.22-1.2154.nptlsmp
kernel when the lockup happened.
Comment 19 Ronny Buchmann 2004-01-17 16:31:06 EST
I've the same problem, umount hangs (all ide+sata disks). It seems to
only appear in halt. I don't have automounts or nfs mounts.

My filesystems (ext3 and reiserfs) are on top of LVM on software-RAID1
and 5.

My system is a P4 with HT and ICH5 (some disks on Promis 20267).

It appeared AFAIR with every Fedora-kernel version since 2115 (last
hang with 2140, didn't shutdown 2149, since it takes a fair amount of
time to rebuild the RAID arrays)
Comment 20 Ronny Buchmann 2004-01-17 17:56:47 EST
I did the work and wrote the output of sysrq-t/p down

it hangs after:
umount -v -f /var /usr/src/tuxbox ...
/dev/raid5vg/varlv umounted

/var is ext3
/usr/src/tuxbox is reiserfs

sysrq-p
 
Pid/TGid: 41/41, comm: kjournald
EIP: 0060:[<c01211dc>] CPU: 0
EIP is at .text.lock.sched [kernel] 0xcd (2.4.22-1.2140nptlsmp)
 EFLAGS: 00000286 Not tainted
EAX: df070000 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: df070000 EBP: df071e3c DS: 0068 ES: 0068 FS: 0000
GS: 0000
CR0: 8005003b CR1: ?        CR2: 09a444f4 CR3: 07d84000 CR4: 000006d0
Call Trace: 015110e __wait_on_buffer [kernel] 0x6e df071e40
e0b82db3 journal_commit_transaction [jbd] 0x2c3 df071e68
c0107b3f __switch_to [kernel] 0x16f df071f1c
c010f13c schedule [kernel] 0x3fc df071f3c
e0b8628a kjournald [jbd] 0x16a df071fb8
e0b86100 commit_timeout [jbd] 0x0 df071fd4
e0b86120 kjournald [jbd] 0x0 df071fe4
c01074bd kernel_thread_helper [kernel] 0x5 df071ff0
 
sysrq-t
 
umount R C03EBF80 9631 9029 (NOTLB)
Call Tace: c010be21 do_IRQ [kernel] 0xd1 cb7b9ed4
c0152382 invalidate_inode_buffers [kernel] 0xd2 cb7b9f10
c01699af invalidate_list [kernel] 0x3f cb7b9f2c
e0d83df0 reiserfs_fs_type [reiserfs] 0x0 cb7b9f48
c0169a6d invalidate_inodes [kernel] 0x4d cb7b9f4c
e0d83da0 reiserfs_sops [reiserfs] 0x0 cb7b9f6c
c0156e32 kill_super [kernel] 0xe2 cb7b9f70
c016caff sys_umount [kernel] 0x3f cb7b9f8c
c0109b27 system_call [kernel] 0x33 cb7b9fc0
Comment 21 Steve Dickson 2004-01-21 11:44:15 EST
Created attachment 97148 [details]
Serial console trace of umount hang

Here is example of the hang with an ext3 fs 
with NFS not running or started. Attached is 
the entire serial console trace.

umount	      R C0370D38  2397	 2033			  (NOTLB)
Call Trace:   [<c010be4b>] do_IRQ [kernel] 0xfb (0xd0855eb0)
[<c010e9a8>] call_do_IRQ [kernel] 0x5 (0xd0855ed4)
[<c0151da2>] invalidate_bdev [kernel] 0xb2 (0xd0855f00)
[<c0151eee>] __invalidate_buffers [kernel] 0x2e (0xd0855f34)
[<f8854134>] ext3_put_super [ext3] 0xf4 (0xd0855f48)
[<f885aca0>] ext3_sops [ext3] 0x0 (0xd0855f68)
[<f885acf0>] ext3_fs_type [ext3] 0x0 (0xd0855f6c)
[<c0156ea6>] kill_super [kernel] 0x156 (0xd0855f70)
[<c016caff>] sys_umount [kernel] 0x3f (0xd0855f8c)
[<c013c07b>] sys_munmap [kernel] 0x4b (0xd0855fa4)
[<c0109b27>] system_call [kernel] 0x33 (0xd0855fc0
Comment 22 Michal Szymanski 2004-01-26 03:47:17 EST
Contrary to what I wrote in comment #13, all Fedora SMP kernels (upto
2149, both i686 and athlon, hang the system, and not only when
shutting down. I was suffering repeated crashes of a dual-Xeon 3.06
machine running some CPU intensive (plus some network I/O, but not
very much) tasks. Also, using 'autofs' or 'amd' automounters does not
change much.

All those symptoms disappeared immediately when I installed the latest
RedHat 9 SMP kernel (kernel-smp-2.4.20-28.9). I'm only not 100% sure
if such a change would not break something in the system (like e.g.
the lack of "ntpl") - any comments would be welcome.

I am a bit disappointed. It seems that Fedora Core 1 linux is just NOT
WORKING on SMP machines. As for a full, not-beta distribution, it does
not look very attractive. And I cannot see much attention to the
problem from the FC team. We even do not know whether this is a
generic problem with the 2.4.22 kernel, "ntpl" or some other
modifications or add-ons introduced in FC.

regards, Michal.
Comment 23 Ian Pilcher 2004-01-27 00:21:01 EST
Nice to see I'm not the only one seeing this.  Dual 1GHz Pentium-IIIs
on an Abit VP6.  30GB boot/scratch drive on hda; 120GB drives on hde,
hdg, hdi, hdk (hdg is actually 180GB because it was cheapest). 
hde-hdk have parallel partitioning schemes, with multiple RAID devices
spanning the disks.  Just converted the large RAID devices from JFS to
ReiserFS over the weekend in the hopes the problem was JFS-related. 
autofs is off.
Comment 24 Michal Szymanski 2004-01-28 06:15:43 EST
Bad news. I'm getting repeated system hangs on "stopping automount"
(autofs) on a SINGLE CPU PIV machine (HT disabled) running FC1 updated
to 2.4.22-1.2149.nptl (non-SMP) kernel.

Well, I'm close to (sadly) conclude that FEDORA just sucks.

regards, Michal.
Comment 25 Alexander Brinkman 2004-01-28 13:41:30 EST
Same problem here; Fedora Core 1 on PIV with HT enabled and 2149 SMP
kernel (I noticed this problem also on 2115 SMP).

Disabling autofs did not solve this. I have no network mounts of any
type (no SMB, NFS, AFS, etc.). I _do_ have NTFS mounts (my WinXP
disks) and use reiserfs and ext3 for my Linux partitions.

The problem is not the crash itself; I can powerdown the computer with
the power switch ;) The problem is more that reiserfs (but also ext3
sometimes) tends to screw up the changes in files on recovery. Alsa
for instance saves sound card volumes to a config file. After a crash
I regularly find the contents of a pid or log file in it. :(
(ok, losing a volume setting or log file is not that serious but I
really don't want to lose an important paper or source file)

I will try to see if not having the NTFS mounts has any effect and
report back later.
Comment 26 Bob Jones 2004-01-30 02:33:33 EST
his happened to me also. 'Took a while to figure it out. I'm running 
a dual PIII 450 on an old 440GX motherboard, 2940U2W Adaptec SCSI 
card and FC1-testing, fully updated. Although it doesn't make any 
difference if you are running "testing" or "base" right off of the 
CDs. It only occurs when running the nptlsmp kernels and NFS, not the 
plain uni-processor (nptl) kernels - NFS or not. If you're not 
running smp kernels ignore this. It's not the same problem.

What I found was that, if there is just one system/directory listed 
in the /etc/exports file that does not have its internet address 
listed in the /etc/host file, the system will hang. I was lazy and 
copied the files from other machines on the network without paying 
attention to the inconsistancies. The tip-off to the problem shows up 
in the /var/log/messages file  "exportfs: foo has non-inet addr". And 
that's all that shows up in the logs. I'm convinced that this also 
causes the mysterious hangs at undetermined intervals when just 
running without re-booting. In happened to me twice - but I re-boot 
frequently on my "testing" machine. Ultimately, if you let this go on 
after several reboots you'll corrupt the root file system. I 
initially attributed the problem to an old SCSI drive that I was 
using on my "testing" machine - but it still occurred after 
installing a new drive. Interestingly, the last time I got to the 
point of having to do a forced fsck I rebooted and appended 
autofs=off to the command line and booted up cleanly. Moreover, if 
you turn off/stop autofs in "System-Services". the hangs do not 
occur,
Comment 27 Michal Szymanski 2004-01-30 02:58:38 EST
Well, it is not quite as Bob writes in #26. All the machines on which
I got into troubles with SMP FC1 kernels have /etc/exports files with
machine names that are ALL LISTED in /etc/hosts. And still, the system
hanged often. Also, when I changed "automount" to "amd", although it
seemed to had helped at the first look, finally it also started to crash.

Another comment: just turning OFF automounting system in a local
network using NFS is NOT A SOLUTION.
Comment 28 Zaid D. 2004-01-31 14:14:03 EST
I'm having the same problem, on a P4 2.6 512 cache, 800 mhz fsb, Asus
P4P800 motherboard, running the latest FC1 kernel
(2.4.22-1.2149.nptlsmp) most of the time It just sits at "unmounting
filesystems", I tried disabling autofs and replacing it with am-utils
and it helped a bit (before doing this, 2 out of 4 shutdowns the
system would hang on unmounting filesystems), I tried 6 reboots with
only 1 hang, so its better than nothing but its not really a fix, I'll
try to compile a vanilla kernel and see if I have any more hangs, and
I dont know if it will cause any problems with FC1 (since this new
kernel wont be using the NPTL).
btw, I have no shares of any kind, and I only mount a fat32 partition
(and I changed the entry for it in fstab to noauto just incase that
fat32 partition is causing the problem), and right now I disabled
am-utils and nfslock to see if the problem is just NFS-oriented, I'll
write back if I findout anything.
Comment 29 Barton Fisk 2004-01-31 21:58:34 EST
Bug just cropped up tonight running on a quad Xeon Dell Powervault.
SMB shares and NFS shares, box uses Dell PERC 4 RAID, Boot is RAID 1
and there are 2 RAID 5 volumes.> 500 Gigs each. Initial load from CD
was updated with yum to latest 2149smp kernel, then on reboot, hung
for about 15 minutes then finally shutdown. No NTFS file systems
mounted, ext3 only.After system came up, everything appears normal.
Checking now for possible corrupt files. Will post back if anything
discovered.

Comment 30 Alexander Brinkman 2004-02-02 14:12:05 EST
I have done a dozen or so reboots/halts after my first post (comment
#25) and have had no crashes. On all occasions I manually umounted my
NTFS partitions first. I noticed that fam each time was watching a
directory on one of the NTFS partitions. Before I was unable to umount
that partition I had to kill fam first.
Normally this should be taken care of by the shutdown script and it
shouldn't be a problem. But since no one seems to have a clue yet why
these hangs occur I might mention it anyways ;)

I also upgraded to fedora rawhide over the weekend. I will try some
reboots with new (2.6.1) kernel and old (2.4.22) kernel to check if it
is kernel related or is related to certain version/software package
combinations. (so far the 2.6.1 kernel did not let me down; no
crashes! (at least not from the kernel))
Comment 31 Nate Thompson 2004-02-03 15:04:48 EST
I have an interesting new twist to this.  I am running 2149smp stock,
no autofs, etc with LVM on aic7899 scsi 160 Single Proc Hyperthreaded
Dell 2650.  The machine hangs at Unmounting Filesystems.  dd'd the
drive with dd bs=10M if=/dev/sda of=/dev/sdb, then took the duplicated
drive and put it in identical 2650 (different box).  No hangs on that
box after several days and many tries, but always hangs on the
original disk (which is an identical install obviously)  Trying to get
more information but was wondering if someone else could try dd'ing a
drive in a similar way and see what happens
Comment 32 Michael Young 2004-02-04 04:50:42 EST
Created attachment 97461 [details]
sysrq output

I think I am seeing the same issue. Attached is the sysrq output (host was
hung, responds to pings and sysrq only)
Comment 33 Paolo Prandini 2004-02-07 03:52:16 EST
Hi i use 2.4.22-1.2149smp on a i848P+ICH5 Chipset with P4 (HT enabled)
This happens to me 70% of time when I stay uptime 20-45 minutes using
Mozilla or after heavy disk work (updatedb).
It doen not seem to happen with UP Kernel.
At the moment I am trying to compile a vanilla SMP 2.4.24 bz taken
from kernel.org and I'll let you know the outcomes.

- Regards
- Paolo

Comment 34 Zaid D. 2004-02-10 01:33:41 EST
Hi:
after disabling autofs, nfslock, automount (and all NFS-related
daemons) and disabling automounting of my fat32 partition, all was
working fine for a week until last night, after 16 hours of uptime, I
tried to shutdown and it hang up on unmounting file systems again, so
I just got a vanilla 2.4.24 kernel from kernel.org and I'm compiling
it now, I'll run it for a few days and post back if I get any news.
Zaid
Comment 35 Stu Tomlinson 2004-02-11 09:37:16 EST
Created attachment 97583 [details]
Another sysrq output

I'm also hitting this bug.

System: Dell PowerEdge 1750 (megaraid), Single Xeon CPU w/HT, LVM ext3
filesystems only, no network filesystems.
SMP kernel hangs, UP kernel does not hang.

Attached is the output from SysRq-P and SysRq-T (slightly corrputed due to bad
serial console setup)
Comment 36 Michal Szymanski 2004-02-12 10:45:44 EST
The latest FC1 kernel update (2166) claims to "fix NPTL SMP hang".
Worth trying if it is "our" problem fix. I'll do it tomorrow and let
you know. 
Michal.
Comment 37 JY XU 2004-02-19 06:32:48 EST
I am glad to see that not only me having this problem.

My P4(HT) sometimes hangs with the FC1 SMP kernel. I am not running
any of automounters, and it happens all smp kernels prior to
2.4.22-2166. For me, it happens when the system is running (i.e., not
during its shutdown), especially when I run many CPU-heavy
computational jobs.

I haven't experienced the hangups with 2.4.22-2166, but this is
probably because I haven't tested it with heavy jobs.

Just upgrade to 2.4.22-2174, will come back and report the problem if
I had another hang.

In addition, the version 2.4.22-2166 of smp kernel at ATRPMS
(http://atrpms.physik.fu-berlin.de/) is claimed to fix the SMP hang
problem. I haven't tested that. Probably Redhat guys go to have a look
the source?

My experience with FC1 is better with RH9. But this hang problem bits
me. Also, the Intel 82801EB AC97 on-board sound card doesn't work. It
seems many posts on the web reporting the problem but sadly no a clear
solution proposed (except recommend to use ALSA).

Cheers,

JY
Comment 38 Michal Szymanski 2004-02-20 05:17:09 EST
Bad news :(

I've given a try 2.4.22-2174 SMP kernel on dual Xeon 3.06 machine
running in LAN, with 'autofs' ON. At first it seemed to be fine but
about 4th consecutive reboot it hanged again on "Stopping automount".
Same on 6th reboot. I gave up going back to RH9 2.4.20-28.9 smp kernel.

I sadly conclude that all those FC1 2.4.22-XXXX NPTL kernels just SUCK.

The "NPTL SMP hang" fix proudly announced by Fedora team at the
release of 2166 kernels must have been something else :(

regards, Michal.
Comment 39 Norman Gaywood 2004-02-22 19:34:43 EST
Created attachment 97934 [details]
SysRq from a hung 2.4.22-1.2174.nptlsmp

Attached is another SysRq output, this time for 2.4.22-1.2174.nptlsmp. Also
included are the boot messages for the 4 processor Pentium III.
									       
					
I previously posted a similar report for an earlier kernel (2166) on bug
#109497
									       
					
I get the system to hang with a script, see later attachment, that repeatedly
does a mount/umount. On this run the system hung after 247 mount/umounts.
									       
					
This is the grub entry used to boot the kernel:
									       
					
title Fedora Core (2.4.22-1.2174.nptlsmp)
	root (hd0,0)
	kernel /vmlinuz-2.4.22-1.2174.nptlsmp ro root=LABEL=/ console=tty0
console=ttyS0,9600n81 panic=60 nmi_watchdog=1
	initrd /initrd-2.4.22-1.2174.nptlsmp.img
Comment 40 Norman Gaywood 2004-02-22 19:39:55 EST
Created attachment 97935 [details]
mount/umount script to hang system

Attached is a script to trigger the SMP hang. It creates a small filesystem in
/tmp and repeatedly does a loop back mount/umount until you abort (^C) it or
the system hangs.
Comment 41 Valentin Guggiana 2004-03-22 08:29:54 EST
Got it!

I played around with selfcompiled kernels based on the kernel source
provided by Fedora (kernel-source-2.4.22-1.2174.nptl) using the
following commands:

cd /usr/src/linux-2.4.22-1.2174.nptl
make clean; make mrproper
cp /boot/config-2.4.22-1.2174.nptlsmp .config
make xconfig
make dep; make bzImage; make modules; make modules_install
cp System.map /boot/System.map-2.4.22-1.2174.nptlcustom
cp .config /boot/config-2.4.22-1.2174.nptlcustom
cp arch/i386/boot/bzImage /boot/vmlinuz-2.4.22-1.2174.nptlcustom
cd /boot
mkinitrd -f initrd-2.4.22-1.2174.nptlcustom.img 2.4.22-1.2174.nptlcustom

Of course one has to add the new kernel into grub.conf

Simply saving the config in xconfig without any changes generates
a crashing kernel a expected (I used Norman Gaywoods script to
trigger it).

Removing 'low latency scheduling' in 'Processor type and features'
leads to a stable kernel! I run the script for hours (millions
of mount/umounts).
Comment 42 Norman Gaywood 2004-03-22 17:54:19 EST
Valentin,

You have provided another data point, but it's not a solution to the
problem. Sure by removing 'low latency scheduling' the bug may not get
triggered, but the bug is still there.

Also, the script to trigger the problem with mount/umount is still
only a trigger to the problem. It does not point to the problem
directly. The  script probably does not even work for some people. I
saw on one of the Fedora mailing lists that Dave Jones has only been
able to trigger the problem once.

Obviously the problem is a difficult one and I suspect that we will be
using one of the many work-arounds, including yours, until FC2 comes out.
Comment 43 Michal Szymanski 2004-03-23 04:33:10 EST
Hi,

I can confirm the usability of Valentin's fix to SMP kernels. I never
used Norman's script to hang a system. All my FC1 SMP machines, both
dual Xeon and dual AthlonMP, just used to hang during normal shutdown,
with the "success coefficient" of about 70%. Also, I was suffering
from system hangs when running even moderately NFS-intensive jobs (I
checked both autofs and amd automounters).

Now with Valentin's fix, I am after a whole night of rebooting the
machine every 5 minutes and it survived w/o any problem. I also gave a
try Norman's script - again no problems during 100000 cycles.

BTW, anybody out there knowing what the 'low latency scheduling' does?
There is no help available for that item in 'xconfig'.

regards, Michal.
Comment 44 Jason Tibbitts 2004-04-06 16:13:24 EDT
This bug has been killing me on various SMP servers; they always hang
on shutdown and randomly hang every couple of days.  I have a
backtrace from a machine hung at shutdown; but it is substantially
similar to other backtraces posted here.  (It is written on paper; I
will transcribe it here if it will help.)

Since I have many machines to maintain, I hacked the kernel spec file
a bit to turn off CONFIG_LOLAT and rebuilt the kernel packages.  My
packages are available upon request.  You can also easily build your
own; install the kernel SRPM and apply the fillowing patch to the
.spec file:

--- kernel-2.4.spec     2004-02-18 12:51:46.000000000 -0600
+++ kernel-2.4.spec-uh  2004-04-06 13:43:15.000000000 -0500
@@ -899,6 +899,9 @@
 cp -fv $RPM_SOURCE_DIR/kernel-%{kversion}-athlon*.config configs
 cp -fv $RPM_SOURCE_DIR/kernel-%{kversion}-x86_64*.config configs

+# XXX Local change: turn off low latency scheduling
+perl -spi -e 's/^(CONFIG_LOLAT).*/# $1 is not set/' configs/*
+
 # make sure the kernel has the sublevel we know it has...
 perl -p -i -e "s/^SUBLEVEL.*/SUBLEVEL = %{sublevel}/" Makefile

Then edit the %define release line to add some text identifying your
custom packages; I changed "nptl" to "uh_nptl".  Then edit the %define
build* options near the beginning of the file to match what you want
to build and do rpmbuild -ba kernel-2.4.spec --target=i686 (or athlon
if that's what you're running).  Wait a good long while and you should
get some RPMS.  Packages built this way work for me; I can stick them
in my local yum repo and send them out to all of my servers.  They
reboot fine and survive the previously posted tests.

Has anyone experimented with leaving CONFIG_LOLAT on but also turning
on CONFIG_LOLAT_SYSCTL and then controlling it that way?  It should
enable an option somewhere in /proc to turn it off, but I can't figure
out where from reading the patch (linux-2.4.20-akpm-lowlatency.patch
in the kernel SRPM).
Comment 45 Dave Jones 2004-04-07 14:52:05 EDT
FWIW, I've turned off the lowlat patch in CVS.
I've got a bunch of other stuff pending merging, but I'll push out an
update kernel in the next week or so.

Thanks for being patient and hunting this down.
Comment 46 Joe Handley 2004-04-08 19:15:01 EDT
I just noticed on my uni-processor kernel that umount hung a session.
 Top reported that umount was using 99% of the CPU.  I went into
another session and repeated the same umount command, which succeeded,
and also freed the hung session.  It appears the problem may not be
SMP specific, just more serious on SMP kernels.
Comment 47 John Muir 2004-04-15 11:29:59 EDT
This problem is resolved in a more recent version of the low-latency
patch from Andrew Morton (for 2.4.25 for example). See this web page:
http://www.zip.com.au/~akpm/linux/schedlat.html

Essentially, for this particular issue in invalidate_bdev(), the code
has been modified such that it only attempts to call schedule() 10
times within invalidate_bdev() before reverting to the original code.

I guess this doesn't solve an underlying issue, you won't have your
hangning umount anymore.
Comment 48 Alan Cox 2004-05-03 18:26:32 EDT
Should be fixed in the 2188 and later kernels. Please re-open if it
reccurs

Note You need to log in before you can comment on or make changes to this bug.