Description of problem: System hang due to dead lock while running multiple instances of fdisk. Version-Release number of selected component (if applicable): AS 3.0 How reproducible: 1. Run the following two scripts(test1.sh and test2.sh) simultaneously # cat test1.sh while [ 0 ] do fdisk /dev/sdk < ./rp.txt done # cat test2.sh while [ 0 ] do fdisk -l /dev/sdk done # cat rp.txt o w Steps to Reproduce: Run the above scripts simultaneously. Actual results: System hangs after some time. Expected results: Not system hangs. Additional info: Root cause Analysis: The problem is with the "fdisk -l" command. The stack trace is as follows: The second call in the stack trace wrong, and it is actually the call for "do_open()". 0xf6c4befc 0xf882da32 [sd_mod]sd_open+0xb2 (0xf6cbfb00, 0xf71ff500, 0x0, 0xf6c4bf20 0xc01697cc ioctl_by_bdev+0x22c (0xf6cbfb00, 0xf71ff500,0x8000, 0xf6c4bf58 0xc015f7f2 dentry_open+0x1c2 0xf6c4bf74 0xc015f628 filp_open+0x68 0xf6c4bfac 0xc015fa53 sys_open+0x53 0xf6c4bfc4 0xc041606d no_timing+0x7 -------------- In do_open(), a call is made to "lock_kernel()", and then called sd_open() (i.e by holding the kernel lock). Since device is busy(because of the other fdisk operation) it went into the following code in sd_open(). while (rscsi_disks[target].device-busy) { barrier(); cpu_relax(); } "cpu_relax()" actually, doesn't release the cpu. It is just a NOP operation. So, the above code is just a busy loop. The important thing here is, it is holding the kernel lock. ------------- The other fdisk process which is responsible to set device-busy to zero, is trying to lock some page, since it was locked by some one, it called schedule. In schedule() call, the other process is trying to reaquire the kernel lock. Since the first fdisk operation is already holding the kernel lock and is busy operation, system is in deadlock state. stack trace for the 2nd fdisk operation: 0xe3abbd4c 0xc01257b9 .text.lock.sched+0xb4 0xe3abbd4c 0xc0123611 schedule+0x361 (0xe3aba000) 0xe3abbd94 0xc012490a io_schedule+0x2a (0xc1b4d934, 0x1, 0xe3aba000, 0xe3abbda0 0xc0146319 __lock_page+0x89 (0xc1b4d934) 0xe3abbdd8 0xc014636c lock_page+0x1c (0xe3b2a944, 0x0, 0xc0168c70, 0x0) 0xe3abbde0 0xc0149005 read_cache_page+0x55 0xe3abbe04 0xc0194f7d read_dev_sector+0x4d (0xc36e6e80, 0x0, 0xe3abbe38, 0xe3abbe24 0xc0195a1d handle_ide_mess+0x2d (0xc36e6e80, 0x8, 0x0, 0xe3abbe50 0xc0195c3f msdos_partition+0x6f 0xe3abbf84 0xc0169a8e blkdev_ioctl+0x3e (0xf17ca880, 0xf65cc900, 0x125f, the solution is to implement cpu_relax(), by actually releasing the cpu by calling schedule_timeout(), so that kernel lock will be relesed.
Patch posted for review on 16-Jun-2005.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.9.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html