Bug 119451

Summary: System can hang while running multiple instances of fdisk
Product: Red Hat Enterprise Linux 3 Reporter: sureshbabu <sureshb>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: peterm, petrides, riel, sheryl.sage, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-663 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 14:22:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 156320    

Description sureshbabu 2004-03-30 17:02:08 UTC
Description of problem:

System hang due to dead lock while running multiple instances of
fdisk.

Version-Release number of selected component (if applicable):
AS 3.0

How reproducible:

 1. Run the following two scripts(test1.sh and test2.sh)
    simultaneously

# cat test1.sh
 while [ 0 ]
 do
    fdisk /dev/sdk < ./rp.txt
 done

# cat test2.sh
 while [ 0 ]
 do
    fdisk -l /dev/sdk
 done

# cat rp.txt
 o
 w

Steps to Reproduce:

 Run the above scripts simultaneously.
  
Actual results:

System hangs after some time.

Expected results:

 Not system hangs.

Additional info:

 Root cause Analysis:

    The problem is with the "fdisk -l" command. The stack trace
    is as follows:
    The second call in the stack trace wrong, and it is actually
    the call for "do_open()".

    0xf6c4befc 0xf882da32 [sd_mod]sd_open+0xb2 (0xf6cbfb00,
    0xf71ff500, 0x0,
    0xf6c4bf20 0xc01697cc ioctl_by_bdev+0x22c (0xf6cbfb00, 
    0xf71ff500,0x8000,
    0xf6c4bf58 0xc015f7f2 dentry_open+0x1c2
    0xf6c4bf74 0xc015f628 filp_open+0x68
    0xf6c4bfac 0xc015fa53 sys_open+0x53
    0xf6c4bfc4 0xc041606d no_timing+0x7

    --------------
    In do_open(), a call is made to "lock_kernel()", and then called
    sd_open() (i.e by holding the kernel lock).

    Since device is busy(because of the other fdisk operation) it
    went into the following code  in sd_open().

           while (rscsi_disks[target].device-busy) {
                    barrier();
                    cpu_relax();
            }
   "cpu_relax()" actually, doesn't release the cpu.
    It is just a NOP operation. So, the above code is just a busy
    loop.
    The important thing here is, it is holding the kernel lock.
    -------------
    The other fdisk process which is responsible to set device-busy
    to zero, is trying to lock some page, since it was locked by some
    one, it called schedule. In schedule() call, the other process is
    trying to reaquire the kernel lock. Since the first fdisk 
    operation is already holding the kernel lock and is busy 
    operation, system is in deadlock state.

    stack trace for the 2nd fdisk operation:

    0xe3abbd4c 0xc01257b9 .text.lock.sched+0xb4
    0xe3abbd4c 0xc0123611 schedule+0x361 (0xe3aba000)
    0xe3abbd94 0xc012490a io_schedule+0x2a (0xc1b4d934, 0x1, 
    0xe3aba000,
    0xe3abbda0 0xc0146319 __lock_page+0x89 (0xc1b4d934)
    0xe3abbdd8 0xc014636c lock_page+0x1c (0xe3b2a944, 0x0,
    0xc0168c70, 0x0)
    0xe3abbde0 0xc0149005 read_cache_page+0x55
    0xe3abbe04 0xc0194f7d read_dev_sector+0x4d (0xc36e6e80, 0x0,
    0xe3abbe38,
    0xe3abbe24 0xc0195a1d handle_ide_mess+0x2d (0xc36e6e80, 0x8, 0x0,
    0xe3abbe50 0xc0195c3f msdos_partition+0x6f
    0xe3abbf84 0xc0169a8e blkdev_ioctl+0x3e (0xf17ca880, 0xf65cc900,
    0x125f,

    the solution is to implement cpu_relax(), by actually
    releasing the cpu by calling schedule_timeout(), so that kernel
    lock will be relesed.

Comment 1 Ernie Petrides 2005-06-17 00:11:09 UTC
Patch posted for review on 16-Jun-2005.

Comment 2 Ernie Petrides 2005-06-17 22:59:02 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.9.EL).


Comment 7 Red Hat Bugzilla 2005-09-28 14:22:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html