Bug 198892 - kernel deadlock on reading /proc/meminfo on 4 CPU's at the same time
kernel deadlock on reading /proc/meminfo on 4 CPU's at the same time
Status: CLOSED DUPLICATE of bug 174990
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Kernel Maintainer List
Brian Brock
:
Depends On:
Blocks: 181409
  Show dependency treegraph
 
Reported: 2006-07-14 09:50 EDT by Ivan Szanto
Modified: 2007-11-30 17:07 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-07-14 14:28:17 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ps (7.22 KB, application/octet-stream)
2006-07-14 09:50 EDT, Ivan Szanto
no flags Details
ps-t (15.86 KB, application/octet-stream)
2006-07-14 09:51 EDT, Ivan Szanto
no flags Details
bt-a (4.92 KB, application/octet-stream)
2006-07-14 09:51 EDT, Ivan Szanto
no flags Details
ftp_chk (163 bytes, application/octet-stream)
2006-07-14 09:52 EDT, Ivan Szanto
no flags Details
pop3_chk (165 bytes, application/octet-stream)
2006-07-14 09:53 EDT, Ivan Szanto
no flags Details
squid_chk (165 bytes, application/octet-stream)
2006-07-14 09:54 EDT, Ivan Szanto
no flags Details

  None (edit)
Description Ivan Szanto 2006-07-14 09:50:01 EDT
Description of problem:

Customer reported recurring hangs on LVS director (Red Hat Cluster Suite). At
one hang he was able to produce a crash dump by pressing Alt-SysRq-c.

Computer has 4 physical CPU's (2 dual core CPU's):

  SYSTEM MAP: kernel/boot/System.map-2.6.9-34.0.1.ELsmp
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.9-34.0.1.ELsmp/vmlinux
(2.6.9-34.0.1.ELsmp)
    DUMPFILE: cluster2/vmcore
        CPUS: 4
        DATE: Sun Jul  9 18:48:38 2006
      UPTIME: 9 days, 09:00:04
LOAD AVERAGE: 43.99, 43.97, 43.91
       TASKS: 116
    NODENAME: cluster2
     RELEASE: 2.6.9-34.0.1.ELsmp
     VERSION: #1 SMP Wed May 17 17:05:24 EDT 2006
     MACHINE: i686  (2794 Mhz)
      MEMORY: 2 GB
       PANIC: "Oops: 0002 [#1]" (check log for details)
         PID: 20928
     COMMAND: "ftp_chk"
        TASK: f35d87b0  [THREAD_INFO: f377b000]
         CPU: 0
       STATE: TASK_RUNNING (SYSRQ)

Crash dump analysis revealed that the last four processes of the process list
were in a deadlock situation (see attachment "ps").

These four processes were "Service Monitoring Scripts" written using the
following sample script from the documentation of Red Hat Cluster Suite
(see
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-piranha-virtservs.html
)

---------------
#!/bin/sh

TEST=`dig -t soa example.com @$1 | grep -c dns.example.com

if [ $TEST != "1" ]; then
	echo "OK
else
	echo "FAIL"
fi
--------------

I am going to attach a copy of the actual scripts running at the time of the
deadlock. These scripts should have finished within a second, but they were
running for about 3 hours already when the user pressed Alt-SysRq-c (see
attachment "ps-t").

Each of the four processes were running on one of the CPU's, and each of them
was waiting on a spinlock in the read system call on the file /proc/meminfo (see
attachment "bt-a"). Thereby they managed to hang the machine, because no CPU was
left free to work.

This looks like a kernel deadlock.

Version-Release number of selected component (if applicable):

2.6.9-34.0.1.ELsmp

How reproducible:

system hang occurred regularly at a 1-2 weeks interval

Steps to Reproduce:
1. install RH Cluster suite with 2 LVS nodes and 3 real servers
2. set it up to run the supplied scripts to monitor the named services on all
three real servers
3. wait for a week or two
  
Actual results:

system hung

Expected results:

system goes on working normally

Additional info:

a possible workaround is to use monitoring scripts that avoid reading /proc/meminfo
Comment 1 Ivan Szanto 2006-07-14 09:50:01 EDT
Created attachment 132432 [details]
ps
Comment 2 Ivan Szanto 2006-07-14 09:51:17 EDT
Created attachment 132433 [details]
ps-t
Comment 3 Ivan Szanto 2006-07-14 09:51:58 EDT
Created attachment 132434 [details]
bt-a
Comment 4 Ivan Szanto 2006-07-14 09:52:37 EDT
Created attachment 132435 [details]
ftp_chk
Comment 5 Ivan Szanto 2006-07-14 09:53:07 EDT
Created attachment 132436 [details]
pop3_chk
Comment 6 Ivan Szanto 2006-07-14 09:54:12 EDT
Created attachment 132437 [details]
squid_chk
Comment 7 Jason Baron 2006-07-14 14:13:56 EDT
Can  you post the core file somewhere where i can download it. thanks.
Comment 8 Jason Baron 2006-07-14 14:23:47 EDT
nevermind. i found the cause of this. if you examine thPID: 20923  TASK:
eee4a6b0  CPU: 1   COMMAND: "squid_chk"
 #0 [c03cbf1c] smp_call_function_interrupt at c0116cab
 #1 [c03cbf24] call_function_interrupt at c02d30a9
    EAX: c038ff00  EBX: c038ff00  ECX: c0326f00  EDX: 00000000  EBP: c03cbfcc
    DS:  007b      ESI: f8e943c1  ES:  007b      EDI: ffffffff
    CS:  0060      EIP: c02d1232  ERR: fffffffb  EFLAGS: 00000286
 #2 [c03cbf58] _spin_lock at c02d1232
 #3 [c03cbf60] nr_blockdev_pages at c0160c76
 #4 [c03cbf68] si_meminfo at c0143d21
 #5 [c03cbf70] update_defense_level at f8e941dd
 #6 [c03cbfc0] defense_timer_handler at f8e943c1
 #7 [c03cbfc4] run_timer_softirq at c012a2cf
 #8 [c03cbfe8] __do_softirq at c0126752
--- <soft IRQ> ---
 #0 [ec58fde8] do_softirq at c01080f4
 #1 [ec58fdf0] smp_apic_timer_interrupt at c0117483
 #2 [ec58fdf8] apic_timer_interrupt at c02d30c9
 #3 [ec58fe34] si_meminfo at c0143d21
 #4 [ec58fe3c] meminfo_read_proc at c0189753
 #5 [ec58ff50] proc_file_read at c0187c2c
 #6 [ec58ff88] vfs_read at c015a41a
 #7 [ec58ffa4] sys_read at c015a62b
 #8 [ec58ffc0] system_call at c02d2688
e CPU 1 backtrace you see:


The spinlock is held in process context, then an interrupt comes in that also
tries to acquire the same spinlock. Thus cpu 1 hangs first. Eventually all other
CPUs end up hanging in the same code path. I'll post a patch to fix this. thanks
for the great debug information.
Comment 9 Jason Baron 2006-07-14 14:28:17 EDT
Actually, now that i look at this some more, we've already fixed this for U4.
Test kernels are availabel at: http://people.redhat.com/~jbaron/rhel4/

*** This bug has been marked as a duplicate of 174990 ***

Note You need to log in before you can comment on or make changes to this bug.