Bug 198892
| Summary: | kernel deadlock on reading /proc/meminfo on 4 CPU's at the same time | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Ivan Szanto <szivan> | ||||||||||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> | ||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||
| Priority: | medium | ||||||||||||||||
| Version: | 4.0 | ||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | i686 | ||||||||||||||||
| OS: | Linux | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2006-07-14 18:28:17 UTC | Type: | --- | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||
| Bug Blocks: | 181409 | ||||||||||||||||
| Attachments: |
|
||||||||||||||||
Created attachment 132432 [details]
ps
Created attachment 132433 [details]
ps-t
Created attachment 132434 [details]
bt-a
Created attachment 132435 [details]
ftp_chk
Created attachment 132436 [details]
pop3_chk
Created attachment 132437 [details]
squid_chk
Can you post the core file somewhere where i can download it. thanks. nevermind. i found the cause of this. if you examine thPID: 20923 TASK:
eee4a6b0 CPU: 1 COMMAND: "squid_chk"
#0 [c03cbf1c] smp_call_function_interrupt at c0116cab
#1 [c03cbf24] call_function_interrupt at c02d30a9
EAX: c038ff00 EBX: c038ff00 ECX: c0326f00 EDX: 00000000 EBP: c03cbfcc
DS: 007b ESI: f8e943c1 ES: 007b EDI: ffffffff
CS: 0060 EIP: c02d1232 ERR: fffffffb EFLAGS: 00000286
#2 [c03cbf58] _spin_lock at c02d1232
#3 [c03cbf60] nr_blockdev_pages at c0160c76
#4 [c03cbf68] si_meminfo at c0143d21
#5 [c03cbf70] update_defense_level at f8e941dd
#6 [c03cbfc0] defense_timer_handler at f8e943c1
#7 [c03cbfc4] run_timer_softirq at c012a2cf
#8 [c03cbfe8] __do_softirq at c0126752
--- <soft IRQ> ---
#0 [ec58fde8] do_softirq at c01080f4
#1 [ec58fdf0] smp_apic_timer_interrupt at c0117483
#2 [ec58fdf8] apic_timer_interrupt at c02d30c9
#3 [ec58fe34] si_meminfo at c0143d21
#4 [ec58fe3c] meminfo_read_proc at c0189753
#5 [ec58ff50] proc_file_read at c0187c2c
#6 [ec58ff88] vfs_read at c015a41a
#7 [ec58ffa4] sys_read at c015a62b
#8 [ec58ffc0] system_call at c02d2688
e CPU 1 backtrace you see:
The spinlock is held in process context, then an interrupt comes in that also
tries to acquire the same spinlock. Thus cpu 1 hangs first. Eventually all other
CPUs end up hanging in the same code path. I'll post a patch to fix this. thanks
for the great debug information.
Actually, now that i look at this some more, we've already fixed this for U4. Test kernels are availabel at: http://people.redhat.com/~jbaron/rhel4/ *** This bug has been marked as a duplicate of 174990 *** |
Description of problem: Customer reported recurring hangs on LVS director (Red Hat Cluster Suite). At one hang he was able to produce a crash dump by pressing Alt-SysRq-c. Computer has 4 physical CPU's (2 dual core CPU's): SYSTEM MAP: kernel/boot/System.map-2.6.9-34.0.1.ELsmp DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.9-34.0.1.ELsmp/vmlinux (2.6.9-34.0.1.ELsmp) DUMPFILE: cluster2/vmcore CPUS: 4 DATE: Sun Jul 9 18:48:38 2006 UPTIME: 9 days, 09:00:04 LOAD AVERAGE: 43.99, 43.97, 43.91 TASKS: 116 NODENAME: cluster2 RELEASE: 2.6.9-34.0.1.ELsmp VERSION: #1 SMP Wed May 17 17:05:24 EDT 2006 MACHINE: i686 (2794 Mhz) MEMORY: 2 GB PANIC: "Oops: 0002 [#1]" (check log for details) PID: 20928 COMMAND: "ftp_chk" TASK: f35d87b0 [THREAD_INFO: f377b000] CPU: 0 STATE: TASK_RUNNING (SYSRQ) Crash dump analysis revealed that the last four processes of the process list were in a deadlock situation (see attachment "ps"). These four processes were "Service Monitoring Scripts" written using the following sample script from the documentation of Red Hat Cluster Suite (see http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-piranha-virtservs.html ) --------------- #!/bin/sh TEST=`dig -t soa example.com @$1 | grep -c dns.example.com if [ $TEST != "1" ]; then echo "OK else echo "FAIL" fi -------------- I am going to attach a copy of the actual scripts running at the time of the deadlock. These scripts should have finished within a second, but they were running for about 3 hours already when the user pressed Alt-SysRq-c (see attachment "ps-t"). Each of the four processes were running on one of the CPU's, and each of them was waiting on a spinlock in the read system call on the file /proc/meminfo (see attachment "bt-a"). Thereby they managed to hang the machine, because no CPU was left free to work. This looks like a kernel deadlock. Version-Release number of selected component (if applicable): 2.6.9-34.0.1.ELsmp How reproducible: system hang occurred regularly at a 1-2 weeks interval Steps to Reproduce: 1. install RH Cluster suite with 2 LVS nodes and 3 real servers 2. set it up to run the supplied scripts to monitor the named services on all three real servers 3. wait for a week or two Actual results: system hung Expected results: system goes on working normally Additional info: a possible workaround is to use monitoring scripts that avoid reading /proc/meminfo