Bug 144680

Summary: My server freeze (Only ping and SysRq respond)
Product: Red Hat Enterprise Linux 3 Reporter: Carlos Antonio Gomez <cgomez>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: petrides, raimondi, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 19:09:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
SysRQ outputs
none
Same old, same old
none
Sysrq outputs none

Description Carlos Antonio Gomez 2005-01-10 17:05:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET 
CLR 1.1.4322)

Description of problem:
Hi:

I  have Red Hat Enterprise Linux AS release 3 
Kernel 2.4.21-27.0.1.EL on an i686 and my server hangs with different 
intervals of stability but with the same symptoms:
 -all daemon dies and only kernel is up (ping,Sysrq, ..)

This problem happen (98%) when i send one email to de mail list (the 
server have postfix + mailman and all members are locals )

This problem is described in Bug #: 122077 

Regards  


Version-Release number of selected component (if applicable):


How reproducible:
Didn't try


Additional info:

Comment 1 Larry Woodman 2005-01-10 18:05:43 UTC
Carlos, can you get me AltSysrq-M, AltSysrq-T and AltSysrq-W outputs
when the system hangs?  I cant reproduce the problem inside Red Hat.

Thanks, Larry Woodman


Comment 2 Carlos Antonio Gomez 2005-01-10 19:17:20 UTC
Hi Larry:
Im sorry very much, i will have to wait him to happen again, 
nevertheless you can use AltSysrq-M, AltSysrq-T and AltSysrq-W outputs
put in bug # 122077, Additional Comment #83, and top image put in 
Additional Comment #84. 

In this moment i was using Kernel 2.4.21-26.EL put at 
http://people.redhat.com/~lwoodman/RHEL3/


Comment 3 Carlos Antonio Gomez 2005-02-03 21:47:01 UTC
Hi Larry:

My server freeze once again.
I put the SysRQ outputs.

 Regards

Comment 4 Carlos Antonio Gomez 2005-02-03 21:55:56 UTC
Created attachment 110628 [details]
SysRQ  outputs

Comment 5 Larry Woodman 2005-02-07 16:53:13 UTC
1.) The problem here is that one process here is doing a quotactl when
memory is very low so it downs the dqopt->dqio_sem then, does a read
which eventually calls __alloc_pages() which blocks in wakeup_kswapd.

[<c0148e51>] wakeup_kswapd [kernel] 0xf1 (0xca94dc10)
[<c014abb0>] __alloc_pages [kernel] 0xf0 (0xca94dc64)
[<c011d6c5>] schedule [kernel] 0x125 (0xca94dc84)
[<c013d6af>] do_generic_file_read [kernel] 0x5df (0xca94dcac)
[<c013dd85>] generic_file_new_read [kernel] 0xc5 (0xca94dcec)
[<c013dbe0>] file_read_actor [kernel] 0x0 (0xca94dcfc)
[<c013d406>] do_generic_file_read [kernel] 0x336 (0xca94dd04)
[<c013deaf>] generic_file_read [kernel] 0x2f (0xca94dd38)
[<c0175734>] read_blk [kernel] 0x74 (0xca94dd50)
[<c0176495>] find_block_dqentry [kernel] 0x55 (0xca94dd7c)
[<c0175734>] read_blk [kernel] 0x74 (0xca94dda8)
[<c0176691>] find_tree_dqentry [kernel] 0xd1 (0xca94ddd4)
[<c017665d>] find_tree_dqentry [kernel] 0x9d (0xca94ddf4)
[<c017665d>] find_tree_dqentry [kernel] 0x9d (0xca94de14)
[<c017665d>] find_tree_dqentry [kernel] 0x9d (0xca94de34)
[<c01766f1>] v2_read_dquot [kernel] 0x41 (0xca94de54)
[<c0172e59>] read_dqblk [kernel] 0x59 (0xca94deb0)
[<c0173668>] dqget [kernel] 0x128 (0xca94dec8)
[<c0174e73>] vfs_get_dqblk [kernel] 0x23 (0xca94dee8)
[<c0171ed4>] do_quotactl [kernel] 0x354 (0xca94defc)
[<c0161f6c>] __user_walk [kernel] 0x5c (0xca94df44)
[<c0161135>] path_release [kernel] 0x15 (0xca94df54)
[<c0171b79>] resolve_dev [kernel] 0x79 (0xca94df60)
[<c015d6fc>] sys_stat64 [kernel] 0x5c (0xca94df84)
[<c0171fe2>] sys_quotactl [kernel] 0xa2 (0xca94df98)


2.) kswapd wakes up and eventually calls prune_icache() which results
in trying to down the dqopt->dqio_sem.  At this point the system is
deadlocked. 

[<c010abca>] __down [kernel] 0x5a (0xc25e5ee4)
[<c010ad24>] __down_failed [kernel] 0x8 (0xc25e5f18)
[<c0175333>] .text.lock.dquot [kernel] 0x19 (0xc25e5f28)
[<c0173410>] dqput [kernel] 0x50 (0xc25e5f3c)
[<c01741d6>] dquot_drop [kernel] 0x46 (0xc25e5f48)
[<c016cab7>] clear_inode [kernel] 0x67 (0xc25e5f58)
[<c016cb6c>] dispose_list [kernel] 0x3c (0xc25e5f64)
[<c016cde5>] prune_icache [kernel] 0x75 (0xc25e5f7c)
[<c016d004>] shrink_icache_memory [kernel] 0x24 (0xc25e5fa0)
[<c0148a28>] do_try_to_free_pages_kswapd [kernel] 0x168 (0xc25e5fac)
[<c0148bd8>] kswapd [kernel] 0x68 (0xc25e5fd0)
[<c0148b70>] kswapd [kernel] 0x0 (0xc25e5fe4)
[<c010945d>] kernel_thread_helper [kernel] 0x5 (0xc25e5ff0)


At this point I am trying to figure out a way around this deadlock.


Larry Woodman



Comment 6 Carlos Antonio Gomez 2005-02-10 16:41:28 UTC
HI larry

I hope a solution to this problem

Regards and good luck 

Comment 7 Albert Graham 2005-02-16 13:01:45 UTC
Created attachment 111125 [details]
Same old, same old

Comment 8 Albert Graham 2005-02-16 13:02:42 UTC
I'm sure I have the same problem here, I have an IPMI card installed
so I was able to login and copy/paste afew  screen as follows:

Oops: 0000
nfsd nfs lockd sunrpc e1000 bonding microcode ext3 jbd dpt_i2o
diskdumplib sd_mo
d scsi_mod
CPU:    0
EIP:    0060:[<c0169810>]    Not tainted
EFLAGS: 00010246

EIP is at try_to_free_buffers [kernel] 0x30 (2.4.21-27.0.2.ELsmp/i686)
eax: c03a8b80   ebx: aecea3a8   ecx: 00000000   edx: 00000000
esi: c3fb9fe8   edi: aecea3a8   ebp: c3fb9fe8   esp: cbaabf68
ds: 0068   es: 0068   ss: 0068
Process kswapd (pid: 11, stackpage=cbaab000)
Stack: 000001d0 c3fba004 c3fb9fe8 00000011 c03a7080 c015604c c3fb9fe8
00000000
       c03a5d80 c03a8248 c1a59668 c03a6f40 0000002e c03a7080 0001658d
00000001
       00000040 c0156c24 c03a7080 00000040 00000000 00000d6f 00000000
0000e494
Call Trace:   [<c015604c>] rebalance_laundry_zone [kernel] 0x46c
(0xcbaabf7c)
[<c0156c24>] do_try_to_free_pages_kswapd [kernel] 0x204 (0xcbaabfac)
[<c0156d38>] kswapd [kernel] 0x68 (0xcbaabfd0)
[<c0156cd0>] kswapd [kernel] 0x0 (0xcbaabfe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xcbaabff0)

Code: 8b 53 1c 8b 43 14 83 e2 06 09 d0 0f 85 b9 00 00 00 8b 5b 2c

Kernel panic: Fatal exception

Also and another from an identical machine.


Oops: 0002                                                           
          bs/jav
nfsd nfs lockd sunrpc e1000 bonding microcode ext3 jbd dpt_i2o
diskdumplib sd_mo
d scsi_mod
CPU:    2
EIP:    0060:[<c017dbdd>]    Not tainted
EFLAGS: 00010206

EIP is at prune_dcache [kernel] 0x3d (2.4.21-27.0.2.ELsmp/i686)
eax: c03a9458   ebx: dec2a118   ecx: e760d300   edx: 00730073
esi: dec2a100   edi: debf7580   ebp: 000117a4   esp: cbaabf88
ds: 0068   es: 0068   ss: 0068
Process kswapd (pid: 11, stackpage=cbaab000)
Stack: e760d300 dec2a180 c03a4a80 0000808d 00000000 00000040 c017e0f8
00017d94
       00000000 c0156b70 00000006 000001d0 00000014 00010ead 00000000
0000576e
       00000000 00000000 c0156d38 000001d0 00000001 00000040 00000068
c0156cd0
Call Trace:   [<c017e0f8>] shrink_dcache_memory [kernel] 0x68 (0xcbaabfa0)
[<c0156b70>] do_try_to_free_pages_kswapd [kernel] 0x150 (0xcbaabfac)
[<c0156d38>] kswapd [kernel] 0x68 (0xcbaabfd0)
[<c0156cd0>] kswapd [kernel] 0x0 (0xcbaabfe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xcbaabff0)

Code: 89 02 89 5b 04 89 1b 8b 46 54 a9 08 00 00 00 0f 85 4b 01 00

Kernel panic: Fatal exception

I've also attached another one (jpg) (above)

I get a least one of these a day, the only solution is to revert to
RHAS U2.

RHAS U3+U4 are way too unstable for production.


Comment 9 Larry Woodman 2005-02-16 17:12:46 UTC
Carlos and Albert, none of the OOPS you are hitting appear to be
related to the system hang that Carlos reported.  That hang is a known
interaction between kswapd and disk quotas and I am working on a fix
for that problem.  

The OOPS that have been reported here are corruption in lists that I
am pretty sure has been fixed in the latest pre-RHEL3-U5 kernel, can
you verify this for me?  It can be located here:

>>>http://people.redhat.com/~lwoodman/RHEL3/

Larry Woodman


Comment 10 Carlos Antonio Gomez 2005-02-21 16:08:10 UTC
Hi Larry

You need more information?  
Why the bug this in state of NEEDINFO?

Regards 
  Carlos A.

Comment 12 Carlos Antonio Gomez 2005-03-21 17:42:23 UTC
Hi Larry

My server hang three and four time every week!!
do you fix the bug?

Regards 

Comment 13 Carlos Antonio Gomez 2005-05-12 15:26:33 UTC
Created attachment 114298 [details]
Sysrq outputs

Hi Larry:

My server freeze once again.
I put the SysRQ outputs.

 Regards

Comment 14 Larry Woodman 2005-05-12 15:44:13 UTC
Carlos, your system only has 256MB or RAM, you have exhausted your swap
space(Free swap: 0kB) and your system should OOM kill one or more processes to
lighten up the load.  Basically this system is overloaded, what workload are you
running? Can you add more RAM?  While we do officially support 256MB the system
will perform poorly especially if you have 2 or more CPUs that wan that memory.
 Finally can you get me an "uname -a" output so I can see the exact kernel verions?

Thanks, Larry Woodman



Comment 15 Larry Woodman 2005-05-12 15:48:37 UTC
Sorry, I meant to say 512MB of ram.

Larry


Comment 16 Carlos Antonio Gomez 2005-05-12 16:25:25 UTC
hi larry:
my server has 512 MB of ram and 1 GB of swap,  perhaps I can increase to 1 GB 
of ram an 2 GB of ram.
most used: 
  is mail server (postfix + mailman, 1200 users) and a 99% of the users of the 
lists are local.  It has activated quotas and has  HORDE like web client .  
Little used:
vsftp, samba ,dns secondary

The hanging of the server is happening whether mail is sent to the small
or large lists.  

uname output
Linux ceis.cujae.edu.cu 2.4.21-27.0.2.EL #1 Wed Jan 19 02:20:34 GMT 2005 i686 
i686 i386 GNU/Linux

sorry larry, i dont speak english but attempt to communicate

regards 
   Carlos A

Comment 17 Carlos Antonio Gomez 2005-05-12 16:25:57 UTC
hi larry:
my server has 512 MB of ram and 1 GB of swap,  perhaps I can increase to 1 GB 
of ram an 2 GB of ram.
most used: 
  is mail server (postfix + mailman, 1200 users) and a 99% of the users of the 
lists are local.  It has activated quotas and has  HORDE like web client .  
Little used:
vsftp, samba ,dns secondary

The hanging of the server is happening whether mail is sent to the small
or large lists.  

uname output
Linux ceis.cujae.edu.cu 2.4.21-27.0.2.EL #1 Wed Jan 19 02:20:34 GMT 2005 i686 
i686 i386 GNU/Linux

sorry larry, i dont speak english but attempt to communicate

regards 
   Carlos A

Comment 18 Carlos Antonio Gomez 2005-05-12 16:28:52 UTC
sorry I meant to say 2 GB  of SWAP.


Comment 19 Carlos Antonio Gomez 2005-05-12 16:38:27 UTC
I tapeworm redhat 7.3 with 256 MB of RAM and 1,7 GHZ(actualmente 2,8 GHZ) with 
the same amount of users and services and worked perfectly.  When i migrate 
began the problem

Comment 20 Larry Woodman 2005-05-12 17:02:56 UTC
Carlos, your system needs more RAM or swap.  The total anonymous memory is
exceeding RAM+swap, thats why the system is hanging/OOM killing.  

Larry


Comment 21 Carlos Antonio Gomez 2005-05-12 18:20:54 UTC
larry, but the kernel does not have to prevent that the system dies.
why with the 7,3 it worked to me perfectly?
and what you expressed in "Comment #5 From Larry Woodman (lwoodman)  
on 2005-02-07 11:53 EST [reply]" in it bug

thk and regards

Carlos

Comment 22 Albert Graham 2005-05-20 10:32:44 UTC
Larry, can you cross ref bug
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=158075 with this bug please.

Thank you.

Albert Graham.


Comment 23 Larry Woodman 2006-12-08 13:46:29 UTC
Is this bug still occurring after you increase the system size in terms of
memory and swap space?  This system simply used all RAM ans swap space so it OOM
killed a process.

Larry Woodman


Comment 24 Carlos Antonio Gomez 2006-12-08 22:06:05 UTC
Hi Larry
	
At this time i dont have RAM to put and the system hang three and four time
every day. I cant increase the swap espace?

I have Linux ceis.cujae.edu.cu 2.4.21-27.0.2.EL #1 Wed Jan 19 02:20:34 GMT 2005
i686 i686 i386 GNU/Linux


Comment 25 RHEL Program Management 2007-10-19 19:09:34 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.