Bug 597855 - MD errror / random reboots
Summary: MD errror / random reboots
Keywords:
Status: CLOSED DUPLICATE of bug 573106
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 514490
TreeView+ depends on / blocked
 
Reported: 2010-05-30 19:02 UTC by David Kovalsky
Modified: 2014-03-31 23:45 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-01 13:31:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description David Kovalsky 2010-05-30 19:02:41 UTC
My machine started often rebooting by itself with 2.6.18-194.3.1.el5xen kernel.

What seems to trigger the issue is starting a few (~6-8) virtual machines at the same time while an array is rebuilding/verifying shorty after reboot.

array config:
md104 : active raid5 sdd3[0] sdc3[1] sdb6[2] sda9[3]
      11999808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

During the VM bootup the array is syncing very slowly (~600KB/s).

Some thoughts:
 - I've never seen before upgrading to 5.5
 - the machine is otherwise physically healthy. Memory test passes, cooling is OK. Recently changed HDD (SATA) cables, still occurs.


The only thing interesting the logs is this:

53832 May 30 20:48:50 ns1 kernel: INFO: task md104_resync:491 blocked for more than 120 seconds.
53833 May 30 20:48:50 ns1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
53834 May 30 20:48:50 ns1 kernel: md104_resync  D ffff88000100a460     0   491     11           494   490 (L-TLB)
53835 May 30 20:48:50 ns1 kernel:  ffff8801da5fbd70  0000000000000246  ffff8801da5fbcb0  0000000000000000
53836 May 30 20:48:50 ns1 kernel:  0000000000000009  ffff8801da5e77e0  ffff88000002b7a0  000000000008169c
53837 May 30 20:48:50 ns1 kernel:  ffff8801da5e79c8  0000000000000000
53838 May 30 20:48:50 ns1 kernel: Call Trace:
53839 May 30 20:48:50 ns1 kernel:  [<ffffffff80264eaf>] __kprobes_text_start+0x317/0x438
53840 May 30 20:48:50 ns1 kernel:  [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
53841 May 30 20:48:50 ns1 kernel:  [<ffffffff8040b4b8>] md_do_sync+0x1d8/0x833
53842 May 30 20:48:50 ns1 kernel:  [<ffffffff80288749>] dequeue_task+0x18/0x37
53843 May 30 20:48:50 ns1 kernel:  [<ffffffff80288790>] deactivate_task+0x28/0x5f
53844 May 30 20:48:50 ns1 kernel:  [<ffffffff8026ef31>] monotonic_clock+0x35/0x7b
53845 May 30 20:48:51 ns1 kernel:  [<ffffffff80262dd3>] thread_return+0x6c/0x113
53846 May 30 20:48:51 ns1 kernel:  [<ffffffff80248d8c>] try_to_wake_up+0x392/0x3a4
53847 May 30 20:48:51 ns1 kernel:  [<ffffffff8029c521>] autoremove_wake_function+0x0/0x2e
53848 May 30 20:48:51 ns1 kernel:  [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
53849 May 30 20:48:51 ns1 kernel:  [<ffffffff8040be8c>] md_thread+0xf8/0x10e
53850 May 30 20:48:51 ns1 kernel:  [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
53851 May 30 20:48:51 ns1 kernel:  [<ffffffff8040bd94>] md_thread+0x0/0x10e     
53852 May 30 20:48:51 ns1 kernel:  [<ffffffff80233b0f>] kthread+0xfe/0x132
53853 May 30 20:48:51 ns1 kernel:  [<ffffffff80260b2c>] child_rip+0xa/0x12
53854 May 30 20:48:51 ns1 kernel:  [<ffffffff8029c309>] keventd_create_kthread+0x0/0xc4
53855 May 30 20:48:51 ns1 kernel:  [<ffffffff80233a11>] kthread+0x0/0x132
53856 May 30 20:48:51 ns1 kernel:  [<ffffffff80260b22>] child_rip+0x0/0x12
53857 May 30 20:48:51 ns1 kernel:

Comment 1 Andrew Jones 2010-06-01 11:24:10 UTC
(In reply to comment #0)
> Some thoughts:
>  - I've never seen before upgrading to 5.5

Can you please try 5.4? or any other earlier release? I don't know of anything patch-wise in 5.5 that it could be, but it should be a quick and possibly informative exercise to try.

Thanks,
Drew

Comment 2 David Kovalsky 2010-06-01 12:38:53 UTC
Maybe it's not a regression from 5.4.

Recently one of the hard drives starting showing a much higher temperature then usual. I do have a suspicion that it is failing, so perhaps HW issues (inability to read/write a block in 120s) appearing now may be triggering the condition.

I checked the logs and I've installed kernel-xen-2.6.18-194.3.1.el5.x86_64 on May 15th. This is the first time that the tracebacks started appearing in /var/log/messages. Up to the time the machine was running kernel-xen-2.6.18-164.15.1.el5.x86_64 from March 19th - no issues.

It is not an easy task to run the downgraded kernel, this is a semi-production system. I've scheduled an outage for Thu night and will see what I can find.

Comment 3 David Kovalsky 2010-06-07 21:44:12 UTC
I've tried running the old kernel for a couple of hours, but the messages didn't pop up. Likely doesn't prove anything. The MD errors are triggered pretty randomly. And this time rebuild of dirty array wasn't necessary.

Comment 4 David Kovalsky 2010-12-20 14:29:45 UTC
And it's back again. Different machine, this time kernel on bare metal (no xen).The message is repeated several times, but only for md101_resync.

kernel-2.6.18-194.26.1.el5

This started appearing after updating to the above kernel. There's no mention in the logs while running kernel-2.6.18-194.17.1.el5.

If there are any logs / command outputs I can take, now is the time. Let me know information I can add to make the issue easy for you guys to fix. 

/proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md100 : active raid5 sdd1[2] sdc1[1] sdb1[0]
      585937280 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md101 : active raid5 sdd2[2] sdc2[1] sdb2[0]
      585937280 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md0 : active raid1 sda1[0] hdb1[1]
      513984 blocks [2/2] [UU]

md1 : active raid1 sda2[0] hdb2[1](W)
      30716160 blocks [2/2] [UU]
unused devices: <none>

/var/log/messages:

Dec 19 00:48:09 bigbang kernel: INFO: task md101_resync:30403 blocked for more than 120 seconds.
Dec 19 00:48:09 bigbang kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 19 00:48:09 bigbang kernel: md101_resync  D ffff8101e9e55080     0 30403     87               30402 (L-TLB)
Dec 19 00:48:09 bigbang kernel:  ffff8101f4471d70 0000000000000046 e16bc1a617ae2752 ffff81021fa0a40c
Dec 19 00:48:09 bigbang kernel:  ffff81021fa0000c 000000000000000a ffff8100a6abe7e0 ffff8101e9e55080
Dec 19 00:48:09 bigbang kernel:  0000500ce5c9b391 0000000000002469 ffff8100a6abe9c8 00000000c6048baf
Dec 19 00:48:09 bigbang kernel: Call Trace:
Dec 19 00:48:09 bigbang kernel:  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8021af2b>] md_do_sync+0x1d8/0x833
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8008ca47>] enqueue_task+0x41/0x56
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8008cab2>] __activate_task+0x56/0x6d
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8008c897>] dequeue_task+0x18/0x37
Dec 19 00:48:09 bigbang kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
Dec 19 00:48:09 bigbang kernel:  [<ffffffff800a0b16>] autoremove_wake_function+0x0/0x2e
Dec 19 00:48:09 bigbang kernel:  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8021b8ff>] md_thread+0xf8/0x10e
Dec 19 00:48:09 bigbang kernel:  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8021b807>] md_thread+0x0/0x10e
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8003290a>] kthread+0xfe/0x132
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 19 00:48:09 bigbang kernel:  [<ffffffff800a08fe>] keventd_create_kthread+0x0/0xc4
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8003280c>] kthread+0x0/0x132
Dec 19 00:48:09 bigbang kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Dec 19 00:48:09 bigbang kernel:

Comment 5 David Kovalsky 2011-01-24 16:04:32 UTC
Hi, any updates? 

I can reproduce this with every raidcheck (weekly). Disk don't have any reallocated or pending sectors.

kernel 2.6.18-194.32.1.el5

Comment 6 Paolo Bonzini 2011-04-01 13:31:01 UTC
Not Xen related.

*** This bug has been marked as a duplicate of bug 573106 ***


Note You need to log in before you can comment on or make changes to this bug.