Bug 249563

Summary: Deadlocks on high disk I/O
Product: [Fedora] Fedora Reporter: Matthias Hensler <adv>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 7CC: bruno, bugzilla.redhat.com, chris.brown, davej, mishu, rrauenza, sputhenp
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.24 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-03 19:44:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Processlist showing the bug
none
Calltrace from echo t >/proc/sysrq-trigger
none
Patch to fix the issue
none
vmstat and memifo data from while processes were stuck none

Description Matthias Hensler 2007-07-25 15:24:57 UTC
Description of problem:

Two different servers (with different hardware and HDD-controllers) running a
similar setup (although not identical) are hanging nearly every morning (always
the time when high disk I/O occurs due to backups).

Version-Release number of selected component (if applicable):

The problem exists since late Fedora Core 5, 6 and 7 (as far as I remember the
problems started around april 2007). The latest updates were always installed
(including kernels), without difference.

The problem got worse in the last weeks. Earlier we had a deadlock situation
once or twice a week, at the moment the deadlock nearly occurs daily.

Since the hardware is different on both servers I suspect a software problem
here and no hardware problem. From the current investigations it looks like a
kernel problem, although all stuck processes are running in userspace.

How reproducible:

That is the difficult part. So far I think the problem was narrowed down to our
mailsystem, which is the most identical part on both servers.

It is running like this:

Server1: Postfix -> Amavisd (including Spamassassin) -> Cyrus-Imapd (delivering
via lmtp)

Server2: Postfix > Amavisd -> Procmail -> Spamassassin -> Cyrus-Imapd
(delivering via yrus-deliver)

Both Postfix instances use MySQL for their mailaliases.

There are different partitions (all on LVM) for the serverprocesses, e.g. single
partitions for the imap-spool, the /var/lib/imap directory and the postfix queue.

Each partition as a corresponding partition on the second disk and is mirrored
by rsync every 4 hours.

Steps to Reproduce:
1. Wait for high disk I/O
2.
3.
  
Actual results:

The system runs into a deadlock, but is still responsive (see below).

Expected results:

No deadlock should occur.

Additional info:

If a SSH session is open to the server it is still possible to lookup the
processtable, kill some processes (eventually resulting in breaking the
deadlock), trigger sysrq, run strace, etc.

However, trying to log into that sever in such a situation results in a timeout.

I am happy to provide debugging info if pointed to what would be helpful.

I will attach two files. The first shows a processlist when the problem exists.
Please note that a lot of processes are stuck (e.g. crond, httpd, etc).

The second one is a callstacktrace for all processes (resulting from "echo t >
/proc/sysrq-trigger").

As far as I know each of such a deadlock situation had high I/O load on
/var/spool/imap (it is always this partition). At the moment it is often
triggered by rsync mirroring that partition to its corresponding partition on
the second disk, but not always (sometimes it is triggered by a backup to a
ftp-server, but again the same partition).

All partitions use ext3 as filesystem.

The fstab for the partitions in question looks like that:

/dev/hd2/varspoolimap               /var/spool/imap                 ext3   
async,auto,data=ordered,exec,noacl,noatime,nodev,nosuid,nouser,rw     1 2

/dev/hd1/varspoolimap               /mnt/mirror/varspoolimap        ext3   
async,auto,data=ordered,exec,noacl,noatime,nodev,nosuid,nouser,rw     1 2

Looking at the calltrace it seems all processes are waiting for some I/O. I am
able to kill still running processes (before I made the attached calltrace I
killed all cron instances).

If the right process is killed the deadlock will resolv. However, it is not the
same process that needs to be killed, which is why a suspect a deadlock in the
kernel, either ext3 or scheduler (CFQ) related.

I would like to add that an open ssh session is still very responsive, so it is
not just a high load causing the server to slow done. The server gets stuck and
the deadlock never resolvs alone (at least not waiting several hours). It is
either needed to kill processes or hardreset the server (in case no ssh session
was open)

Lets look at an example of a stuck process (here rsync, which is marked as
uninterruptable (D)):

Call Trace:
 [<c042ffcb>] lock_timer_base+0x19/0x35
 [<c060bb55>] schedule_timeout+0x70/0x8f
 [<c042fd37>] process_timeout+0x0/0x5
 [<c060bb50>] schedule_timeout+0x6b/0x8f
 [<c060b67c>] io_schedule_timeout+0x39/0x5d
 [<c0465eea>] congestion_wait+0x50/0x64
 [<c0438539>] autoremove_wake_function+0x0/0x35
 [<c04620e2>] balance_dirty_pages_ratelimited_nr+0x148/0x193
 [<c045e7fd>] generic_file_buffered_write+0x4c7/0x5d3
 [<c05a5c5f>] sock_wfree+0x21/0x36
 [<c05a742f>] __kfree_skb+0xb5/0x109
 [<c042cca5>] current_fs_time+0x41/0x46
 [<c045ed89>] __generic_file_aio_write_nolock+0x480/0x4df
 [<c045ee3d>] generic_file_aio_write+0x55/0xb4
 [<f888ffb8>] ext3_file_write+0x24/0x90 [ext3]
 [<c047a31c>] do_sync_write+0xc7/0x10a
 [<c0438539>] autoremove_wake_function+0x0/0x35
 [<c047a255>] do_sync_write+0x0/0x10a
 [<c047ab6e>] vfs_write+0xa8/0x15a
 [<c047b18b>] sys_write+0x41/0x67
 [<c0404f8e>] syscall_call+0x7/0xb

Every single stuck process (look at the full attached trace) does some writing
-> ext3_file_write -> ... -> balance_dirty_pages_ratelimited_nr -> ... ->
congestion_wait -> ... -> schedule_timeout -> lock_timer_base

Looking at LKML there were some problems there, and I guess that our problem is
also running in some deadlock here. This morning I booted one server with the
anticipatory scheduler to sort our problems with CFQ.

As written I am able to present more debug information. Please tell me where to
look. When logged in I am still able to do "lsof" on all processes, and "strace"
at least on the running/waiting processes (doing strace on an uninterruptable
"D"-process is no good, having strace hanging as well and no chance to abort).

Both servers currently running the recent F7 update kernel:
Linux zoidberg.canadatux.org 2.6.22.1-27.fc7 #1 SMP Tue Jul 17 17:13:26 EDT 2007
i686 athlon i386 GNU/Linux

But the problem existed with 2.6.21 and 2.6.20 (maybe 2.6.19, not so sure) as well.

I suspect that the sysrq-call trace is the most interessting part for debugging
it down. Since it is big (500K) I will attach it as seperate file instead of
copy and pasting it here.
The kernel is not tainted and no special modules are loaded (except standard
modules as device-mapper, iptables, adm1026 and the network and disk-controller
drivers).

Comment 1 Matthias Hensler 2007-07-25 15:24:58 UTC
Created attachment 159939 [details]
Processlist showing the bug

Comment 2 Matthias Hensler 2007-07-25 15:27:55 UTC
Created attachment 159940 [details]
Calltrace from echo t >/proc/sysrq-trigger

This is the full calltrace (after killing the crond processes shown in the
processlisting) resulting from "echo t > /proc/sysrq-trigger"

Comment 3 Chuck Ebbert 2007-07-25 19:09:58 UTC
Try booting with kernel options

  nohz=off highres=off

to disable the new timer code.

Also try 2.6.22.1-33

Comment 4 Matthias Hensler 2007-07-27 07:30:25 UTC
(In reply to comment #3)
> Try booting with kernel options
> 
>   nohz=off highres=off
> 
> to disable the new timer code.
> 
> Also try 2.6.22.1-33

OK, I will do. However I do not think that it is related to the new timer code,
since the problem was already there with a lot older kernel.

In the meantime I have mounted all partitions with default values and had no
deadlock so far (have to watch it further).

Might it be possible that "noatime" is causing the problem? The CFQ scheduler
might expect more dirty pages from doing high I/O like rsync. If noatime is set
there won't be any, at least not on the partition reading from.


Comment 5 Matthias Hensler 2007-08-01 08:49:24 UTC
> OK, I will do. However I do not think that it is related to the new timer
> code, since the problem was already there with a lot older kernel.

I would like to give an update on this as I think that I found the root cause of
the problem.

So far we have not changed anything (still running 2.6.22.1-27 with the new
timer code). The only thing that was changed were the mount options.

Both systems now run stable for over a week, so the problem is not longer existent.

Old fstab:
/dev/hd2/varspoolimap               /var/spool/imap                 ext3   
async,auto,data=ordered,exec,noacl,noatime,nodev,nosuid,nouser,rw     1 2

New fstab:
/dev/hd2/varspoolimap           /var/spool/imap                 ext3    defaults
       1 2

Since "defaults" implies rw,suid,dev,exec,auto,nouser,async the only differences
here are:

data=ordered    (but that is a ext3-default, so also no difference)
noacl           (should no make a difference, SELinux is off, no security
contextes are used here)
noatime         (which I suspect is the rootcause)
nodev           (should not affect rsync)
nosuid          (likewise should not affect rsync)

I am pretty sure that mounting our imap-spool with noatime seems to cause
deadlocks on high I/O load. So there is mostlikely some bug within the kernel
and the schedular. At least since 2.6.19.

I provided the stacktrace of such a situation with the original bugreport.
Anything more I could provide to narrow the issue down?

Comment 6 Matthias Hensler 2007-08-09 10:03:16 UTC
(In reply to comment #5)
> Anything more I could provide to narrow the issue down?

In the meantime I recompiled 2.6.22.1-27.fc7 with just the twoline patch from
Richard Kennedy (http://lkml.org/lkml/2007/8/2/89) which seems to have fixed
this issue for now.

I am still watching the systems closely to be sure that the patch indeed has the
indended effect.

Regards,
Matthias

Comment 7 Matthias Hensler 2007-08-09 10:05:05 UTC
Created attachment 160963 [details]
Patch to fix the issue

Posted by Richard Kennedy on LKML: http://lkml.org/lkml/2007/8/2/89

Comment 8 Christopher Brown 2007-09-20 12:26:06 UTC
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel? Could you also test 2.6.23
from rawhide as this should have the fix indicated in comment #7.

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Cheers
Chris

Comment 9 Matthias Hensler 2007-09-20 12:54:39 UTC
> There hasn't been much activity on this bug for a while. Could you tell me if
> you are still having problems with the latest kernel?

Yes, the bug is still present.

> Could you also test 2.6.23
> from rawhide as this should have the fix indicated in comment #7.

The last activity was my mail from august 9th to LKML (see
http://lkml.org/lkml/2007/8/9/346). There was no further answer to that.

The patch from comment #7 has resolved the issue (the affected system has now a
uptime of nearly 50 days). However, that patch is neither present upstream
(checked 2.6.23-rc7), nor in rawhide (checked kernel-2.6.23-0.189.rc6.git8.fc8).

So, the problem still exists.

Regards,
Matthias

Comment 10 Christopher Brown 2007-09-20 13:46:58 UTC
Okay, will re-assign and hopefully get this included soon. Chuck?

Comment 11 Bruno Wolff III 2007-09-22 17:34:23 UTC
This sounds very similar to a problem I have been seeing with FC5 and F7. I am
holding one machine (that I can't afford to have hangs happen on every couple of
days) at 2.6.19-1.2288.2.4.fc5 and it is working fine.

Comment 12 Christopher Brown 2007-09-23 17:16:49 UTC
(In reply to comment #11)
> This sounds very similar to a problem I have been seeing with FC5 and F7. I am
> holding one machine (that I can't afford to have hangs happen on every couple of
> days) at 2.6.19-1.2288.2.4.fc5 and it is working fine.

If you can post some details to confirm the dupe that would be helpful - similar
output to Matthias' for example.

Comment 13 Bruno Wolff III 2007-09-24 04:36:30 UTC
I submitted 235043 which refers to this issue and probably has some of what you
want.
The two systems I saw the problem on (though one I kept at FC5 and never tried
F7) have only 512MB of memory and use ext3 file systems with journal=data and
write barriers running on top of software raid 1 on pata disks.


Comment 14 Simon Klempert 2007-10-24 12:21:02 UTC
Hi, any news here? I'm running the system with the problem mentioned above
together with Matthias Hensler.

We currently using 2.6.22.1-27.fc7
(last ChangeLog entry * Tue Jul 17 2007 John W. Linville <linville>)
with patch form comment #7, noatime option enabled and haven't faced the problem
again. Is this patch or another patch trying to fix this problem included in
RawHide kernel kernel-2.6.23.1-30.fc8. If yes, we would offer to try it on our
system.

Comment 15 Chuck Ebbert 2007-10-24 20:55:34 UTC
We just go round and round on linux-kernel and nobody seems to have a safe
short-term fix. Supposedly 2.6.25 will have the real fix but it's hundreds of
lines of code changes.

Comment 16 Chuck Ebbert 2007-10-24 20:59:38 UTC
FWIW this patch should be much safer:

http://lkml.org/lkml/diff/2007/9/20/415/1


Comment 17 Simon Klempert 2007-10-30 19:55:11 UTC
Is anybody else using patch from comment #16 ?

Two days ago I've compiled
kernel-2.6.22.9-91.bz249563.fc7
including the fix from comment #16

Yesterday the server got slow and somebody from first level support rebooted the
server without further investigating the situation. Today the server got stuck a
bit (means some static html pages from webmail frontend still worked, but login
(ssh and serial) timed ou)t. Unfortunately I had no open session to investigate
it further and I had to do hard reset. For now I have reverted back to
kernel-2.6.22.1-27.bz249563.fc7 with fix from comment #7.

Comment 18 Bruno Wolff III 2007-10-31 03:59:13 UTC
I didn't try any of the patches, but I did collect some vmstat and meminfo data
and post a pointer to it on lmkl yesterday. I don't know if the right people
have looked at it, but during the thread someone mentioned they were interested
in seeing that data.

Comment 19 Bruno Wolff III 2007-11-02 18:27:47 UTC
Created attachment 246931 [details]
vmstat and memifo data from while processes were stuck

This is the mixed vmstat and meminfo data I posted a link to previously. Since
it didn't get any attention on lmkl, I wanted to put a copy somewhere where
anyone who might want to look at it later can find it.

Comment 20 Simon Klempert 2008-01-13 14:47:20 UTC
Any news regarding this problem?

I'm in the process to update to FC8, but I don't want to take the risk that I'm
stuck without a running kernel. Was this problem addressed in FC8 kernel?

Comment 21 Bruno Wolff III 2008-01-14 01:55:32 UTC
If you are already using the latest FC5 or later without problem, then you
probably won't run into what I am seeing. I have seen the problem happening
continuously since late FC5 through current F8. In fact I just rebooted the
machine where this problem is triggered to clear things up again about an hour ago.


Comment 22 Rich Rauenzahn 2008-01-15 21:52:20 UTC
I may also be having the same deadlock.  I had /var/spool/squid mounted noatime
(just changed it after reading this bug report)

Symptoms were 
   Login prompt at console lets you type something in, password prompt doesn't
come up
   System is pingable
   ssh attempts timeout

Wasn't able to do any sysrq logging as the person who resets the machine doesn't
 have enough CS background.  And I've never had an active login when it has
happened.  It happens every 3-7 days.
   
The system is an older 650MHZ PIII with RAID1 and only 512MB of RAM.  So it all
sounds very similar to reports above.  I wouldn't call my disk I/O unusually
high, though.

Currently running FC8 with 2.6.23.



Comment 23 Simon Klempert 2008-01-19 17:02:38 UTC
I'm not running plain kernel. I'm running kernel 2.6.22.1-27.bz249563.fc7
This is local compiled 2.6.22.1-27 with additional patch from comment #7

Without the patch I'm facing the problem described in this bug report. I need to
know if the problem is fixed in FC8 kernel, since in comment #17 I've tried a
updated FC7 kernel with the patch from comment #7 and faced the problem again.

Comment 24 Chuck Ebbert 2008-01-21 02:04:26 UTC
The problem should be fixed properly in 2.6.24.

Comment 25 Christopher Brown 2008-01-21 16:26:52 UTC
Setting to NEEDINFO then - could all reporters test with a 2.6.24-based kernel
and report their results. Thanks in advance.

Comment 26 Simon Klempert 2008-01-23 10:51:01 UTC
OK, will do so in a few days and post the result here.

Comment 27 Rich Rauenzahn 2008-01-23 17:02:03 UTC
I turned off noatime, rebooted, and have had the hang/lockup since then with
2.6.23 on FC8.  



Comment 28 Bruno Wolff III 2008-01-28 17:42:01 UTC
I have upgraded the machine I have been having problems on to rawhide and am
using a 2.6.24 kernel. So far the problem hasn't reoccurred, but it is still too
early for me to say the problem has been fixed.

Comment 29 Bruno Wolff III 2008-01-30 18:06:26 UTC
I have had 2.6.24-2.fc9 running continuously for 3.5 days, which considering
that I have been doing significant yum updates in that time is longer that I
would have expected to go without a lockup under the old kernels. I probably
need to go a few more days to be really sure, but if things still look good on
Friday I will try an upgrade of my system I have held on FC5 because of this
problem.

Comment 30 Bruno Wolff III 2008-02-03 16:59:32 UTC
After a week I rebooted the machine to start using a kernel update. It hadn't
made a week in a long time (about 6 months), so I think there is a very good
chance the problem is fixed in the 2.6.24 kernel.

Comment 31 Christopher Brown 2008-02-03 19:44:57 UTC
Okay, I'm closing to reflect that then. As the original reporter hasn't
commented in a while and there's plenty of evidence to believe things are
resolved. Please re-open if the problem re-occurs.

Cheers
Chris

Comment 32 Rich Rauenzahn 2008-03-20 16:08:25 UTC
I'm still getting deadlocks of some sort.  

ssh half responds (sock connects, no communication..), syslog will stop logging.
   System is FC8, with a 2.6.24 kernel (can't look at the subversion right now
since the system is locked up and at another location.)  sysrq keys work for
unmounting, killing, and rebooting, which is the SOP when the system locks up
like this, which is about once a week.

How do people get the state of the system when this happens?  

Comment 33 Chuck Ebbert 2008-03-21 00:01:21 UTC
(In reply to comment #32)
> 
> How do people get the state of the system when this happens?  

sysrq-t, but you've got to have some way to capture it.

Comment 34 Rich Rauenzahn 2008-03-21 21:14:05 UTC
So I've set up a silly scheme.. might work.. I've read that for some people, an
active login still works mostly.

So, I ssh in and while there, every 60 seconds, echo 't' to the sysrq /proc
file.  Meanwhile, I use klogctl to read from the kernel ring buffer to stdout...

Hopefully while in this state, existing sockets remain working, sshd's and
klogctl remains working...

The output gets logged on the host performing the ssh.

We'll see if it works....

If it doesn't I guess I'd need to boot with a serial port configured as console
and take a dump with a null modem and laptop? I don't see any other way.