Description of problem: Hi. This has been happening for quite a while, but I found some meaningful info (hopefully) just recently. The problem seems quite strange. I keep my thunmerbird mailboxes on an ntfs partition, so that I can share the emails easily when I dual boot. There are 2 POP3 and one IMAP accounts set up. Sometimes when thunderbird finishes fetching emails from either of POP3 ones, cpu usage goes 100 % and top shows it's mount.ntfs-3g. A quick look into /var/log/messages reveals the following: Oct 9 18:25:56 localhost ntfs-3g[1601]: ntfs_attr_pread: na=0x66c180 b=0x66d160 pos=2138112 count=-1023473: Bad argument Oct 9 18:26:38 localhost ntfs-3g[1601]:last message repeated 113551 times That's more than a hundred thousand times in less than a minute. Nice. Version-Release number of selected component (if applicable): 2:1.1004-1.fc8 How reproducible: most of the times Steps to Reproduce: 1. set up your thunderbird mail storage folders on an ntfs partition 2. fetch emails 3. wait until it breaks Actual results: cpu usage goes sky high Expected results: cpu usage stays normal Additional info: This has been working OK for ages with vfat, and since I bought this laptop in January with ntfs-3g. The breakage began roughly about the same time when I started to use the rawhide kernel with f7. I thought it's just a peculiar setup, but now I'm rawhide-only and the issue persists.
Did you convert VFAT to NTFS? It sometimes corrupts the filesystem. Please run 'chkdsk DRIVE: /f' on Windows and reboot into Windows twice. Can you still reproduce the problem on Linux afterwards? If yes then please send your /var/log/messages* files here or to szaka. Thanks.
This laptop has been ntfs-3g from the beginning. IIRC I did run chkdsk, but I'll do that again just to be sure.
I ran chkdsk and then rebooted into linux. The thunderbird issue persisted. I'm attaching the logs for further investigation. Watch out, they are *huge* when unzipped.
Created attachment 224991 [details] /var/log/messages
Created attachment 225001 [details] /var/log/messages-20071007
Created attachment 225011 [details] /var/log/messages.1
Created attachment 225021 [details] /var/log/messages.2
Created attachment 225031 [details] /var/log/messages.3
Created attachment 225041 [details] /var/log/messages.4
Thanks a lot. Can you compile/install the driver if I send you patches?
Sure. I'll just add them to rpm.
Could you please do an strace -p <PID_of_thunderbird> what it is doing? If it infinitely passes the wrong read(2) parameters to the kernel and ntfs-3g returns the valid EINVAL error, what your logs indicate, then this is a thunderbird bug. It could be also 64-bit related rawhide kernel or FUSE problem. I couldn't find anyhing related to your unique problem in ntfs-3g yet which could result falling into an infinite loop internally. If it's not a thunderbird bug then could you please run 'ltrace -p <pid_of_ntfs-3g>' when thunderbird is hanging and send the output? Thanks.
It can be a kernel problem - as I said before, it started when I installed rawhide kernel under Fedora 7. It was pre-2.6.23, while F-7 2.6.22 worked fine. I'll strace/ptrace it as soon as it happens again.
Then it indeed looks to be a pre-2.6.23 kernel problem (I remember some issues, but they are not related to your problem). So, the problem is not reproducible? Please also note, that I have asked the outputs of strace and ltrace. The later tracks the ntfs-3g library calls.
strace and ltrace, of course. That was my mistake. As for reproducibility, it does not happen every time, but quite often.
Btw, I couldn't find the ntfs-3g version and mount options in the logs which are always logged. The logs are continuous, the info is still not there which means that apparently there is some reliability problem with the rawhide system logging as well.
strace of thunderbird-bin process spams the console with the following: read(43, 0x1408ac0, 1024) = -1 EINVAL (Invalid argument) ltrace of ntfs-3g process throws out loads of these: ntfs_attr_open(0x64f940, 128, 0x60964c, 0, 0x65f4f0) = 0x65f4f0 ntfs_attr_pread(0x65f4f0, 880640, -419862, 0x64e930, 0x65f900 <unfinished ...> ntfs_log_handler_syslog(0x3ad641fcf0, 0x3ad641eb2c, 757, 256, 0) = 82 <... ntfs_attr_pread resumed> ) = -1 __errno_location() = 0x2aaaaaad1348 ntfs_attr_close(0x65f4f0, 0x65f7b0, -1, 0x65f7a0, 0x65f7b0) = 56097 ntfs_inode_close(0x64f940, 0x65f4f0, 0x65f540, 0x65f4e0, 0x65f4f0) = 0 free(0x64e3e0) = <void> __strdup(0x613140, 0x7fff47846fc8, 0x7fff47846fc0, 880640, 0x7fff478470c0) = 0x64e3e0 ntfs_pathname_to_inode(0x60e3e0, 0, 0x64e3e0, 0x786f626e492f6d, 0x542f61747a636f70) = 0x64f940 I mount ntfs-3g with default options.
Thanks. Could you please send more info from the context? At least 100 lines in both cases unless they are absolutely indentical. i.e. any single character. So far I can see that there seems to be a problem at least in thunderbird which apparently repeats the failing call instead of reporting the error to the user. There may be further problems in the kernel and I also found a problem in ntfs-3g which may not be the root cause of the error but it may flood the log needlessly. Soon I'm sending you an ntfs-3g patch, so we can get more detail about what's really going. Thanks again.
OK. IIRC the lines were exactly the same, but I'll redirect the output to the text file next time just to be sure. Will report back ASAP.
CCing thunderbird owner.
The ntfs-3g patch is at http://ntfs-3g.org/download/bogus-offset-read-fix.diff It seems ntfs-3g is called with wrong parameters (kernel or thunderbird problem) what the driver can't fix but the log flooding shouldn't happen anymore unless there is a real ntfs-3g problem as well. Please apply it to ntfs-3g and test the problem. All recent ntfs-3g version should be ok. Thanks.
Will try the patch in a moment. In the meantime, attaching more strace/ltrace lines. Looks like these are really all the same.
Created attachment 226621 [details] strace of thunderbird process
Created attachment 226631 [details] ltrace of ntfs-3g process
Looks like the flood problem is indeed fixed, I remounted the volume, ran into the issue and nothing at all got written to /var/log/messages. The CPU hogging problem persists, though.
Thunderbird wants to read over the end of a file and never gives up. It's not clear if this is a pure thunderbird or kernel problem too. Ntfs-3g logged this error which caused the log flooding. The log flooding shouldn't happen anymore with the patch, so your problem will be a thunderbird hang what the thunderbird should be able to fix by just stopping readin when it doesn't makes sense anymore. It's also possible that the kernel missed to send a setattr() to ntfs-3g and that's why the files size is not the one thunderbird expectes. There is indeed kernel work recently on this area, so maybe a kernel bug was indeed introduced.
Let's CC the kernel folks as well then
Hi Julian, Could you please reproduce the problem again with this ntfs-3g debug patch to get more information about the nature of the problem? http://ntfs-3g.org/download/bogus-offset-read-debug.diff This will flood again the log but with useful information. Please do these steps: apply the ntfs-3g patch mount -t ntfs-3g -o debug <device> <mountpoint> &> fuse.log reproduce the problem umount <mountpoint> Then please send the fuse.log file and the ntfs-3g outputs from /var/log/messages. One line is enough in the latter case if they are the same, probably that's the case. Then you can revert to the old ntfs-3g patch, which will be included in the next ntfs-3g release, so logs don't get flooded in case of similar softwares errors. Thanks.
Created attachment 228341 [details] fuse.log file This time nothing got written to /var/log/messages, fuse.log was flooded instead. Attached. It's approx 25 MiB unzipped, I left it baking for a bit too long.
If nothing was written to /var/log/messages then the old patch was applied, or not the patched driver was installed or run. The output from /var/log/messages is crucial. Thanks.
Sorry, yesterday I was too tired and now still sleeping. You did everything right, the ntfs-3g log also goes to fuse.log when the debug mount option is used. I have all the needed info. What is the size of /Julian/poczta/Thunderbird/pop3.poczta.onet.pl/Inbox? Thunderbird tries to read at offset 712704 but it's size is only 364696.
Could you please also start 'strace -f -p <pid_of_thunderbird> -o thunderbird.log' before the problem happens and send the log? This way we maybe able to confirm that all the problems are in thunderbird or there is an additional one in the kernel which is probably in FUSE if no one can reproduce it on any other file systems. What is your exact kernel version? Any pointer to somewhere what patches the kernel exactly has?
The size is 364696 bytes, exactly as you reported. The kernel I am running currently is 2.6.23-5.fc8.x86_64. For the patches it uses, take a look into fedora cvs [1]. Look in the spec file to check which ones are actually applied. I'll try to get the full strace as soon as possible. [1] http://cvs.fedoraproject.org/viewcvs/rpms/kernel/devel/
One thing came to my mind: the size of the file is is 364696 bytes, but that's without a couple of emails that were received and then deleted during the bug capture. So maybe it is some kind of race condition, which tries to read content from disk before actually writing it? This could explain the non-reproducibility.
More: after receiving these emails Inbox should be bigger than 364696.
The Inbox size is always under 400 kB. The reads are behind 700 kB. Apparently thunderbird does this fd = open(file, ...) read(fd, buf, bufsize); which should arrive to ntfs-3 as READ bufsize at offset 0 but it arrives as READ bufsize at offset 712704. So, something corrupts or doesn't reset the ofset value. It could be a thinderbird multi-threading issue, or kernel, or FUSE. I couldn't find anything in the log which would have ordered ntfs-3g to increase the file size to above 712704.
Julian, could you please change the Component field in the bug report from ntfs-3g to thunderbird? I can't find anything which would indicate that this would be an ntfs-3g problem. There is at least one thunderbird problem (no error handling) and there is another one somewhere (file descriptor offset corruption).
OK, I managed to obtain full strace of thunderbird.
Created attachment 228651 [details] Full thunderbird strace, from start to lockup
Looks like the whence argument of the second lseek() below was supposed to be SEEK_SET, not SEEK_CUR. This seems to be a pure thunderbird problem. Thanks a lot Julian. 3022 open("/mnt/windows/Julian/poczta/Thunderbird/pop.gmail.com/Inbox", O_RDONLY) = 43 3022 lseek(43, 229086, SEEK_CUR) = 229086 3022 read(43, "From - Tue Oct 16 13:58:27 2007\n"..., 1024) = 1024 3022 read(43, "108.195.44 as permitted sender) "..., 1024) = 1024 3022 lseek(43, 231133, SEEK_CUR) = 462267 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) 3022 read(43, 0x147c380, 1024) = -1 EINVAL (Invalid argument) [ ... forever ... ]
(In reply to comment #40) > Looks like the whence argument of the second lseek() below was supposed to be > SEEK_SET, not SEEK_CUR. This seems to be a pure thunderbird problem. Thanks a > lot Julian. > No problem. I am interested in a fix too, after all :)
Not to be too pedantic about it, but why is this only seen on the ntfs-3g filesystem? (or is it?)
I keep my mailbox on an ntfs partition. I could move it to ext3 and check, but is that really necessary?
Here is another recent thunderbird hang: https://bugzilla.redhat.com/show_bug.cgi?id=302791 Moreover thunderbird may do some wrong assumption and behaves differently as usual. There were plenty such problems in the past for example made by Firefox, etc. This can be especially true if they still share the broken code.
Not necessary I suppose, was just curious. It would possibly help absolutely verify that this is a pure tbird bug, not related to the filesystem. But, up to you :)
(In reply to comment #45) > Not necessary I suppose, was just curious. It would possibly help absolutely > verify that this is a pure tbird bug, not related to the filesystem. But, up to > you :) It's pretty clear from the logs that this is a pure thunderbird bug.
Given the peculiarity of the issue, I'd prefer to leave the things as they are for now. I wouldn't like lose the ability to reproduce the behaviour. It's tough enough already. Let's see what callion will say.
The problem is getting reported from non-rawhide distros too. Please fix thunderburd, it should be very simple, it was pointed out exactly where the problem is in comment #40. Thanks.
Any updates on this?
Thanks to Tullio Andreatta for pointing out that ntfs-3g shouldn't return EINVAL but 0 in such cases (the core of the driver is highly paranoid to catch corruptions and other errors but this case should be indeed handled differently in an upper layer). So we have several problems in both softwares. I'll post an ntfs-3g patch for testing ASAP.
Well, this seems to be a kernel issue too in FUSE. Apparently the problem in ntfs-3g was hidden until recently because the FUSE kernel module handled these cases correctly (it's not really true that FUSE drivers are user space drivers because often the syscalls are done purely by the FUSE kernel module -- that's also why none can see any difference sometimes in performance compared to in-kernel drivers). But this may not be true anymore for this case, though I suspect thunderbird multithreading may have something to do with the issue. I'll discuss the issue with Miklos Szeredi.
The ntfs-3g fix is against release 1.1030: http://ntfs-3g.org/download/read-over-file-end-fix.diff Please let me know how it works. Thanks.
Hi Julian, I'm intereted in pursuing the kernel side of this bug, if you can help with that. Two things that may shed some light on the issue: - trying a kernel based on 2.6.24-rc1, which has a fuse bug fixed, which might be related to this - doing a 'strace -f -v -x -s 256 ...' on the ntfs-3g process You should obviously do this with an unpatched ntfs-3g, so hitting the bug is visible. Thanks
Are you saying that Szaka's patch only hides the bug? I don't want to get overly enthusiastic, but since I installed it a few hours ago thunderbird never locked up. As for the 2.6.24-based kernel, I'll have to wait until it hits rawhide. As for the strace, I can downgrade ntfs-3g and get it.
Yes, ntfs-3g should never have got a request that was over the file size. This can only happen if ntfs-3g is reporting the file size incorrectly to fuse, or if there's some bug in fuse. There was a related bug that could make fuse beleive the file size is shorter than it actually is, hence preveting reads up to the end of file. So I'm looking for something similar.
Created attachment 245801 [details] strace of mount.ntfs-3g That's just the lockup. If you need more context, let me know.
> Yes, ntfs-3g should never have got a request that was over the file size. > This can only happen if ntfs-3g is reporting the file size incorrectly to > fuse, or if there's some bug in fuse. The problem started with kernel upgrades and if ntfs-3g reported wrong sizes then people would report it with older kernels too, I guess. > There was a related bug that could make fuse beleive the file size is > shorter than it actually is, hence preveting reads up to the end of file. > So I'm looking for something similar. For me it seems that FUSE thinks that the file is actually larger than it is in real. Maybe because of the 2nd lseek. I don't know, I can't reproduce the problem with my kernels. Btw Julian, could you try the FUSE kernel module from the FUSE package, not the one in the Fedora kernel?
Thanks, more context would be useful. If you start 'less strace-mount.ntfs-3g.log' and search for the string 'writev' then you will notice when the pattern starts to repeat. That's where the lockup starts, and basically the interesting part is before that.
The Gentoo kernel having the same problem is 2.6.23-rc8-mm2.
Created attachment 245901 [details] strace of mount.ntfs-3g This one should have more info as I attached to the process before starting thunderbird.
Created attachment 246861 [details] debug patch for fuse I didn't find the reason for the bug in the traces. So as a next step, we need to look for the problem inside the kernel. Attaching a patch that adds some debugging code to the fuse kernel module. Can you please try this? The simplest is probably to grab the fuse source package from http://switch.dl.sourceforge.net/sourceforge/fuse/fuse-2.7.1.tar.gz after unpacking apply the patch: patch fuse-2.7.1/kernel/file.c < fuse_read_debug.patch compile the kernel module: cd fuse-2.7.1/kernel; ./configure --enable-kernel-module; make then replace the kernel module: rmmod fuse; insmod fuse-2.7.1/kernel/fuse.ko You will probably need to install a kernel-source or kernel-headers package or something similar, before compiling the module. The debug output should be in /var/log/messages, and I only need the first stack trace after the hang (if there is a stack trace at all), as all of them will probably be the same. Thanks.
Nov 8 21:15:20 localhost ntfs-3g[21577]: Version 1.1030 Nov 8 21:15:20 localhost ntfs-3g[21577]: Mounted /dev/sda1 (Read-Write, label "windows", NTFS 3.1) Nov 8 21:15:20 localhost ntfs-3g[21577]: Cmdline options: rw Nov 8 21:15:20 localhost ntfs-3g[21577]: Mount options: noatime,rw,silent,allow_other,nonempty,fsname=/dev/sda1,blkdev,blksize=4096 Nov 8 21:15:23 localhost kernel: fuse: read beyond EOF Nov 8 21:15:23 localhost kernel: Nov 8 21:15:23 localhost kernel: Call Trace: Nov 8 21:15:23 localhost kernel: [<ffffffff88992a26>] :fuse:fuse_read_fill+0x58/0xb0 Nov 8 21:15:23 localhost kernel: [<ffffffff889920ea>] :fuse:fuse_readdir+0xbf/0x1df Nov 8 21:15:23 localhost kernel: [<ffffffff810a769c>] filldir+0x0/0xb7 Nov 8 21:15:23 localhost kernel: [<ffffffff810a769c>] filldir+0x0/0xb7 Nov 8 21:15:23 localhost kernel: [<ffffffff810a77ca>] vfs_readdir+0x77/0xa9 Nov 8 21:15:23 localhost kernel: [<ffffffff810a7a1c>] sys_getdents+0x75/0xbd Nov 8 21:15:23 localhost kernel: [<ffffffff8100bce1>] tracesys+0x71/0xda Nov 8 21:15:23 localhost kernel: [<ffffffff8100bd45>] tracesys+0xd5/0xda Nov 8 21:15:23 localhost kernel: Nov 8 21:15:24 localhost kernel: fuse: read beyond EOF Nov 8 21:15:24 localhost kernel: Nov 8 21:15:24 localhost kernel: Call Trace: Nov 8 21:15:24 localhost kernel: [<ffffffff88992a26>] :fuse:fuse_read_fill+0x58/0xb0 Nov 8 21:15:24 localhost kernel: [<ffffffff88992b32>] :fuse:fuse_send_read+0x23/0x39 Nov 8 21:15:24 localhost kernel: [<ffffffff889935a0>] :fuse:fuse_readpage+0x86/0xc3 Nov 8 21:15:24 localhost kernel: [<ffffffff810732c0>] do_generic_mapping_read+0x250/0x3f7 Nov 8 21:15:24 localhost kernel: [<ffffffff81072956>] file_read_actor+0x0/0x18a Nov 8 21:15:24 localhost kernel: [<ffffffff81074972>] generic_file_aio_read+0x11d/0x160 Nov 8 21:15:24 localhost kernel: [<ffffffff8109b70c>] do_sync_read+0xc9/0x10c Nov 8 21:15:24 localhost kernel: [<ffffffff889931a4>] :fuse:fuse_open_common+0x141/0x152 Nov 8 21:15:24 localhost kernel: [<ffffffff810493c1>] autoremove_wake_function+0x0/0x2e Nov 8 21:15:24 localhost kernel: [<ffffffff8109a26d>] do_filp_open+0x2a/0x38 Nov 8 21:15:24 localhost kernel: [<ffffffff810f7924>] selinux_file_permission+0x57/0x117 Nov 8 21:15:24 localhost kernel: [<ffffffff8109bf83>] vfs_read+0xcb/0x173 Nov 8 21:15:24 localhost kernel: [<ffffffff8109c364>] sys_read+0x45/0x6e Nov 8 21:15:24 localhost kernel: [<ffffffff8100bd45>] tracesys+0xd5/0xda Nov 8 21:15:24 localhost kernel: Then the second trace repeats 2 more times.
Found the culprit, it's this commit, which moves the i_size check after ->readpage(): http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a32ea1e1f925399e0d81ca3f7394a44a6dafa12c It can be fixed in fuse by checking the size in ->readpage(). I'll submit a fix. Maybe this also qualifies for -stable. I think this bug can now be closed. Thanks for the help.
After more thought fuse should not be changed, except to document that a read may be performed beyond EOF, and the fs should return zero in that case. Such reads could happen anyway (even with 2.6.22 or earlier kernels), if a truncate races with a read, since these are not serialized. As an optimization, the check for read beyond EOF could be added, but that may just hide bugs, so I think it's better to let the filesystem always deal with this situation.
Same scenario same result. I've bitten by this bug.
(In reply to comment #65) > Same scenario same result. I've bitten by this bug. The file system part of the problem was fixed in NTFS-3G 1.1104 release: http://ntfs-3g.org/releases.html FreeBSD had the same issue and they also confirmed that the fix works. The current Fedora NTFS-3G package is newer, 1.1120. I suggest, just update. Thank you Miklos for the great analyzes.
Well, not exactly the same scenario. Fedora 8 -stable. kernel-2.6.23.1-49.fc8.i686.rpm ntfs-3g-1.1030-1.fc8.i386.rpm thunderbird-2.0.0.9-1.fc8.i386.rpm Update to (test) ntfs-3g-1.1104-1.fc8.i386.rpm Seems ok. No hang in several tries.