Red Hat Bugzilla – Bug 601621
[abrt] crash in openoffice.org-calc-1:3.2.0-12.24.fc13: oslDoCopyFile->write: SIGBUS on copying from successfully mapped input file
Last modified: 2010-08-18 21:22:08 EDT
abrt 1.1.1 detected a crash.
Attached file: backtrace
cmdline: /usr/lib64/openoffice.org3/program/scalc.bin -calc /home/wittig/gleit/gleitzeit.ods
reason: Process /usr/lib64/openoffice.org3/program/scalc.bin was killed by signal 7 (SIGBUS)
release: Fedora release 13 (Goddard)
Trying to open a .ods spread sheet on a remote Windows server, filesystem mounted via CIFS.
oocalc crashes while opening the file.
If I copy that file to an ext4 filesystem on a local disc, opening works fine.
Created attachment 422136 [details]
void* pSourceFile = mmap( 0, nSourceSize, PROT_READ, MAP_SHARED, SourceFileFD, 0 );
if ( pSourceFile != MAP_FAILED )
nWritten = write( DestFileFD, pSourceFile, nSourceSize ); /*here*/
nRemains -= nWritten;
munmap( (char*)pSourceFile, nSourceSize );
We've seen these before. mmap works, but a read from the successfully mmaped file then dies horribly.
Not exactly sure where this belongs, samba itself, or kernel side
Can you provide some details about how to reproduce this? What mount options are you using on this cifs mount? Does this occur every time you try to do this?
> Can you provide some details about how to reproduce this? What mount options
> are you using on this cifs mount? Does this occur every time you try to do
Here's the relevant line from /etc/fstab:
//windowsserver/directory /localdirectory cifs rw,credentials=/protected-file,uid=wittig,gid=wittig,iocharset=iso8859-1,file_mode=0644,dir_mode=0755 0 0
And yes, I tried to open that file 10 times just for testing purposes. Everytime oocalc crashed and ABRT woke up (but I didn't let ABRT create new bugzilla tickets for these tests). Looks like the symptoms are always the same.
How to reproduce this bug?
Just invoke ooffice from the shell with the CIFS-Filename as an argument, like this:
$ ooffice /path/to/cifs-file
/usr/lib64/openoffice.org3/program/soffice: line 127: 20130 Bus error (core dumped) "$sd_prog/$sd_binary" "$@"
However, the ABRT bug report is only generated if I first open a local .ods file like this:
$ oocalc localsheet.ods
and then typing Control-O and select that CIFS file in the file dialogue. Apart from this, there's no difference in the symptoms.
Btw., that bug seems to be a little older. It is present at least since the beginning of the year under fc12 (ooo-3.1.1). I just didn't let ABRT report it. :-)
Created attachment 422166 [details]
gcc -o copydemo copydemo.c
I wonder if this testcase from a previous very similar problem helps reproduce it. i.e.
gcc -o copydemo copydemo.c
copydemo /path/to/file/on/cifs/mount /tmp/destfile
does that work successfully ?
(In reply to comment #7)
> I wonder if this testcase from a previous very similar problem helps reproduce
> it. i.e.
> gcc -o copydemo copydemo.c
> copydemo /path/to/file/on/cifs/mount /tmp/destfile
> does that work successfully ?
Yes works fine here with exactly that file that crashes ooo-3.2.0.
*** Bug 606323 has been marked as a duplicate of this bug. ***
*** Bug 611375 has been marked as a duplicate of this bug. ***
*** Bug 614772 has been marked as a duplicate of this bug. ***
OS Release: Fedora release 13 (Goddard)
How to reproduce
1. Double click saved impress document - Impress crashes before launched
Filed similar report earlier. Can't launch impress. Have to restart before Impress will run
(In reply to comment #12)
> Package: openoffice.org-impress-1:3.2.0-12.25.fc13
> Architecture: x86_64
> OS Release: Fedora release 13 (Goddard)
> How to reproduce
> 1. Double click saved impress document - Impress crashes before launched
> Filed similar report earlier. Can't launch impress. Have to restart before
> Impress will run
Did yum reinstall openoffice*, same problem. Finally rm -rf ~/.openoffice, and that appears to have fixed my problem.
*** Bug 615292 has been marked as a duplicate of this bug. ***
I've given a try to reproducing this but haven't been able to so far.
Could you strace the copydemo program while it's failing against one of those files? I want to verify that it's falling down on the write().
Does this fail against other files on this share or is it only particular ones?
What sort of windows server are you mounting here?
What may be helpful is to turn up cifsFYI while reproducing this. That may give me some indication of what's going wrong here. See this page for info on how to do that:
The crashes occur often but inconsistently on all files on a share. However, I never managed to get a crash with an empty .ODS file in new and otherwise empty directory. Once a file starts to provoke a crash it typically continues to do so, however I sometimes also observed the opposite (that suddenly files could be opened that crashed before).
AFAIK it is a Windows Server 2003 share, but I could validate this if necessary.
Created attachment 434164 [details]
dmesg log of crash with CIFS debugging enabled
I don't have the permission to add it as "external bug", but I think that this here might refer to the same issue:
Yes that's the same issue, there we're suggesting just using traditional simple read/write loop instead of the cunning write with arguments of filelength and pointer returned from successful mmap of input file. Even if that approach is taken the (apparent) bug here would remain, though not affecting OOo.
fs/cifs/file.c: CIFS VFS: leaving cifs_open (xid = 2057739) rc = 0
fs/cifs/file.c: CIFS VFS: in cifs_file_mmap as Xid: 2057741 with uid: 500
fs/cifs/file.c: CIFS VFS: leaving cifs_file_mmap (xid = 2057741) rc = 0
fs/cifs/file.c: CIFS VFS: in cifs_readpage as Xid: 2057742 with uid: 500
fs/cifs/file.c: readpage ffffea00001e9d60 at offset 16384 0x4000
fs/cifs/file.c: CIFS VFS: in cifs_read as Xid: 2057743 with uid: 500
fs/cifs/cifssmb.c: Reading 4096 bytes on fid 32873
fs/cifs/transport.c: For smb_command 46
fs/cifs/transport.c: Sending smb: total_len 63
fs/cifs/connect.c: rfc1002 length 0x27
Status code returned 0xc0000054 NT_STATUS_FILE_LOCK_CONFLICT
fs/cifs/netmisc.c: Mapping smb error code 33 to POSIX err -13
fs/cifs/misc.c: Null buffer passed to cifs_small_buf_release
CIFS VFS: Send error in read = -13
fs/cifs/file.c: CIFS VFS: leaving cifs_read (xid = 2057743) rc = -13
fs/cifs/file.c: CIFS VFS: leaving cifs_readpage (xid = 2057742) rc = -13
I suspect the problem is above. cifs uses the generic mmap routines, and those just call down to the filesystem for reads when pages need to be faulted in. When there's an error on read, I believe the kernel will send a SIGBUS.
I'm not sure what we can really do in this situation. We have no choice but to return an error when the read can't be satisfied. We also have no good way to know that any or part of the range is locked at mmap time. Even if we could, we have no way to prevent someone from placing locks on the file later (aside from placing a lock ourselves which could cause other situations to fail).
I don't think changing the code to use a traditional read/write copy in a loop would really help much here. It would prevent the SIGBUS, but it'll probably error out too.
The question here is whether the file is locked via a different filehandle on the same client that's trying to do the mmaped read here, or if it's locked by a different client or on the server itself. I don't see any evidence of locking calls in the dmesg output, but I can't rule that out.
If the file is being locked by this particular client, then one possible workaround is to just skip sending lock calls to the server at all by mounting the share with '-o nolock'. That won't help however if the file is locked by a different client entirely.
*** Bug 619606 has been marked as a duplicate of this bug. ***
OS Release: Fedora release 13 (Goddard)
How to reproduce
1. open ods on a samba share
As I said before, I don't think there's much we can reasonably do here. A SIGBUS is just what the kernel will send when there's an error faulting in a page, and we have no choice but to return an error when there's an error reading in the page.
There are a couple of things that OO could do to mitigate this. One is to switch to a more typical read/write loop and handle errors appropriately on read. Another would be to place a lock on the file prior to reading from the mmap and have the program deal with lock conflicts.
At this point, I'm going to set this back to an oo bug, but I'll stay on the cc list and can try to assist in coming up with a solution.
Ok, I do understand that this is not a kernel issue. However, I'm still puzzled whether this is an OO problem (using functions without doing the necessary checks before) or a Samba problem (not implementing these functions correctly). If there is any additional information that I could provide to sort this out, please let me know.
I really don't have any opinions on the responsibilities for this issue, but I'm growing desperate to see it getting fixed. For me this thing has become a real issue, causing users to keep local copies of shared documents (since they cannot work with the ones on the SMB share anymore) leading to more and more consistency/versioning problems. OpenOffice (see Issue 106591) says that they could just provide a "workaround" for someone else's problem and pushed the target to 3.4 (which is another 6 months).
Well, I wouldn't say "it's not a kernel issue". The situation is a bit more subtle...
The issue is really that windows isn't 100% POSIX compliant, particularly not when it comes to file locking. Windows locking is always mandatory -- if you lock a file for write, then another thread won't be able to read that file until it's unlocked. This is in contrast to Linux and other unix-y OS's which mostly implement advisory locks. Yes, it's possible to do mandatory locking on Linux too, but it's fairly uncommon.
So, the problem arises there -- at least in the case of the cifsFYI info above. The file is locked by something (another process? another client?) so when we go to read it, the server returns an error. CIFS has no choice but to return an error. Ok, that's not 100% true -- we could block indefinitely and retry the read, but that'll cause other problems. Returning an error is the best we can do, I think.
In any case...the issue is really that the core CIFS protocol isn't and can never be 100% posix compliant. We do the best we can, but we're really constrained by the protocol and server implementations.
I think it's prudent to have OO avoid using mmap for this. It's not really needed if all they're doing is copying the file, and avoiding it will avoid a SIGBUS on a read error. If the problems are all related to file locking, that'll probably just trade the SIGBUS for an application level error, but I think that's still preferable.
I'll roll out a classic read/write update soonish, maybe Monday. Though in my naive worldview I'd prefer that mmap on a cifs filesytem simply failed rather than give me a result which can't be relied upon.
openoffice.org-3.2.0-12.29.fc13 has been submitted as an update for Fedora 13.
openoffice.org-3.3.0-3.2.fc14 has been submitted as an update for Fedora 14.
openoffice.org-3.2.0-12.29.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
su -c 'yum --enablerepo=updates-testing update openoffice.org'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/openoffice.org-3.2.0-12.29.fc13
openoffice.org-3.2.0-12.29.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.
openoffice.org-3.3.0-3.2.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.