601621 – [abrt] crash in openoffice.org-calc-1:3.2.0-12.24.fc13: oslDoCopyFile->write: SIGBUS on copying from successfully mapped input file

Bug 601621 - [abrt] crash in openoffice.org-calc-1:3.2.0-12.24.fc13: oslDoCopyFile->write: SIGBUS on copying from successfully mapped input file

Summary: [abrt] crash in openoffice.org-calc-1:3.2.0-12.24.fc13: oslDoCopyFile->write:...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	openoffice.org
Sub Component:
Version:	13
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Caolan McNamara
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	abrt_hash:a7f25fe94a1a8bb06e3a3821c7d...
Duplicates (5):	606323 611375 614772 615292 619606 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-08 10:42 UTC by Georg Wittig
Modified:	2010-08-19 01:22 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openoffice.org-3.3.0-3.2.fc14
Clone Of:
Environment:
Last Closed:	2010-08-13 21:24:18 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File: backtrace (71.91 KB, text/plain) 2010-06-08 10:42 UTC, Georg Wittig	no flags	Details
gcc -o copydemo copydemo.c (2.91 KB, text/plain) 2010-06-08 12:32 UTC, Caolan McNamara	no flags	Details
dmesg log of crash with CIFS debugging enabled (42.40 KB, text/plain) 2010-07-24 16:27 UTC, Patrick Oltmann	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenOffice.org	106591	0	None	None	None	Never

Description Georg Wittig 2010-06-08 10:42:46 UTC

abrt 1.1.1 detected a crash.

architecture: x86_64
Attached file: backtrace
cmdline: /usr/lib64/openoffice.org3/program/scalc.bin -calc /home/wittig/gleit/gleitzeit.ods
component: openoffice.org
crash_function: write
executable: /usr/lib64/openoffice.org3/program/scalc.bin
global_uuid: a7f25fe94a1a8bb06e3a3821c7d9edf88f5b7dcd
kernel: 2.6.33.5-112.fc13.x86_64
package: openoffice.org-calc-1:3.2.0-12.24.fc13
rating: 4
reason: Process /usr/lib64/openoffice.org3/program/scalc.bin was killed by signal 7 (SIGBUS)
release: Fedora release 13 (Goddard)

comment
-----
Trying to open a .ods spread sheet on a remote Windows server, filesystem mounted via CIFS.
oocalc crashes while opening the file.

If I copy that file to an ext4 filesystem on a local disc, opening works fine.

Comment 1 Georg Wittig 2010-06-08 10:42:48 UTC

Created attachment 422136 [details]
File: backtrace

Comment 2 Caolan McNamara 2010-06-08 10:56:44 UTC

    void* pSourceFile = mmap( 0, nSourceSize, PROT_READ, MAP_SHARED, SourceFileFD, 0 );
    if ( pSourceFile != MAP_FAILED )
    {
        nWritten = write( DestFileFD, pSourceFile, nSourceSize ); /*here*/
        nRemains -= nWritten;
        munmap( (char*)pSourceFile, nSourceSize );
    }

We've seen these before. mmap works, but a read from the successfully mmaped file then dies horribly.

Not exactly sure where this belongs, samba itself, or kernel side

Comment 3 Jeff Layton 2010-06-08 11:24:34 UTC

Probably kernel...

Comment 4 Jeff Layton 2010-06-08 11:34:50 UTC

Can you provide some details about how to reproduce this? What mount options are you using on this cifs mount? Does this occur every time you try to do this?

Comment 5 Georg Wittig 2010-06-08 12:11:11 UTC

> Can you provide some details about how to reproduce this? What mount options
> are you using on this cifs mount? Does this occur every time you try to do
> this?    


Here's the relevant line from /etc/fstab:

//windowsserver/directory /localdirectory cifs rw,credentials=/protected-file,uid=wittig,gid=wittig,iocharset=iso8859-1,file_mode=0644,dir_mode=0755 0 0

And yes, I tried to open that file 10 times just for testing purposes. Everytime oocalc crashed and ABRT woke up (but I didn't let ABRT create new bugzilla tickets for these tests). Looks like the symptoms are always the same.

How to reproduce this bug?
Just invoke ooffice from the shell with the CIFS-Filename as an argument, like this:

$ ooffice /path/to/cifs-file
/usr/lib64/openoffice.org3/program/soffice: line 127: 20130 Bus error               (core dumped) "$sd_prog/$sd_binary" "$@"
$ 

However, the ABRT bug report is only generated if I first open a local .ods file like this:

$ oocalc localsheet.ods

and then typing Control-O and select that CIFS file in the file dialogue. Apart from this, there's no difference in the symptoms.

Comment 6 Georg Wittig 2010-06-08 12:20:10 UTC

Btw., that bug seems to be a little older. It is present at least since the beginning of the year under fc12 (ooo-3.1.1). I just didn't let ABRT report it.  :-)

Comment 7 Caolan McNamara 2010-06-08 12:32:07 UTC

Created attachment 422166 [details]
gcc -o copydemo copydemo.c

I wonder if this testcase from a previous very similar problem helps reproduce it. i.e. 

gcc -o copydemo copydemo.c

copydemo /path/to/file/on/cifs/mount /tmp/destfile

does that work successfully ?

Comment 8 Georg Wittig 2010-06-08 12:43:53 UTC

(In reply to comment #7)

> I wonder if this testcase from a previous very similar problem helps reproduce
> it. i.e. 
> 
> gcc -o copydemo copydemo.c
> 
> copydemo /path/to/file/on/cifs/mount /tmp/destfile
> 
> does that work successfully ?    


Yes works fine here with exactly that file that crashes ooo-3.2.0.

Comment 9 Caolan McNamara 2010-06-21 12:59:34 UTC

*** Bug 606323 has been marked as a duplicate of this bug. ***

Comment 10 Caolan McNamara 2010-07-05 08:01:24 UTC

*** Bug 611375 has been marked as a duplicate of this bug. ***

Comment 11 Caolan McNamara 2010-07-15 08:58:17 UTC

*** Bug 614772 has been marked as a duplicate of this bug. ***

Comment 12 Ian Roberts 2010-07-15 09:05:33 UTC

Package: openoffice.org-impress-1:3.2.0-12.25.fc13
Architecture: x86_64
OS Release: Fedora release 13 (Goddard)


How to reproduce
-----
1. Double click saved impress document - Impress crashes before launched
2. 
3.


Comment
-----
Filed similar report earlier. Can't launch impress.  Have to restart before Impress will run

Comment 13 Ian Roberts 2010-07-15 09:55:22 UTC

(In reply to comment #12)
> Package: openoffice.org-impress-1:3.2.0-12.25.fc13
> Architecture: x86_64
> OS Release: Fedora release 13 (Goddard)
> 
> 
> How to reproduce
> -----
> 1. Double click saved impress document - Impress crashes before launched
> 2. 
> 3.
> 
> 
> Comment
> -----
> Filed similar report earlier. Can't launch impress.  Have to restart before
> Impress will run    

Update:

Did yum reinstall openoffice*, same problem.  Finally rm -rf ~/.openoffice, and that appears to have fixed my problem.

Comment 14 Caolan McNamara 2010-07-16 12:13:47 UTC

*** Bug 615292 has been marked as a duplicate of this bug. ***

Comment 15 Jeff Layton 2010-07-24 10:26:21 UTC

I've given a try to reproducing this but haven't been able to so far.

Some questions:

Could you strace the copydemo program while it's failing against one of those files? I want to verify that it's falling down on the write().

Does this fail against other files on this share or is it only particular ones?

What sort of windows server are you mounting here?

What may be helpful is to turn up cifsFYI while reproducing this. That may give me some indication of what's going wrong here. See this page for info on how to do that:

    http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting

Comment 16 Patrick Oltmann 2010-07-24 16:22:54 UTC

The crashes occur often but inconsistently on all files on a share. However, I never managed to get a crash with an empty .ODS file in new and otherwise empty directory. Once a file starts to provoke a crash it typically continues to do so, however I sometimes also observed the opposite (that suddenly files could be opened that crashed before).

AFAIK it is a Windows Server 2003 share, but I could validate this if necessary.

Comment 17 Patrick Oltmann 2010-07-24 16:27:27 UTC

Created attachment 434164 [details]
dmesg log of crash with CIFS debugging enabled

Version information:
openoffice.org-calc-3.2.0-12.25.fc13.x86_64
samba-3.5.4-62.fc13.x86_64
kernel-2.6.33.6-147.fc13.x86_64

Comment 18 Patrick Oltmann 2010-07-24 16:34:42 UTC

I don't have the permission to add it as "external bug", but I think that this here might refer to the same issue:
http://qa.openoffice.org/issues/show_bug.cgi?id=106591

Comment 19 Caolan McNamara 2010-07-24 18:48:22 UTC

Yes that's the same issue, there we're suggesting just using traditional simple read/write loop instead of the cunning write with arguments of filelength and pointer returned from successful mmap of input file. Even if that approach is taken the (apparent) bug here would remain, though not affecting OOo.

Comment 20 Jeff Layton 2010-07-26 12:09:13 UTC

-------------------------[snip]--------------------------
 fs/cifs/file.c: CIFS VFS: leaving cifs_open (xid = 2057739) rc = 0
 fs/cifs/file.c: CIFS VFS: in cifs_file_mmap as Xid: 2057741 with uid: 500
 fs/cifs/file.c: CIFS VFS: leaving cifs_file_mmap (xid = 2057741) rc = 0
 fs/cifs/file.c: CIFS VFS: in cifs_readpage as Xid: 2057742 with uid: 500
 fs/cifs/file.c: readpage ffffea00001e9d60 at offset 16384 0x4000

 fs/cifs/file.c: CIFS VFS: in cifs_read as Xid: 2057743 with uid: 500
 fs/cifs/cifssmb.c: Reading 4096 bytes on fid 32873
 fs/cifs/transport.c: For smb_command 46
 fs/cifs/transport.c: Sending smb:  total_len 63
 fs/cifs/connect.c: rfc1002 length 0x27
Status code returned 0xc0000054 NT_STATUS_FILE_LOCK_CONFLICT
 fs/cifs/netmisc.c: Mapping smb error code 33 to POSIX err -13
 fs/cifs/misc.c: Null buffer passed to cifs_small_buf_release
 CIFS VFS: Send error in read = -13
 fs/cifs/file.c: CIFS VFS: leaving cifs_read (xid = 2057743) rc = -13
 fs/cifs/file.c: CIFS VFS: leaving cifs_readpage (xid = 2057742) rc = -13
-------------------------[snip]--------------------------

I suspect the problem is above. cifs uses the generic mmap routines, and those just call down to the filesystem for reads when pages need to be faulted in. When there's an error on read, I believe the kernel will send a SIGBUS.

I'm not sure what we can really do in this situation. We have no choice but to return an error when the read can't be satisfied. We also have no good way to know that any or part of the range is locked at mmap time. Even if we could, we have no way to prevent someone from placing locks on the file later (aside from placing a lock ourselves which could cause other situations to fail).

I don't think changing the code to use a traditional read/write copy in a loop would really help much here. It would prevent the SIGBUS, but it'll probably error out too.

The question here is whether the file is locked via a different filehandle on the same client that's trying to do the mmaped read here, or if it's locked by a different client or on the server itself. I don't see any evidence of locking calls in the dmesg output, but I can't rule that out.

If the file is being locked by this particular client, then one possible workaround is to just skip sending lock calls to the server at all by mounting the share with '-o nolock'. That won't help however if the file is locked by a different client entirely.

Comment 21 David Tardon 2010-07-30 04:26:31 UTC

*** Bug 619606 has been marked as a duplicate of this bug. ***

Comment 22 Paulo Lopes 2010-08-02 16:54:06 UTC

Package: openoffice.org-calc-1:3.2.0-12.25.fc13
Architecture: x86_64
OS Release: Fedora release 13 (Goddard)


How to reproduce
-----
1. open ods on a samba share
2.
3.

Comment 23 Jeff Layton 2010-08-05 15:32:53 UTC

As I said before, I don't think there's much we can reasonably do here. A SIGBUS is just what the kernel will send when there's an error faulting in a page, and we have no choice but to return an error when there's an error reading in the page.

There are a couple of things that OO could do to mitigate this. One is to switch to a more typical read/write loop and handle errors appropriately on read. Another would be to place a lock on the file prior to reading from the mmap and have the program deal with lock conflicts.

At this point, I'm going to set this back to an oo bug, but I'll stay on the cc list and can try to assist in coming up with a solution.

Comment 24 Patrick Oltmann 2010-08-06 11:55:14 UTC

Ok, I do understand that this is not a kernel issue. However, I'm still puzzled whether this is an OO problem (using functions without doing the necessary checks before) or a Samba problem (not implementing these functions correctly). If there is any additional information that I could provide to sort this out, please let me know.

I really don't have any opinions on the responsibilities for this issue, but I'm growing desperate to see it getting fixed. For me this thing has become a real issue, causing users to keep local copies of shared documents (since they cannot work with the ones on the SMB share anymore) leading to more and more consistency/versioning problems. OpenOffice (see Issue 106591) says that they could just provide a "workaround" for someone else's problem and pushed the target to 3.4 (which is another 6 months).

Comment 25 Jeff Layton 2010-08-06 12:19:20 UTC

Well, I wouldn't say "it's not a kernel issue". The situation is a bit more subtle...

The issue is really that windows isn't 100% POSIX compliant, particularly not when it comes to file locking. Windows locking is always mandatory -- if you lock a file for write, then another thread won't be able to read that file until it's unlocked. This is in contrast to Linux and other unix-y OS's which mostly implement advisory locks. Yes, it's possible to do mandatory locking on Linux too, but it's fairly uncommon.

So, the problem arises there -- at least in the case of the cifsFYI info above. The file is locked by something (another process? another client?) so when we go to read it, the server returns an error. CIFS has no choice but to return an error. Ok, that's not 100% true -- we could block indefinitely and retry the read, but that'll cause other problems. Returning an error is the best we can do, I think.

In any case...the issue is really that the core CIFS protocol isn't and can never be 100% posix compliant. We do the best we can, but we're really constrained by the protocol and server implementations.

I think it's prudent to have OO avoid using mmap for this. It's not really needed if all they're doing is copying the file, and avoiding it will avoid a SIGBUS on a read error. If the problems are all related to file locking, that'll probably just trade the SIGBUS for an application level error, but I think that's still preferable.

Comment 26 Caolan McNamara 2010-08-06 13:41:25 UTC

I'll roll out a classic read/write update soonish, maybe Monday. Though in my naive worldview I'd prefer that mmap on a cifs filesytem simply failed rather than give me a result which can't be relied upon.

Comment 27 Fedora Update System 2010-08-09 13:18:45 UTC

openoffice.org-3.2.0-12.29.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/openoffice.org-3.2.0-12.29.fc13

Comment 28 Fedora Update System 2010-08-10 07:23:34 UTC

openoffice.org-3.3.0-3.2.fc14 has been submitted as an update for Fedora 14.
http://admin.fedoraproject.org/updates/openoffice.org-3.3.0-3.2.fc14

Comment 29 Fedora Update System 2010-08-10 21:44:32 UTC

openoffice.org-3.2.0-12.29.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update openoffice.org'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/openoffice.org-3.2.0-12.29.fc13

Comment 30 Fedora Update System 2010-08-13 21:23:42 UTC

openoffice.org-3.2.0-12.29.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 31 Fedora Update System 2010-08-19 01:21:52 UTC

openoffice.org-3.3.0-3.2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.