Bug 1150082 - downloads get stuck in splice() after upgrade to FF31
Summary: downloads get stuck in splice() after upgrade to FF31
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: firefox
Version: 5.11
Hardware: i386
OS: All
medium
medium
Target Milestone: rc
: ---
Assignee: Martin Stransky
QA Contact: Desktop QE
URL:
Whiteboard:
: 1143982 1154840 (view as bug list)
Depends On:
Blocks: 1162195
TreeView+ depends on / blocked
 
Reported: 2014-10-07 12:09 UTC by Pavel Kankovsky
Modified: 2019-11-14 06:31 UTC (History)
18 users (show)

Fixed In Version: firefox-31.2.0-5.el5_11
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1162195 (view as bug list)
Environment:
Last Closed: 2014-12-16 14:32:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
splice() test (929 bytes, text/plain)
2014-10-07 12:09 UTC, Pavel Kankovsky
no flags Details
Script to patch omni.ja (1.05 KB, application/x-shellscript)
2014-10-20 13:27 UTC, Pavel Kankovsky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1228583 0 None None None Never

Description Pavel Kankovsky 2014-10-07 12:09:06 UTC
Created attachment 944552 [details]
splice() test

Description of problem:
Firefox 31 attempts to download a file on RHEL5 (including implicit downloads such as addon installs and upgrades) do often get stuck if the desired final location is on other filesystem than /tmp.

Version-Release number of selected component (if applicable):
31.1.0-6.el5

How reproducible:
It might take several attempts before the problem occurs.

Steps to Reproduce:
1. Prepare a RHEL5 installation with separate /tmp filesystem.
2. Start FF31 and download one or more larger files that take 30 seconds or more to download. Try changing preferences to make FF ask for the destination and wait a few seconds before you click "Save".

Actual results:
The file is saved as /tmp/[random].part but it is never moved to the final location. The download appears in the list as unfinished ("Unknown time remaining") and it cannot be cancelled (only paused but you cannot resume it later).

History menu gets stuck too (does neither highlight nor open) sometimes.

When FF exits, it complains about unfinished downloads, spews out error messages to stderr about getting stuck and commits suicide. Here is a sample (slightly redacted):

[small pause here]
WARNING: A phase completion condition is taking too long to complete. Condition: OS.File: flush I/O queued before profile-before-change Phase: profile-before-change State: (none)
WARNING: A phase completion condition is taking too long to complete. Condition: CrashMonitor: Writing notifications to file after receiving profile-before-change Phase: profile-before-change State: (none)
[much longer pause here]
ERROR: At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources. Conditions: [{"name":"OS.File: flush I/O queued before profile-before-change","state" {"launched":true,"shutdown":false,"worker":true,"pendingReset":false,"latestSent":["Fri Oct 03 2014 14:05:14 GMT+0200 (CEST)","move",[{"string":"/tmp/8HgdO3g9.rpm.part"},{"string":"/home/user/Downloads/somefilename"},null]],"latestReceived":null,"messagesSent":0,"messagesReceived":18,"messagesQueued":29,"DEBUG":false}},{"name":"CrashMonitor: Writing notifications to file after receiving profile-before-change","state":"(none)"}] Phase: profile-before-change
WARNING: No crash reporter available
[16991] ###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 431
[16991] ###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 431
Segmentation fault

Expected results:
FF finishes download and saves the downloaded file to the desired location.

Additional info:
It turns out one of FF threads gets stuck in OS.File.copy when it calls splice(). Here is an excerpt from strace:

17543 open("/tmp/qYAY9GJi.rpm.part", O_RDONLY|O_APPEND <unfinished ...>
17543 <... open resumed> )              = 59
...
17543 rename("/tmp/qYAY9GJi.rpm.part", "/home/peak/user/somefilename" <unfinished ...>
17543 <... rename resumed> )            = -1 EXDEV (Invalid cross-device link)
...
17543 open("/home/user/Downloads/somefilename", O_WRONLY|O_CREAT|O_TRUNC, 0600 <unfinished ...>
17543 <... open resumed> )              = 60
...
17543 pipe( <unfinished ...>
17543 <... pipe resumed> [61, 62])      = 0
...
17543 splice(0x3b, 0, 0x3e, 0, 0x20000, 0 <unfinished ...>

The last syscall does never finish. It is told to stuff 2^17 bytes into a pipe buffer but the buffer has space only for 2^16 bytes and splice() gets stuck waiting for someone to clear the buffer. splice() in newer (EL6+) kernels can somehow recover and return when the buffer fills up. (See the attached test program.)

The problem disappeared when I modified toolkit/components/osfile/osfile.jsm not to call splice() (I changed "if (UnixFile.splice) {" to "if (0 && UnixFile.splice) {".)

PS: Some but not all of the symptoms of this bug are similar to the symptoms of #1143982.

Comment 1 Tru Huynh 2014-10-09 16:14:42 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1045340#c36 is propably the same issue.

Comment 2 richard rigby 2014-10-10 13:03:51 UTC
just to add, i rebuilt the firefox 31/el5 packages, adding the suggested patch:

#---

--- mozilla-esr31/toolkit/components/osfile/modules/osfile_unix_front.jsm       2014-08-25 14:17:27.000000000 +0100
+++ mozilla-esr31.patched/toolkit/components/osfile/modules/osfile_unix_front.jsm       2014-10-08 18:11:37.000000000 +0100
@@ -574,7 +574,7 @@
        };

        // Fortunately, under Linux, that pumping function can be optimized.
-       if (UnixFile.splice) {
+       if (0 && UnixFile.splice) {
          const BUFSIZE = 1 << 17;

          // An implementation of |pump| using |splice| (for Linux/Android)

#---

and have been using it for a couple of days (as have a few others) and downloading files seems to be working again.

thanks,

richard

Comment 3 Michael Lampe 2014-10-10 20:47:33 UTC
Good catch!

I'm actually using seamonkey on el5 and since some time I experienced the following bugs:

- hangs on quit

- no downloads possible anymore after some time of usage

- hangs on installing or updating addons

- hangs when moving larger amounts of mail between folders

This is now all cured by disabling this splice optimization as above.

So, my guess is that this will also fix a couple of other open bugs with FF in el5.

Comment 4 Akemi Yagi 2014-10-10 22:34:30 UTC
(In reply to Michael Lampe from comment #3)

> So, my guess is that this will also fix a couple of other open bugs with FF
> in el5.

In fact, it looks as if the patch also fixes the shutdown issue reported in:

https://bugzilla.redhat.com/show_bug.cgi?id=1045340

Comment 5 Michael Lampe 2014-10-10 22:59:12 UTC
I'm wondering if Thunderbird is also affected. It uses this file, too. And from my experience with Seamonkey, I'd say yes.

I'm also wondering if this is a known documented change in splice behaviour (it works as is in el6) or perhaps a (Redhat?) kernel bug.

Comment 6 Simon Matter 2014-10-13 06:12:06 UTC
There is another issue with firefox-31.1.0-6.el5 which may also be related:
When asked to save the current session, or by sending SIGUSR1, FF can not save its current state and on restart, it loads the last saved session which never gets updated.

Comment 7 Robert Heller 2014-10-13 20:20:32 UTC
It also affect downloading and installing larger plugins/extensions (like firebug or Firefox 2, the theme reloaded). Smaller plugins/extensions seem to download and install OK.

Comment 8 Akemi Yagi 2014-10-13 20:27:46 UTC
If you'd like to test the patched version of firefox, unsigned binaries are available. Please see this note in the CentOS bug tracker :

http://bugs.centos.org/view.php?id=7683#c21124

Comment 9 Simon Matter 2014-10-14 12:54:08 UTC
(In reply to Akemi Yagi from comment #8)
> If you'd like to test the patched version of firefox, unsigned binaries are
> available. Please see this note in the CentOS bug tracker :
> 
> http://bugs.centos.org/view.php?id=7683#c21124

Hi Akemi, I can confirm that your CentOS build with disabled splice() function works very well for me.

Session saving works, downloading works and I have also not seen the hanging shutdown of firefox.

Comment 10 Simon Matter 2014-10-14 14:06:22 UTC
While looking at splice.c in the EL5 kernel I'm wondering if it is really correct or should be changed like this:

--- linux-2.6.18.4/fs/splice.c.orig     2014-10-14 15:27:16.000000000 +0200
+++ linux-2.6.18.4/fs/splice.c  2014-10-14 15:56:52.000000000 +0200
@@ -73,8 +73,8 @@
                 */
                wait_on_page_writeback(page);
 
-               if (PagePrivate(page)
-                   && try_to_release_page(page, GFP_KERNEL))
+               if (PagePrivate(page) &&
+                   !try_to_release_page(page, GFP_KERNEL))
                        goto out_unlock;
 
                /*

Comment 11 Akemi Yagi 2014-10-14 15:52:46 UTC
(In reply to Simon Matter from comment #9)
 
> > http://bugs.centos.org/view.php?id=7683#c21124
> 
> Hi Akemi, I can confirm that your CentOS build with disabled splice()
> function works very well for me.

Just to be precise... the patched firefox was built and offered by a CentOS dev, Tru Huynh. :)

> Session saving works, downloading works and I have also not seen the hanging
> shutdown of firefox.

Glad to hear that you were able to confirm the fix.

Comment 12 Akemi Yagi 2014-10-14 15:55:03 UTC
(In reply to Simon Matter from comment #10)
> While looking at splice.c in the EL5 kernel I'm wondering if it is really
> correct or should be changed like this:
> 
> --- linux-2.6.18.4/fs/splice.c.orig     2014-10-14 15:27:16.000000000 +0200
> +++ linux-2.6.18.4/fs/splice.c  2014-10-14 15:56:52.000000000 +0200
> @@ -73,8 +73,8 @@
>                  */
>                 wait_on_page_writeback(page);
>  
> -               if (PagePrivate(page)
> -                   && try_to_release_page(page, GFP_KERNEL))
> +               if (PagePrivate(page) &&
> +                   !try_to_release_page(page, GFP_KERNEL))
>                         goto out_unlock;
>  
>                 /*

I can see EL6 kernels have this corrected as shown above. Maybe it's worth building the EL5 kernel with this patch and try it.

Comment 13 Michael Lampe 2014-10-14 18:12:43 UTC
In reply to Simon Matter from comment #10:

It seems RH has applied a broken version of this upstream patch: https://gitlab.com/teobaluta/opw/commit/ca39d651d17df49b6d11f851d56c0ce0ce01ea1a

Comment 14 Simon Matter 2014-10-14 18:20:38 UTC
I've built and tested a kernel with above patch but it didn't make any difference, firefox still didn't work right as the firefox build without using splice() does.

I still have the feeling that splice() support in the EL5 kernel is somehow broken. Did anyone manage to run the affected firefox build on a EL6 based system?

Comment 15 Pavel Kankovsky 2014-10-14 19:50:54 UTC
(In reply to Tru Huynh from comment #1)
(In reply to Akemi Yagi from comment #4)

Issue described in comment 36 of BZ #1045340 is probably the same as this bug. But #1045340 itself is probably a different problem. See my comment there
https://bugzilla.redhat.com/show_bug.cgi?id=1045340#c38

(In reply to Michael Lampe from comment #3)
(In reply to Simon Matter from comment #6)
(In reply to Robert Heller from comment #7)

FF seems to be quite fond of moving files around and all hell broke loose when it hung up in splice() therefore I am not suprised all sorts of problems (some seemingly unrelated) have been cured when the code calling splice() was disabled.

(In reply to Simon Matter from comment #14)

EL5 implementation of splice() is probably buggy as observed in comment #10 but the bug does not seem to affect the FF's particular pattern of use. Have a look at the attached testing program ("splice() test"). It simulates the offending code in FF31 and it runs fine when you tell it to use buffer size <= 2^16 (or do nonblocking splice() calls).

I did not run the EL5 build of FF31 on EL6+ but EL6 build of FF31 & the test program run without problems with the default buffer size i.e. 2^17. But it does not really transfer 2^17-byte chunks. splice() always returns when it transfers 2^16 bytes and fills the pipe buffer.

To be honest, I do not think the observed behaviour of splice() in EL5 is incorrect: if you do a blocking call and tell it to stuff 2^17 bytes into a buffer whose capacity is 2^16 then it is no surprise you have to wait until anyone reads the data and frees the buffer. It is FF's fault it does not make a non-blocking call when it handles both ends of the pipe in the same thread. YMMV.

Comment 16 Simon Matter 2014-10-15 20:02:14 UTC
I've just tested the new updated firefox-31.2.0-3.el5 and it has the same issue :(

Comment 17 Pavel Kankovsky 2014-10-20 13:27:16 UTC
Created attachment 948555 [details]
Script to patch omni.ja

This script modifies /usr/lib/firefox/omni.ja to get rid of the troublesome splice() call (using patch in comment #2).
It makes downloads work but it may print wierd unzip warnings and other confusing messages, needs quite a lot of space in /tmp, and the final result is probably suboptimal because the magic applied during the creation of the original omni.ja is lost.
Use at your own risk.
The original omni.ja is preserved as /usr/lib/firefox/omni.ja.orig.
Tested with firefox-31.2.0-3.el5.

Comment 18 RHEL Program Management 2014-10-23 13:43:12 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 19 Akemi Yagi 2014-10-23 15:17:30 UTC
(In reply to RHEL Product and Program Management from comment #18)
> This request was not resolved in time for the current release.
> Red Hat invites you to ask your support representative to
> propose this request, if still desired, for consideration in
> the next release of Red Hat Enterprise Linux.

Hmmm I'm afraid there is no such thing like "the next release of Red Hat Enterprise Linux". This is RHEL-5. The fix must be done in the current release.

Comment 24 Martin Stransky 2014-11-10 12:37:48 UTC
Let's use the splice patch while there isn't any risk and users can benefit from it.

Comment 26 Martin Stransky 2014-11-10 12:44:56 UTC
*** Bug 1143982 has been marked as a duplicate of this bug. ***

Comment 27 Akemi Yagi 2014-12-05 00:28:03 UTC
firefox-31.3.0-4 is out. Within my limited test I confirm the problem has been fixed.

Comment 28 Andrew Riell 2014-12-05 00:35:42 UTC
I received verification from my customer.  The latest version of firefox fixes the issue.  Thank you.

P.S. This BZ could be closed.

Comment 29 Pavel Kankovsky 2014-12-08 22:41:47 UTC
firefox-31.3.0-4.el5 seems to have fixed the problem.

Comment 30 Martin Stransky 2014-12-16 14:31:57 UTC
*** Bug 1154840 has been marked as a duplicate of this bug. ***

Comment 31 Michael Lampe 2015-05-14 00:55:36 UTC
Dropping the splice patch for el5's FF38 has opened this can of worms again.

I have filed a new bug report for this: https://bugzilla.redhat.com/show_bug.cgi?id=1221368


Note You need to log in before you can comment on or make changes to this bug.