203496 – checkouts form subversion via apache fail with "Could not read chunk size" error

Bug 203496 - checkouts form subversion via apache fail with "Could not read chunk size" error

Summary: checkouts form subversion via apache fail with "Could not read chunk size" error

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	httpd
Sub Component:
Version:	5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Joe Orton
QA Contact:
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-08-22 02:46 UTC by Mark Slemko
Modified:	2008-05-06 16:15 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-05-06 16:15:24 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
debug output from failures running subversion (28.88 KB, text/plain) 2006-09-05 23:22 UTC, Mark Slemko	no flags	Details
backtrace for pid 11731 (6.91 KB, text/plain) 2006-09-13 21:18 UTC, Mark Slemko	no flags	Details
additional backtrace (5.59 KB, text/plain) 2006-09-14 15:54 UTC, Mark Slemko	no flags	Details
backtrace I'm getting all the time (1.24 KB, text/plain) 2006-09-29 19:05 UTC, Mark Slemko	no flags	Details
core dump with the 'rv' error found (1.20 MB, application/octet-stream) 2006-10-02 17:11 UTC, Mark Slemko	no flags	Details
Another core dump (1.36 MB, application/octet-stream) 2006-10-25 17:39 UTC, Mark Slemko	no flags	Details
View All

Description Mark Slemko 2006-08-22 02:46:18 UTC

Description of problem:
there is a problem with the combination of subversion and apache where the
checkout or update of a project fails with an error when large volumes of data
are transferred. Basically, an error like this occurs:
REPORT request failed on '/svn/srt/!svn/vcc/default'
REPORT of '/svn/srt/!svn/vcc/default': Could not read chunk size: connection was
closed


Version-Release number of selected component (if applicable):
Apache/2.2.2
svn, version 1.3.2

How reproducible:
somewhat reproducable.

Steps to Reproduce:
when a bunch of large files (multi MB) are checked into a repository and then
you proceed to get them using a subversion client via checkout or update (slower
network connection exacerbates the problem, i.e. wireless), it will regularly
fail. This happens on clients for windows, mac, and linux (fc5) in the same
manner, so the server side is the suspect here. Apache with mod_dav and
subversion are the server side components.

Actual results:
svn: REPORT request failed on '/svn/srt/!svn/vcc/default'
svn: REPORT of '/svn/srt/!svn/vcc/default': Could not read chunk size:
connection was closed by server. 
(other similar error messages appeared with the 'Could not read chunk size' message)

Expected results:
finished the checkout/update without errors

Additional info:
This is described on the tigris site as well here:
http://subversion.tigris.org/faq.html#tiger-apr-0.9.6

Suggested solution:
(quoted from the tigris faq above for completeness)
Currently, the best solution is to rebuild Apache and Subversion from their
respective sources, setting the following environment variable before running
configure for Apache:
   setenv ac_cv_func_poll no
or in Bourne shell syntax, like this:
   ac_cv_func_poll=no; export ac_cv_func_poll
If you built APR / APRUTIL separately (i.e., you did not use the ones that come
as part of the Apache tarball), you must set that environment variable before
running configure for APR, as this is where the problem lies.

Comment 1 Joe Orton 2006-08-29 08:14:01 UTC

A connection truncation like this is generally caused by the server doing
something wrong.  Do you have a reliable reproduction case for the error?

The error_log will likely contain a mod_dav error (or a segfault) around where
the report was truncated.  Can you provide error_log details?

Comment 2 Mark Slemko 2006-08-29 21:21:13 UTC

I am able to reproduce this on our internal server quite easily.

I ran an update on my project which takes several minutes to do, and it caused
the error 2 times during 2 sequential updates. I show both error messages below.
This is interesting, too, as it indicates php as a source of the error/failure
(although I don't know why php would be involked for subversion update). [I was
the only person using the server at the time.]

from the error_log:
PHP Fatal error:  Maximum execution time of 30 seconds exceeded in Unknown on line 0
[Tue Aug 29 14:37:59 2006] [notice] child pid 20552 exit signal Segmentation
fault (11)
PHP Fatal error:  Maximum execution time of 30 seconds exceeded in Unknown on line 0
[Tue Aug 29 14:39:08 2006] [notice] child pid 4570 exit signal Segmentation
fault (11)

I suspect I could increase the php timeout to hide the problem. ;-)

Comment 3 Joe Orton 2006-08-30 08:51:56 UTC

That's pretty odd.  Can you capture a backtrace?  Please do:

  yum install httpd-debuginfo subversion-debuginfo php-debuginfo
  echo CoreDumpDirectory /tmp > /etc/httpd/conf.d/coredump.conf
  service httpd restart
  <... trigger segfault...>
  gdb /usr/sbin/httpd /tmp/core.<pid>

Comment 4 Mark Slemko 2006-08-31 22:09:44 UTC

I installed:
yum install httpd-devel subversion-devel php-devel
and followed the rest of your instructions, however, it seems to have changed
its frequency of occurance. (haven't yet had it fail), so the *devel libraries
have affected this problem (in a positive way?). 
If/when it does fail again I will follow up on this.

Comment 5 Joe Orton 2006-09-01 06:44:39 UTC

It's the *debuginfo* packages not -devel - these will simply give a better
backtrace when you can get a core dump so will aid diagnosis, not fix the issue.

Comment 6 Mark Slemko 2006-09-05 23:22:06 UTC

Created attachment 135615 [details]
debug output from failures running subversion

This is the requested debug information when the system fails during a
subversion update, commit or other functions.

Comment 7 Joe Orton 2006-09-06 08:32:55 UTC

Thanks very much, this is very useful.  It seems like the PHP alarm signal is
not getting reset after a previous request in this child.

Can you check your access_log and see what PHP requests had been processed in
the seconds/minutes prior up to the segfault?  And give me an idea of what these
PHP scripts are doing?  Are there any abnormal PHP errors logged in error_log
prior to the "Maximum execution time" error?

Comment 8 Mark Slemko 2006-09-06 15:55:58 UTC

You're welcome.
There aren't many errors and it is hard to know if they are related. I see these
type of errors before some of the failures - this pair occurred 5 minutes before
pid=21010 error (from the error log):

[Tue Sep 05 10:53:49 2006] [error] [client 192.168.1.100] Provider encountered
an error while streaming a REPORT response.  [500, #0]
[Tue Sep 05 10:53:49 2006] [error] [client 192.168.1.100] A failure occurred
while driving the update report editor  [500, #190004]

The PHP code that was called prior was 6 minutes earlier (from the access_log).

192.168.1.1 - name [05/Sep/2006:10:51:37 -0600] "GET
/bugs/view_all_bug_page.php?page_number=1 HTTP/1.1" 200 27162
192.168.1.1 - name [05/Sep/2006:10:51:38 -0600] "GET
/bugs/images/mantis_logo_button.gif HTTP/1.1" 304 -

and the activitity just before the crash was like this:

192.168.1.1 - - [05/Sep/2006:10:58:48 -0600] "PROPFIND
/svn/srt/branches/version-0.6 HTTP/1.1" 401 491
192.168.1.1 - name [05/Sep/2006:10:58:48 -0600] "PROPFIND
/svn/srt/branches/version-0.6 HTTP/1.1" 207 714
192.168.1.1 - name [05/Sep/2006:10:58:48 -0600] "PROPFIND
/svn/srt/!svn/vcc/default HTTP/1.1" 207 399
192.168.1.1 - name [05/Sep/2006:10:58:48 -0600] "PROPFIND /svn/srt/!svn/bln/3387
HTTP/1.1" 207 456
192.168.1.1 - name [05/Sep/2006:10:58:48 -0600] "PROPFIND
/svn/srt/branches/version-0.6 HTTP/1.1" 207 714
192.168.1.1 - name [05/Sep/2006:10:58:48 -0600] "REPORT
/svn/srt/!svn/vcc/default HTTP/1.1" 200 260372
192.168.1.100 - - [05/Sep/2006:10:58:50 -0600] "PROPFIND
/svn/srt/branches/version-0.6 HTTP/1.1" 401 479
192.168.1.100 - name2 [05/Sep/2006:10:58:50 -0600] "PROPFIND
/svn/srt/branches/version-0.6 HTTP/1.1" 207 714
192.168.1.100 - name2 [05/Sep/2006:10:58:50 -0600] "PROPFIND
/svn/srt/!svn/vcc/default HTTP/1.1" 207 399
192.168.1.100 - name2 [05/Sep/2006:10:58:50 -0600] "PROPFIND
/svn/srt/!svn/bln/3387 HTTP/1.1" 207 456
192.168.1.100 - name2 [05/Sep/2006:10:58:50 -0600] "PROPFIND
/svn/srt/branches/version-0.6 HTTP/1.1" 207 714

Comment 9 Joe Orton 2006-09-06 16:47:12 UTC

Thanks.  Reproducing this will be very difficult.  Would you be willing to try
running this diagnostic module for a while to help track this down further?

http://people.apache.org/~jorton/mod_traptimer.c

To use: download the C file, and run "apxs -cia mod_traptimer.c" then restart httpd.

This will abort() a child process if PHP leaves its timer set after the request
has been completed, which might help narrow the issue down.

With this loaded you should get errors like;

  "profiling timer still set for /blah"

in the error_log, and more core dumps.

Comment 10 Mark Slemko 2006-09-13 21:18:30 UTC

Created attachment 136212 [details]
backtrace for pid 11731

Got an error, and this is it:

PHP Fatal error:  Maximum execution time of 30 seconds exceeded in Unknown on
line 0
[Wed Sep 13 15:01:51 2006] [notice] child pid 11731 exit signal Segmentation
fault (11), possible coredump in /tmp

debug output from failure attached

Comment 11 Mark Slemko 2006-09-14 15:54:45 UTC

Created attachment 136275 [details]
additional backtrace

an additional backtrace (for pid 11732)

I'm not seeing the additional information regarding the errors in the logs.
Should I be doing something more than just restarting apache?

Comment 12 Joe Orton 2006-09-18 09:50:15 UTC

No - can you just check you have an uncommented "LoadModule" line for
mod_traptimer in the httpd.conf?

Comment 13 Mark Slemko 2006-09-29 19:05:15 UTC

Created attachment 137422 [details]
backtrace I'm getting all the time

Hi,
I was waiting for the original problem to surface and it seems that it is being
interfered with by the "mod_traptimer" code. 

What happens is that I am getting a large number of core dumps with the
"mod_traptimer" activated (It's generating an error related to an "rv"
variable.) I haven't witnessed the originally reported problem but it kind-of
makes sense if the current httpd thread is being closed early and regularly.

I don't quite see why the rv is not being accessed. Perhaps you have some
insight into this. :-)

Comment 14 Joe Orton 2006-10-02 09:21:47 UTC

That's mod_traptimer catching exactly the bad state we're looking for!

Can you bzip2-up and attach the core file from one of those segfaults?

(the "rv" thing is just gdb being confused, ignore that)

Comment 15 Mark Slemko 2006-10-02 17:11:27 UTC

Created attachment 137568 [details]
core dump with the 'rv' error found

Hi,
you're insight is quite a bit better! ;-) Cool...
I've attached a core dump example. (I have many and it is rather significant in
size: 1.2MB)

I hope this is the bees knees!

Comment 16 Mark Slemko 2006-10-25 17:39:43 UTC

Created attachment 139369 [details]
Another core dump

An additional core dump that might help in sorting through this.
Let me know if you need anything.

Comment 17 Mark Slemko 2006-11-29 18:08:34 UTC

Updated from FC5 to FC6 
- problem has been resolved.

Comment 18 Mark Slemko 2006-12-02 02:52:44 UTC

Oops, I replied too soon.
core dumps are still happening in FC6 - just didn't see anything until larger
xfers I guess.

Comment 19 Bug Zapper 2008-04-04 03:34:32 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 20 Bug Zapper 2008-05-06 16:15:22 UTC

This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.