Bug 650255

Summary: Curl does not handle FTP server timeout gracefully
Product: [Fedora] Fedora Reporter: Simon <number.cruncher>
Component: curlAssignee: Kamil Dudka <kdudka>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: daniel, kdudka, paul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: curl-7.20.1-5.fc13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 651592 (view as bug list) Environment:
Last Closed: 2010-11-28 20:41:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 651592, 720253    
Attachments:
Description Flags
a testing client
none
output of the testing client
none
Client output connecting to VSFTPD server
none
proposed fix
none
ext2 floppy image with latest libcurl (x86_64)
none
FTP timeout during anaconda with curl patch
none
follow-up fix, solution proposed by Daniel Stenberg
none
ext2 floppy image with latest libcurl (x86_64), including attachment #460042
none
Network trace following FTP timeout
none
minimal example
none
Simple client 460052 output
none
Simple client 460052 client output with patched libcurl none

Description Simon 2010-11-05 15:56:17 UTC
Description of problem:
If a Curl connection to an FTP server remains idle for a long time, the server may send a "421 Timeout". This should prompt Curl to drop the connection and re-connect for any subsequent transfers.

At the moment, Curl seems to ignore the 421 and gets confused when later trying to send request to the previous command channel.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. If possible, adjust timeout for FTP server. E.g vsftpd has an idle_session_timeout. Make this small e.g. 60 seconds.

2. Start a Fedora 14 installation via an FTP server. Wait at the package selection page for longer than the timeout.

3.
  
Actual results:
Installer fails to download filelists.xml.gz due to timeout problem.

Expected results:
Curl re-connects after server error.


Additional info:

Comment 1 Kamil Dudka 2010-11-09 13:20:26 UTC
I was playing with that and it seems to work in most cases.  I set up a testing FTP server (proftpd) with idle timeout set to 10:

# grep Idle /etc/proftpd.conf
TimeoutIdle 10

Then I prepared a testing client to repeat the issue.  It performs two FTP downloads using a specified delay between them.  If the delay is 8 or 12, it works fine.  If the delay is 10, it does not.  Am I observing the same issue as you with Anaconda?

Comment 2 Kamil Dudka 2010-11-09 13:22:41 UTC
Created attachment 459098 [details]
a testing client

Comment 3 Kamil Dudka 2010-11-09 13:30:22 UTC
Created attachment 459099 [details]
output of the testing client

taken as;

$ curl -sJO 'https://bugzilla.redhat.com/attachment.cgi?id=459098'
$ sh bz650255.c 2>&1 | tee bz650255.log

Comment 4 Kamil Dudka 2010-11-09 13:32:27 UTC
Comment on attachment 459098 [details]
a testing client

Oops, one typo in there, but it still works:

s/exho/echo/

Comment 5 Kamil Dudka 2010-11-09 13:42:58 UTC
Comment on attachment 459099 [details]
output of the testing client

The following part of the log is interesting for us:

> * Re-using existing connection! (#0) with host localhost
> * Connected to localhost (127.0.0.1) port 21 (#0)
> * Request has same path as previous transfer
> > EPSV
> * Connect data stream passively
> < 421 Idle timeout (10 seconds): closing control connection
> * disabling EPSV usage
> > PASV
> * Recv failure: Connection reset by peer
> * Closing connection #0
> * Failure when receiving data from the peer
> + exit 56

Comment 6 Simon 2010-11-09 13:48:04 UTC
Created attachment 459106 [details]
Client output connecting to VSFTPD server

Output when re-connecting before timeout.

Comment 7 Kamil Dudka 2010-11-09 14:07:20 UTC
Could you please try to adjust the timeout, such that it reflects your FTP server's timeout?

diff --git a/bz650255.c b/bz650255.c
index ca11705..89de371 100644
--- a/bz650255.c
+++ b/bz650255.c
@@ -1,12 +1,12 @@
 #if 0
 gcc -pedantic -Wall -Wextra -O0 -g bz650255.c -lcurl || exit $?
-./a.out 8  || exit $?
-./a.out 12 || exit $?
+./a.out 58 || exit $?
+./a.out 62 || exit $?
 echo
 echo === still OK ===
-exho
+echo
 set -x
-./a.out 10
+./a.out 60
 exit $?
 #endif

Comment 8 Kamil Dudka 2010-11-09 16:11:06 UTC
Created attachment 459170 [details]
proposed fix

Comment 9 Kamil Dudka 2010-11-09 16:19:41 UTC
The attached fix at least helps to return a more appropriate error code in case of the timeout on server's side.  I am not sure if libcurl itself should try to reconnect in that case...

Daniel, what do you think about the issue?

Is the behavior somehow specified in libcurl's API?

Comment 10 Simon 2010-11-09 16:38:04 UTC
I am having a hard time reproducing this with your client code. Perhaps the anaconda code is taking a different code path? Can't fault the above fix, though!

Comment 11 Kamil Dudka 2010-11-09 16:46:42 UTC
Well, let's fix the issue with proftpd and my reproducer for now.  Then we can retest with Anaconda.

Based on your description, I suspect that Anaconda holds a libcurl handle across all the user interaction during the package selection.  To be honest, I don't know if we are supposed to handle such cases transparently within libcurl and silently reconnect on server's idle timeout ... but the response in case of my reproducer looks wrong to me.

Comment 12 Daniel Stenberg 2010-11-09 17:01:00 UTC
(In reply to comment #9)
> The attached fix at least helps to return a more appropriate error code in case
> of the timeout on server's side.  I am not sure if libcurl itself should try to
> reconnect in that case...
> 
> Daniel, what do you think about the issue?
> 
> Is the behavior somehow specified in libcurl's API?

I think that's a fair approach. 421 is documented in RFC959 to be exactly "we're closing down the control connection" and it can come at any time so it makes sense to put the logic in the common response reader function.

I don't think libcurl needs to/should reconnect by itself.

Please go ahead and commit this upstream as well!

Comment 13 Kamil Dudka 2010-11-09 17:24:51 UTC
Daniel, thanks for looking at the issue.  I've just pushed the patch upstream:

https://github.com/bagder/curl/commit/12b2412

Anaconda developers have IMO two choices:

1) avoid holding the libcurl handle during packages selection

2) pick up the error code and reconnect on their own responsibility eventually

Comment 14 Kamil Dudka 2010-11-09 17:33:16 UTC
fixed in curl-7.21.2-4.fc15

Comment 15 Chris Lumens 2010-11-09 18:12:13 UTC
Such a change would need to be made somewhere farther down in the pile, like either in yum or python-urlgrabber.  anaconda does not directly work with libcurl in the package downloading portion, aside from resetting the name server cache stuff.

Comment 16 Simon 2010-11-10 08:58:43 UTC
What really frustrated me was the offer to "Retry" the original FTP retrieval with no success. libcurl wasn't actually sending a new, clean request. See https://bugzilla.redhat.com/show_bug.cgi?id=649347#c9

Perhaps this latest patch would fix this behaviour, i.e. timeout cause initial retrieval failure, but subsequent "Retry" in anaconda would then succeed. This would be an acceptable behaviour.

Is there any chance of a new updates image for libcurl, that I could test against my F14 install?

Comment 17 Daniel Stenberg 2010-11-10 09:07:35 UTC
If that kind of behavior is _not_ fixed with the patch mentioned in here, then there's a separate problem that needs to be dealt with as well...

Comment 18 Kamil Dudka 2010-11-10 09:51:54 UTC
Created attachment 459386 [details]
ext2 floppy image with latest libcurl (x86_64)

(In reply to comment #16)
> Is there any chance of a new updates image for libcurl, that I could test
> against my F14 install?

Sure thing, testing is appreciated!

Comment 19 Simon 2010-11-10 14:16:13 UTC
Created attachment 459453 [details]
FTP timeout during anaconda with curl patch

Still getting exactly the same observed behaviour as originally.

The network trace shows the server sending 421 followed by a TCP connection closure (FIN) which is duly ACK'd by the client TCP stack.

When the curl lib is invoked later, it sends a QUIT to the server *on a port which had previously ACK'd closure* (50816 src port in the full trace, but not shown in the FTP Request: QUIT packet).

The server tries to send a response to QUIT, but gets a connection reset.

I'm no TCP expert; but perhaps the application layer (libcurl) should just close the socket upon a 421 and not try to send new data.

Comment 20 Simon 2010-11-10 14:17:28 UTC
I should add that the "Retry" fails (hence multiple connections with no request in https://bugzilla.redhat.com/attachment.cgi?id=459453 )

Comment 21 Daniel Stenberg 2010-11-10 14:40:51 UTC
I've just added test case 1120 to the curl test suite upstreanm but it doesn't show that behavior...

Comment 22 Kamil Dudka 2010-11-12 13:03:30 UTC
Created attachment 460042 [details]
follow-up fix, solution proposed by Daniel Stenberg

Thank you for the testing.  I've attached yet another patch on top of that, which may fix it.  Unfortunately I wasn't successful in reproducing the bug myself with proftpd.  A new updates image for Anaconda will be available shortly.

Comment 23 Kamil Dudka 2010-11-12 14:06:02 UTC
Created attachment 460052 [details]
ext2 floppy image with latest libcurl (x86_64), including attachment #460042 [details]

Comment 24 Simon 2010-11-12 16:32:35 UTC
Created attachment 460092 [details]
Network trace following FTP timeout

Still no luck. Different behaviour (no QUIT sent), but still no successful retry. Is this an anaconda problem, i.e. what gets called when the GUI Retry button is pressed?

Comment 25 Kamil Dudka 2010-11-12 16:41:53 UTC
(In reply to comment #24)
> Still no luck. Different behaviour (no QUIT sent), but still no successful
> retry. Is this an anaconda problem, i.e. what gets called when the GUI Retry
> button is pressed?

Chris, could you please answer ^^^?

Comment 26 Chris Lumens 2010-11-12 16:59:22 UTC
There are several places where there's a Retry button for fetching metadata, and in all of them we just jump right back into yum.  That is, there's nothing going on in anaconda here.  Let's look at them all:

    def doGroupSetup(self, anaconda):
        while True:
            try:
                self.ayum.doGroupSetup()
            except (GroupsError, NoSuchGroup, RepoError) as e:
                buttons = [_("_Exit installer"), _("_Retry")]
                log.error("Unable to read group information: %s" % e)
            else:
                break # success

            rc = anaconda.intf.messageWindow(_("Error"),
                                        _("Unable to read group information "
                                          "from repositories.  This is "
                                          "a problem with the generation "
                                          "of your install tree."),
                                        type="custom", custom_icon="error",
                                        custom_buttons = buttons)
            if rc == 0:
                sys.exit(0)
            else:
                self.ayum._setGroups(None)
                continue

There, self.ayum.doGroupSetup is just the superclass method in yum.

            while True:
                try:
                    fn(repo)
                    if callback:
                        callback.disconnect()
                except RepoError, e:
                    if callback:
                        callback.disconnect()
                    buttons = [_("_Exit installer"), _("Edit"), _("_Retry")]
                else:
                    break # success

                if anaconda.ksdata:
                    buttons.append(_("_Continue"))

                if not fatalerrors:
                    raise RepoError, e

                rc = anaconda.intf.messageWindow(_("Error"),
                                   _("Unable to read package metadata. This may be "
                                     "due to a missing repodata directory.  Please "
                                     "ensure that your install tree has been "
                                     "correctly generated.\n\n%s" % e),
                                     type="custom", custom_icon="error",
                                     custom_buttons=buttons)
                if rc == 0:
                    # abort
                    sys.exit(0)
                elif rc == 1:
                    # edit
                    anaconda.intf.editRepoWindow(repo)
                    break
                elif rc == 2:
                    # retry, but only if button is present
                    continue
                else:
                    # continue, but only if button is present
                    self.ayum.repos.delete(repo.id)
                    break

Here, fn is one of self.ayum.doRepoSetup or self.ayum.doSackSetup.  So on Retry, we just jump back and call whichever of those functions we're processing again.  And again, those are just the superclass's methods in yum.

        while True:
            try:
                (code, msgs) = self.ayum.buildTransaction()
                 ...
            except RepoError, e:
                # FIXME: would be nice to be able to recover here
                rc = anaconda.intf.messageWindow(_("Error"),
                               _("Unable to read package metadata. This may be "
                                 "due to a missing repodata directory.  Please "
                                 "ensure that your install tree has been "
                                 "correctly generated.\n\n%s" % e),
                                 type="custom", custom_icon="error",
                                 custom_buttons=[_("_Exit installer"), _("_Retry")])
                dscb.pop()

                if rc == 0:
                    sys.exit(0)
                else:
                    continue

And this last one does the same thing - we just jump back and retry calling the superclass's method again.  anaconda doesn't actually do anything directly for downloading.  It's all in yum or urlgrabber.

Comment 27 Kamil Dudka 2010-11-12 17:26:32 UTC
Created attachment 460111 [details]
minimal example

Simon, could you please try this minimal example against your server and attach its output?  You need to either set the timeout <16s, or increase the delay within the test.

Comment 28 seth vidal 2010-11-12 18:29:07 UTC
okay it's not anaconda. The trick is that anaconda is using the header-retrieving code in yum and nothing else does that anymore.


so it is definitely not code that's in anaconda -it's just code that only anaconda ends up using.

Comment 29 Kamil Dudka 2010-11-12 18:36:17 UTC
Seth, how are you confirming that the observed behavior (wrongly handled session idle timeout) is related to the "header-retrieving code"?

Comment 30 Simon 2010-11-15 11:23:24 UTC
Created attachment 460518 [details]
Simple client 460052 output

F14 libcurl behaviour following timeout (libcurl-7.21.0-5.fc14.x86_64)

Comment 31 Kamil Dudka 2010-11-15 12:00:39 UTC
F14 libcurl does not contain the fix.  Please try libcurl from the updates image:

# curl -JO 'https://bugzilla.redhat.com/attachment.cgi?id=460052'
# mkdir updates
# mount -o loop bz650255.img updates/
# curl -o bz650255.c 'https://bugzilla.redhat.com/attachment.cgi?id=460111'
# LD_PRELOAD=$PWD/updates/libcurl.so sh bz650255.c
# umount updates/

Comment 32 Simon 2010-11-15 13:59:11 UTC
Created attachment 460542 [details]
Simple client 460052 client output with patched libcurl

No discernible change. Seems to work fine here.

Comment 33 Kamil Dudka 2010-11-15 14:18:32 UTC
Comment on attachment 460542 [details]
Simple client 460052 client output with patched libcurl

>* Connection #0 seems to be dead!
>* Closing connection #0
>* About to connect() to ftp port 21 (#0)
>*   Trying 192.168.10.121... * connected
>* Connected to ftp (192.168.10.121) port 21 (#0)
>< 220-

The difference is at this point ^^^.  In contrast to F-14 libcurl, there is no QUIT sent.

Unfortunately it means that my test-case was not reflecting the Anaconda problem.

Comment 34 Kamil Dudka 2010-11-19 13:23:05 UTC
a more generic version of attachment #460042 [details] proposed upstream:

http://thread.gmane.org/gmane.comp.web.curl.library/29855

Comment 35 Fedora Update System 2010-11-26 14:31:31 UTC
curl-7.21.0-6.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/curl-7.21.0-6.fc14

Comment 36 Fedora Update System 2010-11-26 15:10:34 UTC
curl-7.20.1-5.fc13 has been submitted as an update for Fedora 13.
https://admin.fedoraproject.org/updates/curl-7.20.1-5.fc13

Comment 37 Fedora Update System 2010-11-26 21:04:52 UTC
curl-7.21.0-6.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update curl'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/curl-7.21.0-6.fc14

Comment 38 Fedora Update System 2010-11-28 20:41:07 UTC
curl-7.21.0-6.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 39 Fedora Update System 2010-12-03 20:38:41 UTC
curl-7.20.1-5.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.