Bug 1204825
| Summary: | [RFE] urlgrabber.urlread() is not thread-safe | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Adrian Reber <adrian> | ||||||||
| Component: | python-urlgrabber | Assignee: | Michal Domonkos <mdomonko> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Jan Kepler <jkejda> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 7.1 | CC: | adrian, bnater, egolov, kevin, mbanas, mdomonko, vmukhame | ||||||||
| Target Milestone: | rc | Keywords: | FutureFeature, Patch | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | python-urlgrabber-3.10-9.el7 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2018-10-30 10:15:54 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1112660, 1549618 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Adrian Reber
2015-03-23 14:56:06 UTC
(In reply to Adrian Reber from comment #0) > Description of problem: > One of the Fedora infrastructure machines > (mm-crawler01.stg.phx2.fedoraproject.org) the new MirrorManager > implementation is installed. To check if all Fedora mirrors are up to date > it periodically scans the existing mirrors. It does this using rsync, http > and ftp. Where can we get source code of the crawler in question? > This new crawler is now multithreaded instead of forking a crawler for each > host. Unfortunately this new design seems to trigger a segfault in libcurl: Please make sure that libcurl is used in the supported way. That is, you never share the same handle in multiple threads: http://curl.haxx.se/docs/faq.html#Is_libcurl_thread_safe > Unfortunately I have no debuginfo packages installed. I have the core file > and can distribute it if you need. Please do. LD_PRELOAD-ing the libcurl from Fedora 21 resolves this bug by the way. Something between the curl version in RHEL7.1 and F21 seems to fix this problem. Source code: https://github.com/fedora-infra/mirrormanager2/blob/master/utility/mm2_crawler I will sent you a link to the core file as I am not sure what kind of information is stored in the core file. Thank you for providing the backtrace. It crashes in Curl_retry_request():
1851│ if(/* workaround for broken TLS servers */ data->state.ssl_connect_retry ||
1852│ ((data->req.bytecount +
1853│ data->req.headerbytecount == 0) &&
1854│ conn->bits.reuse &&
1855│ !data->set.opt_no_body &&
1856│ data->set.rtspreq != RTSPREQ_RECEIVE)) {
1857│ /* We got no data, we attempted to re-use a connection and yet we want a
1858│ "body". This might happen if the connection was left alive when we were
1859│ done using it before, but that was closed when we wanted to read from
1860│ it again. Bad luck. Retry the same request on a fresh connect! */
1861│ infof(conn->data, "Connection died, retrying a fresh connect\n");
1862│ *url = strdup(conn->data->change.url);
1863│ if(!*url)
1864│ return CURLE_OUT_OF_MEMORY;
1865│
1866│ conn->bits.close = TRUE; /* close this connection */
1867│ conn->bits.retry = TRUE; /* mark this as a connection we're about
1868│ to retry. Marking it this way should
1869│ prevent i.e HTTP transfers to return
1870│ error just because nothing has been
1871│ transferred! */
1872│
1873│
1874├> if((conn->handler->protocol&CURLPROTO_HTTP) &&
1875│ data->state.proto.http->writebytecount)
1876│ return Curl_readrewind(conn);
1877│ }
1878│ return CURLE_OK;
1879│ }
At line 1875 of transfer.c, 'data->state.proto.http' is NULL and basically all the content of *data seems to be nullified.
The function was called from multi_runsingle():
1437│ k = &data->req;
1438│
1439│ if(!(k->keepon & KEEP_RECV)) {
1440│ /* We're done receiving */
1441│ easy->easy_conn->readchannel_inuse = FALSE;
1442│ }
1443│
1444│ if(!(k->keepon & KEEP_SEND)) {
1445│ /* We're done sending */
1446│ easy->easy_conn->writechannel_inuse = FALSE;
1447│ }
1448│
1449│ if(done || (easy->result == CURLE_RECV_ERROR)) {
1450│ /* If CURLE_RECV_ERROR happens early enough, we assume it was a race
1451│ * condition and the server closed the re-used connection exactly when
1452│ * we wanted to use it, so figure out if that is indeed the case.
1453│ */
1454├> CURLcode ret = Curl_retry_request(easy->easy_conn, &newurl);
1455│ if(!ret)
1456│ retry = (newurl)?TRUE:FALSE;
1457│
1458│ if(retry) {
1459│ /* if we are to retry, set the result to OK and consider the
1460│ request as done */
1461│ easy->result = CURLE_OK;
1462│ done = TRUE;
1463│ }
1464│ }
At line 1454 of multi.c, done is TRUE and easy->result is CURLE_OK.
The backtrace looks like this:
#0 Curl_retry_request (conn=0x7f7ed471bd30, ...) at transfer.c:1874
#1 multi_runsingle (multi=0x7f7f0400cea0, ...) at multi.c:1454
#2 curl_multi_perform (multi_handle=0x7f7f0400cea0, ...) at multi.c:1713
#3 curl_easy_perform (easy=0x1be7630) at easy.c:482
#4 do_curl_perform (self=0x1b9f8c0) at src/pycurl.c:1019
#5 call_function (...) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4082
There is other 105 threads executing (or being blocked on) waitpid(), sem_wait(), recv(), connect(), epoll_wait(), or mprotect().
I see no obvious bug fixes related to this issue in the upstream git repository. The code causing SIGSEGV has been changed, but the change is not documented as a bug fix (and at first glance does not look like it would fix the bug):
https://github.com/bagder/curl/commit/e79535bc
Instead of LD_PRELOAD I rebuilt curl from the Fedora master branch on the epel7 branch and that also works: http://koji.fedoraproject.org/koji/taskinfo?taskID=9391121 curl from RHEL7.1 still fails every time after about 90 seconds. The cronjob starts at 00:00 and 12:00 [Tue Mar 31 00:01:40 2015] mm2_crawler[4358]: segfault at 30 ip 00007f58062e2557 sp 00007f576a7f1250 error 4 in libcurl.so.4.3.0[7f58062b9000+63000] [Tue Mar 31 12:01:20 2015] mm2_crawler[9945]: segfault at 30 ip 00007fc9c1db5557 sp 00007fc9aa7f9250 error 4 in libcurl.so.4.3.0[7fc9c1d8c000+63000] [Wed Apr 1 00:01:33 2015] mm2_crawler[10428]: segfault at 30 ip 00007f22b7ede557 sp 00007f22adf84250 error 4 in libcurl.so.4.3.0[7f22b7eb5000+63000] If you have a self-contained script that decides whether a version of curl is "good" or "bad" in a finite amount of time, we can use git-bisect to find the upstream commit that turns the "bad" behavior into the "good" behavior. I am now running bisect for the last three days and unfortunately it is not always easy to decide if it fails or not. This will probably take a couple more days. I now have additional segfaults which seem to be in libc: [Fri Apr 3 18:42:33 2015] python[6141]: segfault at 0 ip 00007f0a3810c811 sp 00007f098bff4238 error 4 in libc-2.17.so[7f0a38086000+1b6000] [Fri Apr 3 20:07:42 2015] python[20690]: segfault at 0 ip 00007f801287e811 sp 00007f7f5afda238 error 4 in libc-2.17.so[7f80127f8000+1b6000] So I am not sure what this means for the curl problem I am bisceting. (In reply to Adrian Reber from comment #7) > I am now running bisect for the last three days and unfortunately it is not > always easy to decide if it fails or not. This will probably take a couple > more days. Thank you for analyzing it further! Wishing good luck with git-bisect! > I now have additional segfaults which seem to be in libc: > > [Fri Apr 3 18:42:33 2015] python[6141]: segfault at 0 ip 00007f0a3810c811 > sp 00007f098bff4238 error 4 in libc-2.17.so[7f0a38086000+1b6000] > [Fri Apr 3 20:07:42 2015] python[20690]: segfault at 0 ip 00007f801287e811 > sp 00007f7f5afda238 error 4 in libc-2.17.so[7f80127f8000+1b6000] > > So I am not sure what this means for the curl problem I am bisceting. The above lines do not say much about the actual cause. If there is a memory corruption inside libcurl, it can also result in a SIGSEGV in libc and abnormal termination of the python process. However, we would need more info to confirm such a hypothesis. I have finished a bisect run where I was treating all segfaults (libc and libcurl) the same and if did not segfault I was running the crawler up to 5 times. Eventually I got a segfault all the times with every curl version I have tested. So this errors has not been fixed even if libcurl.so from Fedora 21 does not crash. So this is probably very hard to trigger. I bisected curl between 7.29 and yesterdays git. Sometimes it takes a couple of hours to trigger this segfault. I do not know how to debug this further. Thank you for debugging this! Did you run git-bisect on Fedora or the upstream git repository? If upstream, are you sure that the same configure options and compiler flags were used as with rpmbuild? I was running git bisect on the upstream repository. The following is the output of git bisect log # bad: [992a731116fc9134d4a0acf17fe10219917ecb30] test142[23]: verify that an empty file is stored on success # good: [bf633a584dcbb0f80273ba856b7198ad1e395315] vms: config-vms.h is removed, no use trying to distribute it git bisect start 'master' 'bf633a584dcbb0f80273ba856b7198ad1e395315' # good: [75e996f29f1855d47299cf29f96507cd78d5aff1] tool: Moved --progress-bar to the global config git bisect good 75e996f29f1855d47299cf29f96507cd78d5aff1 # good: [2e121fda355ecc94e155d27adbd21525aa60fdba] sasl_sspi: Fixed typo in comment git bisect good 2e121fda355ecc94e155d27adbd21525aa60fdba # good: [62a018762e081a679534a19c3b41fcf597de64ec] sockfilt.c: Replace 100ms sleep with thread throttle git bisect good 62a018762e081a679534a19c3b41fcf597de64ec # good: [0daf1ef7299dcd4755a75d6b9342739da6be7244] curl.1: clarify that -X is used for all requests git bisect good 0daf1ef7299dcd4755a75d6b9342739da6be7244 # good: [76afe14584a1e6377663bbf6f0587981a686b615] CURLOPT_STDERR.3: added an example git bisect good 76afe14584a1e6377663bbf6f0587981a686b615 # good: [e438a9e2f022819c1023b6fa9daf4b37bf0b8111] CURLOPT_PATH_AS_IS.3: add type 'long' to prototype git bisect good e438a9e2f022819c1023b6fa9daf4b37bf0b8111 # good: [ae3c985060d7d5fd7a72d7dcb0b6b176f3c972b2] tool_operate: only set SSL options if SSL is enabled git bisect good ae3c985060d7d5fd7a72d7dcb0b6b176f3c972b2 # good: [f203edc544e1fb902fbc950e47d04e1505c594de] cyassl: Set minimum protocol version before CTX callback git bisect good f203edc544e1fb902fbc950e47d04e1505c594de # good: [f2a0b2164a1cdeaa806debbc3d0b46cfe04976e9] checksrc.bat: quotes to support an SRC_DIR with spaces git bisect good f2a0b2164a1cdeaa806debbc3d0b46cfe04976e9 # good: [c3101ae287fcfc420bdd816f1eaf39c8dc9b242b] x509asn1: Silence x64 loss-of-data warning on RSA key length assignment git bisect good c3101ae287fcfc420bdd816f1eaf39c8dc9b242b # good: [f251417d85d232605ca86e9562a64500c67ccdee] src/tool_cb_wrt: separate fnc for output file creation git bisect good f251417d85d232605ca86e9562a64500c67ccdee # good: [261a0fedcf1545440190965311a1554d7695b6c0] src/tool_operate: create output file on successful download git bisect good 261a0fedcf1545440190965311a1554d7695b6c0 # first bad commit: [992a731116fc9134d4a0acf17fe10219917ecb30] test142[23]: verify that an empty file is stored on success I used following configure options on RHEL7 to build curl from git: ../configure --disable-static --enable-symbol-hiding --enable-ipv6 --enable-ldaps --enable-manual --enable-threaded-resolver -with-ca-bundle=/etc/pki/tls/certs/ca-bundle.crt -with-gssapi --with-libidn --with-libmetalink --with-libssh2 -without-ssl --with-nss --disable-sspi Those were the options I copied from the Fedora rawhide curl spec file. I have logfiles from all builds and test runs if you need to look at the configure or make output. If all commits were good (meaning a crash occurred), then the output of git-bisect says nothing. Now the question is why it works with f21 libcurl? Does it really work? If so, I would suggest to checkout the curl-7_37_0 tag in the upstream git repository, build it, and check whether it suffers from this bug. If it does, we need to check downstream patches. Also, are you sure that you had all _build_ dependencies of curl installed when you built the upstream curl from sources? Note that libmetalink-devel is not available on RHEL but that would likely make no difference regarding this bug. I am pretty sure I had all dependencies installed. libmetalink-devel was from EPEL: libmetalink-0.1.2-4.el7.x86_64 This is the result of configure: curl version: 7.36.0-DEV Host setup: x86_64-unknown-linux-gnu Install prefix: /usr/local Compiler: gcc SSL support: enabled (NSS) SSH support: enabled (libSSH2) zlib support: enabled GSSAPI support: enabled (MIT/Heimdal) SPNEGO support: no (--with-spnego) TLS-SRP support: no (--enable-tls-srp) resolver: POSIX threaded ipv6 support: enabled IDN support: enabled Build libcurl: Shared=yes, Static=no Built-in manual: enabled --libcurl option: enabled (--disable-libcurl-option) Verbose errors: enabled (--disable-verbose) SSPI support: no (--enable-sspi) ca cert bundle: /etc/pki/tls/certs/ca-bundle.crt ca cert path: no LDAP support: enabled (OpenLDAP) LDAPS support: enabled RTSP support: enabled RTMP support: no (--with-librtmp) metalink support: enabled HTTP2 support: disabled (--with-nghttp2) Protocols: DICT FILE FTP FTPS GOPHER HTTP HTTPS IMAP IMAPS LDAP LDAPS POP3 POP3S RTSP SCP SFTP SMTP SMTPS TELNET TFTP I am now running my test against the curl-7_37_0 tag to see if I can reproduce the error. (In reply to Adrian Reber from comment #14) Looks good. Thanks for the confirmation! It took some tries but I was able to crash it using the build from the curl-7_37_0 tag. I was now also able to trigger a segfault in the Fedora 21 version of curl. So this error seems to exist in all versions of curl. With some versions it segfaults after a few minutes and other versions take a couple of hours. My assumption about this being a bug of libcurl was based on your statement that the bug does not occur with a different version of libcurl. Now as you say that all versions of libcurl suffer from this bug, it looks rather as a bug somewhere higher in the application stack. Other people use libcurl in multi-threaded applications with no issues. In any case, a minimal example would be nice... I was able to create a "simple" version of the crash. It does not need a database only a list with URLs as inputs. I will attach the python script and and the URL list later to this. It does not really look like a bug in curl.
Most of the time the crawler only does HEAD requests using python's httplib. Only for repomd.xml files the crawler switches to urlgrabber to download the file to calculate the hash:
s = urlgrabber.urlread(url)
sha256 = hashlib.sha256(s).hexdigest()
Without this part it does not crash and with that code it segfaults like described above. It should be easy to replace this code with something that does not use urlgrabber. Maybe urlgrabber is not thread safe. Thanks for your help.
Created attachment 1013122 [details]
crash_test script
Created attachment 1013123 [details]
input file
Thanks for the test script! I can confirm that it indeed crashes from time to time but it is difficult to debug. For example the last backtrace did not go through libcurl at all: #0 0x0000003611073c8b in list_dealloc () from /lib64/libpython2.7.so.1.0 #1 0x000000361108004b in dict_dealloc () from /lib64/libpython2.7.so.1.0 #2 0x00000036110584a3 in instance_dealloc () from /lib64/libpython2.7.so.1.0 #3 0x000000361108004b in dict_dealloc () from /lib64/libpython2.7.so.1.0 #4 0x00000036110584a3 in instance_dealloc () from /lib64/libpython2.7.so.1.0 #5 0x00000036110ddace in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #6 0x00000036110e3400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #7 0x000000361106f5dc in function_call () from /lib64/libpython2.7.so.1.0 #8 0x000000361104a903 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #9 0x00000036110dc4c7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #10 0x00000036110d83a8 in builtin_map () from /lib64/libpython2.7.so.1.0 #11 0x00000036110e21ff in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #12 0x00000036110e3400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #13 0x000000361106f6bd in function_call () from /lib64/libpython2.7.so.1.0 #14 0x000000361104a903 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #15 0x00000036110e0050 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #16 0x00000036110e3400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #17 0x000000361106f6bd in function_call () from /lib64/libpython2.7.so.1.0 #18 0x000000361104a903 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #19 0x00000036110e0050 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #20 0x00000036110e1be6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #21 0x00000036110e1be6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #22 0x00000036110e3400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #23 0x000000361106f5dc in function_call () from /lib64/libpython2.7.so.1.0 #24 0x000000361104a903 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #25 0x0000003611059815 in instancemethod_call () from /lib64/libpython2.7.so.1.0 #26 0x000000361104a903 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #27 0x00000036110dc4c7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #28 0x0000003611112062 in t_bootstrap () from /lib64/libpython2.7.so.1.0 #29 0x000000360f00752a in start_thread (arg=0x7ffeca7fc700) at pthread_create.c:310 #30 0x000000360e50022d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Would it be possible to modify it such that it uses solely libcurl for network I/O? Created attachment 1013256 [details]
crash_test using pycurl
I attached a new version which uses pycurl to make the HEAD requests. It still crashes when using urlgrabber (which is also curl based I think) to download the repo file for checksumming. This took me quite a huge amount of time to debug because I was looking for the cause at a completely wrong place -- bisecting libcurl up to the commit where it started to crash: https://github.com/bagder/curl/commit/058fb335 The minimal example was pretty helpful. I modified it to print the exceptions falling out from urlgrabber and saw that setopt() is being called _during_ the execution of perform(), which was caught by pycurl. It reminded me the FAQ entry I referred to in comment #2 -- curl handles are not allowed to be shared by multiple threads. Finally, I looked into urlgrabber sources and saw this: _curl_cache = pycurl.Curl() # make one and reuse it over and over and over [...] def _do_open(self): self.curl_obj = _curl_cache self.curl_obj.reset() # reset all old settings away, just in case The above code clearly violates the requirement if urlgrabber.urlread() is called from multiple threads. After patching urlgrabber like this: --- a/urlgrabber/grabber.py +++ b/urlgrabber/grabber.py @@ -1592,7 +1592,7 @@ class PyCurlFileObject(object): raise err def _do_open(self): - self.curl_obj = _curl_cache + self.curl_obj = _pycurl.Curl() self.curl_obj.reset() # reset all old settings away, just in case # setup any ranges self._set_opts() ... your example started to work reliably (and independently on the version of curl). I would suggest to just stop using urlgrabber in your code, it is rarely useful these days. Thanks for your analysis. The MirrorManager2 crawler has been rewritten to use urllib2 instead of urlgrabber, We are happy that we found the reason and that the crash is understood. *** Bug 1178659 has been marked as a duplicate of this bug. *** Created attachment 1223392 [details]
downstream patch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:3130 |