1160603 – urllib.parse.urlsplit() and http:// isn't quite right

Bug 1160603 - urllib.parse.urlsplit() and http:// isn't quite right

Summary: urllib.parse.urlsplit() and http:// isn't quite right

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	python3
Sub Component:
Version:	22
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Charalampos Stratakis
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-05 08:55 UTC by Alexander Todorov
Modified:	2016-07-19 20:52 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-07-19 20:52:11 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Python	22891	0	None	None	None	Never

Description Alexander Todorov 2014-11-05 08:55:49 UTC

Description of problem:

In the urllib.parse (or urlparse on Python 2.X) module there is this function:

    157 def urlsplit(url, scheme='', allow_fragments=True):
    158     """Parse a URL into 5 components:
    159     <scheme>://<netloc>/<path>?<query>#<fragment>
    160     Return a 5-tuple: (scheme, netloc, path, query, fragment).
    161     Note that we don't break the components up in smaller bits
    162     (e.g. netloc is a single string) and we don't expand % escapes."""
    163     allow_fragments = bool(allow_fragments)
    164     key = url, scheme, allow_fragments, type(url), type(scheme)
    165     cached = _parse_cache.get(key, None)
    166     if cached:
    167         return cached
    168     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
    169         clear_cache()
    170     netloc = query = fragment = ''
    171     i = url.find(':')
    172     if i > 0:
    173         if url[:i] == 'http': # optimize the common case
    174             scheme = url[:i].lower()
    175             url = url[i+1:]
    176             if url[:2] == '//':
    177                 netloc, url = _splitnetloc(url, 2)
    178             if allow_fragments and '#' in url:
    179                 url, fragment = url.split('#', 1)
    180             if '?' in url:
    181                 url, query = url.split('?', 1)
    182             v = SplitResult(scheme, netloc, url, query, fragment)
    183             _parse_cache[key] = v
    184             return v
    185         for c in url[:i]:
    186             if c not in scheme_chars:
    187                 break
    188         else:
    189             scheme, url = url[:i].lower(), url[i+1:]
    190 
    191     if url[:2] == '//':
    192         netloc, url = _splitnetloc(url, 2)
    193     if allow_fragments and '#' in url:
    194         url, fragment = url.split('#', 1)
    195     if '?' in url:
    196         url, query = url.split('?', 1)
    197     v = SplitResult(scheme, netloc, url, query, fragment)
    198     _parse_cache[key] = v
    199     return v



There is an issue here (or a few of them) as follows:

* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on  line 174 is reduntant:
174    scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"


* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
    173         if url[:i].lower() == 'http': # optimize the common case

* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")

but the last 2 invocations also go through lines 185-199


* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.


My personal vote goes for removal of lines 173-184. 


Version-Release number of selected component (if applicable):

This is present in both Python 3 and Python 2 on all versions I have access to:

python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64


Lets resolve in Rawhide first and then I'll clone it for other versions/products.

Comment 1 Bohuslav "Slavek" Kabrda 2014-11-14 11:24:19 UTC

Just to make sure I understand this correctly, this is not about the function being broken. It's about refactoring/removing duplicated code/improving performance, is that right? If so, would you consider opening an upstream bug? I think you pretty much outlined the solution here and I wouldn't want to get credit for your solution (although I can do that if you insist).
I suggest you create upstream patch and link it here. Once it's accepted, I'll be happy to backport it to our downstream repos.

Comment 2 Jaroslav Reznik 2015-03-03 16:28:12 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle.
Changing version to '22'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22

Comment 3 Fedora Admin XMLRPC Client 2015-05-12 12:03:02 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 4 Fedora Admin XMLRPC Client 2016-01-29 13:07:45 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 5 Fedora End Of Life 2016-07-19 20:52:11 UTC

Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.