Bug 1254327

Summary:	dnf makecache doesn't appear to have a network timeout
Product:	[Fedora] Fedora	Reporter:	John Sullivan <jsrhbz>
Component:	dnf	Assignee:	Packaging Maintenance Team <packaging-team-maint>
Status:	CLOSED EOL	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	21	CC:	jsilhan, jsrhbz, jzeleny, mluscon, packaging-team-maint, pnemade, tim.lauridsen, vmukhame
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-02 15:00:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Sullivan 2015-08-17 17:47:21 UTC

dnf-0.6.4-5.fc21.noarch

Attempting to do a manual "dnf update" gave the following message then stalled:

Waiting for process with pid 20324 to finish.

That process was the periodic makecache:

  PID TTY      STAT   TIME COMMAND
20324 ?        SNs    0:25 /usr/bin/python -OO /usr/bin/dnf -v makecache timer

stracing it:

Process 20324 attached
select(41, [38 39 40], [], [], {0, 32573}) = 0 (Timeout)
poll([{fd=38, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
poll([{fd=39, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
poll([{fd=40, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
select(41, [38 39 40], [], [], {1, 0})  = 0 (Timeout)
poll([{fd=38, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
poll([{fd=39, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
poll([{fd=40, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
select(41, [38 39 40], [], [], {1, 0})  = 0 (Timeout)
[...]

Those fds are:

lrwx------ 1 root root 64 Aug 15 03:33 40 -> socket:[12259455]
lrwx------ 1 root root 64 Aug 15 03:33 39 -> socket:[12259454]
lrwx------ 1 root root 64 Aug 15 03:33 38 -> socket:[12259453]

tcp6       0      0 <...>:44827 2001:67c:1220:e100::93e5:390:80 ESTABLISHED 0          12259455   20324/python        
tcp6       0      0 <...>:44825 2001:67c:1220:e100::93e5:390:80 ESTABLISHED 0          12259453   20324/python        
tcp6       0      0 <...>:44826 2001:67c:1220:e100::93e5:390:80 ESTABLISHED 0          12259454   20324/python       

The remote host resolves to mirror.cis.vutbr.cz.

This was at 2015-08-17 18:10, so it looks like it's been in that state for over 2.5 days. Given that it's reading from port 80 I guess it's send a HTTP request and is waiting for a response.

Either the remote server is just sitting there doing nothing, or due to some network glitch the remote end has gone away but we never received a FIN or RST. When that happens, because in the absence of SO_KEEPALIVE a TCP receiver doesn't *require* any packets to be received, unlike a TCP sender which must receive a timely ACK or it will abort the connection, a receiver that missed the FIN/RST can hang indefinitely unless it implements some sort of timeout of its own.

(Or enables SO_KEEPALIVE on the socket, though the default timeout on that is still a little over 2 hours, which is a bit high for this kind of thing.)

I recovered this with "systemctl stop dnf-makecache.service", but I'd expect some reasonable kind of timeout to deal with duff mirrors and network glitches alike.

(Err, just noticed as well: why poll() all the fds individually if select() on them didn't return? The answer's really not likely to have changed...)

Comment 1 Michal Luscon 2015-09-01 14:35:12 UTC

This issue has been fixed in F22 version of dnf. Would you like to request a backport?

Comment 2 John Sullivan 2015-10-12 15:44:16 UTC

Dunno. This machine may be on F21 for a while yet, so notionally yet.

OTOH I've effectively bypassed the cache by setting metadata_expire=1 so it updates when I ask it. I may disable the dnf-makecache.service completely as well to avoid pointless background updates. Even with it running, now I know what to look for I know how to recover (though it hasn't happened to me again yet, but other people on F21 may appreciate not bumping into this at all.)

My bigger issue atm is bug 1252106.

Comment 3 John Sullivan 2015-10-12 15:45:55 UTC

s/notionally yet/notionally yes/

Comment 4 Fedora End Of Life 2015-11-04 10:12:30 UTC

This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Fedora End Of Life 2015-12-02 15:00:09 UTC

Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.