Bug 998113

Summary: [locking] dnf makecache stuck, blocks operation
Product: [Fedora] Fedora Reporter: Karel Volný <kvolny>
Component: dnfAssignee: Packaging Maintenance Team <packaging-team-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: akozumpl, jsilhan, packaging-team-maint, pnemade
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-09 09:04:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Karel Volný 2013-08-17 11:32:29 UTC
Description of problem:
I cannot run dnf upgrade as it says metadata-cache is locked.

Version-Release number of selected component (if applicable):
dnf-0.3.10-1.giteb9dddb.fc19.noarch

How reproducible:
depending on the trigger condition (network timeout?)

Steps to Reproduce:
1. get `dnf -v makecache timer` command (cronjob) stuck
2. dnf upgrade

Actual results:
[root@kvolny ~]# dnf upgrade
metadata-cache already locked by 3575
  The application with PID 3575 is: dnf
    Paměť  :  26 M RSS (430 MB VSZ)
    Spuštěn: Fri Aug 16 03:45:07 2013 - před 1 day(s) 9:10:26
    State  : Spí


Expected results:
(upgrade goes as usual)


Additional info:
trying strace, it shows timeouts repeating:

[root@kvolny ~]# strace -y -p 3575
Process 3575 attached
select(13, [12<socket:[44973]>], [], [], {0, 276892}) = 0 (Timeout)
poll([{fd=12<socket:[44973]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
select(13, [12<socket:[44973]>], [], [], {1, 0}) = 0 (Timeout)
poll([{fd=12<socket:[44973]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
...


from dnf.log:

Aug 16 02:45:07 INFO Metadata cache refreshed recently.
Aug 16 03:45:07 INFO --- logging initialized ---
Aug 16 03:45:07 DEBUG Config time: 0.006
Aug 16 03:45:07 DEBUG dnf version: 0.3.10
Aug 16 03:45:07 SUBDEBUG COMMAND: dnf -v makecache timer 
Aug 16 03:45:07 SUBDEBUG Installroot: /
Aug 16 03:45:07 SUBDEBUG Ext Commands: timer
Aug 16 03:45:07 DEBUG Vytváření cache souborů pro všechna metadata.
Aug 16 03:45:07 DEBUG rpmfusion-nonfree-debuginfo: will expire after 378557 seconds.
Aug 16 03:45:07 DEBUG fedora: metadata will expire after 567 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rpmfusion-nonfree-updates-debuginfo: metadata will expire after 568 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rpmfusion-free-updates: metadata will expire after 568 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG adobe-linux-x86_64: metadata will expire after 568 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rpmfusion-nonfree-updates: metadata will expire after 569 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rpmfusion-free: metadata will expire after 569 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rhpkg: will expire after 120658 seconds.
Aug 16 03:45:07 DEBUG updates-debuginfo: metadata will expire after 573 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG fedora-debuginfo: will expire after 378559 seconds.
Aug 16 03:45:07 DEBUG updates: metadata will expire after 584 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rpmfusion-nonfree: metadata will expire after 588 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG rpmfusion-free-debuginfo: will expire after 378559 seconds.
Aug 16 03:45:07 DEBUG rpmfusion-free-updates-debuginfo: metadata will expire after 589 seconds and will be refreshed now
Aug 16 03:45:07 DEBUG qa-tools: will expire after 120659 seconds.
Aug 16 03:45:07 DEBUG beaker-client: will expire after 120660 seconds.
Aug 16 03:45:07 DEBUG not found deltainfo for: RPM Fusion for Fedora 19 - Nonfree - Debug
Aug 17 12:55:33 INFO --- logging initialized ---
Aug 17 12:55:33 DEBUG Config time: 0.007
Aug 17 12:55:33 DEBUG dnf version: 0.3.10
Aug 17 12:55:33 SUBDEBUG COMMAND: dnf upgrade 
Aug 17 12:55:33 SUBDEBUG Installroot: /
Aug 17 12:55:33 SUBDEBUG Ext Commands: 
Aug 17 12:55:33 CRITICAL metadata-cache already locked by 3575
Aug 17 12:55:33 CRITICAL   The application with PID 3575 is: dnf
Aug 17 12:55:33 CRITICAL     Paměť  :  26 M RSS (430 MB VSZ)
Aug 17 12:55:33 CRITICAL     Spuštěn: Fri Aug 16 03:45:07 2013 - před 1 day(s) 9:10:26
Aug 17 12:55:33 CRITICAL     State  : Spí


after killing the process, everything goes ok

Comment 1 Ales Kozumplik 2013-08-19 07:03:19 UTC
This is closely related to the bug 922441. I can only wonder what causes the hang in the makecache process: it is probably a network communication hang but to honestly blame libreport we'd need to catch it red-handed (hopefully this [1] will help it in the future).

Leaving this open for now, hoping to reproduce.

[1] https://github.com/akozumpl/dnf/commit/d46d2ff51eb274daa73cbbab28a6c8508776b9ba

Comment 2 Karel Volný 2013-08-20 14:02:08 UTC
(In reply to Ales Kozumplik from comment #1)
> This is closely related to the bug 922441.

looking at that ... would that be extremely hard (or undesired) to implement some inter process communication so that the program run by the user would be able to get the status of the one running in background, show the background downloads progress, offer to interrupt the background process, etc.?

> I can only wonder what causes the hang in the makecache process: it is probably
> a network communication hang but to honestly blame libreport we'd need to catch
> it red-handed (hopefully this [1] will help it in the future).

ugh, what does libreport has to do with this?

Comment 3 Ales Kozumplik 2013-08-20 14:19:36 UTC
(In reply to Karel Volný from comment #2)
> looking at that ... would that be extremely hard (or undesired) to implement
> some inter process communication so that the program run by the user would
> be able to get the status of the one running in background, show the
> background downloads progress, offer to interrupt the background process,
> etc.?

It's a possibility given that both processes belong to the same user.

> 
> > I can only wonder what causes the hang in the makecache process: it is probably
> > a network communication hang but to honestly blame libreport we'd need to catch
> > it red-handed (hopefully this [1] will help it in the future).
> 
> ugh, what does libreport has to do with this?

sorry, I meant 'librepo'.

Comment 4 Karel Volný 2013-08-20 14:57:10 UTC
(In reply to Ales Kozumplik from comment #3)
> (In reply to Karel Volný from comment #2)
> > looking at that ... would that be extremely hard (or undesired) to implement
> > some inter process communication so that the program run by the user would
> > be able to get the status of the one running in background, show the
> > background downloads progress, offer to interrupt the background process,
> > etc.?
> 
> It's a possibility given that both processes belong to the same user.

well, except for read only operations where you don't need to acquire any locks anyways, does dnf allow other user than root (or someone using sudo to become root)? - I guess "same user" rule would cover 100% cases here ...

...
> sorry, I meant 'librepo'.

:-)

Comment 5 Ales Kozumplik 2013-08-21 07:05:22 UTC
> well, except for read only operations where you don't need to acquire any
> locks anyways, does dnf allow other user than root (or someone using sudo to
> become root)? - I guess "same user" rule would cover 100% cases here ...
> 

yes, DNF runs under the normal user, it just will fail to do any actual RPM transaction of course.

about the read-only locking, you are right, this is one thing I'm considering, please see bug 980227 comment 2.

Comment 6 Karel Volný 2013-08-29 14:35:47 UTC
(In reply to Ales Kozumplik from comment #5)
> > well, except for read only operations where you don't need to acquire any
> > locks anyways, does dnf allow other user than root (or someone using sudo to
> > become root)? - I guess "same user" rule would cover 100% cases here ...
> > 
> 
> yes, DNF runs under the normal user, it just will fail to do any actual RPM
> transaction of course.

I thought it gains root for cache updates ...

> about the read-only locking, you are right, this is one thing I'm
> considering, please see bug 980227 comment 2.

uh, that seems far in the future ... can something be done about the makecache problem with the current implementatino?

Comment 7 Ales Kozumplik 2014-09-09 09:04:06 UTC
Closing this. DNF's metadata cache is heavily built on the idea that there is only one local copy of the repository metadata. If a background process that creates this metadata is stuck it can either be because of:

a) a persistent network issue. In that case terminating the background process and running DNF from the command line will result in a stuck operation again. Also, since this has been reported we have tweaked the timeouting values for librepo so eventually all such DNF processes finish or fail with a network error.

b) a transient network error. In that case the user best knows what to do and can kill the background process. Again, should not  be necessary.

c) an actual bug in the DNF/librepo networking code that prevents DNF from terminating successfully or with an error---such should be reported separately and properly debugged.

For possibilities of running multiple DNF with multiple caches (using different installroots) see bug 1124316.

For a request of blocking DNF executions see bug 1049028.