Bug 1470352 - dnf --refresh update broken since 26 update
dnf --refresh update broken since 26 update
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: nss (Show other bugs)
26
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Daiki Ueno
Fedora Extras Quality Assurance
:
: 1471466 1472954 (view as bug list)
Depends On: 1460051
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-12 15:32 EDT by Kostya Vasilyev
Modified: 2017-08-11 05:53 EDT (History)
23 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-11 05:53:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
strace (3.00 MB, text/plain)
2017-07-12 15:50 EDT, Kostya Vasilyev
no flags Details
dnf.librepo.log (867.22 KB, application/x-gzip)
2017-07-14 08:18 EDT, Kostya Vasilyev
no flags Details
last successful update transaction (8.55 KB, text/plain)
2017-07-15 11:30 EDT, Alexander Ploumistos
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Mozilla Foundation 1381784 None None None 2017-07-18 06:39 EDT

  None (edit)
Description Kostya Vasilyev 2017-07-12 15:32:34 EDT
Description of problem:

Just updated from F25 to F26 using dnf system-upgrade.

Now when I run "sudo dnf --refresh update", it just hangs after (seemingly) fetching repo info.

Version-Release number of selected component (if applicable):

dnf 2.5.1-1.fc26

How reproducible:

Tried 3-5 times, killing dnf between attempts. Same every time.

Steps to Reproduce:

1. Fedora 26, fresh upgrade from F25
2. "sudo dnf --refresh update".
3.

Actual results:

"sudo dnf --refresh update" apparently done with refreshing latest info, then just hangs.

Ctrl+C doesn't abort, have to use kill.

Expected results:

"sudo dnf --refresh update" refreshing latest info from remote repos, and then (probably) doing nothing, should be no updates at this point, and quitting to shell.

Additional info:

CPU usage in htop - nothing.
Internet connection - perfect.

Attaching debug logs (nothing interesting, it seems).
Comment 1 Kostya Vasilyev 2017-07-12 15:37:24 EDT
Not sure if this may be the cause, but found this just now:

$ sudo tail -f /var/log/dnf.librepo.log
22:24:48 lr_handle_perform: Downloading/Locating yum repo
22:24:48 lr_yum_download_remote: Downloading/Copying repo..
22:24:48 lr_yum_download_repomd: Downloading repomd.xml via mirrorlist
22:24:48 lr_download: Target: repodata/repomd.xml (-)
22:24:48 lr_prepare_lrmirrors: Preparing internal mirror list for handle id: 0x56472177ccd0
22:24:48 lr_prepare_lrmirrors: Mirror: https://download.sublimetext.com/rpm/dev/x86_64
22:24:48 select_next_target: Selecting mirror for: repodata/repomd.xml
22:24:48 prepare_next_transfer: URL: https://download.sublimetext.com/rpm/dev/x86_64/repodata/repomd.xml
22:24:48 add_librepo_xattr: Cannot set xattr user.Librepo.DownloadInProgress (fd: 35): Operation not supported
22:24:48 lr_download: Downloading started

My file systems are ext4, not sure about xattr support:

- /etc/fstab doesn't have user_xattr for /
- findmnt isn't showing xattr either

So maybe this is harmless and unrelated, but...

---

Oh and despite the "Downloading started" message, it didn't... DNF is locked up and can't even be aborted by Ctrl+C (as I already mentioned above).
Comment 2 Kostya Vasilyev 2017-07-12 15:50 EDT
Created attachment 1297196 [details]
strace

Last few lines...

The last one was the same on the several times it locked up -- futex(... WAIT ...)

stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
futex(0x55c7543cd900, FUTEX_WAIT_PRIVATE, 2, NULL)
Comment 3 kartochka22 2017-07-14 05:45:08 EDT
same here, happens 3-4 times, some time "dnf upgrade" succeed, some stuck no idea why (network is perfect), librepo log exact like posted above
Comment 4 Igor Gnatenko 2017-07-14 05:46:33 EDT
can you attach full dnf.librepo.log?
Comment 5 Kostya Vasilyev 2017-07-14 08:18 EDT
Created attachment 1298293 [details]
dnf.librepo.log

dnf.librepo.log as requested
Comment 6 Igor Gnatenko 2017-07-14 08:49:17 EDT
it seems like it is timeouting on mirror.yandex.ru, could it be the case?
Comment 7 Kostya Vasilyev 2017-07-14 13:11:31 EDT
Um, I've been using mirror.yandex.ru for a couple of years now, and not seen anything like this...

Anything's possible I suppose (higher load due to the F26 release), however:

- There is are no progress messages at the point where it's stuck

- There are no error messages

- Can't be aborted with Ctrl+C

This kind of makes it look like it's not a network issue, and more like a concurrency bug (?).

I just left it stuck for several minutes (again) just to be sure:

1 - It was once again stuck at:

futex(0x562bfe5e7510, FUTEX_WAIT_PRIVATE, 2, NULL

2 - The list of network connections was:

$ sudo ss -nap | egrep "dnf"
tcp    CLOSE-WAIT 348    0      192.168.88.11:50818              158.69.195.211:443                 users:(("dnf",pid=15251,fd=30))

which is:

211.195.69.158.in-addr.arpa. 41671 IN	PTR	mirrorlist05.ovh.rpmfusion.net.

I'm not super familiar with what CLOSE-WAIT means.

3 - After about 5-7 minutes, it was still stuck, nothing new in strace

4 - Still could not be aborted with Ctrl+C

---

console output was:

$ sudo strace dnf -v --refresh update 2> dnf.strace.log
Loaded plugins: builddep, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, needs-restarting, playground, repoclosure, repograph, repomanage, reposync, system-upgrade
DNF version: 2.5.1
cachedir: /var/cache/dnf
reviving: 'fedora-cisco-openh264' can be revived - repomd matches.
not found deltainfo for: Fedora 26 openh264 (From Cisco) - x86_64
not found updateinfo for: Fedora 26 openh264 (From Cisco) - x86_64
fedora-cisco-openh264: using metadata from Fri 24 Mar 2017 08:58:22 PM MSK.
reviving: 'updates-testing' can be revived - repomd matches.
updates-testing: using metadata from Fri 14 Jul 2017 02:23:44 AM MSK.
reviving: failed for 'updates', mismatched repomd.
Fedora 26 - x86_64 - Updates                                                                                              9.3 MB/s | 4.3 MB     00:00    
updates: using metadata from Fri 14 Jul 2017 03:48:03 PM MSK.
reviving: 'fedora' can be revived - repomd matches.
not found deltainfo for: Fedora 26 - x86_64
not found updateinfo for: Fedora 26 - x86_64
fedora: using metadata from Wed 05 Jul 2017 11:31:38 PM MSK.
reviving: 'google-chrome' can be revived - repomd matches.
not found deltainfo for: google-chrome
not found updateinfo for: google-chrome
google-chrome: using metadata from Wed 12 Jul 2017 09:05:54 PM MSK.
reviving: 'home_snwh_paper' can be revived - repomd matches.
not found deltainfo for: Paper Project (Fedora_25)
not found updateinfo for: Paper Project (Fedora_25)
home_snwh_paper: using metadata from Fri 10 Mar 2017 06:11:23 AM MSK.
reviving: 'rpmfusion-free-updates' can be revived - metalink checksums match.
not found deltainfo for: RPM Fusion for Fedora 26 - Free - Updates
not found updateinfo for: RPM Fusion for Fedora 26 - Free - Updates
rpmfusion-free-updates: using metadata from Thu 13 Jul 2017 04:16:07 PM MSK.
reviving: 'rpmfusion-free' can be revived - metalink checksums match.
not found deltainfo for: RPM Fusion for Fedora 26 - Free
not found updateinfo for: RPM Fusion for Fedora 26 - Free
rpmfusion-free: using metadata from Mon 10 Jul 2017 03:06:27 PM MSK.
reviving: failed for 'rpmfusion-nonfree-updates', mismatched sha256 sum.

---

But let's say I'm wrong and it is a network issue.

- Shouldn't it time out after a reasonable time limit (15 - 30 seconds - a minute?) with an error message?

- Shouldn't it be possible to kill it with Ctrl+C?
Comment 8 Kostya Vasilyev 2017-07-14 13:28:27 EDT
Tried disabling rpmfusion repos, like this:

dnf -v --refresh --disablerepo rpmfusion-free --disablerepo rpmfusion-free-updates --disablerepo rpmfusion-nonfree --disablerepo rpmfusion-nonfree-updates update

Makes it somewhat harder to reproduce, but still happens say 1 run out of 3-5.

Stacktrace:

#0  0x00007fa005874fad in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007fa00586df44 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00007f9fff96a1b9 in PR_Lock () at /lib64/libnspr4.so
#3  0x00007fa000231cb5 in nssSlot_IsTokenPresent () at /lib64/libnss3.so
#4  0x00007fa000231ee6 in nssSlot_GetToken () at /lib64/libnss3.so
#5  0x00007fa00022c011 in nssTrustDomain_FindCertificateByIssuerAndSerialNumber () at /lib64/libnss3.so
#6  0x00007fa00022c150 in nssTrustDomain_FindCertificateByEncodedCertificate () at /lib64/libnss3.so
#7  0x00007fa00022737b in CERT_NewTempCertificate () at /lib64/libnss3.so
#8  0x00007fa000724ef3 in ssl3_CompleteHandleCertificate () at /lib64/libssl3.so
#9  0x00007fa0007275b6 in ssl3_HandleHandshakeMessage () at /lib64/libssl3.so
#10 0x00007fa00072ad0a in ssl3_HandleRecord () at /lib64/libssl3.so
#11 0x00007fa00072ca00 in ssl3_GatherCompleteHandshake () at /lib64/libssl3.so
#12 0x00007fa000732e69 in SSL_ForceHandshake () at /lib64/libssl3.so
#13 0x00007fa001bbdef5 in nss_connect_common () at /lib64/libcurl.so.4
#14 0x00007fa001bba550 in Curl_ssl_connect_nonblocking () at /lib64/libcurl.so.4
#15 0x00007fa001b6efd2 in https_connecting () at /lib64/libcurl.so.4
#16 0x00007fa001b95f36 in multi_runsingle () at /lib64/libcurl.so.4
#17 0x00007fa001b96fb3 in curl_multi_perform () at /lib64/libcurl.so.4
#18 0x00007fa0020196fe in lr_download () at /lib64/librepo.so.0
#19 0x00007fa002019a69 in lr_download_target () at /lib64/librepo.so.0
#20 0x00007fa002019b16 in lr_download_url () at /lib64/librepo.so.0
#21 0x00007fa002028117 in lr_yum_perform () at /lib64/librepo.so.0
#22 0x00007fa00201ebb9 in lr_handle_perform () at /lib64/librepo.so.0
#23 0x00007fa002238c92 in py_perform () at /usr/lib64/python3.6/site-packages/librepo/_librepo.so

----

Looks like it's trying to complete an SSL handshake - and is stuck waiting for "something", infinitely, trying to read the system "CA database".

This matches what I'm seeing in strace output, just a bit above the futex, which is accessing a SQLite database presumably containing the certificates:

stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffe2a4be660) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffe2a4be670) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffe2a4be670) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffe2a4be3c0) = -1 ENOENT (No such file or directory)

---

/etc/pki/nssdb/cert9.db does exist, it's non-empty, has a schema, a single table called nssPublic, and no data in it.

There also is a zero-sized /etc/pki/cert9.db (???)
Comment 9 Alexander Ploumistos 2017-07-15 07:27:24 EDT
I've been seeing this as well. It started sometime yesterday (14/07) when I noticed that the make cache operation that runs on a timer was stuck. I am still on F25. So far, running a "dnf clean all" before each transaction seems to do the trick.
Comment 10 Kostya Vasilyev 2017-07-15 09:44:38 EDT
(In reply to Alexander Ploumistos from comment #9)
> I've been seeing this as well. It started sometime yesterday (14/07) when I
> noticed that the make cache operation that runs on a timer was stuck. I am
> still on F25. So far, running a "dnf clean all" before each transaction
> seems to do the trick.

Wonder what updates you got recently?

In particular, any of the libraries from my stack trace above:

nss
librepo
libcurl

?

For me on Fedora 26:

nss-3.31.0-1.0.fc26.x86_64
librepo-1.7.20-3.fc26.x86_64
libcurl-7.53.1-7.fc26.x86_64
Comment 11 Alexander Ploumistos 2017-07-15 11:30 EDT
Created attachment 1298944 [details]
last successful update transaction

(In reply to Kostya Vasilyev from comment #10)
> Wonder what updates you got recently?

I checked my dnf history and I am attaching the last transaction before the problem appeared. 

> In particular, any of the libraries from my stack trace above:
> 
> nss
> librepo
> libcurl
> 
> ?
> 
> For me on Fedora 26:
> 
> nss-3.31.0-1.0.fc26.x86_64
> librepo-1.7.20-3.fc26.x86_64
> libcurl-7.53.1-7.fc26.x86_64

It seems that we only have nss in common (nss-3.31.0-1.0.fc25.x86_64). The other two are:
librepo-1.7.18-3.fc25.x86_64
libcurl-7.51.0-7.fc25.x86_64

I haven't had anything crashing yet, just dnf waiting indefinitely and it responds to SIGTERM immediately.
Comment 12 kartochka22 2017-07-15 22:49:32 EDT
Definitely "dnf clean all" works this for me.
Comment 13 Steven Haigh 2017-07-16 07:12:21 EDT
I'm having the same problems with:

# rpm -qa | grep -E '^(librepo-|nss-|libcurl-)' | sort
libcurl-7.53.1-7.fc26.x86_64
librepo-1.7.20-3.fc26.x86_64
nss-3.31.0-1.0.fc26.x86_64
Comment 14 Christian Dersch 2017-07-16 07:15:49 EDT
Same here (also same versions Steven Haigh reported).
Comment 15 Steven Haigh 2017-07-16 07:19:14 EDT
Downgrading to the NSS version here fixed the problem in F26 for me:
https://koji.fedoraproject.org/koji/buildinfo?buildID=888916

Version there is: nss-3.30.2-1.1.fc26

After downgrading to this version, I can confirm dnf works successfully again.
Comment 16 Steven Haigh 2017-07-16 07:20:39 EDT
Leave karma here for F26:
https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9
Comment 17 David H. Gutteridge 2017-07-17 19:33:32 EDT
I'm seeing this on F25 as well. I was running nss-3.31.0-1.0.fc25, like Alexander Ploumistos. Downgrading to nss-3.30.2-1.1.fc25.x86_64 resolved the problem for me.
Comment 18 Steven Haigh 2017-07-17 19:45:27 EDT
Added Daiki Ueno to this report as its his build (and this is filed under dnf, not nss so may be missed)
Comment 19 Freddy Willemsen 2017-07-18 05:58:16 EDT
Same here, running fc25, the only viable workaround for me is to downgrade to nss-3.30.
Comment 20 Fedora Update System 2017-07-18 12:34:42 EDT
nspr-4.15.0-1.fc26 nss-3.31.0-1.1.fc26 nss-softokn-3.31.0-1.0.fc26 nss-util-3.31.0-1.0.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9
Comment 21 Steven Haigh 2017-07-19 03:50:23 EDT
Can confirm that updating to nss-3.31.0-1.1.fc26 fixes the issue. Have left +1 karma for the bodhi build.
Comment 22 Igor Gnatenko 2017-07-19 07:17:34 EDT
*** Bug 1471466 has been marked as a duplicate of this bug. ***
Comment 23 Jonathan Lebon 2017-07-19 12:47:53 EDT
*** Bug 1472954 has been marked as a duplicate of this bug. ***
Comment 24 Fedora Update System 2017-07-19 20:24:24 EDT
nspr-4.15.0-1.fc26, nss-3.31.0-1.1.fc26, nss-softokn-3.31.0-1.0.fc26, nss-util-3.31.0-1.0.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9
Comment 25 Fedora Update System 2017-07-26 14:21:49 EDT
nspr-4.15.0-1.fc26, nss-3.31.0-1.1.fc26, nss-softokn-3.31.0-1.0.fc26, nss-util-3.31.0-1.0.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.