Bug 1470352

Summary:

dnf --refresh update broken since 26 update

Product:

[Fedora] Fedora

Reporter:

Kostya Vasilyev <kmansoft>

Component:

nss

Assignee:

Daiki Ueno <dueno>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

alex.ploumistos, alexus_m, dhgutteridge, dmach, dueno, elio.maldonado.batiz, fabrice, freddy, georgmueller, jlebon, kartochka378, kdudka, kengert, kmansoft, lars.s.jensen, lupinix.fedora, mhatina, mrunge, netwiz, packaging-team-maint, rpm-software-management, vmukhame, yann

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-11 09:53:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1460051

Bug Blocks:

Attachments:

Description	Flags
strace	none
dnf.librepo.log	none
last successful update transaction	none

Description Kostya Vasilyev 2017-07-12 19:32:34 UTC

Description of problem:

Just updated from F25 to F26 using dnf system-upgrade.

Now when I run "sudo dnf --refresh update", it just hangs after (seemingly) fetching repo info.

Version-Release number of selected component (if applicable):

dnf 2.5.1-1.fc26

How reproducible:

Tried 3-5 times, killing dnf between attempts. Same every time.

Steps to Reproduce:

1. Fedora 26, fresh upgrade from F25
2. "sudo dnf --refresh update".
3.

Actual results:

"sudo dnf --refresh update" apparently done with refreshing latest info, then just hangs.

Ctrl+C doesn't abort, have to use kill.

Expected results:

"sudo dnf --refresh update" refreshing latest info from remote repos, and then (probably) doing nothing, should be no updates at this point, and quitting to shell.

Additional info:

CPU usage in htop - nothing.
Internet connection - perfect.

Attaching debug logs (nothing interesting, it seems).

Comment 1 Kostya Vasilyev 2017-07-12 19:37:24 UTC

Not sure if this may be the cause, but found this just now:

$ sudo tail -f /var/log/dnf.librepo.log
22:24:48 lr_handle_perform: Downloading/Locating yum repo
22:24:48 lr_yum_download_remote: Downloading/Copying repo..
22:24:48 lr_yum_download_repomd: Downloading repomd.xml via mirrorlist
22:24:48 lr_download: Target: repodata/repomd.xml (-)
22:24:48 lr_prepare_lrmirrors: Preparing internal mirror list for handle id: 0x56472177ccd0
22:24:48 lr_prepare_lrmirrors: Mirror: https://download.sublimetext.com/rpm/dev/x86_64
22:24:48 select_next_target: Selecting mirror for: repodata/repomd.xml
22:24:48 prepare_next_transfer: URL: https://download.sublimetext.com/rpm/dev/x86_64/repodata/repomd.xml
22:24:48 add_librepo_xattr: Cannot set xattr user.Librepo.DownloadInProgress (fd: 35): Operation not supported
22:24:48 lr_download: Downloading started

My file systems are ext4, not sure about xattr support:

- /etc/fstab doesn't have user_xattr for /
- findmnt isn't showing xattr either

So maybe this is harmless and unrelated, but...

---

Oh and despite the "Downloading started" message, it didn't... DNF is locked up and can't even be aborted by Ctrl+C (as I already mentioned above).

Comment 2 Kostya Vasilyev 2017-07-12 19:50:51 UTC

Created attachment 1297196 [details]
strace

Last few lines...

The last one was the same on the several times it locked up -- futex(... WAIT ...)

stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
futex(0x55c7543cd900, FUTEX_WAIT_PRIVATE, 2, NULL)

Comment 3 kartochka378 2017-07-14 09:45:08 UTC

same here, happens 3-4 times, some time "dnf upgrade" succeed, some stuck no idea why (network is perfect), librepo log exact like posted above

Comment 4 Igor Gnatenko 2017-07-14 09:46:33 UTC

can you attach full dnf.librepo.log?

Comment 5 Kostya Vasilyev 2017-07-14 12:18:53 UTC

Created attachment 1298293 [details]
dnf.librepo.log

dnf.librepo.log as requested

Comment 6 Igor Gnatenko 2017-07-14 12:49:17 UTC

it seems like it is timeouting on mirror.yandex.ru, could it be the case?

Comment 7 Kostya Vasilyev 2017-07-14 17:11:31 UTC

Um, I've been using mirror.yandex.ru for a couple of years now, and not seen anything like this...

Anything's possible I suppose (higher load due to the F26 release), however:

- There is are no progress messages at the point where it's stuck

- There are no error messages

- Can't be aborted with Ctrl+C

This kind of makes it look like it's not a network issue, and more like a concurrency bug (?).

I just left it stuck for several minutes (again) just to be sure:

1 - It was once again stuck at:

futex(0x562bfe5e7510, FUTEX_WAIT_PRIVATE, 2, NULL

2 - The list of network connections was:

$ sudo ss -nap | egrep "dnf"
tcp    CLOSE-WAIT 348    0      192.168.88.11:50818              158.69.195.211:443                 users:(("dnf",pid=15251,fd=30))

which is:

211.195.69.158.in-addr.arpa. 41671 IN	PTR	mirrorlist05.ovh.rpmfusion.net.

I'm not super familiar with what CLOSE-WAIT means.

3 - After about 5-7 minutes, it was still stuck, nothing new in strace

4 - Still could not be aborted with Ctrl+C

---

console output was:

$ sudo strace dnf -v --refresh update 2> dnf.strace.log
Loaded plugins: builddep, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, needs-restarting, playground, repoclosure, repograph, repomanage, reposync, system-upgrade
DNF version: 2.5.1
cachedir: /var/cache/dnf
reviving: 'fedora-cisco-openh264' can be revived - repomd matches.
not found deltainfo for: Fedora 26 openh264 (From Cisco) - x86_64
not found updateinfo for: Fedora 26 openh264 (From Cisco) - x86_64
fedora-cisco-openh264: using metadata from Fri 24 Mar 2017 08:58:22 PM MSK.
reviving: 'updates-testing' can be revived - repomd matches.
updates-testing: using metadata from Fri 14 Jul 2017 02:23:44 AM MSK.
reviving: failed for 'updates', mismatched repomd.
Fedora 26 - x86_64 - Updates                                                                                              9.3 MB/s | 4.3 MB     00:00    
updates: using metadata from Fri 14 Jul 2017 03:48:03 PM MSK.
reviving: 'fedora' can be revived - repomd matches.
not found deltainfo for: Fedora 26 - x86_64
not found updateinfo for: Fedora 26 - x86_64
fedora: using metadata from Wed 05 Jul 2017 11:31:38 PM MSK.
reviving: 'google-chrome' can be revived - repomd matches.
not found deltainfo for: google-chrome
not found updateinfo for: google-chrome
google-chrome: using metadata from Wed 12 Jul 2017 09:05:54 PM MSK.
reviving: 'home_snwh_paper' can be revived - repomd matches.
not found deltainfo for: Paper Project (Fedora_25)
not found updateinfo for: Paper Project (Fedora_25)
home_snwh_paper: using metadata from Fri 10 Mar 2017 06:11:23 AM MSK.
reviving: 'rpmfusion-free-updates' can be revived - metalink checksums match.
not found deltainfo for: RPM Fusion for Fedora 26 - Free - Updates
not found updateinfo for: RPM Fusion for Fedora 26 - Free - Updates
rpmfusion-free-updates: using metadata from Thu 13 Jul 2017 04:16:07 PM MSK.
reviving: 'rpmfusion-free' can be revived - metalink checksums match.
not found deltainfo for: RPM Fusion for Fedora 26 - Free
not found updateinfo for: RPM Fusion for Fedora 26 - Free
rpmfusion-free: using metadata from Mon 10 Jul 2017 03:06:27 PM MSK.
reviving: failed for 'rpmfusion-nonfree-updates', mismatched sha256 sum.

---

But let's say I'm wrong and it is a network issue.

- Shouldn't it time out after a reasonable time limit (15 - 30 seconds - a minute?) with an error message?

- Shouldn't it be possible to kill it with Ctrl+C?

Comment 8 Kostya Vasilyev 2017-07-14 17:28:27 UTC

Tried disabling rpmfusion repos, like this:

dnf -v --refresh --disablerepo rpmfusion-free --disablerepo rpmfusion-free-updates --disablerepo rpmfusion-nonfree --disablerepo rpmfusion-nonfree-updates update

Makes it somewhat harder to reproduce, but still happens say 1 run out of 3-5.

Stacktrace:

#0  0x00007fa005874fad in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007fa00586df44 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00007f9fff96a1b9 in PR_Lock () at /lib64/libnspr4.so
#3  0x00007fa000231cb5 in nssSlot_IsTokenPresent () at /lib64/libnss3.so
#4  0x00007fa000231ee6 in nssSlot_GetToken () at /lib64/libnss3.so
#5  0x00007fa00022c011 in nssTrustDomain_FindCertificateByIssuerAndSerialNumber () at /lib64/libnss3.so
#6  0x00007fa00022c150 in nssTrustDomain_FindCertificateByEncodedCertificate () at /lib64/libnss3.so
#7  0x00007fa00022737b in CERT_NewTempCertificate () at /lib64/libnss3.so
#8  0x00007fa000724ef3 in ssl3_CompleteHandleCertificate () at /lib64/libssl3.so
#9  0x00007fa0007275b6 in ssl3_HandleHandshakeMessage () at /lib64/libssl3.so
#10 0x00007fa00072ad0a in ssl3_HandleRecord () at /lib64/libssl3.so
#11 0x00007fa00072ca00 in ssl3_GatherCompleteHandshake () at /lib64/libssl3.so
#12 0x00007fa000732e69 in SSL_ForceHandshake () at /lib64/libssl3.so
#13 0x00007fa001bbdef5 in nss_connect_common () at /lib64/libcurl.so.4
#14 0x00007fa001bba550 in Curl_ssl_connect_nonblocking () at /lib64/libcurl.so.4
#15 0x00007fa001b6efd2 in https_connecting () at /lib64/libcurl.so.4
#16 0x00007fa001b95f36 in multi_runsingle () at /lib64/libcurl.so.4
#17 0x00007fa001b96fb3 in curl_multi_perform () at /lib64/libcurl.so.4
#18 0x00007fa0020196fe in lr_download () at /lib64/librepo.so.0
#19 0x00007fa002019a69 in lr_download_target () at /lib64/librepo.so.0
#20 0x00007fa002019b16 in lr_download_url () at /lib64/librepo.so.0
#21 0x00007fa002028117 in lr_yum_perform () at /lib64/librepo.so.0
#22 0x00007fa00201ebb9 in lr_handle_perform () at /lib64/librepo.so.0
#23 0x00007fa002238c92 in py_perform () at /usr/lib64/python3.6/site-packages/librepo/_librepo.so

----

Looks like it's trying to complete an SSL handshake - and is stuck waiting for "something", infinitely, trying to read the system "CA database".

This matches what I'm seeing in strace output, just a bit above the futex, which is accessing a SQLite database presumably containing the certificates:

stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffe2a4be660) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffe2a4be670) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffe2a4be670) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffe2a4be3c0) = -1 ENOENT (No such file or directory)

---

/etc/pki/nssdb/cert9.db does exist, it's non-empty, has a schema, a single table called nssPublic, and no data in it.

There also is a zero-sized /etc/pki/cert9.db (???)

Comment 9 Alexander Ploumistos 2017-07-15 11:27:24 UTC

I've been seeing this as well. It started sometime yesterday (14/07) when I noticed that the make cache operation that runs on a timer was stuck. I am still on F25. So far, running a "dnf clean all" before each transaction seems to do the trick.

Comment 10 Kostya Vasilyev 2017-07-15 13:44:38 UTC

(In reply to Alexander Ploumistos from comment #9)
> I've been seeing this as well. It started sometime yesterday (14/07) when I
> noticed that the make cache operation that runs on a timer was stuck. I am
> still on F25. So far, running a "dnf clean all" before each transaction
> seems to do the trick.

Wonder what updates you got recently?

In particular, any of the libraries from my stack trace above:

nss
librepo
libcurl

?

For me on Fedora 26:

nss-3.31.0-1.0.fc26.x86_64
librepo-1.7.20-3.fc26.x86_64
libcurl-7.53.1-7.fc26.x86_64

Comment 11 Alexander Ploumistos 2017-07-15 15:30:17 UTC

Created attachment 1298944 [details]
last successful update transaction

(In reply to Kostya Vasilyev from comment #10)
> Wonder what updates you got recently?

I checked my dnf history and I am attaching the last transaction before the problem appeared. 

> In particular, any of the libraries from my stack trace above:
> 
> nss
> librepo
> libcurl
> 
> ?
> 
> For me on Fedora 26:
> 
> nss-3.31.0-1.0.fc26.x86_64
> librepo-1.7.20-3.fc26.x86_64
> libcurl-7.53.1-7.fc26.x86_64

It seems that we only have nss in common (nss-3.31.0-1.0.fc25.x86_64). The other two are:
librepo-1.7.18-3.fc25.x86_64
libcurl-7.51.0-7.fc25.x86_64

I haven't had anything crashing yet, just dnf waiting indefinitely and it responds to SIGTERM immediately.

Comment 12 kartochka378 2017-07-16 02:49:32 UTC

Definitely "dnf clean all" works this for me.

Comment 13 Steven Haigh 2017-07-16 11:12:21 UTC

I'm having the same problems with:

# rpm -qa | grep -E '^(librepo-|nss-|libcurl-)' | sort
libcurl-7.53.1-7.fc26.x86_64
librepo-1.7.20-3.fc26.x86_64
nss-3.31.0-1.0.fc26.x86_64

Comment 14 Christian Dersch 2017-07-16 11:15:49 UTC

Same here (also same versions Steven Haigh reported).

Comment 15 Steven Haigh 2017-07-16 11:19:14 UTC

Downgrading to the NSS version here fixed the problem in F26 for me:
https://koji.fedoraproject.org/koji/buildinfo?buildID=888916

Version there is: nss-3.30.2-1.1.fc26

After downgrading to this version, I can confirm dnf works successfully again.

Comment 16 Steven Haigh 2017-07-16 11:20:39 UTC

Leave karma here for F26:
https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9

Comment 17 David H. Gutteridge 2017-07-17 23:33:32 UTC

I'm seeing this on F25 as well. I was running nss-3.31.0-1.0.fc25, like Alexander Ploumistos. Downgrading to nss-3.30.2-1.1.fc25.x86_64 resolved the problem for me.

Comment 18 Steven Haigh 2017-07-17 23:45:27 UTC

Added Daiki Ueno to this report as its his build (and this is filed under dnf, not nss so may be missed)

Comment 19 Freddy Willemsen 2017-07-18 09:58:16 UTC

Same here, running fc25, the only viable workaround for me is to downgrade to nss-3.30.

Comment 20 Fedora Update System 2017-07-18 16:34:42 UTC

nspr-4.15.0-1.fc26 nss-3.31.0-1.1.fc26 nss-softokn-3.31.0-1.0.fc26 nss-util-3.31.0-1.0.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9

Comment 21 Steven Haigh 2017-07-19 07:50:23 UTC

Can confirm that updating to nss-3.31.0-1.1.fc26 fixes the issue. Have left +1 karma for the bodhi build.

Comment 22 Igor Gnatenko 2017-07-19 11:17:34 UTC

*** Bug 1471466 has been marked as a duplicate of this bug. ***

Comment 23 Jonathan Lebon 2017-07-19 16:47:53 UTC

*** Bug 1472954 has been marked as a duplicate of this bug. ***

Comment 24 Fedora Update System 2017-07-20 00:24:24 UTC

nspr-4.15.0-1.fc26, nss-3.31.0-1.1.fc26, nss-softokn-3.31.0-1.0.fc26, nss-util-3.31.0-1.0.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9

Comment 25 Fedora Update System 2017-07-26 18:21:49 UTC

nspr-4.15.0-1.fc26, nss-3.31.0-1.1.fc26, nss-softokn-3.31.0-1.0.fc26, nss-util-3.31.0-1.0.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.