1741931 – metalink file download should re-try with some delay

Bug 1741931 - metalink file download should re-try with some delay

Summary: metalink file download should re-try with some delay

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	librepo
Sub Component:
Version:	30
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Assignee:	Jaroslav Mracek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1758383 (view as bug list)
Depends On:	1756400
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-16 13:14 UTC by Pavel Raiskup
Modified:	2020-05-27 05:06 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-10-19 15:13:16 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pavel Raiskup 2019-08-16 13:14:30 UTC

From:
https://copr-be.cloud.fedoraproject.org/results/thm/lxc3.0/fedora-30-aarch64/01006608-lua-lxc/chroot_scan/var/lib/mock/1006608-fedora-30-aarch64-1565957042.659271/root/var/log/dnf.log

2019-08-16T12:01:31Z DEBUG fedora: using metadata from Wed 24 Oct 2018 10:20:15 PM UTC.
2019-08-16T12:01:32Z DEBUG error: Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64 (https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64).
2019-08-16T12:01:32Z DEBUG error: Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64 (https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64).
2019-08-16T12:01:33Z DEBUG error: Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64 (https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64).
2019-08-16T12:01:33Z DEBUG error: Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64 (https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64).
2019-08-16T12:01:33Z DEBUG Cannot download 'https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64': Cannot prepare internal mirrorlist: Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f29&arch=x86_64.
2019-08-16T12:01:33Z ERROR Failed to download metadata for repo 'updates'

It seems like we really retry now, but we retry too fast.  Would it be possible
to insert there some sleep() or so?  And can we assure that we do name
resolution again?  With default configuration, separate requests to
mirrors.fedoraproject.org should end up on different IP addresses.

Comment 1 Pavel Raiskup 2019-08-16 13:15:50 UTC

I'm thinking about follow-up for:
https://github.com/rpm-software-management/librepo/pull/158

Comment 2 Lukáš Hrázký 2019-08-19 09:51:28 UTC

Pavel, what exactly do you expect the sleep to help with? I'm not convinced it will be good for anything, however, it surely is going to slow down the whole metadata download process.

We should certainly look into the DNS resolution to query different servers though.

Comment 3 Pavel Raiskup 2019-08-19 10:44:43 UTC

(In reply to Lukáš Hrázký from comment #2)
> Pavel, what exactly do you expect the sleep to help with? I'm not convinced
> it will be good for anything, however, it surely is going to slow down the
> whole metadata download process.

I'm not sure it would help, but I sort of expected that error 503 has some
temporary character, so that - if we tried a bit later - there would be much
higher chance the same mirror will start working.  Dunno.

Still, for those urls/metalinks which _are not_ backed-up by DNS pool of
alternative addresses, I'd expect the sleep would help a lot.

> We should certainly look into the DNS resolution to query different servers
> though.

This one should solve our (copr) issues by itself, I guess (copr
uses official mirrorlists, which have alternative DNS addresses).  Thank you!

Comment 4 Michal Domonkos 2019-08-19 11:45:10 UTC

A possibly related issue once fixed in yum:

https://github.com/rpm-software-management/yum/commit/a7d50db151a2bfef09b3004c7afae5e1eed651e3

Comment 5 Michal Domonkos 2019-08-19 11:52:16 UTC

There's also a (very long) discussion in the related yum bug that could shed some light on how MirrorManager works:

https://bugzilla.redhat.com/show_bug.cgi?id=1520454

Comment 6 Lukáš Hrázký 2019-08-20 08:52:02 UTC

Thanks, Michal, but the bug as well as the PR is about HTTP redirects, not DNS balancing. I think both mechanisms need to be properly supported.

Comment 7 Michal Domonkos 2019-08-20 08:59:17 UTC

You're right, Lukas.  I think I considered leveraging the DNS A/AAAA list mechanism as well when dealing with the bug, but I ended up just adding support for MirrorManager's internal round-robin mechanism that works on top of HTTP redirects, as you say.  Nevermind then (there still might be some useful insights in the bug comments though).

Comment 8 Michal Domonkos 2019-08-20 09:06:33 UTC

More specifically, I seem to have verified (Comment 12) that curl does support DNS balancing (A/AAAA records):

<snip>
4) curl: if the resolved IP fails, try another one from the A/AAAA list returned for the hostname
</snip>

Comment 9 Pavel Raiskup 2019-08-29 11:41:03 UTC

So it means that urlgrabber (which is not used by dnf) behaves correctly, right?

Can I work-around this somehow?  Today we had again series of build failures because
of this.  Or may we help somehow to have this fixed soon?

Comment 10 Michal Domonkos 2019-08-29 12:22:47 UTC

urlgrabber itself does employ a simple mechanism to rotate the available mirror URLs (in a randomized fashion), but those have to be passed to it via the API from yum (such as parsed from a metalink.xml or mirrorlist.txt file).  What I was referring to was a DNS round-robin mechanism that happens at the curl level.

Comment 11 Michal Domonkos 2019-08-29 12:31:03 UTC

There are actually 3 layers of mirror handling in a typical yum->urlgrabber->curl scenario; at the highest level, there's yum fetching and parsing a metalink.xml or mirrorlist.txt file from the repository server.  After obtaining a list of mirrors, it passes them down to urlgrabber which tries them one by one until it succeeds.  And finally, curl looks at the list of addresses returned by the DNS server for a particular URL and does something similar (but I'm not familiar with this part that much).

Comment 12 Michal Domonkos 2019-08-29 12:34:48 UTC

(In reply to Michal Domonkos from comment #11)
> for a particular URL and does something similar

s/URL/IP/

Comment 13 Michal Domonkos 2019-08-29 12:35:08 UTC

(In reply to Michal Domonkos from comment #11)
> for a particular URL and does something similar

s/URL/hostname/

Comment 14 Pavel Raiskup 2019-08-29 13:21:55 UTC

(In reply to Michal Domonkos from comment #11)
> And finally, curl
> looks at the list of addresses returned by the DNS server for a particular
> URL and does something similar (but I'm not familiar with this part that
> much).

I doubt this is what is happening on curl level in case of librepo (and dnf),
because that would mean that all the hosts in
`$ host mirrors.fedoraproject.org` are dead sometimes.  At least not for the
error 503 (immediate failure of server).

So speaking of the errors 503, can we turn on some round-robin mechanism on?

Comment 15 Jaroslav Mracek 2019-09-06 06:43:25 UTC

I create a patch (https://github.com/rpm-software-management/librepo/pull/167) that adds sleep step after all mirrors where tried. It is mostly applied when one url is available - metalink, baseurl.

Comment 16 Jaroslav Mracek 2019-09-18 07:12:45 UTC

Still think that the issue was not solved properly

Comment 17 Jaroslav Mracek 2019-09-20 13:33:09 UTC

I create an improvement of the first patch https://github.com/rpm-software-management/librepo/pull/169. Still working on improvement of logging.

Comment 18 Jaroslav Mracek 2019-09-27 13:31:29 UTC

It looks like that even delay will not resolve the issue. I suggest that that the issue is caused by dead ip retrieved from DNS.

I tried to use incorrect metalink (https://mirrors.fedoraproject.org/metalinks?repo=updates-released-f30&arch=x86_64) for testing (metalink was replaces by metalinks). During a single run I am nearly unable to to force curl to use different IP that the first one in the list.

The problem is with CURL multi handle (https://curl.haxx.se/libcurl/c/libcurl-multi.html) where curl_easy_setopt(CURL *handle, CURLoption option, parameter) has no effect on dns using CURLOPT_DNS_CACHE_TIMEOUT, CURLOPT_DNS_SHUFFLE_ADDRESSES.

What worked was patch https://github.com/rpm-software-management/librepo/pull/159/commits/ac80f6c26ebbf358f68eb62e31306c22597dbbdc.

Comment 19 Jaroslav Mracek 2019-10-07 12:52:24 UTC

*** Bug 1758383 has been marked as a duplicate of this bug. ***

Comment 20 Jaroslav Mracek 2019-10-19 15:13:16 UTC

Requerd patches were backported into f30

Comment 21 Pavel Raiskup 2020-05-27 05:06:24 UTC

I'm not reopening because that's not our priority now (in copr we anyways
re-try on higher level, which was initially a work-around for other
issues).

Just FYI, note that we probably refused the idea with re-trying of the
same URL (with delay), but per discussion with OpenSUSE users that's
exactly how zypper and OpenSUSE mirroring works [1] -- they are
retrying the same URLs through redirector, and when the redirector
recognizes that some mirror is temporarily down it would redirect the same
URL request to different mirror automatically next time.  But client would
have to re-try (librepo doesn't seem to from my attempts on F31).

[1] https://github.com/rpm-software-management/mock/issues/553

Note You need to log in before you can comment on or make changes to this bug.