Bug 163465 - Make yum feel snappier by caching repodata files
Summary: Make yum feel snappier by caching repodata files
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: yum
Version: rawhide
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jeremy Katz
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-07-17 17:08 UTC by Sigge Kotliar
Modified: 2014-01-21 22:52 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-21 18:06:29 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Sigge Kotliar 2005-07-17 17:08:25 UTC
Hi! I apologise in advance for this entry being very long, but I hope it can be
worth the read.

As it is right now, every call to yum causes it to download one small file per
repo, this file presumably checking whether the repo has changed since the last
time, if yes - it downloads a full repo list, if no it uses the old one.
Although these files are all ~1kB, they each make up one separate HTTP request
to one separate server. On my machine this takes approximately 1 second per
repo. In my case I have six repos, giving me a bit more than six seconds to wait
before any real action is done.
In many cases as I will show later, this checking for updated repos can be
avoided most of the times, making yum seem a lot faster. Even to this day, with
all the great speed improvements yum has had, some people are complaining about
yum being "slow". By slow they don't mean slow on calculating dependencies, but
slow having in mind the time needed from the issue of a command to it being
executed, much because of these unnecessary calls to HTTP servers.

Some use cases:
1) User performs a yum check-update. Servers get checked, updates listed. The
user then decides to update packages a, b, and c. When he/she issues "yum update
a b c" - the HTTP requests for repodata files are sent again, causing a couple
of seconds of unnecessary wait. 
2) User wants to remove package foo. He issues "yum remove foo". It's only a
remove, no need to get the latest versions of the repo data, just calculate
dependencies from what is already stored. I personally always use the " -C"
option to eliminate the getting of repo data
3) User wants to install package foo. Issues "yum install foo", repos updated,
everything ok. But if the user then wants to install bar, the repos update
again. Why? Nothing has changed, isn't it safe to assume the repos are still there?
4) User wants info on package foo. Issues "yum info foo". Needs to wait 6
unnecessary seconds before the info is showed. If he decides to read the info on
package bar, it takes another 6 seconds. 
5) The "base" repo never changes. Ever. Yet it gets "pinged" every time I do
"yum anything"


My suggestions on how to improve:
a) Whenever the repodata.xml files are fetched, the time of the event is stored,
and no newer updates of the repodata.xml are done for X period of time. (X
perhaps needs some discussion, I'd say 12hrs or so.)
b) Actions that don't necessarily need interaction with the servers, like yum
remove and yum info should run with something equivalent to the -C option, and
not connect to any server unless it is needed.
c) A command line option like "--force-update" is created to always force yum to
update the latest repodata.xml files. This is for "power users", that want the
newest stuff, but it's not the default behaviour as most users want a quick
response.

Benefits of the change I propose:
* Yum feels snappier. Makes people not afraid of using yum for simple tasks like
installing a downloaded rpm package, removing a package without dependencies, or
reading the info.
* A large number of repos installed doesn't bog down yum as much, no more +1s
wait for each repo installed.
* People will use rpm -Uvh and rpm -e less, and yum install/yum remove more, now
that it doesn't make them wait a couple of seconds on every command. Perhaps,
rpm commands will never need be used by regular users anymore, in favour of yum
for all rpm-related tasks.
* Less strain on servers. Probably not that big a difference, but still.
* Yum frontends become much much faster. Right now, tools like yumex and perhaps
in the future pup have to do this: 1) do a check-update to get list. 2) Perform
the action 3) do another check-update to see the list again. This means 3
seconds wasted PER REPO. That's a lot of seconds. I'm sure usability people have
figures on how long is too long, but this surely must be too long. 

Cons:
* Users will sometimes get a slightly not-updated repodata when they use yum.
However, this is already the case with mirrors not being in perfect sync all the
time, and should not matter for the average user

All of this shouldn't be too big of a hassle for you yum developers to
implement, and I'm sure the benefits will outweigh the work of implementing
this, and wouldn't be great to brag about "yum processing time reduced to half"
or something?

Comment 1 Seth Vidal 2005-07-22 15:47:55 UTC
I think the recommended implementation would end up with extremely frustrated
users. We'd be better off getting the last-changed info from the repository
server and compare that to the repomd.xml we have on disk - but even then we'll
still need to contact the repository

Comment 2 Sigge Kotliar 2005-07-22 21:18:47 UTC
Well, I can agree that it will be a slight frustration, but I think that the
current frustration of waiting ~5 seconds every time is worse.

Apt (yes, the other packaging system, sorry for bringing it up) needs an apt-get
update before you run any command to get the latest data. Although I'm not
suggesting the same system, but frustrated users could do a "yum check-update"
to get the absolutely latest packs.

Also, as the system works now with mirrors, you can quite often get two
different versions of the "yum check-update" list if you do it twice. I've had
cases with three different versions on different mirrors. So the frustration you
talk about is already there. Now we'll at least lessen it.

Just getting the last-changed header I think would be an improvement, but an
extremely small one. Getting 1kB of data takes virtually no time, even on a
modem, it's the connecting part that takes time.

Just my 2 cents, but I really believe this is the way to go.

Comment 3 Jeremy Katz 2005-09-21 18:06:29 UTC
The two stage process used by apt is a good way of ensuring that users don't
actually get updates (since they don't remember to do the first part or other
such things).  Getting the repodata every time is really the only way to be
assured consistency.

Comment 4 Sigge Kotliar 2005-09-21 21:34:31 UTC
Jeremy: with all though respect, I don't think you read my suggestion properly.
I'm not suggesting an apt-like two stage process. What I'm suggesting is that if
a repodata file has been stored less than say an hour ago that yum doesn't get a
new one. No users would miss any updates - cause the synk difference is even
bigger between different mirrors than it would be with this one hour storage of
repodata files.
It would also help tremendously for everyone who is doing consecutive requests,
ie doing: yum install app-a; yum install app-b. Why two different connections
are needed I don't see.
With commands like yum search, yum info and yum remove, not even one is needed,
unless the repodata file is terribly out of sync, since the names of the rpm's
on the server rarely change - only the versions.

Please review this again, I really do believe in this.

Comment 5 Seth Vidal 2005-11-07 03:15:39 UTC
So we figured out a way to do this w/o going crazy.

Refiling this as rawhide - it will hit rawhide in a future yum release.

not quite how you wanted it done - but I think it will cover the problem.


Comment 6 Sigge Kotliar 2005-11-08 14:23:20 UTC
Just downloaded and tried out yum-2.4.0-9. Is this the version you are talking
about?

Remove operations are now really speedy. On this point I'm really happy.
But for
Doing "yum install something; yum install somethingelse" however still pings
each repo twice - once per command. Same goes for search. On my system the
difference was about 6 secs between "yum search verylongquery" and "yum -C
search verylongquery". Of course, with the total time spent being 20 or 26
second this may not seem much, but for modem users, or people with bad
connections it would be ever greater.
Perhaps this is something that can be further improved with the new plugin
architecture?

Comment 7 Seth Vidal 2005-11-08 19:06:07 UTC
the version if rawhide does not have that patch.

Comment 8 Sigge Kotliar 2005-11-08 23:37:20 UTC
oh. ok then =)
Will wait for the next rawhide release of yum then, and get back to you then. 

Comment 9 Seth Vidal 2005-11-08 23:38:33 UTC
wait until yum 2.4.1 hits rawhide.


Note You need to log in before you can comment on or make changes to this bug.