Bug 664356 - RFE: using git to store and fetch metadata
Summary: RFE: using git to store and fetch metadata
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: yum
Version: rawhide
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
Assignee: Seth Vidal
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-12-20 04:40 UTC by Amit Shah
Modified: 2014-01-21 23:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-04 16:17:07 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Amit Shah 2010-12-20 04:40:22 UTC
Description of problem:

Here's an idea for speeding up the metadata downloads for yum operations:  use git to store and fetch metadata instead of downloading the metadata files.

The metadata is text (sqlite), and can be very well stored and diffed by git.

Not all fedora mirrors have a git server, so using a canonical git server for yum metadata and continuing to use the existing mirrors for rpms can be the default setting.

Since git is much faster at downloading deltas, the metadata update step that's performed every day (by default) will be a much faster step, especially for people with slow network connectivity and who benefit from the deltarpms.

Comment 1 seth vidal 2010-12-20 15:41:32 UTC
A single site for the metadata would be a mess  - not only b/c no one site could stand up to the load but also b/c it's a spof.

However why would mirrors need to run a git server.

git clone --depth=2 http://mirror/someplace/ should work.

I suspect ftp wouldn't work.

Something would need to be done to keep the server master from growing w/o bound b/c of every change being saved into the git repo.

Comment 2 Amit Shah 2010-12-21 08:30:26 UTC
(In reply to comment #1)
> A single site for the metadata would be a mess  - not only b/c no one site
> could stand up to the load but also b/c it's a spof.
> 
> However why would mirrors need to run a git server.

Because the git protocol is efficient.

> git clone --depth=2 http://mirror/someplace/ should work.

Yes, would work too, but would put some extra load on the server.

> I suspect ftp wouldn't work.
> 
> Something would need to be done to keep the server master from growing w/o
> bound b/c of every change being saved into the git repo.

We could have a new git repo for each release, and not have to worry about growing git repo size.

Comment 3 Tim Lauridsen 2010-12-31 08:45:40 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > A single site for the metadata would be a mess  - not only b/c no one site
> > could stand up to the load but also b/c it's a spof.
> > 
> > However why would mirrors need to run a git server.
> 
> Because the git protocol is efficient.

Running git on mirrors is not an option IMO, there are a lot of mirror out there
and needing them to run git to host a Fedora mirror, will reduce that number a lot.

> 
> > git clone --depth=2 http://mirror/someplace/ should work.
> 
> Yes, would work too, but would put some extra load on the server.
> 
> > I suspect ftp wouldn't work.
> > 
> > Something would need to be done to keep the server master from growing w/o
> > bound b/c of every change being saved into the git repo.
> 
> We could have a new git repo for each release, and not have to worry about
> growing git repo size.

There is a lot of metadata changes in a Fedora release, so I'm not sure that one repo for each release is enough to reduce the size of the git repo.

I think it is a good idea to have some kind of delta metadata, but I'm not convinced that git is the way to go, I think there must be a simpler way to do, there don't make the requirement to yum explode and makes to mirror setup more complex.

I think it is to early to start making planing for a feature before any kind om technical testing is done to prove that the benefit is worth the effort.

Comment 4 Amit Shah 2011-01-04 13:28:16 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > A single site for the metadata would be a mess  - not only b/c no one site
> > > could stand up to the load but also b/c it's a spof.
> > > 
> > > However why would mirrors need to run a git server.
> > 
> > Because the git protocol is efficient.
> 
> Running git on mirrors is not an option IMO, there are a lot of mirror out
> there
> and needing them to run git to host a Fedora mirror, will reduce that number a
> lot.

It shouldn't be a requirement, just a priority.  Also, the closest git mirror could be ranked higher than the closest http mirror.

It's also difficult to say a single Fedora metadata git server will have to serve lots of data without actually seeing how much data gets served and how much efficient git is.

> > > git clone --depth=2 http://mirror/someplace/ should work.
> > 
> > Yes, would work too, but would put some extra load on the server.
> > 
> > > I suspect ftp wouldn't work.
> > > 
> > > Something would need to be done to keep the server master from growing w/o
> > > bound b/c of every change being saved into the git repo.
> > 
> > We could have a new git repo for each release, and not have to worry about
> > growing git repo size.
> 
> There is a lot of metadata changes in a Fedora release, so I'm not sure that
> one repo for each release is enough to reduce the size of the git repo.

Lots of changes are fine.  What matters is what's the size of the diff between each metadata update.  If someone can point me to a place which hosts all metadata updates in, say, the Fedora 13 release, I can create a git tree and see how much it grows.

I suspect it's not going to be much.

> I think it is a good idea to have some kind of delta metadata, but I'm not
> convinced that git is the way to go, I think there must be a simpler way to do,
> there don't make the requirement to yum explode and makes to mirror setup more
> complex.

Using the right tool for the right job should be enough of an incentive.  Git fits the bill perfectly.

> I think it is to early to start making planing for a feature before any kind om
> technical testing is done to prove that the benefit is worth the effort.

Of course, that's how development is done anyway.

Comment 5 James Antill 2011-01-04 16:17:07 UTC
> Using the right tool for the right job should be enough of an incentive.  Git
> fits the bill perfectly.

 Yes we assume you think this, but given no reasons why ... and we are far from convinced. Yes, git has delta code ... but so do lots of other things. Yes, we want delta metadata just as we wanted delta rpms ... but git brings with it a huge amount of extra baggage, and AFAIK nobody is (ab)using git in anything like the way this would.

 How does git handle an rsync running in the background?
 How many public mirrors would be willing to run the git code?
 How many private mirrors?
 Would this ever appear in Spacewalk?
 What happens when you move from updates/13 to updates/14 ... this kind of thing just doesn't happen in normal git usage.
 What happens with rawhide ... does the repo. just become 666 GB. Yes, the clients might be able to use --depth (and that will be _interesting_ I bet), but any mirrors can't.

 You say "The metadata is text (sqlite)", which confuses me. One of our problems atm. is that doing "rolling deltas" on the XML is "kind of easy" as it's just additions/subtractions of entries ... but on the sqlite side it's much more complex (really what we need is an sqlite specific differ ... but it'd need to be version agnostic and produce byte for byte identical results).

 Personally using git for this seems like a giant hack, with almost no upside. What we need is:

1. Some way to diff the sqlite (and have the size match the size of the changes to the packages).

2. Some way to store the diffs over multiple runs, probably need at least a week ... size _might_ be a problem here (that's one of the problems still with drpms).

3. Some metadata/protocol/whatever for the client to get the deltas.

...from what I understand git doesn't actually help with #1, because it doesn't have a good binary delta generator (and, as I said, likely we need something that knows about sqlite specifically).
 Git doesn't really help with #2, because by storing everything forever I don't think it's viable ... and I'm pretty sure that's not easy to fix.
 git does help with #3, but that's the easiest problem ... and gits solution is not very good (it's very complex, and I doubt there are any good APIs for talking to git from python).

 In many ways "rsync" is a _much_ better fit, given that it "solves" #2 and #3 ... it _might_ even be able to do #1, at least some of the time. It requires the mirrors run rsync (although we have that info. already). It also has the giant API problem (which is why nobody uses it in code).


Note You need to log in before you can comment on or make changes to this bug.