Currently beaker-repo-update is implemented as a kind of copy-paste inspired by the original /usr/bin/reposync from Yum, with a bunch of extra behaviour. But yum is EOL, and also the code has a lot of problematic behaviour (see bug 1619969). We should rewrite to use either: * librepo + libdnf/libsolv, assuming its API gives us what we need, and assuming by the time we tackle this RFE we are no longer targetting RHEL6; or * requests + our own minimal repodata parser using lxml. Existing behaviour that must be preserved: * only try to sync OS majors that have existing trees in Beaker * grab repodata from under the given URL and fetch the *latest* version of each package for each OS major * write the package to disk in /var/www/beaker/harness * if any package changed, regenerate repodata using createrepo_c * if there is no upstream harness repo for a given OS major, print a non-scary warning (for eaxmple Atomic and RHVH4 etc) Some sorely needed improvements that the current version can't do: * re-use the package on disk only if it matches the expected checksum * verify checksum (against the repodata checksum) after downloading a package * ignore OS majors that are entirely absent but *don't* ignore other errors like write failures or checksum failures -- this should be an immediate hard error with a good message, so that it's not lost amongst the spew as subsequent repos get downloaded * write out packages with the proper atomic rename dance, so that an interrupted download does not leave an incomplete package
Another improvement to the list: * don't use any cached repodata, or only use cached repodata after first checking repomd.xml and ensuring that the repodata we are reusing matches the checksum of the current repodata. In bug 1619969 I noticed that in certain error conditions, Yum falls back to using a local cache of the repodata left behind in /var/tmp/yum-* but that is never what we want.
I've implemented some of the above improvements within the existing Yum-based command for bug 1619969, but it would still be nicer if we can clean it up and simplify it by avoiding the Yum APIs which are very difficult to use correctly.
There is another issue I noticed while working on this. I don't think we have ever hit it so it is purely theoretical, but it would be good to fix it up in this new implementation. If you interrupt beaker-repo-update while it's running createrepo, or if the createrepo fails for some reason, and then you re-run beaker-repo-update to do it again -- it won't actually do it. That's because, as an optimsation, it avoids running createrepo if it hasn't downloaded any new packages in a repo. I am not sure what the best way to keep that optimisation while avoiding this problem if createrepo fails. But some ideas come to mind like, checking modtime on the repodata directory, or using a .dirty marker file whenever a new package is downloaded and then only removing the marker file after createrepo has been successfully executed.