Bug 606862 - /var/www/beaker/logs directory structure questionable
Summary: /var/www/beaker/logs directory structure questionable
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 0.5
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
Assignee: Dan Callaghan
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 632609
TreeView+ depends on / blocked
 
Reported: 2010-06-22 16:03 UTC by Matt Brodeur
Modified: 2011-09-28 15:34 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2010-09-17 02:19:55 UTC
Embargoed:


Attachments (Terms of Use)
migrate_logs.py (946 bytes, text/x-python)
2010-09-01 03:53 UTC, Dan Callaghan
no flags Details
migrate_logs.py (1.35 KB, text/x-python)
2010-09-08 02:05 UTC, Dan Callaghan
no flags Details
migrate_logs.py (1.42 KB, text/x-python)
2010-09-08 04:27 UTC, Dan Callaghan
no flags Details
migrate_logs.py (2.34 KB, text/x-python)
2010-09-09 02:10 UTC, Dan Callaghan
no flags Details

Description Matt Brodeur 2010-06-22 16:03:31 UTC
Description of problem:
The test logs uploaded on the scheduler get stored in directories apparently named by:
/var/www/beaker/logs/[YEAR]/[LAST-2-DIGITS-OF-JOBID]/[JOBID]/[RECIPESET?]/[RECIPE?]

The choice of directory name between the year and jobid is what I'm concerned about.  There are two issues here.  This structure makes it nearly impossible to archive results on other than a year boundary.  This also will have scaling issues once we have hundreds of thousands of jobids.  Each YYYY/XX directory will then have tens of thousands of dentries.

Was this intentional?  Could we insert another level, such as month, to make archival easier?


Version-Release number of selected component (if applicable):
0.5.44-0

Comment 1 Kevin Baker 2010-06-22 16:28:05 UTC
How about this structure

/var/www/beaker/logs/[YYYY]/[MM]/[XX]/[JOBID]

Where XX is a counter that increments after Y job ids have been written to the directory.

For example assuming XX starts at 00 and Y == 100.

/var/www/beaker/logs/2010/06/00/[JOBIDS 0..99]
/var/www/beaker/logs/2010/06/01/[JOBIDS 100..199]
/var/www/beaker/logs/2010/06/02/[JOBIDS 200..299]

This would permit archiving based on a month boundary and give control on how many JOBID directories can be created per XX.

Comment 2 Bill Peck 2010-08-31 18:07:22 UTC
Dan, 

Targeting 0.5.57 which will be scheduled to be deployed Sep 15th.

Lets go with the structure Kevin recommended.

Will need a script to move all the existing data into the new directory structure which the admins can run once.

Comment 3 Bill Peck 2010-08-31 18:13:23 UTC
For simplicity I think I would just do the following for XX.

XX = JOBID / Y

Comment 4 Dan Callaghan 2010-09-01 03:48:38 UTC
Pushed branch bz606862 for inclusion in 0.5.57:
http://git.fedorahosted.org/git/?p=beaker.git;a=commitdiff;h=079e125b15f9db753306ced99246c2add3c84107

Comment 5 Dan Callaghan 2010-09-01 03:53:00 UTC
Created attachment 442318 [details]
migrate_logs.py

We will need to run a one-off migration script (attached) during the outage, to move all log directories into their new locations. We should check for the presence of any uncaught log files after running this script.

We'll need to test this out on beaker-stage during release prep, I don't think there is anywhere else that we can try it out.

Comment 6 Marian Csontos 2010-09-01 05:30:44 UTC
RE: migration script:
IMO there is a potential for conflicts in old-new schema. We may end up moving same file multiple times ending in completely wrong directory.

Proposed Solution:
Do not move to same top-directory or use different schema e.g. [YYYY-MM]/[XX]/[JOBID]

Suggestion: Use hard-links instead of move:
This way it would be easier to verify each job has exactly the same files in new and old schema.
It would also make it safer to run the script on production data giving us easy rollback option:

- run the migration script
- check dir structure
  - if something is wrong: call developer. stop update.
- update beaker
- sanity checks
  - if something is wrong: ...
- remove old data

Comment 7 Bill Peck 2010-09-07 16:22:01 UTC
The migration script needs to work while the system is running.  So after the upgrade the old log data will not be accessible until the migration script runs.  It should also sort the logs to migrate based on the time or test id's.  That way we will move the most recent logs first.

Comment 8 Kevin Baker 2010-09-07 18:24:48 UTC
(In reply to comment #7)
> The migration script needs to work while the system is running.  So after the
> upgrade the old log data will not be accessible until the migration script
> runs.  It should also sort the logs to migrate based on the time or test id's. 
> That way we will move the most recent logs first.

Rough outline

1) outage starts. beaker services stopped
2) migration script is started
3) code is upgraded
4) beaker services restarted

Question:  if the migration hasn't completed by the step 4 how will beaker respond to requests for not-yet-migrated-directories?

Comment 9 Dan Callaghan 2010-09-07 23:05:33 UTC
I was expecting that we would run the migration script to completion during the outage. Is there some reason we want to start beaker up before the migration is complete? Is it because we are expecting the migration to take a long time? It will only be writing approx. 16000 links (and a bunch of parent directories) to the filesystem, I would not expect this to take very long at all, or am I wrong?

Comment 10 Dan Callaghan 2010-09-07 23:08:51 UTC
(In reply to comment #8)
> Question:  if the migration hasn't completed by the step 4 how will beaker
> respond to requests for not-yet-migrated-directories?

If we start the new version of beaker before the migration is complete, it will link users to their task logs under the new location which will then give a 404 if clicked (since the logs don't exist under the new path on the filesystem). I think this is pretty harmless: once the migration script is complete all the links will start working again.

Comment 11 Kevin Baker 2010-09-07 23:27:38 UTC
(In reply to comment #9)
> I was expecting that we would run the migration script to completion during the
> outage. Is there some reason we want to start beaker up before the migration is
> complete? Is it because we are expecting the migration to take a long time? It
> will only be writing approx. 16000 links (and a bunch of parent directories) to
> the filesystem, I would not expect this to take very long at all, or am I
> wrong?

My impression from talking with Matt Brodeur is that it will take hours to run.

Comment 12 Kevin Baker 2010-09-07 23:28:41 UTC
(In reply to comment #10)
> (In reply to comment #8)
> > Question:  if the migration hasn't completed by the step 4 how will beaker
> > respond to requests for not-yet-migrated-directories?
> 
> If we start the new version of beaker before the migration is complete, it will
> link users to their task logs under the new location which will then give a 404
> if clicked (since the logs don't exist under the new path on the filesystem). I
> think this is pretty harmless: once the migration script is complete all the
> links will start working again.

I think that is acceptable. We'll inform all users before hand and migrate the most recent logs first. That should be good enough.

Comment 13 Dan Callaghan 2010-09-07 23:46:11 UTC
(In reply to comment #11)
> My impression from talking with Matt Brodeur is that it will take hours to run.

My Thinkpad can write 16000 links (in an analogous structure) to ext3 in about 300ms. :-) Of course it's not a very accurate test, but I think the script will take seconds, not hours.

Maybe Matt was thinking that we were going to copy the contents of all the log files? We will definitely be avoiding that because of how inefficient it would be.

Comment 14 Dan Callaghan 2010-09-08 00:57:27 UTC
Oops, I was forgetting that hard links to directories are forbidden. So there will be substantially more than 16000 links to write. This also makes the migration script much more complicated. :-(

I still think that "hours" for the runtime is a bit pessimistic though.

Comment 15 Kevin Baker 2010-09-08 01:24:38 UTC
(In reply to comment #14)
> Oops, I was forgetting that hard links to directories are forbidden. So there
> will be substantially more than 16000 links to write. This also makes the
> migration script much more complicated. :-(
> 
> I still think that "hours" for the runtime is a bit pessimistic though.

I almost spat out my scotch when I read that. Brodeur IS pessimism incarnate. 

It's not running on ext3 IIRC, it's running on GFS1 on RHEL4 which is noted for poor performance on many small files.

Comment 16 Dan Callaghan 2010-09-08 02:05:48 UTC
Created attachment 445811 [details]
migrate_logs.py

Comment 17 Dan Callaghan 2010-09-08 02:13:36 UTC
New version of migration script attached. Per mcsontos' suggestion, this version creates a duplicate of each job's log directory using cp -al with the revised directory structure under a separate root (/var/www/beaker/migrated-logs). This lets us check the sanity of the migrated directories (e.g. by comparing `find -type f | wc -l`). Then once we're happy we could remove /var/www/beaker/logs and move the migrated-logs directory into its place.

bpeck is going to test out the script using a copy of the log data from production. This will give us an idea of how it will perform. I've estimated that it will need to create approx 17 million hard links (and a whole bunch of parent directories). I'm still hopeful that it will run in a decent amount of time, so that we can do the entire migration during the outage and avoid the risk and complexity of running it while beaker is up.

Comment 18 Matt Brodeur 2010-09-08 03:38:16 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Oops, I was forgetting that hard links to directories are forbidden. So there
> > will be substantially more than 16000 links to write. This also makes the
> > migration script much more complicated. :-(
> > 
> > I still think that "hours" for the runtime is a bit pessimistic though.
> 
> I almost spat out my scotch when I read that. Brodeur IS pessimism incarnate. 

# time (find /var/www/beaker/logs -type f | wc)
2926370 2926370 226308046

real    125m19.147s
user    0m4.876s
sys     0m21.789s


3 million files, just over two hours *just*to*count*them*.



> It's not running on ext3 IIRC, it's running on GFS1 on RHEL4 which is noted for
> poor performance on many small files.

As we covered earlier, the entire beaker logs repository is still on ext3.  We can't split out older results until after this migration.

Comment 19 Dan Callaghan 2010-09-08 04:27:12 UTC
Created attachment 445839 [details]
migrate_logs.py

Revised migration script which will do newest logs first.

Per discussion with bpeck, jobs which were started before the outage might attempt to append to their logs once we start beaker back up. That means we need to at least wait until the script finishes copying the logs for jobs which might still be running, otherwise beaker would attempt to append to logs which haven't been moved into their new location yet.

Comment 20 Dan Callaghan 2010-09-09 02:10:28 UTC
Created attachment 446123 [details]
migrate_logs.py

Attaching updating migration script. This version selects all currently running jobs and moves them first. It's also more verbose in reporting its current status. Some more robust error handling too.

I've tried this out on beaker-stage, it seems to work well there.

Comment 22 Bill Peck 2010-09-14 14:25:08 UTC
Testing complete on sun-v40z-01.rhts.eng.bos.redhat.com. 

mbrodeur copied the logs from production and restored a sanitized version of the db.  

It look 2 hours and 18 minutes to do the conversion but it only took about 5 minutes to move the running recipe logs.


Note You need to log in before you can comment on or make changes to this bug.