Bug 1322060

Summary: beaker_expire_distros broken ?
Product: [Retired] Beaker Reporter: MikeBoswell <mboswell>
Component: lab controllerAssignee: beaker-dev-list
Status: CLOSED NOTABUG QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: high    
Version: 22CC: dcallagh, drohwer, mjia, rjoost
Target Milestone: future_maintKeywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-04 03:44:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 2 matt jia 2016-03-30 06:43:00 UTC
On Gerrit:

  http://gerrit.beaker-project.org/#/c/4770/

With this patch, beaker-expire-distros can handle HTTP error 403.

Comment 4 matt jia 2016-03-30 23:43:06 UTC
We could use --ignore-errors to skip those 403 or nfs server error messages. Then the expire distro would not stop but the downside is those distros will not get expired until the errors are fixed.

Comment 5 Dan Callaghan 2016-03-31 07:27:35 UTC
beaker-expire-distros was working correctly here, as far as I can see.

There were two separate problems. Some stuff under /fedmsg/dumpdata/ was returning 403 Forbidden. Presumably that's because something got messed up on the server (wrong filesystem perms?) and so Apache could no longer read the directory it was configured to serve up.

There's no way for beaker-expire-distros to know whether 403 means the distro is gone, or the server config is broken. It's actually more likely to be the latter. We intentionally designed beaker-expire-distros to err on the side of *not* removing a distro when the URL is in error. It only considers 404 (and 410) to mean that the distro is deleted, any other response code is just treated as an error.

It used to do the opposite (it used to assume any error response means the distro is deleted) but we kept hitting situations where some temporary problem would happen with a lab mirror and suddenly every distro has been removed from that lab in Beaker and now everyone is complaining because their jobs won't run until the admins re-import all the distros.

So in this case beaker-expire-distros was seeing the 403, it was reporting it in its output, and then it was *not* expiring the distro under the assumption that the admin will fix it up.

The second problem was /net/<oldfiler> does not exist. This is kind of similar: beaker-expire-distros is doing a sanity check to make sure that /net/<filer> exists, because if it doesn't then it means the filer is offline or unmounted or has changed names or something like that. Again, it reports the error and then errs on the side of *not* expiring all the distros, for the same reason.

It seems like the only bug here is that beaker-expire-distros was trying to tell us about these problems but the error message was mixed in with a huge pile of spew that just went into a cron mail and nobody ever saw it. So we could improve how beaker-expire-distros reports these problems and what its output looks like -- but I'm not sure what a better approach would be.

In an ideal world Beaker would not be scraping around in /net and over HTTP to figure out when these trees are deleted, which is the point of the PDC integration we are planning...

Comment 6 MikeBoswell 2016-04-01 20:37:21 UTC
(In reply to Dan Callaghan from comment #5)
> beaker-expire-distros was working correctly here, as far as I can see.
> 
> There were two separate problems. Some stuff under /fedmsg/dumpdata/ was
> returning 403 Forbidden. Presumably that's because something got messed up
> on the server (wrong filesystem perms?) and so Apache could no longer read
> the directory it was configured to serve up.
> 
> There's no way for beaker-expire-distros to know whether 403 means the
> distro is gone, or the server config is broken. It's actually more likely to
> be the latter. We intentionally designed beaker-expire-distros to err on the
> side of *not* removing a distro when the URL is in error. It only considers
> 404 (and 410) to mean that the distro is deleted, any other response code is
> just treated as an error.
> 
> It used to do the opposite (it used to assume any error response means the
> distro is deleted) but we kept hitting situations where some temporary
> problem would happen with a lab mirror and suddenly every distro has been
> removed from that lab in Beaker and now everyone is complaining because
> their jobs won't run until the admins re-import all the distros.
> 
> So in this case beaker-expire-distros was seeing the 403, it was reporting
> it in its output, and then it was *not* expiring the distro under the
> assumption that the admin will fix it up.
> 
> The second problem was /net/<oldfiler> does not exist. This is kind of
> similar: beaker-expire-distros is doing a sanity check to make sure that
> /net/<filer> exists, because if it doesn't then it means the filer is
> offline or unmounted or has changed names or something like that. Again, it
> reports the error and then errs on the side of *not* expiring all the
> distros, for the same reason.
> 
> It seems like the only bug here is that beaker-expire-distros was trying to
> tell us about these problems but the error message was mixed in with a huge
> pile of spew that just went into a cron mail and nobody ever saw it. So we
> could improve how beaker-expire-distros reports these problems and what its
> output looks like -- but I'm not sure what a better approach would be.
> 
> In an ideal world Beaker would not be scraping around in /net and over HTTP
> to figure out when these trees are deleted, which is the point of the PDC
> integration we are planning...

Thank you Dan.  It seems after we unstuck fedmsg things got moving again.  Totally agree w/ the explanation around exiting for a 403.

Comment 7 Dan Callaghan 2016-04-04 03:44:23 UTC
I'll close this as NOTABUG but if you have any suggestions about how to improve beaker-expire-distros' error messages in the short term, let us know and we can implement them.