Red Hat Bugzilla – Bug 636574
[Beaker] System fails to reboot from webUI if distro profile has been removed.
Last modified: 2012-09-13 10:56:31 EDT
Beaker will be getting this exception in the save_system call to cobbler at bkr/server/model.py:1326 in the power method.
It wouldn't happen when Beaker provisions a system, since in that case we will have set a (hopefully) valid profile just before the call to power(). But when someone uses the manual power controls on the system page, we never bother updating the profile and so cobbler will barf if it's still set to an old profile which has been cleaned out. Same deal when Beaker powers off a system after the user returns it (bug 634247).
I'm disinclined to catch the exception inside the power method and just reset the profile somehow, because if we called power() as part of a provision we really do want to know if the profile we are using has disappeared. On the other hand, if we're just using cobbler for power control then I don't think it matters what profile we are using.
I think maybe the right fix here is to fix cobbler so that it removes its record of the system when the distro and its profile are removed.
In my testing with the cobbler CLI (which is just a wrapper for the XMLRPC API I think?) this does actually work properly -- if I create a system, referencing a profile, referencing a distro, then delete that distro, all three are removed. So I need to figure out how the old distros are being removed from cobbler on our lab controllers and why it isn't cleaning up system records properly. Maybe it's a bug in a newer version of cobbler? If so, I think we're better off getting it fixed in cobbler rather than adding code to Beaker to work around it.
Thanks for investigating this. ultimately this is a bug in cobbler, but I don't have a lot of hope for getting it fixed there.
Any other thoughts on how to fix it then?
I should read all your comments before adding mine. :-)
If it works on your local version of cobbler then it should be fixable.
What version of cobbler are you using? and can you try it on lab-devel or lab-stage?
I've got cobbler-126.96.36.199-3.el5 on my VM, which is slightly older than what we have on beaker-devel. It could also be something to do with our usage of cobbler in the labs. I will look into this further today to try and pin down exactly why the system records aren't being removed.
I think this might actually be a race of sorts. So far the only way I've succeeded in reproducing it is as follows:
>>> import xmlrpclib
>>> c = xmlrpclib.ServerProxy('http://localhost/cobbler_api')
>>> tok = c.login('testing', 'testing')
>>> s = c.new_system(tok)
>>> c.modify_system(s, 'name', 'testing.example.com', tok)
>>> c.modify_system(s, 'profile', 'Fedora11-x86_64', tok)
>>> c.save_system(s, tok)
>>> # in another window: cobbler distro remove --name=Fedora11-x86_64
>>> # now cobbler list shows the distro, profile, and system have been removed
>>> c.modify_system(s, 'power_address', 'notexist.example.com', tok)
>>> # at this point, cobbler has somehow resurrected the deleted system record
>>> c.save_system(s, tok)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib64/python2.4/xmlrpclib.py", line 1096, in __call__
return self.__send(self.__name, args)
File "/usr/lib64/python2.4/xmlrpclib.py", line 1383, in __request
File "/usr/lib64/python2.4/xmlrpclib.py", line 1147, in request
return self._parse_response(h.getfile(), sock)
File "/usr/lib64/python2.4/xmlrpclib.py", line 1286, in _parse_response
File "/usr/lib64/python2.4/xmlrpclib.py", line 744, in close
xmlrpclib.Fault: <Fault 1: "cobbler.cexceptions.CX:'system testing.example.com references a missing profile Fedora11-x86_64'">
But I would expect this race to be really hard to hit, and yet we've seen it at least three times recently (bug 63247, this bug, and it happened to me the other week). So I must be missing something here still.
Maybe the race window is not so small as I thought. If my theory is right then it would happen if, at the moment of deleting a distro, there is some system which had used that distro and was in the middle of being powered on or provisioned. Given how busy Beaker can be, maybe this is not so unlikely?
What happens is developers run a reserve task and it gets installed with a nightly tree(I think the nightly trees are only kept for 7 days, rel-eng question). So if they happen to install a nightly that is due to expire in the next 24 hours and have the system reserved for several more days. Once that distro is deleted they can no longer use the webUI to reboot.
Look for the oldest nightly tree, It will have a .n extension (RHEL5.6-Server-20100927.n). That tree will be the next tree to be expired. Run a reserve task/provision, wait for that tree to be deleted. Then try and click the reboot button. You will see the error.
Jeff, I tried reproducing on beaker-stage using those steps, but it didn't trigger the bug. The cobbler system record was removed (as expected) and when I tried to reboot the system Beaker recreated the system record with a dummy profile (as expected). bpeck pointed out that the cobbler version in production is different though, so maybe it behaves differently.
I'm dropping this from 0.5.59 because it's low priority and (so far as I know) users don't hit this very often.
As a workaround you should be able to provision the machine with a new distro (assuming you've manually reserved it). After that, power controls should work again.
Has this issue been resolved? I have not seen it in quite a while
We haven't seen it happening lately.
This bug will finally be solved in Beaker 0.9.0 with the removal of Cobbler.
*** Bug 752330 has been marked as a duplicate of this bug. ***
Beaker 0.9.0 has been released.