Description of problem:
The configd is a python console which retrieves a QMF object from wallaby and keeps track of it while the daemon runs. Periodically, the configd will call obj.update before performing operations. At some point, the update call throws an exception and any subsequent call of update throws an exception.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Can you provide more detail on the failure? What was the text of the exception raised? Was there a timeout? Did the underlying object on the agent disappear? Does this symptom coincide with any loss of connectivity (agent or console)?
Please note that calling object.update() causes a query request to be sent to the agent. The call blocks while waiting for a response. There are two legitimate error cases here:
1) The agent disconnects and does not respond causing a timeout.
2) The object is deleted on the agent and the response contains no data. In this case, an exception is raised with "Underlying object no longer exists".
Your application code needs to handle these cases because they may occur in the normal course of operation.
I believe the cause of this was the agent deleting an object that a console was holding onto. I found a reproducer for the error case by reloading the wallaby database, which will cause the object held by the condor_configd (a console) to be deleted. When the configd goes to update, it throws and exception. I can't say for sure this is the cause of the original issue, but it is likely this case or something similar.
In the scenario we reproduced, we received and exception with "Underlying object no longer exists" and the configd now properly handles the case.
The original error had no useful data to diagnose the issue, so I augmented the configd to print more useful information should it occur again.