Red Hat Bugzilla – Bug 626353
recipe stuck in waiting without watchdog kicking in
Last modified: 2011-03-24 09:17:26 EDT
This happened due to LC misbehaving. Log on the machine says:
> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>
Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:
> <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">
And then the task is scheduled to run again as next one with Waiting status.
- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console
Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.
Bill, I seek your opinion: I have repeating implemented but...
Are XML-RPC failures as seen in Comment 2 ever expected? 
If a call has correct parameters  and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? 
Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093
 The first one. The second one is a consequence I am trying to get rid of.
 Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.
 For cases of broken network setup I filled in Bug 636080.
Created attachment 449887 [details]
repeating-proxy: per-call repeating
Created attachment 449888 [details]
repeat task_start - until it pass
Though it would work for task_start, this will require more sophisticated
1. if the call fails, use ping call (as in Bug 636093) to determine if the net
is broken and wait until service is restored. Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
3.1 try to push new result
3.2 message on console if the result fails
Thanks to Bill for kicking me off.
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.
...and once more.
Ping - Is this still an issue?
Risk at least.
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.
closing, if seen again please re-open.