This happened due to LC misbehaving. Log on the machine says: > 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'> Harness tried to get away with it, but it does not work. And later when trying to stop the task Scheduler says: > <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'"> And then the task is scheduled to run again as next one with Waiting status. Resolution: - do not rely on server only to decide which task to run - task_start and task_end need to be more robust - report any failed XML-RPC's on console Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.
Bill, I seek your opinion: I have repeating implemented but... Are XML-RPC failures as seen in Comment 2 ever expected? [1] If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3] Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093 [1] The first one. The second one is a consequence I am trying to get rid of. [2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway. [3] For cases of broken network setup I filled in Bug 636080.
Created attachment 449887 [details] repeating-proxy: per-call repeating
Created attachment 449888 [details] repeat task_start - until it pass
Though it would work for task_start, this will require more sophisticated approach: 1. if the call fails, use ping call (as in Bug 636093) to determine if the net is broken and wait until service is restored. Repeat ping until: 1.1 the original call succeeds: go on wth next calls. 1.2 the ping succeeds: retry original call (2) 2. before repeating the call, try to get the original call's status: 2.1 e.g. for task_start/task_end use task_info 2.2 for task_result/upload_file repeat the call 3. the call must finish in finite time in the worst case and this must be reported: 3.1 try to push new result 3.2 message on console if the result fails Thanks to Bill for kicking me off.
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.
...and once more.
Ping - Is this still an issue?
Risk at least.
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.
closing, if seen again please re-open.