Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 626353 - recipe stuck in waiting without watchdog kicking in
recipe stuck in waiting without watchdog kicking in
Product: Beaker
Classification: Community
Component: beah (Show other bugs)
All Linux
medium Severity medium (vote)
: future_maint
: ---
Assigned To: Marian Csontos
Depends On:
Blocks: 632609
  Show dependency treegraph
Reported: 2010-08-23 06:11 EDT by Ales Zelinka
Modified: 2011-03-24 09:17 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-03-24 09:17:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
repeating-proxy: per-call repeating (9.29 KB, patch)
2010-09-27 09:22 EDT, Marian Csontos
no flags Details | Diff
repeat task_start - until it pass (2.07 KB, patch)
2010-09-27 09:22 EDT, Marian Csontos
no flags Details | Diff

  None (edit)
Comment 2 Marian Csontos 2010-08-25 01:34:47 EDT
This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.
Comment 3 Marian Csontos 2010-09-21 10:07:40 EDT
Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.
Comment 4 Marian Csontos 2010-09-27 09:22:02 EDT
Created attachment 449887 [details]
repeating-proxy: per-call repeating
Comment 5 Marian Csontos 2010-09-27 09:22:51 EDT
Created attachment 449888 [details]
repeat task_start - until it pass
Comment 6 Marian Csontos 2010-09-27 14:49:51 EDT
Though it would work for task_start, this will require more sophisticated

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.
Comment 7 Marian Csontos 2010-09-27 14:55:13 EDT
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.
Comment 8 Marian Csontos 2010-10-06 01:41:58 EDT
...and once more.
Comment 9 Bill Peck 2011-03-23 17:21:46 EDT
Ping - Is this still an issue?
Comment 10 Marian Csontos 2011-03-24 03:31:05 EDT
Risk at least.
Comment 11 Ales Zelinka 2011-03-24 07:02:22 EDT
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.
Comment 12 Bill Peck 2011-03-24 09:17:26 EDT
closing, if seen again please re-open.

Note You need to log in before you can comment on or make changes to this bug.