Bug 626353 - recipe stuck in waiting without watchdog kicking in
Summary: recipe stuck in waiting without watchdog kicking in
Alias: None
Product: Beaker
Classification: Community
Component: beah (Show other bugs)
(Show other bugs)
Version: 0.5
Hardware: All Linux
medium vote
Target Milestone: future_maint
Assignee: Marian Csontos
QA Contact:
Depends On:
Blocks: 632609
TreeView+ depends on / blocked
Reported: 2010-08-23 10:11 UTC by Ales Zelinka
Modified: 2011-03-24 13:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-03-24 13:17:26 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
repeating-proxy: per-call repeating (9.29 KB, patch)
2010-09-27 13:22 UTC, Marian Csontos
no flags Details | Diff
repeat task_start - until it pass (2.07 KB, patch)
2010-09-27 13:22 UTC, Marian Csontos
no flags Details | Diff

Comment 2 Marian Csontos 2010-08-25 05:34:47 UTC
This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.

Comment 3 Marian Csontos 2010-09-21 14:07:40 UTC
Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.

Comment 4 Marian Csontos 2010-09-27 13:22:02 UTC
Created attachment 449887 [details]
repeating-proxy: per-call repeating

Comment 5 Marian Csontos 2010-09-27 13:22:51 UTC
Created attachment 449888 [details]
repeat task_start - until it pass

Comment 6 Marian Csontos 2010-09-27 18:49:51 UTC
Though it would work for task_start, this will require more sophisticated

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.

Comment 7 Marian Csontos 2010-09-27 18:55:13 UTC
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.

Comment 8 Marian Csontos 2010-10-06 05:41:58 UTC
...and once more.

Comment 9 Bill Peck 2011-03-23 21:21:46 UTC
Ping - Is this still an issue?

Comment 10 Marian Csontos 2011-03-24 07:31:05 UTC
Risk at least.

Comment 11 Ales Zelinka 2011-03-24 11:02:22 UTC
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.

Comment 12 Bill Peck 2011-03-24 13:17:26 UTC
closing, if seen again please re-open.

Note You need to log in before you can comment on or make changes to this bug.