Bug 626353 - recipe stuck in waiting without watchdog kicking in
Summary: recipe stuck in waiting without watchdog kicking in
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Community
Component: beah (Show other bugs)
(Show other bugs)
Version: 0.5
Hardware: All Linux
medium
medium vote
Target Milestone: future_maint
Assignee: Marian Csontos
QA Contact:
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 632609
TreeView+ depends on / blocked
 
Reported: 2010-08-23 10:11 UTC by Ales Zelinka
Modified: 2011-03-24 13:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-03-24 13:17:26 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
repeating-proxy: per-call repeating (9.29 KB, patch)
2010-09-27 13:22 UTC, Marian Csontos
no flags Details | Diff
repeat task_start - until it pass (2.07 KB, patch)
2010-09-27 13:22 UTC, Marian Csontos
no flags Details | Diff

Comment 2 Marian Csontos 2010-08-25 05:34:47 UTC
This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

Resolution:
- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.

Comment 3 Marian Csontos 2010-09-21 14:07:40 UTC
Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.

Comment 4 Marian Csontos 2010-09-27 13:22:02 UTC
Created attachment 449887 [details]
repeating-proxy: per-call repeating

Comment 5 Marian Csontos 2010-09-27 13:22:51 UTC
Created attachment 449888 [details]
repeat task_start - until it pass

Comment 6 Marian Csontos 2010-09-27 18:49:51 UTC
Though it would work for task_start, this will require more sophisticated
approach:

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
   reported:
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.

Comment 7 Marian Csontos 2010-09-27 18:55:13 UTC
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.

Comment 8 Marian Csontos 2010-10-06 05:41:58 UTC
...and once more.

Comment 9 Bill Peck 2011-03-23 21:21:46 UTC
Ping - Is this still an issue?

Comment 10 Marian Csontos 2011-03-24 07:31:05 UTC
Risk at least.

Comment 11 Ales Zelinka 2011-03-24 11:02:22 UTC
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.

Comment 12 Bill Peck 2011-03-24 13:17:26 UTC
closing, if seen again please re-open.


Note You need to log in before you can comment on or make changes to this bug.