Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 626353 - recipe stuck in waiting without watchdog kicking in
recipe stuck in waiting without watchdog kicking in
Status: CLOSED CURRENTRELEASE
Product: Beaker
Classification: Community
Component: beah (Show other bugs)
0.5
All Linux
medium Severity medium (vote)
: future_maint
: ---
Assigned To: Marian Csontos
:
Depends On:
Blocks: 632609
  Show dependency treegraph
 
Reported: 2010-08-23 06:11 EDT by Ales Zelinka
Modified: 2011-03-24 09:17 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-03-24 09:17:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
repeating-proxy: per-call repeating (9.29 KB, patch)
2010-09-27 09:22 EDT, Marian Csontos
no flags Details | Diff
repeat task_start - until it pass (2.07 KB, patch)
2010-09-27 09:22 EDT, Marian Csontos
no flags Details | Diff

  None (edit)
Comment 2 Marian Csontos 2010-08-25 01:34:47 EDT
This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

Resolution:
- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.
Comment 3 Marian Csontos 2010-09-21 10:07:40 EDT
Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.
Comment 4 Marian Csontos 2010-09-27 09:22:02 EDT
Created attachment 449887 [details]
repeating-proxy: per-call repeating
Comment 5 Marian Csontos 2010-09-27 09:22:51 EDT
Created attachment 449888 [details]
repeat task_start - until it pass
Comment 6 Marian Csontos 2010-09-27 14:49:51 EDT
Though it would work for task_start, this will require more sophisticated
approach:

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
   reported:
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.
Comment 7 Marian Csontos 2010-09-27 14:55:13 EDT
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.
Comment 8 Marian Csontos 2010-10-06 01:41:58 EDT
...and once more.
Comment 9 Bill Peck 2011-03-23 17:21:46 EDT
Ping - Is this still an issue?
Comment 10 Marian Csontos 2011-03-24 03:31:05 EDT
Risk at least.
Comment 11 Ales Zelinka 2011-03-24 07:02:22 EDT
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.
Comment 12 Bill Peck 2011-03-24 09:17:26 EDT
closing, if seen again please re-open.

Note You need to log in before you can comment on or make changes to this bug.