Bug 626353

Summary: recipe stuck in waiting without watchdog kicking in
Product: [Retired] Beaker Reporter: Ales Zelinka <azelinka>
Component: beahAssignee: Marian Csontos <mcsontos>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 0.5CC: bpeck, dcallagh, kbaker, mcsontos, rmancy
Target Milestone: future_maint   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-24 13:17:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 632609    
Attachments:
Description Flags
repeating-proxy: per-call repeating
none
repeat task_start - until it pass none

Comment 2 Marian Csontos 2010-08-25 05:34:47 UTC
This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

Resolution:
- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.

Comment 3 Marian Csontos 2010-09-21 14:07:40 UTC
Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.

Comment 4 Marian Csontos 2010-09-27 13:22:02 UTC
Created attachment 449887 [details]
repeating-proxy: per-call repeating

Comment 5 Marian Csontos 2010-09-27 13:22:51 UTC
Created attachment 449888 [details]
repeat task_start - until it pass

Comment 6 Marian Csontos 2010-09-27 18:49:51 UTC
Though it would work for task_start, this will require more sophisticated
approach:

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
   reported:
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.

Comment 7 Marian Csontos 2010-09-27 18:55:13 UTC
As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.

Comment 8 Marian Csontos 2010-10-06 05:41:58 UTC
...and once more.

Comment 9 Bill Peck 2011-03-23 21:21:46 UTC
Ping - Is this still an issue?

Comment 10 Marian Csontos 2011-03-24 07:31:05 UTC
Risk at least.

Comment 11 Ales Zelinka 2011-03-24 11:02:22 UTC
I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.

Comment 12 Bill Peck 2011-03-24 13:17:26 UTC
closing, if seen again please re-open.