626353 – recipe stuck in waiting without watchdog kicking in

Bug 626353 - recipe stuck in waiting without watchdog kicking in

Summary: recipe stuck in waiting without watchdog kicking in

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	beah
Sub Component:
Version:	0.5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	future_maint
Assignee:	Marian Csontos
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	632609
TreeView+	depends on / blocked

Reported:	2010-08-23 10:11 UTC by Ales Zelinka
Modified:	2011-03-24 13:17 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-24 13:17:26 UTC
Embargoed:

Attachments	(Terms of Use)
repeating-proxy: per-call repeating (9.29 KB, patch) 2010-09-27 13:22 UTC, Marian Csontos	no flags	Details \| Diff
repeat task_start - until it pass (2.07 KB, patch) 2010-09-27 13:22 UTC, Marian Csontos	no flags	Details \| Diff
View All

Comment 2 Marian Csontos 2010-08-25 05:34:47 UTC

This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

Resolution:
- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.

Comment 3 Marian Csontos 2010-09-21 14:07:40 UTC

Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.

Comment 4 Marian Csontos 2010-09-27 13:22:02 UTC

Created attachment 449887 [details]
repeating-proxy: per-call repeating

Comment 5 Marian Csontos 2010-09-27 13:22:51 UTC

Created attachment 449888 [details]
repeat task_start - until it pass

Comment 6 Marian Csontos 2010-09-27 18:49:51 UTC

Though it would work for task_start, this will require more sophisticated
approach:

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
   reported:
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.

Comment 7 Marian Csontos 2010-09-27 18:55:13 UTC

As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.

Comment 8 Marian Csontos 2010-10-06 05:41:58 UTC

...and once more.

Comment 9 Bill Peck 2011-03-23 21:21:46 UTC

Ping - Is this still an issue?

Comment 10 Marian Csontos 2011-03-24 07:31:05 UTC

Risk at least.

Comment 11 Ales Zelinka 2011-03-24 11:02:22 UTC

I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.

Comment 12 Bill Peck 2011-03-24 13:17:26 UTC

closing, if seen again please re-open.

Note You need to log in before you can comment on or make changes to this bug.