Bug 626353

Summary:

recipe stuck in waiting without watchdog kicking in

Product:

[Retired] Beaker

Reporter:

Ales Zelinka <azelinka>

Component:

beah

Assignee:

Marian Csontos <mcsontos>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

medium

Docs Contact:

Priority:

medium

Version:

0.5

CC:

bpeck, dcallagh, kbaker, mcsontos, rmancy

Target Milestone:

future_maint

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-03-24 13:17:26 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

632609

Attachments:

Description	Flags
repeating-proxy: per-call repeating	none
repeat task_start - until it pass	none

Comment 2 Marian Csontos 2010-08-25 05:34:47 UTC

This happened due to LC misbehaving. Log on the machine says:

> 2010-08-13 16:33:52,771 ... <Fault 1: 'xmlrpclib.ProtocolError:<ProtocolError for beaker.engineering.redhat.com/client/: -1 >'>

Harness tried to get away with it, but it does not work.
And later when trying to stop the task Scheduler says:

>  <Fault 1: "bkr.server.bexceptions.BX:'recipe task 279973 was never started'">

And then the task is scheduled to run again as next one with Waiting status.

Resolution:
- do not rely on server only to decide which task to run
- task_start and task_end need to be more robust
- report any failed XML-RPC's on console

Note: And make sure not to get into infinite loop! We do not want a single broken RPC breaking whole job.

Comment 3 Marian Csontos 2010-09-21 14:07:40 UTC

Bill, I seek your opinion: I have repeating implemented but...

Are XML-RPC failures as seen in Comment 2 ever expected? [1]

If a call has correct parameters [2] and network is fine, shall I simple repeat the call until it passes and let EWD kill the job if it does not? [3]

Thinking about it, harness should not repeat the real calls, but instead of them use a ping-like call: Bug 636093

[1] The first one. The second one is a consequence I am trying to get rid of.

[2] Let's suppose it does - otherwise something is broken already and the task/recipe will be broken anyway.

[3] For cases of broken network setup I filled in Bug 636080.

Comment 4 Marian Csontos 2010-09-27 13:22:02 UTC

Created attachment 449887 [details]
repeating-proxy: per-call repeating

Comment 5 Marian Csontos 2010-09-27 13:22:51 UTC

Created attachment 449888 [details]
repeat task_start - until it pass

Comment 6 Marian Csontos 2010-09-27 18:49:51 UTC

Though it would work for task_start, this will require more sophisticated
approach:

1. if the call fails, use ping call (as in Bug 636093) to determine if the net
   is broken and wait until service is restored.  Repeat ping until:
1.1 the original call succeeds: go on wth next calls.
1.2 the ping succeeds: retry original call (2)
2. before repeating the call, try to get the original call's status:
2.1 e.g. for task_start/task_end use task_info
2.2 for task_result/upload_file repeat the call
3. the call must finish in finite time in the worst case and this must be
   reported:
3.1 try to push new result
3.2 message on console if the result fails

Thanks to Bill for kicking me off.

Comment 7 Marian Csontos 2010-09-27 18:55:13 UTC

As this is not a high priority bug fix and I won't be able to provide and test the golden-grail solution for 0.5.58 I am pushing this ahead too.

Comment 8 Marian Csontos 2010-10-06 05:41:58 UTC

...and once more.

Comment 9 Bill Peck 2011-03-23 21:21:46 UTC

Ping - Is this still an issue?

Comment 10 Marian Csontos 2011-03-24 07:31:05 UTC

Risk at least.

Comment 11 Ales Zelinka 2011-03-24 11:02:22 UTC

I haven't seen this issue for a long time. Won't mind closing this as fixed. Thanks.

Comment 12 Bill Peck 2011-03-24 13:17:26 UTC

closing, if seen again please re-open.