Bug 859365

Summary:	jobs are killed, tasks doesn't finished. Error: No watchdog exists for recipe
Product:	[Retired] Beaker	Reporter:	Petr Sklenar <psklenar>
Component:	beah	Assignee:	Dan Callaghan <dcallagh>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	0.9	CC:	asaha, azelinka, dcallagh, omoris, rmancy
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	875543 (view as bug list)		Environment:
Last Closed:	2012-09-24 22:12:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Sklenar 2012-09-21 10:34:07 UTC

Description of problem:
There is rhel7 jobs with the same tasks. There are about 100 tasks.
3x its killed after the third tasks.
1x its kil after +- the 30. tasks

Version-Release number of selected component (if applicable):
Version - 0.9.3 

How reproducible:
clone the jobs few time, you will see randomness.

Steps to Reproduce:
lets clone few times:

killed after +- the third tasks
https://beaker.engineering.redhat.com/jobs/300537
https://beaker.engineering.redhat.com/jobs/300659
https://beaker.engineering.redhat.com/jobs/300718

It was killed after more tasks but didn't finished either:
https://beaker.engineering.redhat.com/jobs/300920

  
Actual results:
see the end of console.log:
for ex:
xmlrpclib.Fault: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:'No watchdog exists for recipe 639729'"> 2012-09-18 17:35:59,422 backend __on_error: ERROR 

another
2012-09-18 17:35:55,635 backend.twisted emit: ERROR Unhandled Error 


Expected results:
jobs is finished

Additional info:
We hardly can test rhel7.

Comment 1 Dan Callaghan 2012-09-24 05:53:14 UTC

(In reply to comment #0)
> killed after +- the third tasks
> https://beaker.engineering.redhat.com/jobs/300537
> https://beaker.engineering.redhat.com/jobs/300659
> https://beaker.engineering.redhat.com/jobs/300718

These jobs were all cancelled. The error from beah ("No watchdog exists..") is because the recipe was already terminated by that point, the system had just not been powered off quite yet.

As to why they were cancelled, or by whom, I cannot say. Are you sure you didn't accidentally cancel them yourself? :-)

If you see this happen again please report it sooner. After five days our logs are all completed rotated away so there is no chance for me to see if something strange was going on.

> It was killed after more tasks but didn't finished either:
> https://beaker.engineering.redhat.com/jobs/300920

The watchdog kicked in because the task exceeded its duration of 5 minutes. This isn't a Beaker bug.

Comment 2 Petr Sklenar 2012-09-24 09:14:22 UTC

(In reply to comment #1)
> If you see this happen again please report it sooner. After five days our

These jobs haven't been cancelled for 99,9%.
I scheduled similar set over the weekend with the similar results:
https://beaker.engineering.redhat.com/jobs/303204 https://beaker.engineering.redhat.com/jobs/303206 https://beaker.engineering.redhat.com/jobs/303223 https://beaker.engineering.redhat.com/jobs/303230 

All of them finished with two types of error:
1,
xmlrpclib.Fault: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:'No watchdog exists for recipe 645177'"> 
2,
Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0.


------------

FYI when I looked at console.log and try to search for "error" string, I can see a few beaker issues (maybe its not related to this bug):
post script failed:

Non-fatal POSTIN scriptlet failure in rpm package beah-0.6.40-1.el7.noarch 
  Installing : beah-0.6.40-1.el7.noarch                                     8/8  
error reading information on service beah-srv: No such file or directory 
error reading information on service beah-fakelc: No such file or directory 
error reading information on service beah-beaker-backend: No such file or directory 
error reading information on service beah-fwd-backend: No such file or directory 
warning: %post(beah-0.6.40-1.el7.noarch) scriptlet failed, exit status 1 
  Verifying  : SOAPpy-0.11.6-12.el7.noarch                                  1/8 


rhts-compat missing / smth with systemd  ??

ln -s '/usr/lib/systemd/system/beah-fwd-backend.service' '/etc/systemd/system/multi-user.target.wants/beah-fwd-backend.service' 
error reading information on service rhts-compat: No such file or directory

Comment 3 Ales Zelinka 2012-09-24 11:44:40 UTC

(In reply to comment #1)
> As to why they were cancelled, or by whom, I cannot say. Are you sure you
> didn't accidentally cancel them yourself? :-)
Looks like those were canceled by beaker-jobwatch becuase the /distribution/install task failed:

> Broken: 4
> TJ#300537 RS:522308 (x86_64): broken #0: installation failed
> TJ#300659 RS:522477 (x86_64): broken #1: installation failed
> TJ#300718 RS:522573 (x86_64): broken #2: installation failed
> TJ#300920 RS:522936 (x86_64): broken #3: installation failed

What is the expected behavior? Should the task not fail in this case? Or should beaker-jobwtach ignore the fail and keep the job running?

Comment 4 Petr Sklenar 2012-09-24 12:10:56 UTC

(In reply to comment #2)
> These jobs haven't been cancelled for 99,9%.

(In reply to comment #3)
> Looks like those were canceled by beaker-jobwatch becuase the
> /distribution/install task failed:

OK, so I am really sorry :( ; I didn't realized that our tool is cancelling jobs.... 

And Dan , why is there FAIL for /distribution/install, when machine is installed. Isn't WARN better?

anyway, this bug can be closed.

Comment 5 Ales Zelinka 2012-09-24 12:30:48 UTC

FYI I've added a new command line option to beaker-jobwatch:
beaker-jobwatch --ignore-failure=install ...

won't cancel & reschedule jobs with failed /distribution/install.

 
> And Dan , why is there FAIL for /distribution/install, when machine is
> installed. Isn't WARN better?

+1, I expect the jobs to be unusable when their install failed. Warn result for possible problems found via scanning logs seems better

Comment 6 Dan Callaghan 2012-09-24 22:12:59 UTC

(In reply to comment #3)
> Looks like those were canceled by beaker-jobwatch becuase the
> /distribution/install task failed:

Ales, can you please ensure the beaker-jobwatch script passes --msg when cancelling jobs, for example: "Cancelled by beaker-jobwatch due to..." This message will appear in the job results so it is clear what has happened.

> What is the expected behavior? Should the task not fail in this case? Or
> should beaker-jobwtach ignore the fail and keep the job running?

A failure in /distribution/install does not mean the installation itself has failed (in that case the recipe will probably never start, or fail in some other weird way) but rather that post-install checks have failed. In the case of J:300537, J:300659, J:300718 it failed due to this AVC denial:

******** SElinux AVC Failures ********
[   54.234759] type=1400 audit(1348048560.060:4): avc:  denied  { create } for  pid=800 comm="systemd-tmpfile" name="user" scontext=system_u:system_r:systemd_tmpfiles_t:s0 tcontext=system_u:object_r:user_tmp_t:s0 tclass=dir

which seems like a bug in the selinux policy.

Comment 7 Ales Zelinka 2012-09-25 10:07:35 UTC

(In reply to comment #6)
> (In reply to comment #3)
> > Looks like those were canceled by beaker-jobwatch becuase the
> > /distribution/install task failed:
> 
> Ales, can you please ensure the beaker-jobwatch script passes --msg when
> cancelling jobs, for example: "Cancelled by beaker-jobwatch due to..." This
> message will appear in the job results so it is clear what has happened.
That's a good feature. Thanks Dan.

implemented in v 1.14 (commit 8bb19397a2766f109ed4018f13ce35ee8e4a01ca, soon to be in qa-tools-workstation package)
> 
> > What is the expected behavior? Should the task not fail in this case? Or
> > should beaker-jobwtach ignore the fail and keep the job running?
> 
> A failure in /distribution/install does not mean the installation itself has
> failed (in that case the recipe will probably never start, or fail in some
> other weird way) but rather that post-install checks have failed. In the
> case of J:300537, J:300659, J:300718 it failed due to this AVC denial:
> 
> ******** SElinux AVC Failures ********
> [   54.234759] type=1400 audit(1348048560.060:4): avc:  denied  { create }
> for  pid=800 comm="systemd-tmpfile" name="user"
> scontext=system_u:system_r:systemd_tmpfiles_t:s0
> tcontext=system_u:object_r:user_tmp_t:s0 tclass=dir
> 
> which seems like a bug in the selinux policy.

I understand that the task isn't the installation itself but it still is confusing. I believe that these post-install checks should only produce FAIL result if the job is likely to fail (e.g. some part of harness is missing). AVC denial during installation? WARN is enough.

Or if it is on par with other tasks, just use the same mechanism - inject the /avc result in it. That would make it fail but fail in a consistent way with other tasks.