Bug 859365
Summary: | jobs are killed, tasks doesn't finished. Error: No watchdog exists for recipe | |||
---|---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Petr Sklenar <psklenar> | |
Component: | beah | Assignee: | Dan Callaghan <dcallagh> | |
Status: | CLOSED NOTABUG | QA Contact: | ||
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 0.9 | CC: | asaha, azelinka, dcallagh, omoris, rmancy | |
Target Milestone: | --- | Keywords: | TestBlocker | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 875543 (view as bug list) | Environment: | ||
Last Closed: | 2012-09-24 22:12:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Petr Sklenar
2012-09-21 10:34:07 UTC
(In reply to comment #0) > killed after +- the third tasks > https://beaker.engineering.redhat.com/jobs/300537 > https://beaker.engineering.redhat.com/jobs/300659 > https://beaker.engineering.redhat.com/jobs/300718 These jobs were all cancelled. The error from beah ("No watchdog exists..") is because the recipe was already terminated by that point, the system had just not been powered off quite yet. As to why they were cancelled, or by whom, I cannot say. Are you sure you didn't accidentally cancel them yourself? :-) If you see this happen again please report it sooner. After five days our logs are all completed rotated away so there is no chance for me to see if something strange was going on. > It was killed after more tasks but didn't finished either: > https://beaker.engineering.redhat.com/jobs/300920 The watchdog kicked in because the task exceeded its duration of 5 minutes. This isn't a Beaker bug. (In reply to comment #1) > If you see this happen again please report it sooner. After five days our These jobs haven't been cancelled for 99,9%. I scheduled similar set over the weekend with the similar results: https://beaker.engineering.redhat.com/jobs/303204 https://beaker.engineering.redhat.com/jobs/303206 https://beaker.engineering.redhat.com/jobs/303223 https://beaker.engineering.redhat.com/jobs/303230 All of them finished with two types of error: 1, xmlrpclib.Fault: <Fault 1: "<class 'bkr.common.bexceptions.BX'>:'No watchdog exists for recipe 645177'"> 2, Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0. ------------ FYI when I looked at console.log and try to search for "error" string, I can see a few beaker issues (maybe its not related to this bug): post script failed: Non-fatal POSTIN scriptlet failure in rpm package beah-0.6.40-1.el7.noarch Installing : beah-0.6.40-1.el7.noarch 8/8 error reading information on service beah-srv: No such file or directory error reading information on service beah-fakelc: No such file or directory error reading information on service beah-beaker-backend: No such file or directory error reading information on service beah-fwd-backend: No such file or directory warning: %post(beah-0.6.40-1.el7.noarch) scriptlet failed, exit status 1 Verifying : SOAPpy-0.11.6-12.el7.noarch 1/8 rhts-compat missing / smth with systemd ?? ln -s '/usr/lib/systemd/system/beah-fwd-backend.service' '/etc/systemd/system/multi-user.target.wants/beah-fwd-backend.service' error reading information on service rhts-compat: No such file or directory (In reply to comment #1) > As to why they were cancelled, or by whom, I cannot say. Are you sure you > didn't accidentally cancel them yourself? :-) Looks like those were canceled by beaker-jobwatch becuase the /distribution/install task failed: > Broken: 4 > TJ#300537 RS:522308 (x86_64): broken #0: installation failed > TJ#300659 RS:522477 (x86_64): broken #1: installation failed > TJ#300718 RS:522573 (x86_64): broken #2: installation failed > TJ#300920 RS:522936 (x86_64): broken #3: installation failed What is the expected behavior? Should the task not fail in this case? Or should beaker-jobwtach ignore the fail and keep the job running? (In reply to comment #2) > These jobs haven't been cancelled for 99,9%. (In reply to comment #3) > Looks like those were canceled by beaker-jobwatch becuase the > /distribution/install task failed: OK, so I am really sorry :( ; I didn't realized that our tool is cancelling jobs.... And Dan , why is there FAIL for /distribution/install, when machine is installed. Isn't WARN better? anyway, this bug can be closed. FYI I've added a new command line option to beaker-jobwatch:
beaker-jobwatch --ignore-failure=install ...
won't cancel & reschedule jobs with failed /distribution/install.
> And Dan , why is there FAIL for /distribution/install, when machine is
> installed. Isn't WARN better?
+1, I expect the jobs to be unusable when their install failed. Warn result for possible problems found via scanning logs seems better
(In reply to comment #3) > Looks like those were canceled by beaker-jobwatch becuase the > /distribution/install task failed: Ales, can you please ensure the beaker-jobwatch script passes --msg when cancelling jobs, for example: "Cancelled by beaker-jobwatch due to..." This message will appear in the job results so it is clear what has happened. > What is the expected behavior? Should the task not fail in this case? Or > should beaker-jobwtach ignore the fail and keep the job running? A failure in /distribution/install does not mean the installation itself has failed (in that case the recipe will probably never start, or fail in some other weird way) but rather that post-install checks have failed. In the case of J:300537, J:300659, J:300718 it failed due to this AVC denial: ******** SElinux AVC Failures ******** [ 54.234759] type=1400 audit(1348048560.060:4): avc: denied { create } for pid=800 comm="systemd-tmpfile" name="user" scontext=system_u:system_r:systemd_tmpfiles_t:s0 tcontext=system_u:object_r:user_tmp_t:s0 tclass=dir which seems like a bug in the selinux policy. (In reply to comment #6) > (In reply to comment #3) > > Looks like those were canceled by beaker-jobwatch becuase the > > /distribution/install task failed: > > Ales, can you please ensure the beaker-jobwatch script passes --msg when > cancelling jobs, for example: "Cancelled by beaker-jobwatch due to..." This > message will appear in the job results so it is clear what has happened. That's a good feature. Thanks Dan. implemented in v 1.14 (commit 8bb19397a2766f109ed4018f13ce35ee8e4a01ca, soon to be in qa-tools-workstation package) > > > What is the expected behavior? Should the task not fail in this case? Or > > should beaker-jobwtach ignore the fail and keep the job running? > > A failure in /distribution/install does not mean the installation itself has > failed (in that case the recipe will probably never start, or fail in some > other weird way) but rather that post-install checks have failed. In the > case of J:300537, J:300659, J:300718 it failed due to this AVC denial: > > ******** SElinux AVC Failures ******** > [ 54.234759] type=1400 audit(1348048560.060:4): avc: denied { create } > for pid=800 comm="systemd-tmpfile" name="user" > scontext=system_u:system_r:systemd_tmpfiles_t:s0 > tcontext=system_u:object_r:user_tmp_t:s0 tclass=dir > > which seems like a bug in the selinux policy. I understand that the task isn't the installation itself but it still is confusing. I believe that these post-install checks should only produce FAIL result if the job is likely to fail (e.g. some part of harness is missing). AVC denial during installation? WARN is enough. Or if it is on par with other tasks, just use the same mechanism - inject the /avc result in it. That would make it fail but fail in a consistent way with other tasks. |