Bug 522292
| Summary: | DAGMan/DAG submission version compatibility improvements | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> | ||||||||||||||
| Component: | grid | Assignee: | Pete MacKinnon <pmackinn> | ||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Jan Sarenik <jsarenik> | ||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||
| Priority: | medium | ||||||||||||||||
| Version: | 1.1.6 | CC: | iboverma, jsarenik, jthomas, lbrindle, matt, tao | ||||||||||||||
| Target Milestone: | 1.2 | ||||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | All | ||||||||||||||||
| OS: | All | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
| Doc Text: |
Grid enhancement
DAGMan submission version compatibility checking was improved, reducing the need for he -allowVersionMismatch option on condor_dagman.
|
Story Points: | --- | ||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2009-12-03 09:16:01 UTC | Type: | --- | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||
| Bug Blocks: | 527551 | ||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Pete MacKinnon
2009-09-10 01:27:30 UTC
Our focus is from 7.2 onward By "same", that is to say "same version level" Initial compatibility tests: - 5 concurrent diamond dags - 4 nodes each - AllowVersionMismatch set - dagman argument changed for different versions Results: 7.4.0-0.4 dag submit to 7.2.4 dagman - no issues 7.2.4 dag submit to 7.4.0-0.4 dagman - no issues Other tests: - rescue compatibility (old & new) - tweaking dag logging to scrutinize new lazy log behavior Will try another round using the UW DAG tests Created attachment 363516 [details]
OO spreadsheet with mixed-version UW dag tests
Ran the 42 UW dagman tests with baselines for 7.2.4 and 7.4.0-0.4, and submit/dagman mixes. Stork flunked out on all but I wasn't setup for stork testing.
In general, there are 2 cautionary areas:
1) gittrac #435: dagman core dump when dag has POST script and all submits fail
2) default node logs
We can call these out in our Release Notes. Also, I will provide a patch that essentially relaxes the hard -AllowVersionMismatch restriction. Possible solutions:
1) maintain a data structure that maps versions to feature compat (just for dagman). This would mean more accurate version compat checking at runtime.
2) simply log a warning if the versions are both > 7.2 and continue
Created attachment 364126 [details]
Patch to add compatability checking
These code changes make use of the CondorVersionInfo class which assumes backward-compatabilty. Safe for now but we haven't tested earlier than 7.2.4.
Built into 7.4.0-0.6 Created attachment 364478 [details]
Test scripts used for modifying args and testing UW dag tests
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: DAGMan/DAG submission version compatibility checking was improved when using -allowVersionMismatch option on condor_dagman (522292) Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -DAGMan/DAG submission version compatibility checking was improved when using -allowVersionMismatch option on condor_dagman (522292)+DAGMan/DAG submission version compatibility checking was improved, reducing the need for -allowVersionMismatch option on condor_dagman (522292) Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,3 @@ +Grid enhancement + DAGMan/DAG submission version compatibility checking was improved, reducing the need for -allowVersionMismatch option on condor_dagman (522292) Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1,3 @@ Grid enhancement -DAGMan/DAG submission version compatibility checking was improved, reducing the need for -allowVersionMismatch option on condor_dagman (522292)+DAGMan submission version compatibility checking was improved, reducing the need for he -allowVersionMismatch option on condor_dagman. Should all tests contained in list_dagman succeed? I mean the original versions from condor-test-7.4.1-0.4.el5 now, not the modified by scripts attached to this BZ before. Jan, Not sure what you mean by "list_dagman". All 42 of the upstream 7.4.0 dagman tests as of Oct 1 EXCEPT for the two stork tests (stork wasn't configured) passed for me when testing compatibility (CSD & DM at same version level). Please refer to the previously attached spreadsheet. I see that a few more dagman tests have crept into 7.4.1-0.4. I can't speak to those since I haven't had a chance to test them yet. Let me know if your test results differ. \Pete Just to clarify "list_dagman", I meant: /usr/libexec/condor/test/condor_tests/list_dagman which belongs to package condor-test Created attachment 368210 [details]
Test script running unmodified dagman tests
Is there anything wrong with this script or is it buggy
condor when job_dagman_large_dag.run always fails
on condor-7.4.1-0.4.el5 ?
job_dagman_large_dag.run ran fine for me on F11 with 7.4.1-0.5. Where exactly is this list_dagman test stored? I am running the tests together with modifications made by add_dag_args.pl now. It seems they are running fine. Expect this bug to be VERI in few hours. Created attachment 369211 [details]
Test scripts
Now I know why job_dagman_large_dag.run is failing on RHEL, the packages contain file /usr/libexec/condor/test/condor_tests/create_large_dag which should be executable (is called from that test) and is packaged as not executable. I will ping MattF to rebuild packages if possible. Quick fix: chmod a+x /usr/libexec/condor/test/condor_tests/create_large_dag Besides this one, I have not noticed any unexpected behavior during my testing, including the tests with modified arguments. I did all the tests on all available platforms, (RHEL4,RHEL5) x (i386,x86_64) The exec bit should be set after 7.4.1-0.5 I forgot to do 7.2 submit to 7.4 and backwards. Working on it now. With old condor_submit_dag (from condor-7.2.2-0.9.el5.i386.rpm) submitting to normally installed condor-7.4.1-0.5.el5.i386.rpm tests are being submitted. With new condor_submit_dag (from condor-7.4.1-0.5.el5.i386.rpm) trying to submit dags to condor-7.2.2-0.9.el5.i386.rpm, I am getting this error and assume this behavior is expected: ---------------------------------------------------------------------------- 11/16 13:19:15 Version mismatch: condor_submit_dag ($CondorVersion: 7.4.1 Nov 9 2009 BuildID: RH-7.4.1-0.5.el5 PRE-RELEASE $) vs. condor_dagman ($CondorVersion: 7.2.1 Mar 25 2009 BuildID: RH-7.2.2-0.9.el5 $) 11/16 13:19:15 **** condor_scheduniv_exec.1.0 (condor_DAGMAN) pid 3575 EXITING WITH STATUS 1 The newer CSD to older dagamn behaviour is expected since it is enforced by the dagman executable (and whatever version mismatch logic it has). Created attachment 369709 [details]
shell script
This script was used to verify the bug on supported architectures.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |