Hide Forgot
Description of problem: Dangling references are tracked and resolved in memory in a cumin-data instance when objects come "out of order". Because of this, a QMF class and any classes it references through the objId field must be handled in the same cumin-data instance (since separate instances have separate memory space) Currently, Submission objects are handled in their own instance. That instance must also handle JobServer objects, and consequently Scheduler Objects and Submitter objects. Likewise, SysImage objects must be handled in the general messaging instance if the console is being used with messaging. If messaging is not being used, SysImage can be in its own instance. Modify the cumin configuration file to group appropriately. Version-Release number of selected component (if applicable): cumin-0.1.4767 How reproducible: uncertain, can strike anytime. > 25% ? Steps to Reproduce: 1. Create a bunch of submissions in condor 2. Turn on debug level logging in Cumin 3. Start up Cumin 4. As cumin runs, check in data.grid-submissions.log for messages beginning with "Deferring link to object RosemaryClass(com.redhat.grid,JobServer)" 5. If the above message is seen, the problem has occurred. The deferred reference will never be resolved. Actual results: There will be missing submissions in cumin. Expected results: Ultimately, all deferred links should be resolved. Eventually, messages beginning with "Realized deferred link" should be seen corresponding to each deferred link. Additional info:
Created attachment 502007 [details] Changes to cumin.conf, move handling of some classes from data.grid to data.grid-submissions The change here is to move handling of JobServer, Submitter, and Scheduler classes into the same cumin-data instance that handles Submissions.
Some notes about this issue (how to clear, and how to recognize) -- Clearing/workaround Running a QMF console application such as qpid-tool or qpid-stat can actually clear the problem. When the new console app causes a republish of QMF data, a running Cumin will see updates for the submission objects. If the JobServer object already exists in the database, the receipt of the submission object updates will cause the missing field to be filled in (need to verify further that this is 100%). Note, normal updates of a submission because of a change of state, etc, may also clear up the missing field for a particular submission (not certain of this at this time). It's important to recognize that if someone is trying to test around this BZ, running a QMF console app will likely cause the condition to disappear. -- Recognizing There are a few things that are helpful in recognizing that this condition has occurred, namely incomplete submission objects in the Postgres database which will not be diplayed. 1) Cumin has a freshness filter on submissions which will not display submissions older than 3 days unless the running, idle, or held counts are non-zero. To avoid confusion, I recommend using a freshly installed condor (or one with all job history wiped out) to make sure that all submissions are younger than 3 days. 2) "reschema" on the cumin database is also helpful. This will get rid of all the cumin data. For example: #!/bin/bash -ex cumin-admin drop-schema cumin-admin create-schema cumin-admin add-user guest guest 3) Change the log-level to "debug" in cumin.conf, start Cumin against a grid with new submissions, and let it run for a while (1 minute?) 4) When cumin-0.1.4767 is run without the cumin.conf file change, the first indicator of problematic submissions can be found if the following do not produce matching results (from inside $CUMIN_HOME/log). In fact, without the fix, the second command should always return 0. # greater than 0 indicates problem $ grep "Deferring link" data.grid-submissions.log | wc -l # should always be 0 without the fix $ grep "Realized deferred" data.grid-submissions.log | wc -l 5) Another way to recognize this problem is to check the number of records in the database with null _jobserverRef_id fields. # this will give output showing how many have nulls. This is a single line. $ psql -d cumin -U cumin -h localhost -c 'select count(1) from "com.redhat.grid"."Submission" WHERE "_jobserverRef_id" is null;' # this will show how many without nulls $ psql -d cumin -U cumin -h localhost -c 'select count(1) from "com.redhat.grid"."Submission" WHERE "_jobserverRef_id" is not null;' # this of course will show the total psql -d cumin -U cumin -h localhost -c 'select count(1) from "com.redhat.grid"."Submission";' 6) On the bottom of the Grid->Submissions tab, Cumin will show the total number of Submissions that it is currently displaying in italics, for example "25 of 1514". The second number is the total number of Submissions that are displayable. This should always match the "how many without nulls" result from the second psql query above. If it matches the total number of submissions, then object updates or a console startup have cleared the problem. 7) Working to get an idea of how reproducible this is....
Created attachment 502148 [details] Looping test to show broken submissions on startup This simple script will clear cumin's data, restart it, and look for broken submissions. It reports number of failures and iterations (currently set for 20 iterations). This test assumes that the broker is running, postgres is running, condor is running, and condor has submissions in the queue. I submitted 10 jobs with multi-day sleep commands, just to keep the number of submissions small and constant. This was adequate for the test.
Created attachment 502149 [details] This is the output of the 708495.sh script against cumin-0.1.4767, no fix This output shows that on startup, cumin ended up with broken submissions 19 out of 20 times. This test was run on a local condor/qpidd/cumin install on a RHEL VM with a single slot. Note, the jobserver was not running, submissions were published by the scheduler. Results might be different running against a jobserver.
Created attachment 502150 [details] This is the output of the 708495.sh script against cumin-0.1.4767 with the fix This output shows that with the change to the config file, the same cumin-0.1.4767 setup described in the previous comment successfully restarted 20 out of 20 times without seeing any broken submissions.
Created attachment 502151 [details] For fun, this shows what happens when a console is run after broken submissions are found This script runs qpid-stat -b (any console will do) and waits 30 seconds after bad submissions are discovered. This gives the republished submission data a chance to show up and be processed. This seems to be 100% effective (see attached output). Not sure if this is a ver customer-friendl workaround, but it is at least interesting.
Created attachment 502152 [details] This is the output of the 708495_qpid_stat.sh script against cumin-0.1.4767, no fix The results here show that broken submissions were found on startup 19 out of 20 times. However, in the case of broken submissions qpid-stat -b was run to cause a republish event and in each of those cases the data was corrected.
Changes to cumin.conf file in revision 4793 (4794 in the branch where candidates are built) New package brewed, left tagged for devel at present.
Note, tomorrow I will document just what exactly cumin is receiving from QMF when it sees a submission that ends up with a broken jobserver reference. What is in the jobserverRef_id field coming from QMF? And what exactly is the difference on a republish?
I am unable to reproduce this bug so far.
Sorry, I was not able to reproduce manually, but when I run the attached script, after a while I see similar lines: broken submissions 4 times out of 6 so far Next I will try with the modified configuration file.
oops, forgot to set this to modified.
With the changed config file (and cumin-0.1.4794-1.el5) I am getting: broken submissions 0 times out of 10 so far
broken submissions / total iterations: 0 / 20
For completeness, restatement of my understanding of the underlying behaviors based on code inspection and watching in and out of the debugger. First, the upshot: when might this error be seen, how to fix? 1) Whenever Cumin is restarted and condor is running with submissions in the queue. It does not matter whether the Cumin database has been dropped and recreated before restart, it can happen in both cases. ** beginning with an already populated database makes the problem less likely to manifest because Cumin spends a bunch of time deleting existing objects before creating new ones. Given that there are multiple cumin-data instances, this gives the instance handling JobServer (without the fix) more time to receive the JobServer object and create it in the database before the Submissions are created. 2) Whenever Cumin is up and running and the broker sends agent deletes for condor agents associated with JobServer and Submissions. This can be on a restart of condor with a suitable delay after shutdown (20 seconds based on what I saw), or because of a temporary communication drop between condor and the broker. 3) There is not necessarily any data ordering problem on the plugin/QMF side. Because cumin-data is running in multiple instances (processes) each bound to a subset of QMF classes, there can never be a guarantee that objects received by different instances will be processed in queue order. Therefore, even if the plugin/QMF side of the equation guaranteed perfect order with no unresolved references on the sending side, the cumin deployment can still see the problem on the receiving side. 4) Starting up a new QMF console after the JobServer object has been seen by a cumin-data instance will fix the unresolved references in Submission objects 100% of the time. 5) Grouping handling of all classes which reference each other into the same cumin-data should be guaranteed to solve this problem. In the future, it would be a good idea to move unresolved link resolution into shared memory (either as another cumin construct or through the database) to remove this restriction. Details on how things work: Submission objects are associated with a JobServer object in the cumin database. Any time that a cumin-data instance creates a new Submission object and there is no corresponding JobServer object already in the database, a deferred link will be created in memory and the Submission will not display until the link is realized. The link will be realized when the cumin-data instance sees a new JobServer object from QMF with an object id that matches the id in the deferred link; the object id is a string based on QMF package, class, and host, so it is repeatable and well-known. Consequently, if Submissions and JobServer objects are handled in different cumin-data instances, any Submissions' deferred links to JobServer objects will never be realized. If, however, the JobServer object already exists in the Cumin database, the string lookup based on the object id will succeed for a new Submission and no deferred link will be created. Likewise, if a Submission object has an incomplete JobServer reference and an update for the Submission object is received and a JobServer object exists in the database at the time of the update, the JobServer reference will be filled in, the Submission object will be written through to the database, and the broken link will be fixed indirectly. This is why starting a QMF console application resolves the problem. Cumin-data tracks a list of known agents in memory. If Cumin sees an agent delete, it will delete any objects associated with that agent. Likewise, if Cumin sees an agent create, it will delete any objects associated with an agent of the same name that it has tracked. When Cumin is started fresh, the list of known agents in memory is empty so for every agent seen, associated objects are removed from the database. These cases are important because they define the ways that Submissions and JobServer objects can be removed from the database, opening up the possiblity of unresolved deferred links.