Hide Forgot
Description of problem: We need to reset defaults and/or provide instructions to end users on how to set the cumin vacuum interval and the postgres parameter max_fsm_pages. In overnight testing, with 100+ submissions per second, around 4000 slots, we found that the free space in postgres was not being managed effectively. This caused the database to "leak", since more space was needed per vacuum interval than could be tracked by postgres (so postgres went to disk for more). Shortening the vaccuum interval to 15 minutes and increasing the max_fsm_pages value to 256K seems to be effective, but we're not sure if there is a useful heuristic at this point. These numbers will be relative to submissions/completions, etc.
*** Bug 697640 has been marked as a duplicate of this bug. ***
The plan is to address this in two ways: 1) change the "out of the box" configuration, which includes multiple cumin-data instances for medium scale and up, to run vacuuming and sample expiration from a single thread with a 15 minute interval. 2) include a Release Note which covers setting the max_fsm_pages postgres parameter, a suggested value, and how to run a SQL command that will indicate whether or not the current value is appropriate. (BZ699859)
Default config file fixed in revision 4741. To test, do something like: 1) Run cumin 2) grep -l "is enabled" data.*.log data.grid.log 3) grep -l "is disabled" data.*.log data.grid-slots.log data.grid-submissions.log data.sesame.log 4) Wait 15 minutes 5) grep -l "Starting vacuum" data.*.log data.grid.log 6) grep -l "Starting expire" data.*.log data.grid.log