Description of problem: [RFE] Automatic cleanup of logs table Version-Release number of selected component (if applicable): Satellite 6.5.2 How reproducible: In customer environments with a lot of content hosts and puppet check-ins Steps to Reproduce: 1. We had seen customer environments where logs table contain a ton of data with logs table sometimes consuming more than 35GB. su - postgres -c "psql foreman -c \"select count(*) from logs;\"" count ---------- 539298 (1 row) Actual results: Degradation of Satellite performance since postgres queries too long to finish Expected results: Logs table should be cleaned up automatically to keep the number of records at an optimum level - say 50k records. Additional info:
Created redmine issue https://projects.theforeman.org/issues/27971 from this bug
Upstream bug assigned to lzap
Workaround is to truncate the logs table often as explained in https://access.redhat.com/solutions/4449961
QA: On a host with many reports (100k or more preferably - import them or use foreman-rake console to generate them) run the rake reports:expire and observe performance in the production.log. Check that only logs one week old were deleted and there are no left-overs. Share the performance in the BZ please. The system will log these sentences: logger.info "Expired #{count} #{model} at rate #{rate} #{model}/sec"
RAKE SCRIPT TO GENERATE REPORTS (warning it is very slow): def make_report(host = 1, logs = 100, reported_at = Time.now.utc.to_s) base = { "host" => "host#{host}.example.com", "reported_at" => reported_at, "created_at" => reported_at, "status" => { "applied" => 0, "restarted" => 0, "failed" => 1, "failed_restarts" => 0, "skipped" => 0, "pending" => 0 }, "metrics" => { "time" => { "config_retrieval" => 6.98906397819519, "total" => 13.8197405338287 }, "resources" => { "applied" => 0, "failed" => 1, "failed_restarts" => 0, "out_of_sync" => 0, "restarted" => 0, "scheduled" => 67, "skipped" => 0, "total" => 68 }, "changes" => { "total" => 0 } }, "logs" => [], } (1..logs).each do |i| base["logs"].append({ "log" => { "sources" => { "source" => "//Node[#{i}]/my_servers/minimal/time/Service[#{i}]" }, "messages" => { "message" => "Failed to retrieve current state of resource: #{i}" }, "level" => "err" } }) base["logs"].append({ "log" => { "sources" => { "source" => "Puppet" }, "messages" => { "message" => "Using cached catalog" }, "level" => "notice" } }) end base end host_id = 1 # create current reports (1..1000).each { ConfigReport.import(make_report(host_id)) } # create one month old reports (1..1000).each { ConfigReport.import(make_report(host_id, 100, (Time.now.utc - 2629743).to_s) }
Bryan, can we get this a higher priority? There have been multiple cases with performance degradation that took nontrivial time to identify the cause (often hidden below false alarms) and painful service impact to various bigger customers - see all the attached support cases. Having a reports/cleanup script that: - can be run (internally e.g. in batches) outside maintenance mode (there is some concurrency problem with the workaround, now) - can be adjusted/configurable for user scenarios like "delete any success report older than 1 day or any report older than 1 month" will be a big win here.
Report expiration already is pretty much configurable: Expire Reports automatically Available conditions: * days => number of days to keep reports (defaults to 7) * status => status of the report (if not set defaults to any status) * report_type => report type (defaults to config_report), accepts either underscore / class name styles * batch_size => number of records deleted in single SQL transaction (defaults to 1k) * sleep_time => delay in seconds between batches (defaults to 0.2) Example: rake reports:expire days=7 RAILS_ENV="production" # expires all reports regardless of their status rake reports:expire days=1 status=0 RAILS_ENV="production" # expires all non interesting reports after one day rake reports:expire report_type=my_report days=3 # expires all reports of type MyReport (underscored style) from the last 3 days. rake reports:expire report_type=MyReport days=3 # expires all reports of type MyReport (class name style) from the last 3 days. Thing is, it IS SLOW and we cannot speed up deletion much on deployments where new reports are being pushed in (meaning: most deployments). Not only the deletion is very slow because of huge "logs" join table but in SQL databases, delete operation is expensive and require table lock and this can distrupt report uploads during that time. In other words, the way we store reports need a major overhaul. I have described the steps here: https://community.theforeman.org/t/rfc-optimized-reports-storage/15573