Bug 1757188

Summary: [RFE] Change the way we store reports in PostgreSQL
Product: Red Hat Satellite Reporter: sthirugn <sthirugn>
Component: ReportingAssignee: satellite6-bugs <satellite6-bugs>
Status: NEW --- QA Contact: Lukáš Hellebrandt <lhellebr>
Severity: high Docs Contact:
Priority: high    
Version: 6.5.0CC: ahumbe, aruzicka, bkearney, dsinglet, ehelms, fgarciad, inecas, lzap, mhulan, pmoravec, rbertolj, rjerrido, sokeeffe, sshtein, swachira
Target Milestone: UnspecifiedKeywords: FieldEngineering, FutureFeature, Performance, PrioBumpField, PrioBumpGSS, PrioBumpPM, PrioBumpQA, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sthirugn@redhat.com 2019-09-30 19:13:43 UTC
Description of problem:
[RFE] Automatic cleanup of logs table

Version-Release number of selected component (if applicable):
Satellite 6.5.2

How reproducible:
In customer environments with a lot of content hosts and puppet check-ins

Steps to Reproduce:
1. We had seen customer environments where logs table contain a ton of data with logs table sometimes consuming more than 35GB.

su - postgres -c "psql foreman -c \"select count(*) from logs;\""
 count 
----------
   539298
(1 row)

Actual results:
Degradation of Satellite performance since postgres queries too long to finish

Expected results:
Logs table should be cleaned up automatically to keep the number of records at an optimum level - say 50k records.

Additional info:

Comment 6 Lukas Zapletal 2019-10-01 05:54:50 UTC
Created redmine issue https://projects.theforeman.org/issues/27971 from this bug

Comment 7 Bryan Kearney 2019-10-01 08:06:32 UTC
Upstream bug assigned to lzap

Comment 8 Bryan Kearney 2019-10-01 08:06:34 UTC
Upstream bug assigned to lzap

Comment 9 sthirugn@redhat.com 2019-11-06 21:51:19 UTC
Workaround is to truncate the logs table often as explained in https://access.redhat.com/solutions/4449961

Comment 10 Lukas Zapletal 2019-12-11 11:43:25 UTC
QA: On a host with many reports (100k or more preferably - import them or use foreman-rake console to generate them) run the

rake reports:expire

and observe performance in the production.log. Check that only logs one week old were deleted and there are no left-overs. Share the performance in the BZ please. The system will log these sentences:

logger.info "Expired #{count} #{model} at rate #{rate} #{model}/sec"

Comment 11 Lukas Zapletal 2019-12-11 11:51:57 UTC
RAKE SCRIPT TO GENERATE REPORTS (warning it is very slow):

def make_report(host = 1, logs = 100, reported_at = Time.now.utc.to_s)
  base = {
    "host" => "host#{host}.example.com", "reported_at" => reported_at, "created_at" => reported_at,
    "status" => { "applied" => 0, "restarted" => 0, "failed" => 1, "failed_restarts" => 0, "skipped" => 0, "pending" => 0 },
    "metrics" => { "time" => { "config_retrieval" => 6.98906397819519, "total" => 13.8197405338287 }, "resources" => { "applied" => 0, "failed" => 1, "failed_restarts" => 0, "out_of_sync" => 0, "restarted" => 0, "scheduled" => 67, "skipped" => 0, "total" => 68 }, "changes" => { "total" => 0 } },
    "logs" => [],
  }
  (1..logs).each do |i|
    base["logs"].append({
      "log" => { "sources" => { "source" => "//Node[#{i}]/my_servers/minimal/time/Service[#{i}]" }, 
      "messages" => { "message" => "Failed to retrieve current state of resource: #{i}" },
      "level" => "err" }
    })
    base["logs"].append({
      "log" => { "sources" => { "source" => "Puppet" }, 
      "messages" => { "message" => "Using cached catalog" },
      "level" => "notice" }
    })
  end
  base
end

host_id = 1
# create current reports
(1..1000).each { ConfigReport.import(make_report(host_id)) }
# create one month old reports
(1..1000).each { ConfigReport.import(make_report(host_id, 100, (Time.now.utc - 2629743).to_s) }

Comment 12 Bryan Kearney 2020-01-16 15:06:39 UTC
Upstream bug assigned to lzap

Comment 13 Bryan Kearney 2020-01-16 15:06:42 UTC
Upstream bug assigned to lzap

Comment 14 Pavel Moravec 2020-03-05 16:54:25 UTC
Bryan,
can we get this a higher priority?

There have been multiple cases with performance degradation that took nontrivial time to identify the cause (often hidden below false alarms) and painful service impact to various bigger customers - see all the attached support cases. Having a reports/cleanup script that:
- can be run (internally e.g. in batches) outside maintenance mode (there is some concurrency problem with the workaround, now)
- can be adjusted/configurable for user scenarios like "delete any success report older than 1 day or any report older than 1 month"

will be a big win here.

Comment 16 Lukas Zapletal 2020-03-06 06:52:36 UTC
Report expiration already is pretty much configurable:

  Expire Reports automatically

  Available conditions:
    * days        => number of days to keep reports (defaults to 7)
    * status      => status of the report (if not set defaults to any status)
    * report_type => report type (defaults to config_report), accepts either underscore / class name styles
    * batch_size  => number of records deleted in single SQL transaction (defaults to 1k)
    * sleep_time  => delay in seconds between batches (defaults to 0.2)

    Example:
      rake reports:expire days=7 RAILS_ENV="production" # expires all reports regardless of their status
      rake reports:expire days=1 status=0 RAILS_ENV="production" # expires all non interesting reports after one day
      rake reports:expire report_type=my_report days=3 # expires all reports of type MyReport (underscored style) from the last 3 days.
      rake reports:expire report_type=MyReport days=3 # expires all reports of type MyReport (class name style) from the last 3 days.

Thing is, it IS SLOW and we cannot speed up deletion much on deployments where new reports are being pushed in (meaning: most deployments). Not only the deletion is very slow because of huge "logs" join table but in SQL databases, delete operation is expensive and require table lock and this can distrupt report uploads during that time. In other words, the way we store reports need a major overhaul. I have described the steps here:

https://community.theforeman.org/t/rfc-optimized-reports-storage/15573