Bug 1757188 - [RFE] Change the way we store reports in PostgreSQL
Summary: [RFE] Change the way we store reports in PostgreSQL
Keywords:
Status: NEW
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Reporting
Version: 6.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Lukáš Hellebrandt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-30 19:13 UTC by sthirugn@redhat.com
Modified: 2024-03-28 00:35 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker SAT-1565 0 None None None 2022-04-12 14:37:08 UTC
Red Hat Knowledge Base (Solution) 4449961 0 None None None 2019-09-30 19:18:00 UTC

Description sthirugn@redhat.com 2019-09-30 19:13:43 UTC
Description of problem:
[RFE] Automatic cleanup of logs table

Version-Release number of selected component (if applicable):
Satellite 6.5.2

How reproducible:
In customer environments with a lot of content hosts and puppet check-ins

Steps to Reproduce:
1. We had seen customer environments where logs table contain a ton of data with logs table sometimes consuming more than 35GB.

su - postgres -c "psql foreman -c \"select count(*) from logs;\""
 count 
----------
   539298
(1 row)

Actual results:
Degradation of Satellite performance since postgres queries too long to finish

Expected results:
Logs table should be cleaned up automatically to keep the number of records at an optimum level - say 50k records.

Additional info:

Comment 6 Lukas Zapletal 2019-10-01 05:54:50 UTC
Created redmine issue https://projects.theforeman.org/issues/27971 from this bug

Comment 7 Bryan Kearney 2019-10-01 08:06:32 UTC
Upstream bug assigned to lzap

Comment 8 Bryan Kearney 2019-10-01 08:06:34 UTC
Upstream bug assigned to lzap

Comment 9 sthirugn@redhat.com 2019-11-06 21:51:19 UTC
Workaround is to truncate the logs table often as explained in https://access.redhat.com/solutions/4449961

Comment 10 Lukas Zapletal 2019-12-11 11:43:25 UTC
QA: On a host with many reports (100k or more preferably - import them or use foreman-rake console to generate them) run the

rake reports:expire

and observe performance in the production.log. Check that only logs one week old were deleted and there are no left-overs. Share the performance in the BZ please. The system will log these sentences:

logger.info "Expired #{count} #{model} at rate #{rate} #{model}/sec"

Comment 11 Lukas Zapletal 2019-12-11 11:51:57 UTC
RAKE SCRIPT TO GENERATE REPORTS (warning it is very slow):

def make_report(host = 1, logs = 100, reported_at = Time.now.utc.to_s)
  base = {
    "host" => "host#{host}.example.com", "reported_at" => reported_at, "created_at" => reported_at,
    "status" => { "applied" => 0, "restarted" => 0, "failed" => 1, "failed_restarts" => 0, "skipped" => 0, "pending" => 0 },
    "metrics" => { "time" => { "config_retrieval" => 6.98906397819519, "total" => 13.8197405338287 }, "resources" => { "applied" => 0, "failed" => 1, "failed_restarts" => 0, "out_of_sync" => 0, "restarted" => 0, "scheduled" => 67, "skipped" => 0, "total" => 68 }, "changes" => { "total" => 0 } },
    "logs" => [],
  }
  (1..logs).each do |i|
    base["logs"].append({
      "log" => { "sources" => { "source" => "//Node[#{i}]/my_servers/minimal/time/Service[#{i}]" }, 
      "messages" => { "message" => "Failed to retrieve current state of resource: #{i}" },
      "level" => "err" }
    })
    base["logs"].append({
      "log" => { "sources" => { "source" => "Puppet" }, 
      "messages" => { "message" => "Using cached catalog" },
      "level" => "notice" }
    })
  end
  base
end

host_id = 1
# create current reports
(1..1000).each { ConfigReport.import(make_report(host_id)) }
# create one month old reports
(1..1000).each { ConfigReport.import(make_report(host_id, 100, (Time.now.utc - 2629743).to_s) }

Comment 12 Bryan Kearney 2020-01-16 15:06:39 UTC
Upstream bug assigned to lzap

Comment 13 Bryan Kearney 2020-01-16 15:06:42 UTC
Upstream bug assigned to lzap

Comment 14 Pavel Moravec 2020-03-05 16:54:25 UTC
Bryan,
can we get this a higher priority?

There have been multiple cases with performance degradation that took nontrivial time to identify the cause (often hidden below false alarms) and painful service impact to various bigger customers - see all the attached support cases. Having a reports/cleanup script that:
- can be run (internally e.g. in batches) outside maintenance mode (there is some concurrency problem with the workaround, now)
- can be adjusted/configurable for user scenarios like "delete any success report older than 1 day or any report older than 1 month"

will be a big win here.

Comment 16 Lukas Zapletal 2020-03-06 06:52:36 UTC
Report expiration already is pretty much configurable:

  Expire Reports automatically

  Available conditions:
    * days        => number of days to keep reports (defaults to 7)
    * status      => status of the report (if not set defaults to any status)
    * report_type => report type (defaults to config_report), accepts either underscore / class name styles
    * batch_size  => number of records deleted in single SQL transaction (defaults to 1k)
    * sleep_time  => delay in seconds between batches (defaults to 0.2)

    Example:
      rake reports:expire days=7 RAILS_ENV="production" # expires all reports regardless of their status
      rake reports:expire days=1 status=0 RAILS_ENV="production" # expires all non interesting reports after one day
      rake reports:expire report_type=my_report days=3 # expires all reports of type MyReport (underscored style) from the last 3 days.
      rake reports:expire report_type=MyReport days=3 # expires all reports of type MyReport (class name style) from the last 3 days.

Thing is, it IS SLOW and we cannot speed up deletion much on deployments where new reports are being pushed in (meaning: most deployments). Not only the deletion is very slow because of huge "logs" join table but in SQL databases, delete operation is expensive and require table lock and this can distrupt report uploads during that time. In other words, the way we store reports need a major overhaul. I have described the steps here:

https://community.theforeman.org/t/rfc-optimized-reports-storage/15573


Note You need to log in before you can comment on or make changes to this bug.