Bug 1757188
| Summary: | [RFE] Change the way we store reports in PostgreSQL | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | sthirugn <sthirugn> |
| Component: | Reporting | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | NEW --- | QA Contact: | Lukáš Hellebrandt <lhellebr> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.5.0 | CC: | ahumbe, aruzicka, bkearney, dsinglet, ehelms, fgarciad, inecas, lzap, mhulan, pmoravec, rbertolj, rjerrido, sokeeffe, sshtein, swachira |
| Target Milestone: | Unspecified | Keywords: | FieldEngineering, FutureFeature, Performance, PrioBumpField, PrioBumpGSS, PrioBumpPM, PrioBumpQA, Triaged |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
sthirugn@redhat.com
2019-09-30 19:13:43 UTC
Created redmine issue https://projects.theforeman.org/issues/27971 from this bug Upstream bug assigned to lzap Upstream bug assigned to lzap Workaround is to truncate the logs table often as explained in https://access.redhat.com/solutions/4449961 QA: On a host with many reports (100k or more preferably - import them or use foreman-rake console to generate them) run the
rake reports:expire
and observe performance in the production.log. Check that only logs one week old were deleted and there are no left-overs. Share the performance in the BZ please. The system will log these sentences:
logger.info "Expired #{count} #{model} at rate #{rate} #{model}/sec"
RAKE SCRIPT TO GENERATE REPORTS (warning it is very slow):
def make_report(host = 1, logs = 100, reported_at = Time.now.utc.to_s)
base = {
"host" => "host#{host}.example.com", "reported_at" => reported_at, "created_at" => reported_at,
"status" => { "applied" => 0, "restarted" => 0, "failed" => 1, "failed_restarts" => 0, "skipped" => 0, "pending" => 0 },
"metrics" => { "time" => { "config_retrieval" => 6.98906397819519, "total" => 13.8197405338287 }, "resources" => { "applied" => 0, "failed" => 1, "failed_restarts" => 0, "out_of_sync" => 0, "restarted" => 0, "scheduled" => 67, "skipped" => 0, "total" => 68 }, "changes" => { "total" => 0 } },
"logs" => [],
}
(1..logs).each do |i|
base["logs"].append({
"log" => { "sources" => { "source" => "//Node[#{i}]/my_servers/minimal/time/Service[#{i}]" },
"messages" => { "message" => "Failed to retrieve current state of resource: #{i}" },
"level" => "err" }
})
base["logs"].append({
"log" => { "sources" => { "source" => "Puppet" },
"messages" => { "message" => "Using cached catalog" },
"level" => "notice" }
})
end
base
end
host_id = 1
# create current reports
(1..1000).each { ConfigReport.import(make_report(host_id)) }
# create one month old reports
(1..1000).each { ConfigReport.import(make_report(host_id, 100, (Time.now.utc - 2629743).to_s) }
Upstream bug assigned to lzap Upstream bug assigned to lzap Bryan, can we get this a higher priority? There have been multiple cases with performance degradation that took nontrivial time to identify the cause (often hidden below false alarms) and painful service impact to various bigger customers - see all the attached support cases. Having a reports/cleanup script that: - can be run (internally e.g. in batches) outside maintenance mode (there is some concurrency problem with the workaround, now) - can be adjusted/configurable for user scenarios like "delete any success report older than 1 day or any report older than 1 month" will be a big win here. Report expiration already is pretty much configurable:
Expire Reports automatically
Available conditions:
* days => number of days to keep reports (defaults to 7)
* status => status of the report (if not set defaults to any status)
* report_type => report type (defaults to config_report), accepts either underscore / class name styles
* batch_size => number of records deleted in single SQL transaction (defaults to 1k)
* sleep_time => delay in seconds between batches (defaults to 0.2)
Example:
rake reports:expire days=7 RAILS_ENV="production" # expires all reports regardless of their status
rake reports:expire days=1 status=0 RAILS_ENV="production" # expires all non interesting reports after one day
rake reports:expire report_type=my_report days=3 # expires all reports of type MyReport (underscored style) from the last 3 days.
rake reports:expire report_type=MyReport days=3 # expires all reports of type MyReport (class name style) from the last 3 days.
Thing is, it IS SLOW and we cannot speed up deletion much on deployments where new reports are being pushed in (meaning: most deployments). Not only the deletion is very slow because of huge "logs" join table but in SQL databases, delete operation is expensive and require table lock and this can distrupt report uploads during that time. In other words, the way we store reports need a major overhaul. I have described the steps here:
https://community.theforeman.org/t/rfc-optimized-reports-storage/15573
|