Bug 1359918
| Summary: | [scale] Add partitions to ovirt-engine-history database | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-engine-dwh | Reporter: | mlehrer |
| Component: | Database | Assignee: | Shirly Radco <sradco> |
| Status: | CLOSED WONTFIX | QA Contact: | Lukas Svaty <lsvaty> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.0.0 | CC: | bugs, mlehrer, rgolan, sradco, ylavi |
| Target Milestone: | --- | Keywords: | FutureFeature, Performance |
| Target Release: | --- | Flags: | ykaul:
ovirt-future?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-05 13:02:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Metrics | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
mlehrer
2016-07-25 19:06:35 UTC
Please add details on the scale of the environment this behavior appeared on. Did you see this happening on smaller scale? (In reply to Shirly Radco from comment #1) > Please add details on the scale of the environment this behavior appeared on. I will add this as a private comment. > Did you see this happening on smaller scale? Smaller data set wasn't measured because the scale of the data set and sampling interval map to the amount of rows inserted which maps to physical space needed; so no I didn't check a smaller enviroment, but i would assume that smaller amounts of rows added would lead to growth but at a smaller rate than a larger scale. If you are curious if aggressive auto-vacuuming alone is enough on a smaller scale; I would still say no; a full vacuum is necessary to reclaim disk space. (In reply to Shirly Radco from comment #1) > Please add details on the scale of the environment this behavior appeared on. DataSet vms_disks 6000 vms 3000 hosts 300 vms_interfaces 6000 hosts_interfaces 3000 > Did you see this happening on smaller scale? Did you try tuning the auto vacuum ? https://www.postgresql.org/docs/9.1/static/runtime-config-autovacuum.html autovacuum_vacuum_cost_delay (integer) - default is 20 milliseconds. autovacuum_vacuum_cost_limit (integer) - default is only 200 rows altogether in all tables. which is clearly not enough. autovacuum_max_workers - default is three. autovacuum_naptime - default is one minute (1min). Example: http://dba.stackexchange.com/questions/21068/aggressive-autovacuum-on-postgresql Tuning this may resolve our vacuuming issue. Shirly - any intentions of doing anything about it for 4.0.2? I don't think so, and I don't think it's a blocker either. Please defer unless you have a safe solution to this. I move this to 4.1 since this is only on large scale environment and there is a workaround of changing the sampling back to 60 seconds in the conf file. Not fixing this for 4.0 and deciding on keeping a lower interval than we had in 3.6, will most probably bring us back to hit issues like BZ #1328709 (much faster this time, due to the new interval). I suggest we do one of the following: 1. Find a way to fix it in 4.0 2. Reduce the interval back to 60 sec, so at least we reduce the risk to the level we had in 3.6 (In reply to Gil Klein from comment #7) > Not fixing this for 4.0 and deciding on keeping a lower interval than we had > in 3.6, will most probably bring us back to hit issues like BZ #1328709 > (much faster this time, due to the new interval). > > I suggest we do one of the following: > 1. Find a way to fix it in 4.0 > 2. Reduce the interval back to 60 sec, so at least we reduce the risk to the > level we had in 3.6 This issue can be resolved by updating the vacuum options for our large scale users. The defaults are definitely not enough. In a large scale like the one we tested, or the bug you mentioned, a db maintenance is a must, even with the 60 seconds interval. We will provide a best practice on our recommendations. Also, There is an option to add a conf file to change it back to 60 seconds. Partitioning makes sense when the tables are very large, not necessarily when we have a lot of junk. The observation is accurate though, the rate of record removal is high and therefor tons of garbage is piling up, in the sampling tables. For example in on production system that I track I see a relatively stable, high-rate of dead tuples for the sampling tables. I sample it daily and I see around 1.5 milion dead records. I must say that the disk of that db is not the strongest, so perhaps a stronger setup would be quicker to cleanup. Mordechai do you have a report of one of your running setups to track dead rows over time? Shirly have you seen bugs open on too much diskspace taken by the db, or other bugs related to vacuuming? Closing this for now. Will reopen if needed. |