Bug 1856677 - postgresql restarts too much, eventually fails
Summary: postgresql restarts too much, eventually fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Setup.EngineCommon
Version: 4.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.4.2
: 4.4.2.1
Assignee: Yedidyah Bar David
QA Contact: Pavel Novotny
URL:
Whiteboard:
Depends On:
Blocks: 1847963
TreeView+ depends on / blocked
 
Reported: 2020-07-14 08:20 UTC by Yedidyah Bar David
Modified: 2020-09-18 07:12 UTC (History)
2 users (show)

Fixed In Version: ovirt-engine-4.4.2.1
Doc Type: Bug Fix
Doc Text:
Note to doc team: In principle this bug could happen also earlier, in 4.3 etc., if things are quick enough. But in 4.4, with grafana integration and backup/restore of (also) the grafana db user, it's much more likely, see dependent bug. Since grafana is new in 4.4, it's likely that people didn't actually run into this bug so far, so feel free to mark 'requires doc text' '-'. Actual suggested doc text, in case you do want it, follows: engine-setup and ovirt-engine-provisiondb (used by engine-backup when provisioning databases) need to restart postgresql, several times, depending on exact flow. Under certain circumstances, and if this happened quickly enough, we could run into systemd's default maximum allowed restarts, which is 5 times every 10 seconds, and thus fail starting postgresql again and failing engine-setup/engine-backup. With this release, we run 'systemctl reset-failed postgresql' after every restart of postgresql, thus preventing running into systemd's limit.
Clone Of:
Environment:
Last Closed: 2020-09-18 07:12:01 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.4?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
lleistne: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 110284 0 master MERGED packaging: Reset systemd restart counter on PG restart 2020-09-01 21:17:26 UTC

Description Yedidyah Bar David 2020-07-14 08:20:23 UTC
Description of problem:

In certain flows, e.g. engine+dwh+grafana restore, we restart postgresql quite a lot. If the machine is fast enough, we are hit by systemd's default limit of up to 5 restarts per 10 seconds.

Version-Release number of selected component (if applicable):
Current master

How reproducible:
Always, probably, on a fast-enough machine

Steps to Reproduce:
1. Take a backup of 4.4 engine+dwh+grafana
2. Restore the backup
3.

Actual results:
If the machine is fast enough, one of the restarts will fail, e.g.:

Jul 11 11:06:23 10-37-140-71 systemd[1]: postgresql.service: Start request repeated too quickly.
Jul 11 11:06:23 10-37-140-71 systemd[1]: postgresql.service: Failed with result 'start-limit-hit'.
Jul 11 11:06:23 10-37-140-71 systemd[1]: Failed to start PostgreSQL database server.

Expected results:
I think we want this to always succeed, and without requiring permanent changes to postgresql's configuration - so I think we want our code to call 'systemctl reset-failed postgresql' after restarting it.

Additional info:

Comment 1 Yedidyah Bar David 2020-08-18 11:59:21 UTC
Workaround:

If you try restore and it fails due to this bug, you can change systemd to allow more restarts:

1. Edit /usr/lib/systemd/system/postgresql.service: Under section '[Unit]', add a line:
StartLimitBurst=20

2. systemctl daemon-reload

3. Stop and clean PostgreSQL:

systemctl stop postgresql
rm -rf /var/lib/pgsql/data/*

Then try restore again.

Comment 2 Pavel Novotny 2020-08-30 15:51:25 UTC
Verified in ovirt-engine-4.4.2.3-0.6.el8ev

Engine backup & restore with full scope succeeded, no PostgreSQL errors.

Comment 3 Sandro Bonazzola 2020-09-18 07:12:01 UTC
This bugzilla is included in oVirt 4.4.2 release, published on September 17th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.