Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1569482 - Postgresql pod fails to recover automatically after OpenShift master failure [NEEDINFO]
Summary: Postgresql pod fails to recover automatically after OpenShift master failure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Software Collections
Classification: Red Hat
Component: rh-postgresql96-container
Version: rh-postgresql96
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.6
Assignee: Petr Kubat
QA Contact: Lukáš Zachar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-19 11:29 UTC by Pili Guerra
Modified: 2021-06-10 15:53 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-05 08:55:32 UTC
Target Upstream Version:
praiskup: needinfo? (pguerra)


Attachments (Terms of Use)
Postgresql logs from original incident (389.83 KB, text/plain)
2018-04-26 09:11 UTC, Pili Guerra
no flags Details

Description Pili Guerra 2018-04-19 11:29:38 UTC
Description of problem:

After postgresql pods crash, they are unable to automatically re-start correctly and need to be manually restored from backup. This makes postgresql unsupportable for production operations. 

This was first noticed by customer about 2 months ago and is not happening with any other database pods (mysql or mariadb.) Whilst *some* of the postgresql pods recover *some* of the time, there is always a percentage that experiences data corruption in the userdata directory and therefore fail to recover automatically. The first time this happened, the postgres image was upgraded to latest, but this is still happening with the latest version ().

From troubleshooting with the customer and looking at the logs we suspect this could be due to the pod scaling back up before the previous one has shutdown fully. 

The pods are backed by persistent storage on Netapp filers (NFS).

Version-Release number of selected component (if applicable):

OpenShift 3.4 - 3.6 and postgresql 9.4, 9.5 and latest

How reproducible:

Fairly easy to reproduce when there is activity on the database while it's being shut down.

Steps to Reproduce:

1.
2.
3.

Actual results:

data in userdata is corrupted and database needs to be manually restored from backup


Expected results:

Data in userdata is not corrupted and database pod recovers correctly.


Additional info:

There are 2 other cases by different customers experiencing similar issues.

Comment 1 Pili Guerra 2018-04-19 11:40:22 UTC
Submitted too soon...

Steps to Reproduce:

1. To simulate pod crash, docker kill postgresql pod container, there should ideally be some activity on the database at the time.

You should see the following messages in the logs:

LOG:  database system was shut down at 2018-04-07 00:57:23 UTC
LOG:  invalid resource manager ID 45 at 1/47830788
LOG:  invalid primary checkpoint record
LOG:  invalid resource manager ID in secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 23) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Comment 7 Pili Guerra 2018-04-26 09:11:38 UTC
Created attachment 1427086 [details]
Postgresql logs from original incident

Comment 12 Marc Jadoul 2018-05-18 09:45:22 UTC
Hello,

We have seen some recovery happening after the pod was in crashloopback for a long time... without knowing what made the recovery suddently successfull.

We are wondering if the recovery is failing because the pod do not leave enough time for the recovery to finish.

So I am wondering if I could increase the timeout in case of ongoing recovery...
Marc


Note You need to log in before you can comment on or make changes to this bug.