Bug 1569482

Summary: Postgresql pod fails to recover automatically after OpenShift master failure
Product: Red Hat Software Collections Reporter: Pili Guerra <pguerra>
Component: rh-postgresql96-containerAssignee: Petr Kubat <pkubat>
Status: CLOSED CURRENTRELEASE QA Contact: Lukáš Zachar <lzachar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rh-postgresql96CC: aos-bugs, hhorak, jokerman, marc.jadoul, mmccomas, pkubat, praiskup
Target Milestone: ---   
Target Release: 3.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-05 08:55:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Postgresql logs from original incident none

Description Pili Guerra 2018-04-19 11:29:38 UTC
Description of problem:

After postgresql pods crash, they are unable to automatically re-start correctly and need to be manually restored from backup. This makes postgresql unsupportable for production operations. 

This was first noticed by customer about 2 months ago and is not happening with any other database pods (mysql or mariadb.) Whilst *some* of the postgresql pods recover *some* of the time, there is always a percentage that experiences data corruption in the userdata directory and therefore fail to recover automatically. The first time this happened, the postgres image was upgraded to latest, but this is still happening with the latest version ().

From troubleshooting with the customer and looking at the logs we suspect this could be due to the pod scaling back up before the previous one has shutdown fully. 

The pods are backed by persistent storage on Netapp filers (NFS).

Version-Release number of selected component (if applicable):

OpenShift 3.4 - 3.6 and postgresql 9.4, 9.5 and latest

How reproducible:

Fairly easy to reproduce when there is activity on the database while it's being shut down.

Steps to Reproduce:

1.
2.
3.

Actual results:

data in userdata is corrupted and database needs to be manually restored from backup


Expected results:

Data in userdata is not corrupted and database pod recovers correctly.


Additional info:

There are 2 other cases by different customers experiencing similar issues.

Comment 1 Pili Guerra 2018-04-19 11:40:22 UTC
Submitted too soon...

Steps to Reproduce:

1. To simulate pod crash, docker kill postgresql pod container, there should ideally be some activity on the database at the time.

You should see the following messages in the logs:

LOG:  database system was shut down at 2018-04-07 00:57:23 UTC
LOG:  invalid resource manager ID 45 at 1/47830788
LOG:  invalid primary checkpoint record
LOG:  invalid resource manager ID in secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 23) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Comment 7 Pili Guerra 2018-04-26 09:11:38 UTC
Created attachment 1427086 [details]
Postgresql logs from original incident

Comment 12 Marc Jadoul 2018-05-18 09:45:22 UTC
Hello,

We have seen some recovery happening after the pod was in crashloopback for a long time... without knowing what made the recovery suddently successfull.

We are wondering if the recovery is failing because the pod do not leave enough time for the recovery to finish.

So I am wondering if I could increase the timeout in case of ongoing recovery...
Marc

Comment 21 Red Hat Bugzilla 2023-09-15 00:07:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days