Description of problem:
After postgresql pods crash, they are unable to automatically re-start correctly and need to be manually restored from backup. This makes postgresql unsupportable for production operations.
This was first noticed by customer about 2 months ago and is not happening with any other database pods (mysql or mariadb.) Whilst *some* of the postgresql pods recover *some* of the time, there is always a percentage that experiences data corruption in the userdata directory and therefore fail to recover automatically. The first time this happened, the postgres image was upgraded to latest, but this is still happening with the latest version ().
From troubleshooting with the customer and looking at the logs we suspect this could be due to the pod scaling back up before the previous one has shutdown fully.
The pods are backed by persistent storage on Netapp filers (NFS).
Version-Release number of selected component (if applicable):
OpenShift 3.4 - 3.6 and postgresql 9.4, 9.5 and latest
Fairly easy to reproduce when there is activity on the database while it's being shut down.
Steps to Reproduce:
data in userdata is corrupted and database needs to be manually restored from backup
Data in userdata is not corrupted and database pod recovers correctly.
There are 2 other cases by different customers experiencing similar issues.
Submitted too soon...
Steps to Reproduce:
1. To simulate pod crash, docker kill postgresql pod container, there should ideally be some activity on the database at the time.
You should see the following messages in the logs:
LOG: database system was shut down at 2018-04-07 00:57:23 UTC
LOG: invalid resource manager ID 45 at 1/47830788
LOG: invalid primary checkpoint record
LOG: invalid resource manager ID in secondary checkpoint record
PANIC: could not locate a valid checkpoint record
LOG: startup process (PID 23) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
Created attachment 1427086 [details]
Postgresql logs from original incident
We have seen some recovery happening after the pod was in crashloopback for a long time... without knowing what made the recovery suddently successfull.
We are wondering if the recovery is failing because the pod do not leave enough time for the recovery to finish.
So I am wondering if I could increase the timeout in case of ongoing recovery...