Bug 1569482

Summary:

Postgresql pod fails to recover automatically after OpenShift master failure

Product:

Red Hat Software Collections

Reporter:

Pili Guerra <pguerra>

Component:

rh-postgresql96-container

Assignee:

Petr Kubat <pkubat>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Lukáš Zachar <lzachar>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rh-postgresql96

CC:

aos-bugs, hhorak, jokerman, marc.jadoul, mmccomas, pkubat, praiskup

Target Milestone:

---

Target Release:

3.6

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-08-05 08:55:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Postgresql logs from original incident	none

Description Pili Guerra 2018-04-19 11:29:38 UTC

Description of problem:

After postgresql pods crash, they are unable to automatically re-start correctly and need to be manually restored from backup. This makes postgresql unsupportable for production operations. 

This was first noticed by customer about 2 months ago and is not happening with any other database pods (mysql or mariadb.) Whilst *some* of the postgresql pods recover *some* of the time, there is always a percentage that experiences data corruption in the userdata directory and therefore fail to recover automatically. The first time this happened, the postgres image was upgraded to latest, but this is still happening with the latest version ().

From troubleshooting with the customer and looking at the logs we suspect this could be due to the pod scaling back up before the previous one has shutdown fully. 

The pods are backed by persistent storage on Netapp filers (NFS).

Version-Release number of selected component (if applicable):

OpenShift 3.4 - 3.6 and postgresql 9.4, 9.5 and latest

How reproducible:

Fairly easy to reproduce when there is activity on the database while it's being shut down.

Steps to Reproduce:

1.
2.
3.

Actual results:

data in userdata is corrupted and database needs to be manually restored from backup


Expected results:

Data in userdata is not corrupted and database pod recovers correctly.


Additional info:

There are 2 other cases by different customers experiencing similar issues.

Comment 1 Pili Guerra 2018-04-19 11:40:22 UTC

Submitted too soon...

Steps to Reproduce:

1. To simulate pod crash, docker kill postgresql pod container, there should ideally be some activity on the database at the time.

You should see the following messages in the logs:

LOG:  database system was shut down at 2018-04-07 00:57:23 UTC
LOG:  invalid resource manager ID 45 at 1/47830788
LOG:  invalid primary checkpoint record
LOG:  invalid resource manager ID in secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 23) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Comment 7 Pili Guerra 2018-04-26 09:11:38 UTC

Created attachment 1427086 [details]
Postgresql logs from original incident

Comment 12 Marc Jadoul 2018-05-18 09:45:22 UTC

Hello,

We have seen some recovery happening after the pod was in crashloopback for a long time... without knowing what made the recovery suddently successfull.

We are wondering if the recovery is failing because the pod do not leave enough time for the recovery to finish.

So I am wondering if I could increase the timeout in case of ongoing recovery...
Marc

Comment 21 Red Hat Bugzilla 2023-09-15 00:07:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days