1282733 – [openshift3/postgresql-92-rhel7] Postgresql pod is CrashLoopBackOff if using persistent storage

Bug 1282733 - [openshift3/postgresql-92-rhel7] Postgresql pod is CrashLoopBackOff if using persistent storage

Summary: [openshift3/postgresql-92-rhel7] Postgresql pod is CrashLoopBackOff if using ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Erin Boyd
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1282321 (view as bug list)
Depends On:	1276326 1281671 1282945
Blocks:	1281665
TreeView+	depends on / blocked

Reported:	2015-11-17 09:31 UTC by wewang
Modified:	2016-05-12 16:25 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1281671
Environment:
Last Closed:	2016-05-12 16:25:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:1064	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update	2016-05-12 20:19:17 UTC

Comment 1 Ben Parees 2015-11-17 13:52:16 UTC

*** Bug 1282321 has been marked as a duplicate of this bug. ***

Comment 20 Ben Parees 2015-11-19 17:45:00 UTC

hmm.  I was able to run the existing (3.0.2) image successfully using NFS:

# oc describe pod postgresql-1-y11g6
Name:                postgresql-1-y11g6
Namespace:            p1
Image(s):            registry.access.redhat.com/openshift3/postgresql-92-rhel7
Node:                ip-172-18-14-203.ec2.internal/172.18.14.203
Start Time:            Thu, 19 Nov 2015 12:28:58 -0500
Labels:                deployment=postgresql-1,deploymentconfig=postgresql,name=postgresql
Status:                Running
Reason:               
Message:           
IP:                172.17.0.2
Replication Controllers:    postgresql-1 (1/1 replicas created)
Containers:
  postgresql:
    Container ID:    docker://d875f5136b6f4f543d0d5bfb0a6fb267d1c4bc76d3f1c600b71b107056566ac2
    Image:        registry.access.redhat.com/openshift3/postgresql-92-rhel7
    Image ID:        docker://c10e6b2e643e30eaa93d8c47e6d6c545ba28494cbb6e2e2862a4cb1895f07f6e
    QoS Tier:
      memory:        BestEffort
      cpu:        BestEffort
    State:        Running
      Started:        Thu, 19 Nov 2015 12:29:17 -0500
    Ready:        True
    Restart Count:    0
    Environment Variables:
      POSTGRESQL_USER:        user7NG
      POSTGRESQL_PASSWORD:    8HjI4FKKSiBTJg2V
      POSTGRESQL_DATABASE:    sampledb
Conditions:
  Type        Status
  Ready     True
Volumes:
  postgresql-data:
    Type:    PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:    postgresql
    ReadOnly:    false
  default-token-vddwf:
    Type:    Secret (a secret that should populate this volume)
    SecretName:    default-token-vddwf


I would guess there was an NFS configuration issue in the test described in comment 17.

I created my NFS volume via:
mkdir /nfs
chown -R nfsnobody:nfsnobody /nfs
chmod 777 /nfs
echo '/nfs *(rw)' >> /etc/exports
exportfs -a
setsebool -P virt_use_nfs 1


i'm also confused since per the log output, postgres did initialize successfully, it was only after the (which is performed as part of the container startup) that the permission issue occurred.


That said, it's probably not worth pursuing at the moment, let's see what happens with the updated image that Scott just built.

Comment 21 Wang Haoran 2015-11-20 03:20:26 UTC

test with the new images:
1. I can sure the nfs volume is created successfully, the pod will run the psql start command until the volume is mounted, and also have some data under the nfs server directory
2.I guess maybe there is a scc problem ,not sure. please view this bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1279744

logs:
1. pod startup log:
[vagrant@ose db-templates]$ oc logs -f postgresql-1-u2o7a
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok

Success. You can now start the database server using:

    postgres -D /var/lib/pgsql/data/userdata
or
    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....FATAL:  data directory "/var/lib/pgsql/data/userdata" has wrong ownership
HINT:  The server must be started by the user that owns the data directory.
pg_ctl: could not start server
Examine the log output.
.... stopped waiting
[vagrant@ose db-templates]$ oc get pod
NAME                 READY     STATUS             RESTARTS   AGE
postgresql-1-u2o7a   0/1       CrashLoopBackOff   4          1m

2. data on nfs server:
[root@MTV-NFS-Squid /]# ls -rlt /haowangpv/userdata/
total 84
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_serial
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_snapshots
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_twophase
drwx------. 4 1000130000 nfsnobody  4096 Nov 20 04:11 pg_multixact
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_tblspc
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_stat_tmp
-rw-------. 1 1000130000 nfsnobody     4 Nov 20 04:11 PG_VERSION
-rw-------. 1 1000130000 nfsnobody  1636 Nov 20 04:11 pg_ident.conf
drwx------. 3 1000130000 nfsnobody  4096 Nov 20 04:11 pg_xlog
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_clog
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_subtrans
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 pg_notify
drwx------. 2 1000130000 nfsnobody  4096 Nov 20 04:11 global
drwx------. 5 1000130000 nfsnobody  4096 Nov 20 04:11 base
-rw-------. 1 1000130000 nfsnobody 19895 Nov 20 04:11 postgresql.conf
-rw-------. 1 1000130000 nfsnobody  4674 Nov 20 04:11 pg_hba.conf
[root@MTV-NFS-Squid /]# ls -rlt /haowangpv
total 4
drwx------. 14 1000130000 nfsnobody 4096 Nov 20 04:11 userdata

Comment 22 Ben Parees 2015-11-20 04:00:09 UTC

the bug you reference only affects emptydir volumes, so it's not relevant here.

Comment 23 Ben Parees 2015-11-20 04:01:24 UTC

Martin any theories?  the permissions in the NFS dir look correct, assuming the nss wrapper is properly faking the uid to be postgres.

Comment 24 Pavel Raiskup 2015-11-20 12:17:32 UTC

Isn't it truth we should have 'root' group ownership on datadir?

Comment 25 Pavel Raiskup 2015-11-20 12:20:58 UTC

Anyway:
if (stat_buf.st_uid != geteuid())
   This error is shown.

Comment 26 Ben Parees 2015-11-20 14:45:47 UTC

No, the data dir is a volume that's explicitly mounted with permissions that enable the container uid to access it.

Comment 27 Pavel Raiskup 2015-11-20 15:30:49 UTC

Yes.

Please look at:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/postmaster/postmaster.c;h=2b16e3e26b2f9a9822adb3bf29c6424c6b95efd1;hb=6e1d26f1f79a8ed9a8222d3f6b5ea5d4b667b23f#l1319

Then you can see that the 'geteuid()' does not return the uid of DataDir
owner.

Comment 28 Eric Paris 2015-11-23 14:25:28 UTC

What uid is the container running as? Are you trying to reuse data created by some other pod, running as some other uid? Can you make this available on a system for someone else to troubleshoot and provide access details?

Comment 31 Martin Nagy 2015-11-24 08:44:18 UTC

It all seems like an nfs uid/gid mapping problem.

I would like to get some info from a running container. Can QE create a custom image with some debugging statements included? Ideally before this exec statement: https://github.com/openshift/postgresql/blob/master/9.2/root/usr/bin/run-postgresql-slave#L33

set -x
ls -laZ /var/lib/pgsql/
ls -laZ /var/lib/pgsql/data
ls -laZ /var/lib/pgsql/data/userdata
id

Comment 32 Pavel Raiskup 2015-11-24 09:19:09 UTC

Martine, is it actually problem to take the failed container and execute
those commands inside without rebuilding special image?

Comment 34 Martin Nagy 2015-11-24 10:46:17 UTC

Pavel, while it might be possible, I wouldn't trust such an execution environment. I have created a simple image with debugging statements, Haoran, can you please test with docker.io/mnagy/qe-test-debug-postgresql-92-centos7 ?
It should fail as well, but container logs should be more illuminating.

Comment 35 Honza Horak 2015-11-24 17:12:36 UTC

The conversation shifted to problems with `/var/lib/pgsql/data/userdata`, but what about the issue with `/var/run/postgresql`? Was this solved somehow?

Comment 41 Wenjing Zheng 2015-11-25 08:50:19 UTC

(In reply to Martin Nagy from comment #34)
> Pavel, while it might be possible, I wouldn't trust such an execution
> environment. I have created a simple image with debugging statements,
> Haoran, can you please test with
> docker.io/mnagy/qe-test-debug-postgresql-92-centos7 ?
> It should fail as well, but container logs should be more illuminating.

I tested with this image and got below warnings after pod becomes CrashLoopBackOff:
[root@openshift-114 ~]# docker logs e86561dcdad4
+ set -eu
+ source /usr/share/container-scripts/postgresql/common.sh
++ export POSTGRESQL_MAX_CONNECTIONS=100
++ POSTGRESQL_MAX_CONNECTIONS=100
++ export POSTGRESQL_SHARED_BUFFERS=32MB
++ POSTGRESQL_SHARED_BUFFERS=32MB
++ export POSTGRESQL_RECOVERY_FILE=/var/lib/pgsql/openshift-custom-recovery.conf
++ POSTGRESQL_RECOVERY_FILE=/var/lib/pgsql/openshift-custom-recovery.conf
++ export POSTGRESQL_CONFIG_FILE=/var/lib/pgsql/openshift-custom-postgresql.conf
++ POSTGRESQL_CONFIG_FILE=/var/lib/pgsql/openshift-custom-postgresql.conf
++ postinitdb_actions=
++ psql_identifier_regex='^[a-zA-Z_][a-zA-Z0-9_]*$'
++ psql_password_regex='^[a-zA-Z0-9_~!@#$%^&*()-=<>,.?;:|]+$'
+ set_pgdata
+ '[' -O /var/lib/pgsql/data ']'
+ '[' '!' -d /var/lib/pgsql/data/userdata ']'
+ export PGDATA=/var/lib/pgsql/data/userdata
+ PGDATA=/var/lib/pgsql/data/userdata
+ check_env_vars
+ [[ -v POSTGRESQL_USER ]]
+ [[ -v POSTGRESQL_USER ]]
+ [[ -v POSTGRESQL_PASSWORD ]]
+ [[ -v POSTGRESQL_DATABASE ]]
+ [[ user =~ ^[a-zA-Z_][a-zA-Z0-9_]*$ ]]
+ [[ SF6G2qNegVkw =~ ^[a-zA-Z0-9_~!@#$%^&*()-=<>,.?;:|]+$ ]]
+ [[ userdb =~ ^[a-zA-Z_][a-zA-Z0-9_]*$ ]]
+ '[' 4 -le 63 ']'
+ '[' 6 -le 63 ']'
+ postinitdb_actions+=,simple_db
+ '[' -v POSTGRESQL_ADMIN_PASSWORD ']'
+ [[ 8OmBE81rIs0e =~ ^[a-zA-Z0-9_~!@#$%^&*()-=<>,.?;:|]+$ ]]
+ postinitdb_actions+=,admin_pass
+ case ",$postinitdb_actions," in
+ generate_passwd_file
++ id -u
+ export USER_ID=1000140000
+ USER_ID=1000140000
++ id -g
+ export GROUP_ID=0
+ GROUP_ID=0
+ envsubst
+ export LD_PRELOAD=libnss_wrapper.so
+ LD_PRELOAD=libnss_wrapper.so
+ export NSS_WRAPPER_PASSWD=/var/lib/pgsql/passwd
+ NSS_WRAPPER_PASSWD=/var/lib/pgsql/passwd
+ export NSS_WRAPPER_GROUP=/etc/group
+ NSS_WRAPPER_GROUP=/etc/group
+ generate_postgresql_config
+ envsubst
+ '[' '!' -f /var/lib/pgsql/data/userdata/postgresql.conf ']'
+ set_passwords
+ pg_ctl -w start -o '-h '\'''\'''
waiting for server to start....FATAL:  data directory "/var/lib/pgsql/data/userdata" has wrong ownership
HINT:  The server must be started by the user that owns the data directory.
.... stopped waiting
pg_ctl: could not start server
Examine the log output.

Comment 43 Pavel Raiskup 2015-11-25 10:06:09 UTC

While we see what the USER_ID / GROUP_ID are, we still don't see what are the
permissions of data directory.  Could you send output with all the commands
mentioned by Martin in comment #31?

Comment 44 Pavel Raiskup 2015-11-25 10:07:37 UTC

FWIW: While 'set -x' output is useful for debugging, shouldn't we probably
make sure that even with 'set -x' the passwords are not printed out?

Comment 45 Martin Nagy 2015-11-25 10:32:58 UTC

Pavel, those commands can't be properly executed. I've placed them in the container, but before the main exec and this problem appeared before that.

I'll update the test images.

Comment 47 Wang Haoran 2015-11-26 01:59:06 UTC

see the logs with new test image, I think it's really the the nfs bug prolblem, all files have ownership nobody:nobody.
http://file.rdu.redhat.com/~maszulik/NFSv4%20mount%20incorrectly%20shows%20all%20files%20with%20ownership%20as%20nobody:nobody%20-%20Red%20Hat%20Customer%20Portal.html
[root@openshift-107 ~]# docker logs ade36a6d547d
=============================================
uid=1000130000 gid=0(root)
=============================================
drwxrwx---. 26  0 system_u:object_r:svirt_sandbox_file_t:s0:c10,c11 .
drwxr-xr-x.  0  0 system_u:object_r:svirt_sandbox_file_t:s0:c10,c11 ..
drwxrwx---. 26  0 system_u:object_r:svirt_sandbox_file_t:s0:c10,c11 .pki
drwxrwxrwx. 99 99 system_u:object_r:nfs_t:s0       data
=============================================
drwxrwxrwx. 99 99 system_u:object_r:nfs_t:s0       .
drwxrwx---. 26  0 system_u:object_r:svirt_sandbox_file_t:s0:c10,c11 ..
drwx------. 99 99 system_u:object_r:nfs_t:s0       userdata
=============================================
drwx------. 99 99 system_u:object_r:nfs_t:s0       .
drwxrwxrwx. 99 99 system_u:object_r:nfs_t:s0       ..
-rw-------. 99 99 system_u:object_r:nfs_t:s0       PG_VERSION
drwx------. 99 99 system_u:object_r:nfs_t:s0       base
drwx------. 99 99 system_u:object_r:nfs_t:s0       global
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_clog
-rw-------. 99 99 system_u:object_r:nfs_t:s0       pg_hba.conf
-rw-------. 99 99 system_u:object_r:nfs_t:s0       pg_ident.conf
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_multixact
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_notify
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_serial
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_snapshots
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_stat_tmp
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_subtrans
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_tblspc
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_twophase
drwx------. 99 99 system_u:object_r:nfs_t:s0       pg_xlog
-rw-------. 99 99 system_u:object_r:nfs_t:s0       postgresql.conf
=============================================
waiting for server to start....FATAL:  data directory "/var/lib/pgsql/data/userdata" has wrong ownership
HINT:  The server must be started by the user that owns the data directory.
pg_ctl: could not start server
Examine the log output.
.... stopped waiting

Comment 48 Wang Haoran 2015-11-26 04:38:39 UTC

After fix the nfs server problem according to doc http://file.rdu.redhat.com/~maszulik/NFSv4%20mount%20incorrectly%20shows%20all%20files%20with%20ownership%20as%20nobody:nobody%20-%20Red%20Hat%20Customer%20Portal.html

do this option on NFS server:
 # echo 'Y' > /sys/module/nfsd/parameters/nfs4_disable_idmapping 


now can start successfully with the image:
 rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/postgresql-92-rhel7 4214043d7f89

thanks for all your help, and you can move the bug to ON_QA status I think.

Comment 54 Mark Turansky 2015-12-09 16:16:17 UTC

Re-assigning to erin boyd.  She's recently overhauled all the docs with the new security functionality.

Comment 55 Ben Parees 2015-12-09 16:24:45 UTC

Since this bug is a a lot to read through:  at this point what needs to be done is document the NFS configuration requirements described here: https://bugzilla.redhat.com/show_bug.cgi?id=1282733#c48

Comment 56 Erin Boyd 2016-01-05 14:16:56 UTC

Once the NFS permissions are setup correctly and this is re-ran, can someone please paste in the output from oc get scc restricted?

Is this on a shared system somewhere I can access it?

Comment 57 Erin Boyd 2016-01-05 17:52:34 UTC

Also for version 3.1 and higher NFS should be setup with these permissions:
https://docs.openshift.com/enterprise/3.1/install_config/persistent_storage/persistent_storage_nfs.html

Comment 58 Erin Boyd 2016-01-05 17:56:30 UTC

Lastly, are you setting the DB pw in the pod spec?

Comment 59 Scott Creeley 2016-01-05 18:12:18 UTC

(In reply to Ben Parees from comment #55)
> Since this bug is a a lot to read through:  at this point what needs to be
> done is document the NFS configuration requirements described here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1282733#c48

We will create a PR to update the Admin Guide with a new section for trouble shooting and reference the issue/fix from comment 48 and any others that creep up in the future

Comment 60 wewang 2016-01-06 03:06:09 UTC

(In reply to Erin Boyd from comment #56)
> Once the NFS permissions are setup correctly and this is re-ran, can someone
> please paste in the output from oc get scc restricted?
> 
> Is this on a shared system somewhere I can access it?

Hi, paste output from oc get scc restricted in ose env
-bash-4.2# oc get scc  restricted
NAME         PRIV      CAPS      HOSTDIR   SELINUX     RUNASUSER        FSGROUP    SUPGROUP   PRIORITY
restricted   false     []        false     MustRunAs   MustRunAsRange   RunAsAny   RunAsAny   <none>

Comment 61 Erin Boyd 2016-01-06 14:25:32 UTC

Did you update your nfs mount to the way it's defined in the documentation?
What is the uid range you have created, can you also paste that in as well as the pod spec?
Thanks,
Erin

Comment 62 Scott Creeley 2016-01-06 17:29:45 UTC

Created PR https://github.com/openshift/openshift-docs/pull/1394 to provide additional information in the admin guide

Comment 63 wewang 2016-01-07 07:07:56 UTC

(In reply to Erin Boyd from comment #61)
> Did you update your nfs mount to the way it's defined in the documentation?
> What is the uid range you have created, can you also paste that in as well
> as the pod spec?
> Thanks,
> Erin

Yes follow the the way to create nfs mount, and  paste uid and gid below:
uid=1000030000 gid=0(root) groups=0(root)

Comment 64 Erin Boyd 2016-02-08 15:42:48 UTC

@wewang
where you able to re-test this with the new settings? Please provide me an update.

Comment 65 wewang 2016-02-14 07:28:09 UTC

re-test in 
openshift v3.1.1.6-16-g5327e56
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2
rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/postgresql-92-rhel7 958b5ad91919

steps:
1. # oc rsh postgresql-master-1-z6e8o
bash-4.2$ id 
uid=1000190000 gid=0(root) groups=0(root)
2. bash-4.2$ psql -h postgresql-master -d userdb -U user
Password for user user: 
psql (9.2.14)
Type "help" for help.
userdb=> CREATE TABLE tbl (col1 VARCHAR(20), col2 VARCHAR(20));
CREATE TABLE

Comment 68 errata-xmlrpc 2016-05-12 16:25:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
bchilds
bleanhar
bparees
eparis
fkluknav
haowang
hhorak
jialiu
jokerman
jpazdziora
mmccomas
mnagy
praiskup
qe-baseos-apps
screeley
sdodson
tdawson
wewang
wzheng
xtian