Bug 1593183 - after upgrade to 3.10, data in mariadb-apb pod is lost
Summary: after upgrade to 3.10, data in mariadb-apb pod is lost
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Service Broker
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Jason Montleon
QA Contact: Zihan Tang
URL:
Whiteboard:
Depends On: 1611939
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-20 08:55 UTC by Zihan Tang
Modified: 2018-10-11 07:21 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-10-11 07:20:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:21:18 UTC

Description Zihan Tang 2018-06-20 08:55:23 UTC
Description of problem:
in 3.9 HA env, provision mediawiki+*.sql-apb , then upgrade to 3.10, mediawiki can not work, it shows 'A database query error has occurred. This may indicate a bug in the software.' 

Version-Release number of the following components:
openshift-ansible-3.10.1-1

How reproducible:
always
Steps to Reproduce:
1. install ocp3.9 HA env
2. provision mediawiki & postgresql-apb , and create a bind, add the secret to mediawiki, visit mediawiki website, it shows 'successfully installed'
3. upgrade to 3.10, upgrade job should succeed.
4. check mediawiki and postgresql

Actual results:
meidawiki web site can not visit. it shows 'A database query error has occurred. This may indicate a bug in the software.' 


before upgrade:
  [root@ip-172-18-0-247 ~]# oc logs -f postgresql-9.6-dev-1-ljvrb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ALTER ROLE
waiting for server to shut down.... done
server stopped
Starting server...
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
------------

$ tail -f /var/lib/pgsql/data/userdata/pg_log/postgresql-Tue.log
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
LOG:  database system was shut down at 2018-06-19 08:41:07 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
WARNING:  there is already a transaction in progress
========
after upgrade
# oc logs -f postgresql-9.6-dev-1-n6wpb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ALTER ROLE
waiting for server to shut down.... done
server stopped
Starting server...
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
-----------------
sh-4.2$ tail -f /var/lib/pgsql/data/userdata/pg_log/postgresql-Tue.log 
ERROR:  relation "msg_resource" does not exist at character 44
STATEMENT:  DELETE /* MessageBlobStore::clear  */ FROM "msg_resource"
ERROR:  relation "msg_resource_links" does not exist at character 44
STATEMENT:  DELETE /* MessageBlobStore::clear  */ FROM "msg_resource_links"
ERROR:  relation "page" does not exist at character 219
STATEMENT:  SELECT /* WikiPage::pageData  */  page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM "page"   WHERE page_namespace = '0' AND page_title = 'Main_Page'  LIMIT 1  
ERROR:  relation "job" does not exist at character 87
STATEMENT:  SELECT /* JobQueueDB::doGetSiblingQueuesWithJobs 10.2.12.1 */  DISTINCT job_cmd  FROM "job"   WHERE job_cmd IN ('refreshLinks','refreshLinks2','htmlCacheUpdate','sendMail','enotifNotify','fixDoubleRedirect','uploadFromUrl','AssembleUploadChunks','PublishStashedFile','null')   
ERROR:  relation "page" does not exist at character 219
STATEMENT:  SELECT /* WikiPage::pageData  */  page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM "page"   WHERE page_namespace = '0' AND page_title = 'Main_Page'  LIMIT 1  

Expected results:
*sql-apb pod works as expected. mediawiki can be visited.

Additional info:
1. in a non-ha cluster, these pods works as expected after upgrade.
2. if provison other *sql-apb, mediawiki also can not be visited

Comment 2 Zhang Cheng 2018-06-20 09:02:12 UTC
I set the target release to 3.10 since this a critical scenario.

Comment 3 Zhang Cheng 2018-06-25 02:14:35 UTC
@John,

Please assign this bug and review if it can be fixed in 3.10. I think the target release should be 3.10 since this a critical scenario. Thanks.

Comment 4 Zhang Cheng 2018-06-25 02:22:53 UTC
@David & Dylan,

I add you to the cc list, could you help to push this bug in progress in case John absent? Thanks.

Comment 5 Jason Montleon 2018-06-25 12:41:57 UTC
To me it looks like the database data is gone.

This looks like a pod with ephemeral storage: postgresql-9.6-dev-1-ljvrb

And it looks as though it is getting shut down at some point during hte upgrade? If so I would expect the database data to disappear and of course things wouldn't work if the data is gone.

LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down

I suppose we could do something in the mediawiki entrypoint script to test if the database is gone and if so recreate it, but of course any data the user created in between the setup and shutdown would of course be gone.

Any chance you can confirm this is what we're seeing by using a plan with persistent storage?

Comment 6 Zihan Tang 2018-06-26 02:05:01 UTC
Jason,
> I suppose we could do something in the mediawiki entrypoint script to test if the database is gone and if so recreate it,
I agree , I didn't find any related logs from mediawiki pod.

>but of course any data the user created in between the setup and shutdown would of course be gone.
it's not expected the data is gone after upgrade. I tried non-ha upgrade, data not gone, mediawiki can get data from db . 

>Any chance you can confirm this is what we're seeing by using a plan with persistent storage?

I tried with persistent storage and mysql-apb and mariadb-apb, after upgrade, the mediawiki can not get data to all the dbs with same error "A database query error has occurred. This may indicate a bug in the software". but I don't find important logs in mysql and mariadb pod.

Comment 7 Jason Montleon 2018-06-26 16:00:32 UTC
What is the state of the database pods after the upgrade? Are they not running? I don't see how data on persistent storage can be gone after upgrading the cluster.

Comment 8 Jason Montleon 2018-06-26 17:59:56 UTC
I'm unable to reproduce due to https://bugzilla.redhat.com/show_bug.cgi?id=1591053

I am hitting the error in comment 42 100% of the time now if I provision APB's prior to upgrading.

Comment 9 Jason Montleon 2018-06-26 18:25:41 UTC
I just managed to get through an upgrade.

Prior I had:
[root@192 ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                          READY     STATUS    RESTARTS   AGE
default                             docker-registry-1-z45kg       1/1       Running   0          13m
default                             registry-console-1-6qrpn      1/1       Running   0          12m
default                             router-1-spd57                1/1       Running   0          13m
ephemeral                           mediawiki123-2-vqwsz          1/1       Running   0          3m
ephemeral                           mysql-5.7-dev-1-vfs8b         1/1       Running   0          6m
kube-service-catalog                apiserver-qqtdb               1/1       Running   0          12m
kube-service-catalog                controller-manager-8tlng      1/1       Running   1          11m
openshift-ansible-service-broker    asb-1-wpzbv                   1/1       Running   2          11m
openshift-ansible-service-broker    asb-etcd-1-t2n59              1/1       Running   0          11m
openshift-template-service-broker   apiserver-6qgbf               1/1       Running   0          11m
openshift-web-console               webconsole-75b5bb9587-tsmf8   1/1       Running   0          12m
persistent                          mediawiki123-2-c88vx          1/1       Running   0          3m
persistent                          mysql-5.7-prod-1-sq9mf        1/1       Running   0          5m
[root@192 ~]# curl -I http://mediawiki123-persistent.apps.192.168.121.107.nip.io/index.php/Main_Page
HTTP/1.1 200 OK
Date: Tue, 26 Jun 2018 17:04:16 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16
X-Powered-By: PHP/5.4.16
X-Content-Type-Options: nosniff
Content-language: en
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Last-Modified: Tue, 26 Jun 2018 17:01:15 GMT
Content-Type: text/html; charset=UTF-8
Set-Cookie: a3400a357d077c52f4721df29f8d2d53=d6f2eddf5a8c0758dde344f27d41f83a; path=/; HttpOnly
[root@192 ~]# curl -I http://mediawiki123-ephemeral.apps.192.168.121.107.nip.io/index.php/Main_Page
HTTP/1.1 200 OK
Date: Tue, 26 Jun 2018 17:04:23 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16
X-Powered-By: PHP/5.4.16
X-Content-Type-Options: nosniff
Content-language: en
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Last-Modified: Tue, 26 Jun 2018 17:00:47 GMT
Content-Type: text/html; charset=UTF-8
Set-Cookie: 48906e19a2223fa2bc425345094b372e=1013d5997c816dc2e0fda3052aa655a5; path=/; HttpOnly


After Upgrade:
[root@192 ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                                        READY     STATUS      RESTARTS   AGE
default                             docker-registry-2-gmprn                     1/1       Running     0          10m
default                             registry-console-2-qt5rd                    1/1       Running     0          10m
default                             router-2-879bh                              1/1       Running     0          10m
ephemeral                           mediawiki123-2-vqwsz                        1/1       Running     3          1h
ephemeral                           mysql-5.7-dev-1-vfs8b                       1/1       Running     6          1h
kube-service-catalog                apiserver-k9hxw                             1/1       Running     0          8m
kube-service-catalog                controller-manager-5c8xd                    1/1       Running     0          8m
kube-system                         master-api-192.168.121.107.nip.io           1/1       Running     1          11m
kube-system                         master-controllers-192.168.121.107.nip.io   1/1       Running     1          11m
kube-system                         master-etcd-192.168.121.107.nip.io          1/1       Running     0          1h
openshift-ansible-service-broker    asb-2-nf5qq                                 1/1       Running     0          7m
openshift-ansible-service-broker    asb-etcd-migration-pjskt                    0/1       Completed   0          8m
openshift-node                      sync-mrg7d                                  1/1       Running     0          28m
openshift-sdn                       ovs-9xvfw                                   1/1       Running     0          28m
openshift-sdn                       sdn-2kvgj                                   1/1       Running     1          28m
openshift-template-service-broker   apiserver-cqj7w                             1/1       Running     0          7m
openshift-web-console               webconsole-7b88c47974-8s5mp                 1/1       Running     0          11m
persistent                          mediawiki123-2-c88vx                        1/1       Running     3          1h
persistent                          mysql-5.7-prod-1-sq9mf                      1/1       Running     5          1h
[root@192 ~]# curl -I http://mediawiki123-persistent.apps.192.168.121.107.nip.io/index.php/Main_Page
HTTP/1.1 200 OK
Date: Tue, 26 Jun 2018 18:19:49 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16
X-Powered-By: PHP/5.4.16
X-Content-Type-Options: nosniff
Content-language: en
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Last-Modified: Tue, 26 Jun 2018 17:55:28 GMT
Content-Type: text/html; charset=UTF-8
Set-Cookie: a3400a357d077c52f4721df29f8d2d53=eb274661ba3b6906687d3971f68c5f27; path=/; HttpOnly
[root@192 ~]# curl -I http://mediawiki123-ephemeral.apps.192.168.121.107.nip.io/index.php/Main_Page
HTTP/1.1 200 OK
Date: Tue, 26 Jun 2018 18:19:58 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16
X-Powered-By: PHP/5.4.16
X-Content-Type-Options: nosniff
Content-language: en
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Last-Modified: Tue, 26 Jun 2018 17:55:24 GMT
Content-Type: text/html; charset=UTF-8
Set-Cookie: 48906e19a2223fa2bc425345094b372e=40e0efc628e368082e8df5f1d047af46; path=/; HttpOnly

Manually browsing to each url confirms the Main Page is accessible.

Can you please confirm the pods remain up and that no significant changes to the network/sdn or openshift_node_groups configuration is taking place that would prevent the pods from running or communicating?

Comment 10 Zihan Tang 2018-06-27 01:59:51 UTC
Jason
after upgrade , the apb pods are running, sdn pod is running. 
I'll build a env to reproduce it.

Comment 12 Zhang Cheng 2018-06-27 08:14:12 UTC
@John,

Refer to the current test result: "after upgrade, about 60% mediawiki Main page can not be accessed.", we don't think this bug should be moved out of 3.10 plan.

Please double confirm. Thanks.

Comment 13 Zhang Cheng 2018-06-27 08:26:18 UTC
I'm changing "target release" to 3.10 since it have 60% reproduce rate for lost database after upgrade. Please correct me if I have mistake or you have other concern. Thanks.

Comment 14 Zhang Cheng 2018-06-27 08:42:18 UTC
This bug is a critical scenario: old serviceinstance(databases) should still work after upgrade. Customer will have 60% rate to lost database connection after upgrade if have not fix.

Comment 15 Zihan Tang 2018-06-27 09:11:19 UTC
At present, can not 'rsh' to all the apb pod to gather logs caused by bug https://bugzilla.redhat.com/show_bug.cgi?id=1594341.

Comment 16 Jason Montleon 2018-06-27 14:57:09 UTC
You can figure out which node the containers are running on, use docker ps to look at the name and command to determine which container relates to which pod, and use docker exec to exam them even if you can't with oc exec, etc.

Comment 18 Jason Montleon 2018-06-27 14:58:29 UTC
This isn't going to be resolved by mid-day today for 3.10.

We need to get to the bottom of why the database data is missing, especially in the case of prod instances.

Can you deploy 3.9, provision your apb pairs normally, verify on the prod instances that the storage is mounted in the pod and data files exist, and take note of what pods are bound to what pvc's and what pvcs are bound to what claims (oc get pvc --all-namespaces and oc get pv).

Then run oc get pvc --all-namespaces and oc get pv again after the upgrade to confirm nothing has flip flopped or changed for some reason, and similarly, especially in the case of prod instances note which ones aren't working, that the pvcs are mounted and that the data is available.

I'll try to run through another upgrade with multiple prod instances and do the same and see if I can reproduce it today.

Comment 19 Jason Montleon 2018-06-27 16:50:57 UTC
I created 4 prod instances of each combination of postgres, mysql, and mariadb with mediawiki and each was available after upgrade.

Could this be something to do with the GCE storage or even the failure-domain.beta.kubernetes.io/region annotations? I'm just trying to pin point what is different about your deployments and ours at this point. These are the most obvious things to look at to me considering the issue is related to storage.

I'm also on a single node setup and your on a multi node setup, but I don't think there is an obvious issue there because testing with curl I could communicate between the pods running on different nodes and I noted some were working across nodes while others were not in your environment.

Comment 21 Zihan Tang 2018-06-28 09:51:23 UTC
I provision a mariadb in prod plan in OCP 3.9 .
the data seems not write into pv.

[root@qe-zitang-39-3me-1 ~]# oc get pod 
NAME                              READY     STATUS    RESTARTS   AGE
mediawiki123-2-rpknh              1/1       Running   0          50s
rhscl-mariadb-10.2-prod-1-bwtt7   1/1       Running   0          9m

[root@qe-zitang-39-3me-1 ~]# oc rsh rhscl-mariadb-10.2-prod-1-bwtt7
sh-4.2$  ls -l /var/lib/mysql/data/
total 106564
drwx------. 2 1000180000 root     4096 Jun 28 09:43 admin
-rw-rw----. 1 1000180000 root    16384 Jun 28 09:34 aria_log.00000001
-rw-rw----. 1 1000180000 root       52 Jun 28 09:34 aria_log_control
-rw-rw----. 1 1000180000 root     2799 Jun 28 09:34 ib_buffer_pool
-rw-rw----. 1 1000180000 root  8388608 Jun 28 09:44 ib_logfile0
-rw-rw----. 1 1000180000 root  8388608 Jun 28 09:34 ib_logfile1
-rw-rw----. 1 1000180000 root 79691776 Jun 28 09:44 ibdata1
-rw-rw----. 1 1000180000 root 12582912 Jun 28 09:34 ibtmp1
-rw-rw----. 1 1000180000 root        0 Jun 28 09:34 multi-master.info
drwx------. 2 1000180000 root     4096 Jun 28 09:34 mysql
-rw-r--r--. 1 1000180000 root       14 Jun 28 09:34 mysql_upgrade_info
drwx------. 2 1000180000 root       20 Jun 28 09:34 performance_schema
-rw-rw----. 1 1000180000 root        2 Jun 28 09:34 rhscl-mariadb-10.pid
-rw-rw----. 1 1000180000 root    24576 Jun 28 09:34 tc.log
drwx------. 2 1000180000 root        6 Jun 28 09:34 test

sh-4.2$ mysql -uroot
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 10.2.8-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| admin              |
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
5 rows in set (0.00 sec)

MariaDB [(none)]> use admin;
Database changed
MariaDB [admin]> show tables;
Empty set (0.00 sec)

MariaDB [admin]> show tables;
+--------------------+
| Tables_in_admin    |
+--------------------+
| archive            |
| category           |
| categorylinks      |
| change_tag         |
| externallinks      |
| filearchive        |


check in pv 
[root@qe-zitang-39-3me-1 ~]# oc get pvc
NAME                      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mediawiki123-pvc          Bound     pvc-ecf61a8a-7ab6-11e8-a2d9-fa163e5a535e   1Gi        RWO            standard       6m
rhscl-mariadb-10.2-prod   Bound     pvc-5614cbfd-7ab6-11e8-a2d9-fa163e5a535e   10Gi       RWO            standard       10m

[root@qe-zitang-39-3nrr-1 ~]# mount | grep pvc-bf02e4eb-7ab0-11e8-8902-fa163ec91f46
/dev/vdb on /var/lib/origin/openshift.local.volumes/pods/d575bf32-7ab0-11e8-a2d9-fa163e5a535e/volumes/kubernetes.io~cinder/pvc-bf02e4eb-7ab0-11e8-8902-fa163ec91f46 type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
[root@qe-zitang-39-3nrr-1 ~]# cd /var/lib/origin/openshift.local.volumes/pods/d575bf32-7ab0-11e8-a2d9-fa163e5a535e/volumes/kubernetes.io~cinder/pvc-bf02e4eb-7ab0-11e8-8902-fa163ec91f46

[root@qe-zitang-39-3nrr-2 pvc-5614cbfd-7ab6-11e8-a2d9-fa163e5a535e]# ls -al
总用量 8
drwxrwsr-x. 3 root       1000180000  79 6月  28 05:44 .
drwxr-x---. 3 root       root        54 6月  28 05:34 ..
-rw-------. 1 1000180000 1000180000 104 6月  28 05:44 .bash_history
drwxr-sr-x. 2 root       1000180000   6 6月  28 05:34 data
-rw-------. 1 1000180000 1000180000  62 6月  28 05:43 .mysql_history
srwxrwxrwx. 1 1000180000 1000180000   0 6月  28 05:34 mysql.sock

Comment 22 Zihan Tang 2018-06-28 09:55:25 UTC
add more info in #comment 21
no data files in pv.
[root@qe-zitang-39-3nrr-2 pvc-5614cbfd-7ab6-11e8-a2d9-fa163e5a535e]# ls -l data
总用量 0

I tried mysql and postgresql , the data in pod is sync with data in pv.

attache the v3.9 env.
host-8-241-78.host.centralci.eng.rdu2.redhat.com

Comment 23 Jason Montleon 2018-06-28 12:40:03 UTC
If the database name is admin you need to look in the admin dir for the data files. They are there:

oc exec -it rhscl-mariadb-10.2-prod-1-bwtt7 -- ls -l /var/lib/mysql/data/admin
total 15972
-rw-rw----. 1 1000180000 root    3618 Jun 28 09:43 archive.frm
-rw-rw----. 1 1000180000 root  147456 Jun 28 09:43 archive.ibd
-rw-rw----. 1 1000180000 root    2281 Jun 28 09:43 category.frm
-rw-rw----. 1 1000180000 root  131072 Jun 28 09:43 category.ibd
-rw-rw----. 1 1000180000 root    3364 Jun 28 09:43 categorylinks.frm

In the above you only did:
[root@qe-zitang-39-3me-1 ~]# oc rsh rhscl-mariadb-10.2-prod-1-bwtt7
sh-4.2$  ls -l /var/lib/mysql/data/
total 106564
drwx------. 2 1000180000 root     4096 Jun 28 09:43 admin
-rw-rw----. 1 1000180000 root    16384 Jun 28 09:34 aria_log.00000001

Comment 24 Zihan Tang 2018-06-29 02:16:17 UTC
yes, there are data in 'admin' dir in the pod.
but in the related pv, the 'data' dir is empty.

Comment 25 Jason Montleon 2018-06-29 12:55:12 UTC
Were you able to confirm the pvc was mounted in the pod where this happened? It looks like the host is not up and I don't see output from mount/df to confirm that's the case.

If data is not being written to mounted PVC's this needs to someone familiar with storage.

Comment 26 Zihan Tang 2018-07-02 03:27:53 UTC
in the 3.9 ha env:
mariadb prod pod dc: 
[root@qe-zitang-39-3me-1 ~]# oc get dc -o yaml
apiVersion: v1
items:
- apiVersion: apps.openshift.io/v1
  kind: DeploymentConfig
  metadata:
    creationTimestamp: 2018-07-02T03:12:03Z
    generation: 1
    labels:
      app: rhscl-mariadb-apb
      service: rhscl-mariadb-10.2-prod
    name: rhscl-mariadb-10.2-prod
    namespace: maria
    resourceVersion: "15754"
    selfLink: /apis/apps.openshift.io/v1/namespaces/maria/deploymentconfigs/rhscl-mariadb-10.2-prod
    uid: b161cdc6-7da5-11e8-ba69-fa163e1ba47a
  spec:
    replicas: 1
    selector:
      app: rhscl-mariadb-apb
      service: rhscl-mariadb-10.2-prod
    strategy:
      activeDeadlineSeconds: 21600
      resources: {}
      rollingParams:
        intervalSeconds: 1
        maxSurge: 25%
        maxUnavailable: 25%
        timeoutSeconds: 600
        updatePeriodSeconds: 1
      type: Rolling
    template:
      metadata:
        creationTimestamp: null
        labels:
          app: rhscl-mariadb-apb
          service: rhscl-mariadb-10.2-prod
      spec:
        containers:
        - env:
          - name: MYSQL_ROOT_PASSWORD
            value: dddd
          - name: MYSQL_USER
            value: admin
          - name: MYSQL_PASSWORD
            value: dddd
          - name: MYSQL_DATABASE
            value: admin
          image: registry.access.redhat.com/rhscl/mariadb-102-rhel7
          imagePullPolicy: IfNotPresent
          name: rhscl-mariadb
          ports:
          - containerPort: 3306
            protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /var/lib/mysql
            name: mariadb-data
          workingDir: /
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
        volumes:
        - name: mariadb-data
          persistentVolumeClaim:
            claimName: rhscl-mariadb-10.2-prod
    test: false
    triggers:
    - type: ConfigChange
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: 2018-07-02T03:12:44Z
      lastUpdateTime: 2018-07-02T03:12:44Z
      message: Deployment config has minimum availability.
      status: "True"
      type: Available
    - lastTransitionTime: 2018-07-02T03:12:07Z
      lastUpdateTime: 2018-07-02T03:12:45Z
      message: replication controller "rhscl-mariadb-10.2-prod-1" successfully rolled
        out
      reason: NewReplicationControllerAvailable
      status: "True"
      type: Progressing
    details:
      causes:
      - type: ConfigChange
      message: config change
    latestVersion: 1
    observedGeneration: 1
    readyReplicas: 1
    replicas: 1
    unavailableReplicas: 0
    updatedReplicas: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

in the pod: 
[root@qe-zitang-39-3me-1 ~]# oc rsh rhscl-mariadb-10.2-prod-1-wdvlm
sh-4.2$ df 
Filesystem                                                                                         1K-blocks    Used Available Use% Mounted on
/dev/mapper/docker-253:0-41943105-1073ed4afd7683dd0e07e9c7ec1c6bf67bde79cfa74526678694e8d3528ee877  10475520  476708   9998812   5% /
tmpfs                                                                                                8133200       0   8133200   0% /dev
tmpfs                                                                                                8133200       0   8133200   0% /sys/fs/cgroup
/dev/mapper/rhel-root                                                                               18864128 1819444  17044684  10% /etc/hosts
shm                                                                                                    65536       0     65536   0% /dev/shm
/dev/vdb                                                                                            10475520   32944  10442576   1% /var/lib/mysql
tmpfs                                                                                                8133200      16   8133184   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                                8133200       0   8133200   0% /proc/scsi
tmpfs                                                                                                8133200       0   8133200   0% /sys/firmware

the dir '/var/lib/mysql' is mounted. but connec to the pv in the node, the data dir is empty.

[root@qe-zitang-39-3nrr-1 ~]# mount | grep pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a
/dev/vdb on /var/lib/origin/openshift.local.volumes/pods/b3a6a80c-7da5-11e8-b9eb-fa163e0c95cf/volumes/kubernetes.io~cinder/pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
[root@qe-zitang-39-3nrr-1 ~]# cd /var/lib/origin/openshift.local.volumes/pods/b3a6a80c-7da5-11e8-b9eb-fa163e0c95cf/volumes/kubernetes.io~cinder/pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a
[root@qe-zitang-39-3nrr-1 pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a]# ls
data  mysql.sock
[root@qe-zitang-39-3nrr-1 pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a]# cd data/
[root@qe-zitang-39-3nrr-1 data]# ll
total 0

I tried the latest mariadb apb in a OCP 3.10 env, it looks ok, data is synchronous with PV.

[root@ip-172-18-1-156 ~]# mount | grep pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8
/dev/xvdbg on /var/lib/origin/openshift.local.volumes/pods/f8139aaf-7da3-11e8-a839-0ef5ee5b5cf8/volumes/kubernetes.io~aws-ebs/pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8 type ext4 (rw,relatime,seclabel,data=ordered)
[root@ip-172-18-1-156 ~]# cd /var/lib/origin/openshift.local.volumes/pods/f8139aaf-7da3-11e8-a839-0ef5ee5b5cf8/volumes/kubernetes.io~aws-ebs/pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8

[root@ip-172-18-1-156 pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8]# ls
data  lost+found  mysql.sock
[root@ip-172-18-1-156 pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8]# cd data/
[root@ip-172-18-1-156 data]# ls
admin              aria_log_control  ibdata1      ib_logfile1  mariadb-5d3b49e9-7da3-11e8-a8cb-0a580a800005-1-mq6gk.pid  mysql               performance_schema  test
aria_log.00000001  ib_buffer_pool    ib_logfile0  ibtmp1       multi-master.info                                         mysql_upgrade_info  tc.log
[root@ip-172-18-1-156 data]# cd admin/
[root@ip-172-18-1-156 admin]# ll
总用量 4
-rw-rw----. 1 1000140000 1000140000 65 7月   1 23:00 db.opt

Comment 28 Zihan Tang 2018-07-10 06:03:05 UTC
I tried in non-ha env,  data in mariadb pod(prod plan) is lost after upgrade, so update the title.

Comment 29 Zihan Tang 2018-07-10 06:42:07 UTC
Jason, 
do you think it will be fixed in 3.10.0 since we reproduce it in non-HA env ?

Comment 30 Jason Montleon 2018-07-10 13:06:50 UTC
I haven't been able to reproduce it to understand what's going on. I don't know what data wouldn't be written to the underlying storage. If data isn't being written to the volume we need to get to the bottom of why?

You said in your last comment, "the dir '/var/lib/mysql' is mounted. but connec to the pv in the node, the data dir is empty."

Are we mounting persistent storage in the wrong place? If it's something like this why is it working in some environments? 

Is it something to do with the gce storage you're using? I don't have access to a GCE account to even try to reproduce this. I asked about whether you have tried without the failure region annotations I noted and if it could be something to do with this features, but haven't seen a response. I simply do not know enough about this feature or gce storage to know if this could be the case.

In other words, please help me to understand and get to the bottom of why this is happening in your environment so we can get it to the right person, whether that's us, or someone with the appropriate knowledge about the storage backend.

Comment 31 Zihan Tang 2018-07-11 09:59:19 UTC
It seems to be runtime issue.
As I comment in #comment26, when provision mariadb in v3.9, the data in pod is not syncing with pv, so after upgrade , it will lose data.

I tried in cri-o env 3 times (in gce and aws),  data is syncing with pv. 
[root@ip-172-18-13-17 ~]# cd /var/lib/origin/openshift.local.volumes/pods/d744872f-84ee-11e8-b795-0e84b937625e/volumes/kubernetes.io~aws-ebs/pvc-bf88f34d-84ee-11e8-b795-0e84b937625e
[root@ip-172-18-13-17 pvc-bf88f34d-84ee-11e8-b795-0e84b937625e]# ls
data  lost+found  mysql.sock
[root@ip-172-18-13-17 pvc-bf88f34d-84ee-11e8-b795-0e84b937625e]# ls -l
总用量 20
drwx--S---. 6 1000130000 1000130000  4096 7月  11 05:44 data
drwxrwS---. 2 root       1000130000 16384 7月  11 05:43 lost+found
srwxrwxrwx. 1 1000130000 1000130000     0 7月  11 05:44 mysql.sock
[root@ip-172-18-13-17 pvc-bf88f34d-84ee-11e8-b795-0e84b937625e]# cd data
[root@ip-172-18-13-17 data]# ls
admin              aria_log_control  ibdata1      ib_logfile1  multi-master.info  mysql_upgrade_info  rhscl-mariadb-10.pid  test
aria_log.00000001  ib_buffer_pool    ib_logfile0  ibtmp1       mysql              performance_schema  tc.log

It seems the docker runtime issue.
docker version:docker-1.13.1-58

I find a workaround in docker runtime.
if patch the dc 
        volumeMounts:
        - mountPath: /var/lib/mysql
          name: mariadb-data
mountPath to '/var/lib/mysql/data', the data in mariadb pod can sync with pv

# mount | grep pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff
/dev/vdg on /var/lib/origin/openshift.local.volumes/pods/138d4c32-84cc-11e8-a9bd-fa163ea66cff/volumes/kubernetes.io~cinder/pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
/dev/vdg on /var/lib/origin/openshift.local.volumes/pods/7901d55c-84d5-11e8-a9bd-fa163ea66cff/volumes/kubernetes.io~cinder/pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

[root@qe-zitang-39-2node-registry-router-1 data]# cd /var/lib/origin/openshift.local.volumes/pods/7901d55c-84d5-11e8-a9bd-fa163ea66cff/volumes/kubernetes.io~cinder/pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff

[root@qe-zitang-39-2node-registry-router-1 pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff]# ls
admin  aria_log.00000001  aria_log_control  data  ibdata1  ib_logfile0  ib_logfile1  multi-master.info  mysql  mysql_upgrade_info  performance_schema  rhscl-mariadb-10.pid  tc.log  test

Comment 32 Jason Montleon 2018-07-11 19:26:43 UTC
Why would data in a subdirectory of the mount not be getting written to the pvc?

In your output above:
[root@ip-172-18-13-17 data]# ls
admin              aria_log_control  ibdata1      ib_logfile1  multi-master.info  mysql_upgrade_info  rhscl-mariadb-10.pid  test
aria_log.00000001  ib_buffer_pool    ib_logfile0  ibtmp1       mysql              performance_schema  tc.log


But on the pvc it looks like you have a data directory?:
[root@qe-zitang-39-2node-registry-router-1 pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff]# ls
admin  aria_log.00000001  aria_log_control  data  ibdata1  ib_logfile0  ib_logfile1  multi-master.info  mysql  mysql_upgrade_info  performance_schema  rhscl-mariadb-10.pid  tc.log  test

Comment 33 Zihan Tang 2018-07-12 05:41:12 UTC
(In reply to Jason Montleon from comment #32)
> Why would data in a subdirectory of the mount not be getting written to the
> pvc?
it seems to be runtime issue, if I use cir-o, the data will be written into pv even to subdirectory(I tried in 3 env in #comment31)

> In your output above:
> [root@ip-172-18-13-17 data]# ls
> admin              aria_log_control  ibdata1      ib_logfile1 
> multi-master.info  mysql_upgrade_info  rhscl-mariadb-10.pid  test
> aria_log.00000001  ib_buffer_pool    ib_logfile0  ibtmp1       mysql        
> performance_schema  tc.log
this is in cri-o env. 

> But on the pvc it looks like you have a data directory?:
> [root@qe-zitang-39-2node-registry-router-1
> pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff]# ls
> admin  aria_log.00000001  aria_log_control  data  ibdata1  ib_logfile0 
> ib_logfile1  multi-master.info  mysql  mysql_upgrade_info 
> performance_schema  rhscl-mariadb-10.pid  tc.log  test
the 'data' dir is created using the old dc(mountPath: /var/lib/mysql) , this output is after I patch dc 'mountPath' to '/var/lib/mysql/data',

Comment 35 Zihan Tang 2018-07-13 07:27:10 UTC
I agree to aligning it to 3.11 to find out the real reason. if data will sync to pv before upgrade, upgrade will succeed.

Comment 36 Jason Montleon 2018-08-03 14:24:01 UTC
Is this still a problem in 3.11?

Can you please try with at least one different type of backing storage (NFS, Host Path, etc.) to see if you can reproduce this issue across storage types?

When it comes down to it the mariadb container is a standard application container with a standard pvc. Files in a sub-directory of the pvc mount not making it to being stored on the pv does not make much sense, and I haven't been able to reproduce it.

We need to try to get to the bottom of whether this only happens with one storage backend, whether it's a misconfiguration of the PVC we're creating, etc. so we can correct the issue or get the bug to someone with the knowledge to further debug it and fix it.

Comment 37 Zihan Tang 2018-08-06 02:44:06 UTC
This is blocked by bug https://bugzilla.redhat.com/show_bug.cgi?id=1611939,
can not provision in prod plan. I'll try after this bug fixed. Thanks.

Comment 38 Zihan Tang 2018-08-09 08:28:24 UTC
Verify failed,
In v3.11 , it is the same with v3.10: 
1. in docker(1.13.1) env,  maria-db data is not synced to PV.
2. in cri-o env, maria-db data is synced to PV.

Comment 39 Jason Montleon 2018-08-09 12:50:54 UTC
Have you tried any other storage options like NFS or HostPath for PV's with docker? As mentioned I'm looking for your help in trying to narrow down what in your environment might be causing this since we can't reproduce it.

Comment 41 Jason Montleon 2018-08-10 13:00:48 UTC
I wonder if the difference in behavior between the two is the Volumes set in the Dockerfile:

$ docker inspect registry.access.redhat.com/rhscl/mariadb-102-rhel7:latest | grep \"Volumes\": -A2
            "Volumes": {
                "/var/lib/mysql/data": {}
            },
--
            "Volumes": {
                "/var/lib/mysql/data": {}
            },

vs.
$ docker inspect registry.access.redhat.com/rhscl/mysql-57-rhel7:latest | grep \"Volumes\": -A2
            "Volumes": null,
            "WorkingDir": "/opt/app-root/src",
            "Entrypoint": [
--
            "Volumes": null,
            "WorkingDir": "/opt/app-root/src",
            "Entrypoint": [

It looks from that the expectation is /var/lib/mysql/data as you suggested trying. I've submitted a PR to do this:

https://github.com/ansibleplaybookbundle/mariadb-apb/pull/42

Comment 42 David Zager 2018-08-10 20:45:40 UTC
Errata https://errata.devel.redhat.com/brew/save_builds/33505 updated:

openshift-enterprise-mariadb-apb-v3.11.0-0.13.0.1
openshift-enterprise-mediawiki-apb-v3.11.0-0.13.0.1
openshift-enterprise-postgresql-apb-v3.11.0-0.13.0.1
openshift-enterprise-mysql-apb-v3.11.0-0.13.0.1
openshift-enterprise-asb-container-v3.11.0-0.13.0.1

Comment 43 Zihan Tang 2018-08-13 08:09:48 UTC
Verified with mariadb-apb:v3.11.0-0.14.0.0
in docker and crio env, mairadb-apb  provision succeed, and synced data to pv.

but this is produced when v3.9 upgrade to 3.10, if data synced to pv after provision like other db-apbs, 'data lost issue' will not exist after upgrade.

this fix is on v3.11 apb. 

I haven't maked as 'Veified' as the title is described as ' 3.9 upgrade to 3.10 ...' issue.

Do we need to copy this bug for 3.9 and 3.10 mariadb-apb, to avoid the issue when upgrade 3.9->3.10 and 3.10->3.11 ?

Comment 44 Jason Montleon 2018-08-15 12:36:27 UTC
Yes, please copy it so we can fix those versions of the apb as well. Thanks!

Comment 45 Zihan Tang 2018-08-16 08:08:39 UTC
Based on comment43, marked as verified.

copied 2 bug for 3.10 and 3.9 
https://bugzilla.redhat.com/show_bug.cgi?id=1617939
https://bugzilla.redhat.com/show_bug.cgi?id=1617937

Comment 46 Michael Burke 2018-09-13 20:22:06 UTC
Jason --

If this issue needs to be included in the 3.11 release notes, please enter Doc text above. I am not clear on the fix.

Michael

Comment 47 Jason Montleon 2018-09-13 20:25:00 UTC
I don't think we need a doc update.

We changed the mount point from /var/lib/mysql to /var/lib/mysql/data

Comment 49 errata-xmlrpc 2018-10-11 07:20:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.