Description of problem: in 3.9 HA env, provision mediawiki+*.sql-apb , then upgrade to 3.10, mediawiki can not work, it shows 'A database query error has occurred. This may indicate a bug in the software.' Version-Release number of the following components: openshift-ansible-3.10.1-1 How reproducible: always Steps to Reproduce: 1. install ocp3.9 HA env 2. provision mediawiki & postgresql-apb , and create a bind, add the secret to mediawiki, visit mediawiki website, it shows 'successfully installed' 3. upgrade to 3.10, upgrade job should succeed. 4. check mediawiki and postgresql Actual results: meidawiki web site can not visit. it shows 'A database query error has occurred. This may indicate a bug in the software.' before upgrade: [root@ip-172-18-0-247 ~]# oc logs -f postgresql-9.6-dev-1-ljvrb The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_US.utf8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... posix creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok Success. You can now start the database server using: pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start WARNING: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. waiting for server to start....LOG: redirecting log output to logging collector process HINT: Future log output will appear in directory "pg_log". done server started => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ... ALTER ROLE waiting for server to shut down.... done server stopped Starting server... LOG: redirecting log output to logging collector process HINT: Future log output will appear in directory "pg_log". ------------ $ tail -f /var/lib/pgsql/data/userdata/pg_log/postgresql-Tue.log LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down LOG: database system was shut down at 2018-06-19 08:41:07 UTC LOG: MultiXact member wraparound protections are now enabled LOG: autovacuum launcher started LOG: database system is ready to accept connections WARNING: there is already a transaction in progress ======== after upgrade # oc logs -f postgresql-9.6-dev-1-n6wpb The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_US.utf8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... posix creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok Success. You can now start the database server using: pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start WARNING: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. waiting for server to start....LOG: redirecting log output to logging collector process HINT: Future log output will appear in directory "pg_log". done server started => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ... ALTER ROLE waiting for server to shut down.... done server stopped Starting server... LOG: redirecting log output to logging collector process HINT: Future log output will appear in directory "pg_log". ----------------- sh-4.2$ tail -f /var/lib/pgsql/data/userdata/pg_log/postgresql-Tue.log ERROR: relation "msg_resource" does not exist at character 44 STATEMENT: DELETE /* MessageBlobStore::clear */ FROM "msg_resource" ERROR: relation "msg_resource_links" does not exist at character 44 STATEMENT: DELETE /* MessageBlobStore::clear */ FROM "msg_resource_links" ERROR: relation "page" does not exist at character 219 STATEMENT: SELECT /* WikiPage::pageData */ page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model FROM "page" WHERE page_namespace = '0' AND page_title = 'Main_Page' LIMIT 1 ERROR: relation "job" does not exist at character 87 STATEMENT: SELECT /* JobQueueDB::doGetSiblingQueuesWithJobs 10.2.12.1 */ DISTINCT job_cmd FROM "job" WHERE job_cmd IN ('refreshLinks','refreshLinks2','htmlCacheUpdate','sendMail','enotifNotify','fixDoubleRedirect','uploadFromUrl','AssembleUploadChunks','PublishStashedFile','null') ERROR: relation "page" does not exist at character 219 STATEMENT: SELECT /* WikiPage::pageData */ page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model FROM "page" WHERE page_namespace = '0' AND page_title = 'Main_Page' LIMIT 1 Expected results: *sql-apb pod works as expected. mediawiki can be visited. Additional info: 1. in a non-ha cluster, these pods works as expected after upgrade. 2. if provison other *sql-apb, mediawiki also can not be visited
I set the target release to 3.10 since this a critical scenario.
@John, Please assign this bug and review if it can be fixed in 3.10. I think the target release should be 3.10 since this a critical scenario. Thanks.
@David & Dylan, I add you to the cc list, could you help to push this bug in progress in case John absent? Thanks.
To me it looks like the database data is gone. This looks like a pod with ephemeral storage: postgresql-9.6-dev-1-ljvrb And it looks as though it is getting shut down at some point during hte upgrade? If so I would expect the database data to disappear and of course things wouldn't work if the data is gone. LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down I suppose we could do something in the mediawiki entrypoint script to test if the database is gone and if so recreate it, but of course any data the user created in between the setup and shutdown would of course be gone. Any chance you can confirm this is what we're seeing by using a plan with persistent storage?
Jason, > I suppose we could do something in the mediawiki entrypoint script to test if the database is gone and if so recreate it, I agree , I didn't find any related logs from mediawiki pod. >but of course any data the user created in between the setup and shutdown would of course be gone. it's not expected the data is gone after upgrade. I tried non-ha upgrade, data not gone, mediawiki can get data from db . >Any chance you can confirm this is what we're seeing by using a plan with persistent storage? I tried with persistent storage and mysql-apb and mariadb-apb, after upgrade, the mediawiki can not get data to all the dbs with same error "A database query error has occurred. This may indicate a bug in the software". but I don't find important logs in mysql and mariadb pod.
What is the state of the database pods after the upgrade? Are they not running? I don't see how data on persistent storage can be gone after upgrading the cluster.
I'm unable to reproduce due to https://bugzilla.redhat.com/show_bug.cgi?id=1591053 I am hitting the error in comment 42 100% of the time now if I provision APB's prior to upgrading.
I just managed to get through an upgrade. Prior I had: [root@192 ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-z45kg 1/1 Running 0 13m default registry-console-1-6qrpn 1/1 Running 0 12m default router-1-spd57 1/1 Running 0 13m ephemeral mediawiki123-2-vqwsz 1/1 Running 0 3m ephemeral mysql-5.7-dev-1-vfs8b 1/1 Running 0 6m kube-service-catalog apiserver-qqtdb 1/1 Running 0 12m kube-service-catalog controller-manager-8tlng 1/1 Running 1 11m openshift-ansible-service-broker asb-1-wpzbv 1/1 Running 2 11m openshift-ansible-service-broker asb-etcd-1-t2n59 1/1 Running 0 11m openshift-template-service-broker apiserver-6qgbf 1/1 Running 0 11m openshift-web-console webconsole-75b5bb9587-tsmf8 1/1 Running 0 12m persistent mediawiki123-2-c88vx 1/1 Running 0 3m persistent mysql-5.7-prod-1-sq9mf 1/1 Running 0 5m [root@192 ~]# curl -I http://mediawiki123-persistent.apps.192.168.121.107.nip.io/index.php/Main_Page HTTP/1.1 200 OK Date: Tue, 26 Jun 2018 17:04:16 GMT Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16 X-Powered-By: PHP/5.4.16 X-Content-Type-Options: nosniff Content-language: en Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Last-Modified: Tue, 26 Jun 2018 17:01:15 GMT Content-Type: text/html; charset=UTF-8 Set-Cookie: a3400a357d077c52f4721df29f8d2d53=d6f2eddf5a8c0758dde344f27d41f83a; path=/; HttpOnly [root@192 ~]# curl -I http://mediawiki123-ephemeral.apps.192.168.121.107.nip.io/index.php/Main_Page HTTP/1.1 200 OK Date: Tue, 26 Jun 2018 17:04:23 GMT Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16 X-Powered-By: PHP/5.4.16 X-Content-Type-Options: nosniff Content-language: en Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Last-Modified: Tue, 26 Jun 2018 17:00:47 GMT Content-Type: text/html; charset=UTF-8 Set-Cookie: 48906e19a2223fa2bc425345094b372e=1013d5997c816dc2e0fda3052aa655a5; path=/; HttpOnly After Upgrade: [root@192 ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-2-gmprn 1/1 Running 0 10m default registry-console-2-qt5rd 1/1 Running 0 10m default router-2-879bh 1/1 Running 0 10m ephemeral mediawiki123-2-vqwsz 1/1 Running 3 1h ephemeral mysql-5.7-dev-1-vfs8b 1/1 Running 6 1h kube-service-catalog apiserver-k9hxw 1/1 Running 0 8m kube-service-catalog controller-manager-5c8xd 1/1 Running 0 8m kube-system master-api-192.168.121.107.nip.io 1/1 Running 1 11m kube-system master-controllers-192.168.121.107.nip.io 1/1 Running 1 11m kube-system master-etcd-192.168.121.107.nip.io 1/1 Running 0 1h openshift-ansible-service-broker asb-2-nf5qq 1/1 Running 0 7m openshift-ansible-service-broker asb-etcd-migration-pjskt 0/1 Completed 0 8m openshift-node sync-mrg7d 1/1 Running 0 28m openshift-sdn ovs-9xvfw 1/1 Running 0 28m openshift-sdn sdn-2kvgj 1/1 Running 1 28m openshift-template-service-broker apiserver-cqj7w 1/1 Running 0 7m openshift-web-console webconsole-7b88c47974-8s5mp 1/1 Running 0 11m persistent mediawiki123-2-c88vx 1/1 Running 3 1h persistent mysql-5.7-prod-1-sq9mf 1/1 Running 5 1h [root@192 ~]# curl -I http://mediawiki123-persistent.apps.192.168.121.107.nip.io/index.php/Main_Page HTTP/1.1 200 OK Date: Tue, 26 Jun 2018 18:19:49 GMT Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16 X-Powered-By: PHP/5.4.16 X-Content-Type-Options: nosniff Content-language: en Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Last-Modified: Tue, 26 Jun 2018 17:55:28 GMT Content-Type: text/html; charset=UTF-8 Set-Cookie: a3400a357d077c52f4721df29f8d2d53=eb274661ba3b6906687d3971f68c5f27; path=/; HttpOnly [root@192 ~]# curl -I http://mediawiki123-ephemeral.apps.192.168.121.107.nip.io/index.php/Main_Page HTTP/1.1 200 OK Date: Tue, 26 Jun 2018 18:19:58 GMT Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16 X-Powered-By: PHP/5.4.16 X-Content-Type-Options: nosniff Content-language: en Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Last-Modified: Tue, 26 Jun 2018 17:55:24 GMT Content-Type: text/html; charset=UTF-8 Set-Cookie: 48906e19a2223fa2bc425345094b372e=40e0efc628e368082e8df5f1d047af46; path=/; HttpOnly Manually browsing to each url confirms the Main Page is accessible. Can you please confirm the pods remain up and that no significant changes to the network/sdn or openshift_node_groups configuration is taking place that would prevent the pods from running or communicating?
Jason after upgrade , the apb pods are running, sdn pod is running. I'll build a env to reproduce it.
@John, Refer to the current test result: "after upgrade, about 60% mediawiki Main page can not be accessed.", we don't think this bug should be moved out of 3.10 plan. Please double confirm. Thanks.
I'm changing "target release" to 3.10 since it have 60% reproduce rate for lost database after upgrade. Please correct me if I have mistake or you have other concern. Thanks.
This bug is a critical scenario: old serviceinstance(databases) should still work after upgrade. Customer will have 60% rate to lost database connection after upgrade if have not fix.
At present, can not 'rsh' to all the apb pod to gather logs caused by bug https://bugzilla.redhat.com/show_bug.cgi?id=1594341.
You can figure out which node the containers are running on, use docker ps to look at the name and command to determine which container relates to which pod, and use docker exec to exam them even if you can't with oc exec, etc.
This isn't going to be resolved by mid-day today for 3.10. We need to get to the bottom of why the database data is missing, especially in the case of prod instances. Can you deploy 3.9, provision your apb pairs normally, verify on the prod instances that the storage is mounted in the pod and data files exist, and take note of what pods are bound to what pvc's and what pvcs are bound to what claims (oc get pvc --all-namespaces and oc get pv). Then run oc get pvc --all-namespaces and oc get pv again after the upgrade to confirm nothing has flip flopped or changed for some reason, and similarly, especially in the case of prod instances note which ones aren't working, that the pvcs are mounted and that the data is available. I'll try to run through another upgrade with multiple prod instances and do the same and see if I can reproduce it today.
I created 4 prod instances of each combination of postgres, mysql, and mariadb with mediawiki and each was available after upgrade. Could this be something to do with the GCE storage or even the failure-domain.beta.kubernetes.io/region annotations? I'm just trying to pin point what is different about your deployments and ours at this point. These are the most obvious things to look at to me considering the issue is related to storage. I'm also on a single node setup and your on a multi node setup, but I don't think there is an obvious issue there because testing with curl I could communicate between the pods running on different nodes and I noted some were working across nodes while others were not in your environment.
I provision a mariadb in prod plan in OCP 3.9 . the data seems not write into pv. [root@qe-zitang-39-3me-1 ~]# oc get pod NAME READY STATUS RESTARTS AGE mediawiki123-2-rpknh 1/1 Running 0 50s rhscl-mariadb-10.2-prod-1-bwtt7 1/1 Running 0 9m [root@qe-zitang-39-3me-1 ~]# oc rsh rhscl-mariadb-10.2-prod-1-bwtt7 sh-4.2$ ls -l /var/lib/mysql/data/ total 106564 drwx------. 2 1000180000 root 4096 Jun 28 09:43 admin -rw-rw----. 1 1000180000 root 16384 Jun 28 09:34 aria_log.00000001 -rw-rw----. 1 1000180000 root 52 Jun 28 09:34 aria_log_control -rw-rw----. 1 1000180000 root 2799 Jun 28 09:34 ib_buffer_pool -rw-rw----. 1 1000180000 root 8388608 Jun 28 09:44 ib_logfile0 -rw-rw----. 1 1000180000 root 8388608 Jun 28 09:34 ib_logfile1 -rw-rw----. 1 1000180000 root 79691776 Jun 28 09:44 ibdata1 -rw-rw----. 1 1000180000 root 12582912 Jun 28 09:34 ibtmp1 -rw-rw----. 1 1000180000 root 0 Jun 28 09:34 multi-master.info drwx------. 2 1000180000 root 4096 Jun 28 09:34 mysql -rw-r--r--. 1 1000180000 root 14 Jun 28 09:34 mysql_upgrade_info drwx------. 2 1000180000 root 20 Jun 28 09:34 performance_schema -rw-rw----. 1 1000180000 root 2 Jun 28 09:34 rhscl-mariadb-10.pid -rw-rw----. 1 1000180000 root 24576 Jun 28 09:34 tc.log drwx------. 2 1000180000 root 6 Jun 28 09:34 test sh-4.2$ mysql -uroot Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 9 Server version: 10.2.8-MariaDB MariaDB Server Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | admin | | information_schema | | mysql | | performance_schema | | test | +--------------------+ 5 rows in set (0.00 sec) MariaDB [(none)]> use admin; Database changed MariaDB [admin]> show tables; Empty set (0.00 sec) MariaDB [admin]> show tables; +--------------------+ | Tables_in_admin | +--------------------+ | archive | | category | | categorylinks | | change_tag | | externallinks | | filearchive | check in pv [root@qe-zitang-39-3me-1 ~]# oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mediawiki123-pvc Bound pvc-ecf61a8a-7ab6-11e8-a2d9-fa163e5a535e 1Gi RWO standard 6m rhscl-mariadb-10.2-prod Bound pvc-5614cbfd-7ab6-11e8-a2d9-fa163e5a535e 10Gi RWO standard 10m [root@qe-zitang-39-3nrr-1 ~]# mount | grep pvc-bf02e4eb-7ab0-11e8-8902-fa163ec91f46 /dev/vdb on /var/lib/origin/openshift.local.volumes/pods/d575bf32-7ab0-11e8-a2d9-fa163e5a535e/volumes/kubernetes.io~cinder/pvc-bf02e4eb-7ab0-11e8-8902-fa163ec91f46 type xfs (rw,relatime,seclabel,attr2,inode64,noquota) [root@qe-zitang-39-3nrr-1 ~]# cd /var/lib/origin/openshift.local.volumes/pods/d575bf32-7ab0-11e8-a2d9-fa163e5a535e/volumes/kubernetes.io~cinder/pvc-bf02e4eb-7ab0-11e8-8902-fa163ec91f46 [root@qe-zitang-39-3nrr-2 pvc-5614cbfd-7ab6-11e8-a2d9-fa163e5a535e]# ls -al 总用量 8 drwxrwsr-x. 3 root 1000180000 79 6月 28 05:44 . drwxr-x---. 3 root root 54 6月 28 05:34 .. -rw-------. 1 1000180000 1000180000 104 6月 28 05:44 .bash_history drwxr-sr-x. 2 root 1000180000 6 6月 28 05:34 data -rw-------. 1 1000180000 1000180000 62 6月 28 05:43 .mysql_history srwxrwxrwx. 1 1000180000 1000180000 0 6月 28 05:34 mysql.sock
add more info in #comment 21 no data files in pv. [root@qe-zitang-39-3nrr-2 pvc-5614cbfd-7ab6-11e8-a2d9-fa163e5a535e]# ls -l data 总用量 0 I tried mysql and postgresql , the data in pod is sync with data in pv. attache the v3.9 env. host-8-241-78.host.centralci.eng.rdu2.redhat.com
If the database name is admin you need to look in the admin dir for the data files. They are there: oc exec -it rhscl-mariadb-10.2-prod-1-bwtt7 -- ls -l /var/lib/mysql/data/admin total 15972 -rw-rw----. 1 1000180000 root 3618 Jun 28 09:43 archive.frm -rw-rw----. 1 1000180000 root 147456 Jun 28 09:43 archive.ibd -rw-rw----. 1 1000180000 root 2281 Jun 28 09:43 category.frm -rw-rw----. 1 1000180000 root 131072 Jun 28 09:43 category.ibd -rw-rw----. 1 1000180000 root 3364 Jun 28 09:43 categorylinks.frm In the above you only did: [root@qe-zitang-39-3me-1 ~]# oc rsh rhscl-mariadb-10.2-prod-1-bwtt7 sh-4.2$ ls -l /var/lib/mysql/data/ total 106564 drwx------. 2 1000180000 root 4096 Jun 28 09:43 admin -rw-rw----. 1 1000180000 root 16384 Jun 28 09:34 aria_log.00000001
yes, there are data in 'admin' dir in the pod. but in the related pv, the 'data' dir is empty.
Were you able to confirm the pvc was mounted in the pod where this happened? It looks like the host is not up and I don't see output from mount/df to confirm that's the case. If data is not being written to mounted PVC's this needs to someone familiar with storage.
in the 3.9 ha env: mariadb prod pod dc: [root@qe-zitang-39-3me-1 ~]# oc get dc -o yaml apiVersion: v1 items: - apiVersion: apps.openshift.io/v1 kind: DeploymentConfig metadata: creationTimestamp: 2018-07-02T03:12:03Z generation: 1 labels: app: rhscl-mariadb-apb service: rhscl-mariadb-10.2-prod name: rhscl-mariadb-10.2-prod namespace: maria resourceVersion: "15754" selfLink: /apis/apps.openshift.io/v1/namespaces/maria/deploymentconfigs/rhscl-mariadb-10.2-prod uid: b161cdc6-7da5-11e8-ba69-fa163e1ba47a spec: replicas: 1 selector: app: rhscl-mariadb-apb service: rhscl-mariadb-10.2-prod strategy: activeDeadlineSeconds: 21600 resources: {} rollingParams: intervalSeconds: 1 maxSurge: 25% maxUnavailable: 25% timeoutSeconds: 600 updatePeriodSeconds: 1 type: Rolling template: metadata: creationTimestamp: null labels: app: rhscl-mariadb-apb service: rhscl-mariadb-10.2-prod spec: containers: - env: - name: MYSQL_ROOT_PASSWORD value: dddd - name: MYSQL_USER value: admin - name: MYSQL_PASSWORD value: dddd - name: MYSQL_DATABASE value: admin image: registry.access.redhat.com/rhscl/mariadb-102-rhel7 imagePullPolicy: IfNotPresent name: rhscl-mariadb ports: - containerPort: 3306 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/mysql name: mariadb-data workingDir: / dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - name: mariadb-data persistentVolumeClaim: claimName: rhscl-mariadb-10.2-prod test: false triggers: - type: ConfigChange status: availableReplicas: 1 conditions: - lastTransitionTime: 2018-07-02T03:12:44Z lastUpdateTime: 2018-07-02T03:12:44Z message: Deployment config has minimum availability. status: "True" type: Available - lastTransitionTime: 2018-07-02T03:12:07Z lastUpdateTime: 2018-07-02T03:12:45Z message: replication controller "rhscl-mariadb-10.2-prod-1" successfully rolled out reason: NewReplicationControllerAvailable status: "True" type: Progressing details: causes: - type: ConfigChange message: config change latestVersion: 1 observedGeneration: 1 readyReplicas: 1 replicas: 1 unavailableReplicas: 0 updatedReplicas: 1 kind: List metadata: resourceVersion: "" selfLink: "" in the pod: [root@qe-zitang-39-3me-1 ~]# oc rsh rhscl-mariadb-10.2-prod-1-wdvlm sh-4.2$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/docker-253:0-41943105-1073ed4afd7683dd0e07e9c7ec1c6bf67bde79cfa74526678694e8d3528ee877 10475520 476708 9998812 5% / tmpfs 8133200 0 8133200 0% /dev tmpfs 8133200 0 8133200 0% /sys/fs/cgroup /dev/mapper/rhel-root 18864128 1819444 17044684 10% /etc/hosts shm 65536 0 65536 0% /dev/shm /dev/vdb 10475520 32944 10442576 1% /var/lib/mysql tmpfs 8133200 16 8133184 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 8133200 0 8133200 0% /proc/scsi tmpfs 8133200 0 8133200 0% /sys/firmware the dir '/var/lib/mysql' is mounted. but connec to the pv in the node, the data dir is empty. [root@qe-zitang-39-3nrr-1 ~]# mount | grep pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a /dev/vdb on /var/lib/origin/openshift.local.volumes/pods/b3a6a80c-7da5-11e8-b9eb-fa163e0c95cf/volumes/kubernetes.io~cinder/pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a type xfs (rw,relatime,seclabel,attr2,inode64,noquota) [root@qe-zitang-39-3nrr-1 ~]# cd /var/lib/origin/openshift.local.volumes/pods/b3a6a80c-7da5-11e8-b9eb-fa163e0c95cf/volumes/kubernetes.io~cinder/pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a [root@qe-zitang-39-3nrr-1 pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a]# ls data mysql.sock [root@qe-zitang-39-3nrr-1 pvc-9d5f6d7e-7da5-11e8-ba69-fa163e1ba47a]# cd data/ [root@qe-zitang-39-3nrr-1 data]# ll total 0 I tried the latest mariadb apb in a OCP 3.10 env, it looks ok, data is synchronous with PV. [root@ip-172-18-1-156 ~]# mount | grep pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8 /dev/xvdbg on /var/lib/origin/openshift.local.volumes/pods/f8139aaf-7da3-11e8-a839-0ef5ee5b5cf8/volumes/kubernetes.io~aws-ebs/pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8 type ext4 (rw,relatime,seclabel,data=ordered) [root@ip-172-18-1-156 ~]# cd /var/lib/origin/openshift.local.volumes/pods/f8139aaf-7da3-11e8-a839-0ef5ee5b5cf8/volumes/kubernetes.io~aws-ebs/pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8 [root@ip-172-18-1-156 pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8]# ls data lost+found mysql.sock [root@ip-172-18-1-156 pvc-e0a71876-7da3-11e8-a839-0ef5ee5b5cf8]# cd data/ [root@ip-172-18-1-156 data]# ls admin aria_log_control ibdata1 ib_logfile1 mariadb-5d3b49e9-7da3-11e8-a8cb-0a580a800005-1-mq6gk.pid mysql performance_schema test aria_log.00000001 ib_buffer_pool ib_logfile0 ibtmp1 multi-master.info mysql_upgrade_info tc.log [root@ip-172-18-1-156 data]# cd admin/ [root@ip-172-18-1-156 admin]# ll 总用量 4 -rw-rw----. 1 1000140000 1000140000 65 7月 1 23:00 db.opt
I tried in non-ha env, data in mariadb pod(prod plan) is lost after upgrade, so update the title.
Jason, do you think it will be fixed in 3.10.0 since we reproduce it in non-HA env ?
I haven't been able to reproduce it to understand what's going on. I don't know what data wouldn't be written to the underlying storage. If data isn't being written to the volume we need to get to the bottom of why? You said in your last comment, "the dir '/var/lib/mysql' is mounted. but connec to the pv in the node, the data dir is empty." Are we mounting persistent storage in the wrong place? If it's something like this why is it working in some environments? Is it something to do with the gce storage you're using? I don't have access to a GCE account to even try to reproduce this. I asked about whether you have tried without the failure region annotations I noted and if it could be something to do with this features, but haven't seen a response. I simply do not know enough about this feature or gce storage to know if this could be the case. In other words, please help me to understand and get to the bottom of why this is happening in your environment so we can get it to the right person, whether that's us, or someone with the appropriate knowledge about the storage backend.
It seems to be runtime issue. As I comment in #comment26, when provision mariadb in v3.9, the data in pod is not syncing with pv, so after upgrade , it will lose data. I tried in cri-o env 3 times (in gce and aws), data is syncing with pv. [root@ip-172-18-13-17 ~]# cd /var/lib/origin/openshift.local.volumes/pods/d744872f-84ee-11e8-b795-0e84b937625e/volumes/kubernetes.io~aws-ebs/pvc-bf88f34d-84ee-11e8-b795-0e84b937625e [root@ip-172-18-13-17 pvc-bf88f34d-84ee-11e8-b795-0e84b937625e]# ls data lost+found mysql.sock [root@ip-172-18-13-17 pvc-bf88f34d-84ee-11e8-b795-0e84b937625e]# ls -l 总用量 20 drwx--S---. 6 1000130000 1000130000 4096 7月 11 05:44 data drwxrwS---. 2 root 1000130000 16384 7月 11 05:43 lost+found srwxrwxrwx. 1 1000130000 1000130000 0 7月 11 05:44 mysql.sock [root@ip-172-18-13-17 pvc-bf88f34d-84ee-11e8-b795-0e84b937625e]# cd data [root@ip-172-18-13-17 data]# ls admin aria_log_control ibdata1 ib_logfile1 multi-master.info mysql_upgrade_info rhscl-mariadb-10.pid test aria_log.00000001 ib_buffer_pool ib_logfile0 ibtmp1 mysql performance_schema tc.log It seems the docker runtime issue. docker version:docker-1.13.1-58 I find a workaround in docker runtime. if patch the dc volumeMounts: - mountPath: /var/lib/mysql name: mariadb-data mountPath to '/var/lib/mysql/data', the data in mariadb pod can sync with pv # mount | grep pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff /dev/vdg on /var/lib/origin/openshift.local.volumes/pods/138d4c32-84cc-11e8-a9bd-fa163ea66cff/volumes/kubernetes.io~cinder/pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/vdg on /var/lib/origin/openshift.local.volumes/pods/7901d55c-84d5-11e8-a9bd-fa163ea66cff/volumes/kubernetes.io~cinder/pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff type xfs (rw,relatime,seclabel,attr2,inode64,noquota) [root@qe-zitang-39-2node-registry-router-1 data]# cd /var/lib/origin/openshift.local.volumes/pods/7901d55c-84d5-11e8-a9bd-fa163ea66cff/volumes/kubernetes.io~cinder/pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff [root@qe-zitang-39-2node-registry-router-1 pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff]# ls admin aria_log.00000001 aria_log_control data ibdata1 ib_logfile0 ib_logfile1 multi-master.info mysql mysql_upgrade_info performance_schema rhscl-mariadb-10.pid tc.log test
Why would data in a subdirectory of the mount not be getting written to the pvc? In your output above: [root@ip-172-18-13-17 data]# ls admin aria_log_control ibdata1 ib_logfile1 multi-master.info mysql_upgrade_info rhscl-mariadb-10.pid test aria_log.00000001 ib_buffer_pool ib_logfile0 ibtmp1 mysql performance_schema tc.log But on the pvc it looks like you have a data directory?: [root@qe-zitang-39-2node-registry-router-1 pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff]# ls admin aria_log.00000001 aria_log_control data ibdata1 ib_logfile0 ib_logfile1 multi-master.info mysql mysql_upgrade_info performance_schema rhscl-mariadb-10.pid tc.log test
(In reply to Jason Montleon from comment #32) > Why would data in a subdirectory of the mount not be getting written to the > pvc? it seems to be runtime issue, if I use cir-o, the data will be written into pv even to subdirectory(I tried in 3 env in #comment31) > In your output above: > [root@ip-172-18-13-17 data]# ls > admin aria_log_control ibdata1 ib_logfile1 > multi-master.info mysql_upgrade_info rhscl-mariadb-10.pid test > aria_log.00000001 ib_buffer_pool ib_logfile0 ibtmp1 mysql > performance_schema tc.log this is in cri-o env. > But on the pvc it looks like you have a data directory?: > [root@qe-zitang-39-2node-registry-router-1 > pvc-fcad39f6-84cb-11e8-a9bd-fa163ea66cff]# ls > admin aria_log.00000001 aria_log_control data ibdata1 ib_logfile0 > ib_logfile1 multi-master.info mysql mysql_upgrade_info > performance_schema rhscl-mariadb-10.pid tc.log test the 'data' dir is created using the old dc(mountPath: /var/lib/mysql) , this output is after I patch dc 'mountPath' to '/var/lib/mysql/data',
I agree to aligning it to 3.11 to find out the real reason. if data will sync to pv before upgrade, upgrade will succeed.
Is this still a problem in 3.11? Can you please try with at least one different type of backing storage (NFS, Host Path, etc.) to see if you can reproduce this issue across storage types? When it comes down to it the mariadb container is a standard application container with a standard pvc. Files in a sub-directory of the pvc mount not making it to being stored on the pv does not make much sense, and I haven't been able to reproduce it. We need to try to get to the bottom of whether this only happens with one storage backend, whether it's a misconfiguration of the PVC we're creating, etc. so we can correct the issue or get the bug to someone with the knowledge to further debug it and fix it.
This is blocked by bug https://bugzilla.redhat.com/show_bug.cgi?id=1611939, can not provision in prod plan. I'll try after this bug fixed. Thanks.
Verify failed, In v3.11 , it is the same with v3.10: 1. in docker(1.13.1) env, maria-db data is not synced to PV. 2. in cri-o env, maria-db data is synced to PV.
Have you tried any other storage options like NFS or HostPath for PV's with docker? As mentioned I'm looking for your help in trying to narrow down what in your environment might be causing this since we can't reproduce it.
I wonder if the difference in behavior between the two is the Volumes set in the Dockerfile: $ docker inspect registry.access.redhat.com/rhscl/mariadb-102-rhel7:latest | grep \"Volumes\": -A2 "Volumes": { "/var/lib/mysql/data": {} }, -- "Volumes": { "/var/lib/mysql/data": {} }, vs. $ docker inspect registry.access.redhat.com/rhscl/mysql-57-rhel7:latest | grep \"Volumes\": -A2 "Volumes": null, "WorkingDir": "/opt/app-root/src", "Entrypoint": [ -- "Volumes": null, "WorkingDir": "/opt/app-root/src", "Entrypoint": [ It looks from that the expectation is /var/lib/mysql/data as you suggested trying. I've submitted a PR to do this: https://github.com/ansibleplaybookbundle/mariadb-apb/pull/42
Errata https://errata.devel.redhat.com/brew/save_builds/33505 updated: openshift-enterprise-mariadb-apb-v3.11.0-0.13.0.1 openshift-enterprise-mediawiki-apb-v3.11.0-0.13.0.1 openshift-enterprise-postgresql-apb-v3.11.0-0.13.0.1 openshift-enterprise-mysql-apb-v3.11.0-0.13.0.1 openshift-enterprise-asb-container-v3.11.0-0.13.0.1
Verified with mariadb-apb:v3.11.0-0.14.0.0 in docker and crio env, mairadb-apb provision succeed, and synced data to pv. but this is produced when v3.9 upgrade to 3.10, if data synced to pv after provision like other db-apbs, 'data lost issue' will not exist after upgrade. this fix is on v3.11 apb. I haven't maked as 'Veified' as the title is described as ' 3.9 upgrade to 3.10 ...' issue. Do we need to copy this bug for 3.9 and 3.10 mariadb-apb, to avoid the issue when upgrade 3.9->3.10 and 3.10->3.11 ?
Yes, please copy it so we can fix those versions of the apb as well. Thanks!
Based on comment43, marked as verified. copied 2 bug for 3.10 and 3.9 https://bugzilla.redhat.com/show_bug.cgi?id=1617939 https://bugzilla.redhat.com/show_bug.cgi?id=1617937
Jason -- If this issue needs to be included in the 3.11 release notes, please enter Doc text above. I am not clear on the fix. Michael
I don't think we need a doc update. We changed the mount point from /var/lib/mysql to /var/lib/mysql/data
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652