Bug 1486420 - rhosp-director: OSP12 OC deployment fails: "Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data"
Summary: rhosp-director: OSP12 OC deployment fails: "Error running ['docker', 'run', '...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 12.0 (Pike)
Assignee: Michele Baldessari
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-29 17:55 UTC by Alexander Chuzhoy
Modified: 2018-02-05 19:12 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.0-0.20170821194254
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 21:59:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1714057 0 None None None 2017-08-30 17:24:23 UTC
OpenStack gerrit 499623 0 None MERGED Add --wsrep-provider=none to the mysql_bootstrap container 2020-11-18 06:40:31 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Alexander Chuzhoy 2017-08-29 17:55:52 UTC
rhosp-director: OSP12 OC deployment fails: "Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data"



 \"Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"KOLLA_BOOTSTRAP=True\\", \\"KOLLA_KUBERNETES=True\\", \\"DB_MAX_TIMEOUT=60\\", \\"DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm\\", \\"DB_ROOT_PASSWORD=0i2XpsHwaX\\"], \\"start_order\\": 1, \\"command\\": [\\"bash\\", \\"-ec\\", \\"if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\nkolla_start\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" ping 2>/dev/null; do sleep 1; done\\'\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\\\\\"\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\\\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" shutdown\\"], \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\\", \\"/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\\", \\"/var/lib/mysql:/var/lib/mysql\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10\\", \\"detach\\": false, \\"net\\": \\"host\\"}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=KOLLA_KUBERNETES=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm', '--env=DB_ROOT_PASSWORD=0i2XpsHwaX', '--net=host', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\nkolla_start\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" ping 2>/dev/null; do sleep 1; done\\'\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\"\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\"\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" shutdown']. [124]\", 
 \"170829 17:43:31 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.zDoAcF' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\", 
 \"170829 17:43:38 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pJB84Z' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\", 



Environment:
instack-undercloud-7.2.1-0.20170821194210.el7ost.noarch
openstack-puppet-modules-10.0.0-0.20170712001959.0333c73.el7ost.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170821194253.el7ost.noarch


Steps to reproduce:
Try to deploy overcloud.


Result:
Deployment fails.

See the following error:
 \"Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"KOLLA_BOOTSTRAP=True\\", \\"KOLLA_KUBERNETES=True\\", \\"DB_MAX_TIMEOUT=60\\", \\"DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm\\", \\"DB_ROOT_PASSWORD=0i2XpsHwaX\\"], \\"start_order\\": 1, \\"command\\": [\\"bash\\", \\"-ec\\", \\"if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\nkolla_start\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" ping 2>/dev/null; do sleep 1; done\\'\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\\\\\"\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\\\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" shutdown\\"], \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\\", \\"/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\\", \\"/var/lib/mysql:/var/lib/mysql\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10\\", \\"detach\\": false, \\"net\\": \\"host\\"}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=KOLLA_KUBERNETES=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm', '--env=DB_ROOT_PASSWORD=0i2XpsHwaX', '--net=host', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\nkolla_start\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" ping 2>/dev/null; do sleep 1; done\\'\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\"\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\"\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" shutdown']. [124]\", 
 \"170829 17:43:31 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.zDoAcF' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\", 
 \"170829 17:43:38 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pJB84Z' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\",

Comment 1 Alexander Chuzhoy 2017-08-29 17:57:22 UTC
The issue is reproducing.

Comment 3 Alexander Chuzhoy 2017-08-29 19:53:06 UTC
Note:
Applied w/a mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1486329#c3

Comment 5 Dan Prince 2017-08-29 20:53:02 UTC
The output of the mysql_bootstrap looks normalish to me.

Was able to get access to the controller node this afternoon. The bootstrap container is definitely exiting with a bad status code: 

81a0426f82d6        192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10     "bash -ec 'if [ -e /v"   3 hours ago         Exited (124) 3 hours ago                       mysql_bootstrap

Adding Chris Jones to see if anyone from the HA team has thoughts here too.

Comment 6 Dan Prince 2017-08-29 20:58:18 UTC
Interesting. After seeing the failure I re-ran the mysql_bootstrap container and it exits cleanly. Same output as before. Wondering if there is a race in the setup scripts at step1.

Comment 7 Michele Baldessari 2017-08-30 09:00:24 UTC
(In reply to Dan Prince from comment #5)
> The output of the mysql_bootstrap looks normalish to me.
> 
> Was able to get access to the controller node this afternoon. The bootstrap
> container is definitely exiting with a bad status code: 
> 
> 81a0426f82d6       
> 192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10     "bash
> -ec 'if [ -e /v"   3 hours ago         Exited (124) 3 hours ago             
> mysql_bootstrap
> 
> Adding Chris Jones to see if anyone from the HA team has thoughts here too.

So exit code 124 is very likely the timeout command hitting the timeout itself here https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/database/mysql.yaml#L172 or here https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/database/mysql.yaml#L175

From a quick check I did not see any other container that could be starting and that could map /var/lib/mysql into itself and somehow interfere with this bootstrap operation.

I'll throw two theories in here:
A) we do hit a genuine timeout because mysqld is not up yet [A.1] (L172) or because it is not shutting down [A.2] (L175) within 60 seconds.

B) mysqld_safe is not starting at all.

If I look at the controller-0 sosreport in c#2 I see the following (look at the journal itself not at the os-collect-config output because the timings there are broken):
1) mysql_bootstrap container starts
Aug 29 13:43:28 localhost journal: INFO:__main__:Writing out command to execute
Aug 29 13:43:28 localhost journal: Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...
...
2) Seems mysqld_safe is up successfully
Aug 29 13:43:40 localhost journal: 170829 17:43:40 mysqld_safe WSREP: Recovered position 93cdb7a3-8ce1-11e7-b771-8aa40bcaf74e:3
Aug 29 13:43:41 localhost journal: 170829 17:43:41 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended 

So I don't think this can be [A.1] nor [B]. Seems either something being stuck or we are hitting [A.2].

I'd be grand if we could also get /var/log/containers/* so we see what mariadb says about itself or maybe access to a system where this issue is seen?

Comment 8 Michele Baldessari 2017-08-30 17:15:56 UTC
Thanks again to Mike Bayer and Sasha for their time today.

The problem is that mysqld_safe exits right away and since we do not collect the
exit code from that we were mislead into thinking it was the other commands hanging.

More specifically the problem is this:
170830 17:03:05 [Note] WSREP: Service thread queue flushed.
170830 17:03:05 [Note] WSREP: Assign initial position for certification: 3, protocol version: -1
170830 17:03:05 [Note] WSREP: wsrep_sst_grab()
170830 17:03:05 [Note] WSREP: Start replication
170830 17:03:05 [Note] WSREP: Setting initial position to 12bdc2af-8da5-11e7-bbd0-9be3013253cf:3
170830 17:03:05 [Note] WSREP: protonet asio version 0
170830 17:03:05 [Note] WSREP: Using CRC-32C for message checksums.
170830 17:03:05 [Note] WSREP: backend: asio
170830 17:03:05 [Note] WSREP: gcomm thread scheduling priority set to other:0 
170830 17:03:05 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170830 17:03:05 [Note] WSREP: restore pc from disk failed
170830 17:03:05 [Note] WSREP: GMCast version 0
170830 17:03:05 [Note] WSREP: (16ec5c12, 'tcp://172.17.1.13:4567') listening at tcp://172.17.1.13:4567
170830 17:03:05 [Note] WSREP: (16ec5c12, 'tcp://172.17.1.13:4567') multicast: , ttl: 1
170830 17:03:05 [Note] WSREP: EVS version 0
170830 17:03:05 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer 'overcloud-controller-0.internalapi.localdomain:'
170830 17:03:05 [ERROR] WSREP: failed to open gcomm backend connection: 131: No address to connect (FATAL)
         at gcomm/src/gmcast.cpp:connect_precheck():282
170830 17:03:05 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
170830 17:03:05 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'galera_cluster' at 'gcomm://overcloud-controller-0.internalapi.localdomain': -131 (State not recoverable)
170830 17:03:05 [ERROR] WSREP: gcs connect failed: State not recoverable
170830 17:03:05 [ERROR] WSREP: wsrep::connect() failed: 7
170830 17:03:05 [ERROR] Aborting
 
170830 17:03:05 [Note] WSREP: Service disconnected.
170830 17:03:06 [Note] WSREP: Some threads may fail to exit.
170830 17:03:06 [Note] /usr/libexec/mysqld: Shutdown complete
 
170830 17:03:06 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended


This issue is likely not seen upstream because of mariadb/galera package differences. Downstream we have the following:
mariadb-libs-5.5.56-2.el7.x86_64
galera-25.3.16-3.el7ost.x86_64
mariadb-galera-server-5.5.42-6.el7ost.x86_64
mariadb-5.5.56-2.el7.x86_64
mariadb-galera-common-5.5.42-6.el7ost.x86_64


Basically we need to add an extra argument to mysqld_safe to make sure it does not start in galera mode. I'll link patches today.

Comment 9 Dan Prince 2017-08-31 01:33:11 UTC
I was able to test Michele's upstream patch here this afternoon using infrared and it allowed me to create an Overcloud (Heat stack COMPLETE). See this patch:

https://review.openstack.org/499242

Comment 12 Michele Baldessari 2017-08-31 14:44:41 UTC
For completeness the upstream pike backport is here: https://review.openstack.org/#/c/499623 (I'll delete the master one in the bz links once the pike one merges)

Comment 13 Alexander Chuzhoy 2017-09-01 14:53:08 UTC
Verified
Environment:
openstack-tripleo-heat-templates-7.0.0-0.20170821194254.el7ost.noarch


The issue doesn't reproduce.

Comment 17 errata-xmlrpc 2017-12-13 21:59:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.