Hide Forgot
rhosp-director: OSP12 OC deployment fails: "Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data" \"Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"KOLLA_BOOTSTRAP=True\\", \\"KOLLA_KUBERNETES=True\\", \\"DB_MAX_TIMEOUT=60\\", \\"DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm\\", \\"DB_ROOT_PASSWORD=0i2XpsHwaX\\"], \\"start_order\\": 1, \\"command\\": [\\"bash\\", \\"-ec\\", \\"if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\nkolla_start\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" ping 2>/dev/null; do sleep 1; done\\'\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\\\\\"\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\\\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" shutdown\\"], \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\\", \\"/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\\", \\"/var/lib/mysql:/var/lib/mysql\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10\\", \\"detach\\": false, \\"net\\": \\"host\\"}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=KOLLA_KUBERNETES=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm', '--env=DB_ROOT_PASSWORD=0i2XpsHwaX', '--net=host', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\nkolla_start\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" ping 2>/dev/null; do sleep 1; done\\'\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\"\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\"\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" shutdown']. [124]\", \"170829 17:43:31 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.zDoAcF' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\", \"170829 17:43:38 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pJB84Z' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\", Environment: instack-undercloud-7.2.1-0.20170821194210.el7ost.noarch openstack-puppet-modules-10.0.0-0.20170712001959.0333c73.el7ost.noarch openstack-tripleo-heat-templates-7.0.0-0.20170821194253.el7ost.noarch Steps to reproduce: Try to deploy overcloud. Result: Deployment fails. See the following error: \"Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"KOLLA_BOOTSTRAP=True\\", \\"KOLLA_KUBERNETES=True\\", \\"DB_MAX_TIMEOUT=60\\", \\"DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm\\", \\"DB_ROOT_PASSWORD=0i2XpsHwaX\\"], \\"start_order\\": 1, \\"command\\": [\\"bash\\", \\"-ec\\", \\"if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\nkolla_start\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" ping 2>/dev/null; do sleep 1; done\\'\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\\\\\"\\\\nmysql -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" -e \\\\\\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\\\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\\"${DB_ROOT_PASSWORD}\\\\\\" shutdown\\"], \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\\", \\"/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\\", \\"/var/lib/mysql:/var/lib/mysql\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10\\", \\"detach\\": false, \\"net\\": \\"host\\"}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=KOLLA_KUBERNETES=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=3cJFUhAfZabtBhKE22AuBzYbm', '--env=DB_ROOT_PASSWORD=0i2XpsHwaX', '--net=host', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\nkolla_start\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" ping 2>/dev/null; do sleep 1; done\\'\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\"\\nmysql -uroot -p\\"${DB_ROOT_PASSWORD}\\" -e \\"GRANT PROCESS ON container_images.yaml debug.yaml instackenv.json launch_instance.sh launch-instance-workarounds.sh overcloud_deployment_0.log overcloud_deploy.sh overcloud_deploy.sh_orig rhos12.yaml undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf undercloud-privkey.pem TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\"\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\"${DB_ROOT_PASSWORD}\\" shutdown']. [124]\", \"170829 17:43:31 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.zDoAcF' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\", \"170829 17:43:38 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pJB84Z' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'\",
The issue is reproducing.
Note: Applied w/a mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1486329#c3
The output of the mysql_bootstrap looks normalish to me. Was able to get access to the controller node this afternoon. The bootstrap container is definitely exiting with a bad status code: 81a0426f82d6 192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10 "bash -ec 'if [ -e /v" 3 hours ago Exited (124) 3 hours ago mysql_bootstrap Adding Chris Jones to see if anyone from the HA team has thoughts here too.
Interesting. After seeing the failure I re-ran the mysql_bootstrap container and it exits cleanly. Same output as before. Wondering if there is a race in the setup scripts at step1.
(In reply to Dan Prince from comment #5) > The output of the mysql_bootstrap looks normalish to me. > > Was able to get access to the controller node this afternoon. The bootstrap > container is definitely exiting with a bad status code: > > 81a0426f82d6 > 192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-28.10 "bash > -ec 'if [ -e /v" 3 hours ago Exited (124) 3 hours ago > mysql_bootstrap > > Adding Chris Jones to see if anyone from the HA team has thoughts here too. So exit code 124 is very likely the timeout command hitting the timeout itself here https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/database/mysql.yaml#L172 or here https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/database/mysql.yaml#L175 From a quick check I did not see any other container that could be starting and that could map /var/lib/mysql into itself and somehow interfere with this bootstrap operation. I'll throw two theories in here: A) we do hit a genuine timeout because mysqld is not up yet [A.1] (L172) or because it is not shutting down [A.2] (L175) within 60 seconds. B) mysqld_safe is not starting at all. If I look at the controller-0 sosreport in c#2 I see the following (look at the journal itself not at the os-collect-config output because the timings there are broken): 1) mysql_bootstrap container starts Aug 29 13:43:28 localhost journal: INFO:__main__:Writing out command to execute Aug 29 13:43:28 localhost journal: Installing MariaDB/MySQL system tables in '/var/lib/mysql' ... ... 2) Seems mysqld_safe is up successfully Aug 29 13:43:40 localhost journal: 170829 17:43:40 mysqld_safe WSREP: Recovered position 93cdb7a3-8ce1-11e7-b771-8aa40bcaf74e:3 Aug 29 13:43:41 localhost journal: 170829 17:43:41 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended So I don't think this can be [A.1] nor [B]. Seems either something being stuck or we are hitting [A.2]. I'd be grand if we could also get /var/log/containers/* so we see what mariadb says about itself or maybe access to a system where this issue is seen?
Thanks again to Mike Bayer and Sasha for their time today. The problem is that mysqld_safe exits right away and since we do not collect the exit code from that we were mislead into thinking it was the other commands hanging. More specifically the problem is this: 170830 17:03:05 [Note] WSREP: Service thread queue flushed. 170830 17:03:05 [Note] WSREP: Assign initial position for certification: 3, protocol version: -1 170830 17:03:05 [Note] WSREP: wsrep_sst_grab() 170830 17:03:05 [Note] WSREP: Start replication 170830 17:03:05 [Note] WSREP: Setting initial position to 12bdc2af-8da5-11e7-bbd0-9be3013253cf:3 170830 17:03:05 [Note] WSREP: protonet asio version 0 170830 17:03:05 [Note] WSREP: Using CRC-32C for message checksums. 170830 17:03:05 [Note] WSREP: backend: asio 170830 17:03:05 [Note] WSREP: gcomm thread scheduling priority set to other:0 170830 17:03:05 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) 170830 17:03:05 [Note] WSREP: restore pc from disk failed 170830 17:03:05 [Note] WSREP: GMCast version 0 170830 17:03:05 [Note] WSREP: (16ec5c12, 'tcp://172.17.1.13:4567') listening at tcp://172.17.1.13:4567 170830 17:03:05 [Note] WSREP: (16ec5c12, 'tcp://172.17.1.13:4567') multicast: , ttl: 1 170830 17:03:05 [Note] WSREP: EVS version 0 170830 17:03:05 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer 'overcloud-controller-0.internalapi.localdomain:' 170830 17:03:05 [ERROR] WSREP: failed to open gcomm backend connection: 131: No address to connect (FATAL) at gcomm/src/gmcast.cpp:connect_precheck():282 170830 17:03:05 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable) 170830 17:03:05 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'galera_cluster' at 'gcomm://overcloud-controller-0.internalapi.localdomain': -131 (State not recoverable) 170830 17:03:05 [ERROR] WSREP: gcs connect failed: State not recoverable 170830 17:03:05 [ERROR] WSREP: wsrep::connect() failed: 7 170830 17:03:05 [ERROR] Aborting 170830 17:03:05 [Note] WSREP: Service disconnected. 170830 17:03:06 [Note] WSREP: Some threads may fail to exit. 170830 17:03:06 [Note] /usr/libexec/mysqld: Shutdown complete 170830 17:03:06 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended This issue is likely not seen upstream because of mariadb/galera package differences. Downstream we have the following: mariadb-libs-5.5.56-2.el7.x86_64 galera-25.3.16-3.el7ost.x86_64 mariadb-galera-server-5.5.42-6.el7ost.x86_64 mariadb-5.5.56-2.el7.x86_64 mariadb-galera-common-5.5.42-6.el7ost.x86_64 Basically we need to add an extra argument to mysqld_safe to make sure it does not start in galera mode. I'll link patches today.
I was able to test Michele's upstream patch here this afternoon using infrared and it allowed me to create an Overcloud (Heat stack COMPLETE). See this patch: https://review.openstack.org/499242
For completeness the upstream pike backport is here: https://review.openstack.org/#/c/499623 (I'll delete the master one in the bz links once the pike one merges)
Verified Environment: openstack-tripleo-heat-templates-7.0.0-0.20170821194254.el7ost.noarch The issue doesn't reproduce.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462