Created attachment 1437382 [details] Openstack long failures Description of problem: We are trying to deploy OSP13 with standalone BlockStorage: parameter_defaults: BlockStorageCount: 1 OvercloudBlockStorageFlavor: cinder ControllerCount: 1 OvercloudControlFlavor: controller ComputeCount: 1 OvercloudComputeFlavor: compute Deployment stops in next stage 2018-05-16 00:57:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 2018-05-16 00:57:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 2018-05-16 00:57:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1]: CREATE_FAILED Error: resources.ControllerDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 2 2018-05-16 00:57:19Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED Resource CREATE failed: Error: resources.ControllerDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 2 2018-05-16 00:57:20Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with Heat Stack create failed. Heat Stack create failed. non-zero status code: 2 2018-05-16 00:57:20Z [overcloud]: CREATE_FAILED Resource CREATE failed: Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 2 Stack overcloud CREATE_FAILED overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 02911b18-9498-4a63-9f52-5256f6917c54 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... " os.utime(dst, (st.st_atime, st.st_mtime))", "OSError: [Errno 30] Read-only file system: '/etc/pki/ca-trust/extracted'", "stdout: f394156bde72e578364583b14f5c0b624836b13657609f1e690062ea24722059" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/6c4586d6-4a26-4e15-b995-ffaaec26acc2_playbook.retry 2018-05-16 00:57:20Z Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy undercloud with VM's 2. Deploy overcloud with composable roles with next role-files Controller, Compute, BlockStorage Actual results: Heat Stack create failed. Expected results: Heat Stack created Additional info:
I'm hitting this issue, and it's just 1 controller + 2 computes, all virtualized: "Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 1, \"image\": \"192.168.24.1:8787/rhosp13/openstack-mariadb:2018-05-25.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\", \"KOLLA_BOOTSTRAP=True\", \"DB_MAX_TIMEOUT=60\", \"DB_CLUSTERCHECK_PASSWORD=g2MBvdwbrGWCecq9Tx9h2dMZT\", \"DB_ROOT_PASSWORD=DeRVs33bBu\", \"TRIPLEO_CONFIG_HASH=6fff9ad95bebaf4683dcd50016885e84\"], \"command\": [\"bash\", \"-ec\", \"if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\necho -e \\\\\"\\\\\\\\n[mysqld]\\\\\\\\nwsrep_provider=none\\\\\" >> /etc/my.cnf\\\\nkolla_set_configs\\\\nsudo -u mysql -E kolla_extend_start\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\\\\\"${DB_ROOT_PASSWORD}\\\\\" ping 2>/dev/null; do sleep 1; done\\'\\\\nmysql -uroot -p\\\\\"${DB_ROOT_PASSWORD}\\\\\" -e \\\\\"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\\\\\"\\\\nmysql -uroot -p\\\\\"${DB_ROOT_PASSWORD}\\\\\" -e \\\\\"GRANT PROCESS ON *.* TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\\\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\"${DB_ROOT_PASSWORD}\\\\\" shutdown\"], \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\", \"/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\", \"/var/lib/mysql:/var/lib/mysql\"], \"net\": \"host\", \"detach\": false}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=g2MBvdwbrGWCecq9Tx9h2dMZT', '--env=DB_ROOT_PASSWORD=DeRVs33bBu', '--env=TRIPLEO_CONFIG_HASH=6fff9ad95bebaf4683dcd50016885e84', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.24.1:8787/rhosp13/openstack-mariadb:2018-05-25.1', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\necho -e \"\\\\n[mysqld]\\\\nwsrep_provider=none\" >> /etc/my.cnf\\nkolla_set_configs\\nsudo -u mysql -E kolla_extend_start\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\'until mysqladmin -uroot -p\"${DB_ROOT_PASSWORD}\" ping 2>/dev/null; do sleep 1; done\\'\\nmysql -uroot -p\"${DB_ROOT_PASSWORD}\" -e \"CREATE USER \\'clustercheck\\'@\\'localhost\\' IDENTIFIED BY \\'${DB_CLUSTERCHECK_PASSWORD}\\';\"\\nmysql -uroot -p\"${DB_ROOT_PASSWORD}\" -e \"GRANT PROCESS ON *.* TO \\'clustercheck\\'@\\'localhost\\' WITH GRANT OPTION;\"\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\"${DB_ROOT_PASSWORD}\" shutdown']. [2]", "stderr: INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json", "INFO:__main__:Validating config file", "INFO:__main__:Kolla config strategy set to: COPY_ALWAYS", "INFO:__main__:Copying service configuration files", "INFO:__main__:Copying /dev/null to /etc/libqb/force-filesystem-sockets", "INFO:__main__:Setting permission for /etc/libqb/force-filesystem-sockets", "INFO:__main__:Deleting /etc/my.cnf.d/galera.cnf", "INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/my.cnf.d/galera.cnf to /etc/my.cnf.d/galera.cnf", "ERROR:__main__:Unexpected error:", "Traceback (most recent call last):", " File \"/usr/local/bin/kolla_set_configs\", line 411, in main", " execute_config_strategy(config)", " File \"/usr/local/bin/kolla_set_configs\", line 377, in execute_config_strategy", " copy_config(config)", " File \"/usr/local/bin/kolla_set_configs\", line 306, in copy_config", " config_file.copy()", " File \"/usr/local/bin/kolla_set_configs\", line 150, in copy", " self._merge_directories(source, dest)", " File \"/usr/local/bin/kolla_set_configs\", line 97, in _merge_directories", " os.path.join(dest, to_copy))", " File \"/usr/local/bin/kolla_set_configs\", line 92, in _merge_directories", " self._set_properties(source, dest)", " File \"/usr/local/bin/kolla_set_configs\", line 117, in _set_properties", " self._set_properties_from_file(source, dest)", " File \"/usr/local/bin/kolla_set_configs\", line 122, in _set_properties_from_file", " shutil.copystat(source, dest)", " File \"/usr/lib64/python2.7/shutil.py\", line 98, in copystat", " os.utime(dst, (st.st_atime, st.st_mtime))", "OSError: [Errno 30] Read-only file system: '/etc/pki/ca-trust/extracted'", "stdout: a90cb6fabce8e94fe42effe325a1350a9e47b134f211774acc3153a35a545a7f" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/71cc6674-d6d2-4b0f-ad9c-4a9e470b36bb_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=28 changed=13 unreachable=0 failed=1 deploy_stderr: |
The attachment listed in the BZ description, and comment #1 both report the same failure, and it's something that happens when starting the 'mysql_bootstrap' container using the openstack-mariadb container image. This is not a DF:Storage issue, so reassigning to DFG:DF for further investigation.
PIDONE owns mysql
Hmmm, the error from the bug description seems to mean that when the transient container mysql_bootstrap is starting, kolla_init wants to overwrite /etc/pki/ca-trust/extracted with some config file that would come from /var/lib/config-data/puppet-generated/mysql, or change the folder permissions. I can't think of any reason why this would happen without logs to analyze. Did you enable TLS everywhere when you deployed the overcloud? Can you provide the exact deployment command so I try to replicate locally?
Jakub hit this issue earlier in a virtualized environment, we managed to fix it by making sure the *virt host machine* was NTP-synced. Note that it's not enough that we provide correct NtpServers to the overcloud Heat stack. If the setup is virtual, the VMs will get time from the hypervisor until they sync themselves via NTP. If the time is offset and there's an abrupt jump as the NTP sync takes effect, a wide variety of problems can appear. The only defense is to make sure that the virt host is NTP-synced too. As such this is a problem with the virtualized setups we use. I reported RFEs in in TripleO Quickstart and Infrared, that they should at least validate NTP on the virt host is synced, and refuse to run if not (or maybe even set up the NTP sync on virt host too).
On a second thought, if a bare metal machine goes into the deployment with wrongly set hardware clock, a similar problem could happen. It's probably worth investigating if we can do something to make this safer, e.g. make sure NTP sync is done early during deployment (and we wait for its completion) before running the containerized puppet.
I'll take this back since it's NTP. We can do this in the deployment via host_prep_tasks for ntp to ensure we have a time sync early on in the deployment.
Thanks for taking this Alex, yes either host_prep_tasks, or if we want to continue using the Puppet module for NTP config for whatever reason, we could add an `exec` resource after it [1] with `tries`+`try_sleep` to wait until the NTP is synced. I looked at the code ordering a bit this afternoon and the issue reported here should disappear if we ensure that we never enter the `docker-puppet.py` phase with unsynced NTP. The sync needs to be asserted either in host_prep_tasks or step 1 of puppet run on the host [2]. The docker-puppet.py phase [3] comes after both. Just transfering my thoughts as i already spent a bit of time looking at this. [1] https://github.com/openstack/puppet-tripleo/blob/cab0d34affeb171215e2bb288df7d478049e79cf/manifests/profile/base/time/ntp.pp#L29 [2] https://github.com/openstack/tripleo-heat-templates/blob/4286727ae70b1fa4ca6656c3f035afeac6eb2a95/common/deploy-steps-tasks.yaml#L156 [3] https://github.com/openstack/tripleo-heat-templates/blob/4286727ae70b1fa4ca6656c3f035afeac6eb2a95/common/deploy-steps-tasks.yaml#L184
As we're eventually going to replace ntp with chrony for some configurations, I think the host prep task will work the best for the service we're using. It's also something that we need to perform on the host as opposed to in the containers so it makes the most sense there. The host_prep_tasks get run before any of the container items so it'll make sure we have a synced time before we start working with the containers themselves, https://github.com/openstack/tripleo-heat-templates/blob/4286727ae70b1fa4ca6656c3f035afeac6eb2a95/common/deploy-steps.j2#L281
*** Bug 1592505 has been marked as a duplicate of this bug. ***
In my case I fixed it the timezone to UTC in the hypervisor: cp /usr/share/zoneinfo/UTC /etc/localtime Synchronized against NTP: ntpdate clock.redhat.com And updated the hwclock: hwclock --systohc After this, redeploying the overcloud worked. Maybe the ntp sync in TripleO should happen first thing before invoking puppet-docket at all? Thanks Damien for the super late hours debugging :)!
VERIFIED openstack-tripleo-heat-templates-9.0.1-0.20181013060907.el7ost.noarch
Pls update on this BZ
Will test it today.thx
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045
For what it's worth, I seem to have hit this bug in Queens (openstack-tripleo-heat-templates-8.0.7-21.el7ost.noarch) when deploying with some new x86 nodes (previously hit this in ppc64le). For me the nova containers cannot start, here's a log snippet: [heat-admin@compute-prod-1 ~]$ sudo docker logs nova_libvirt ... + sudo -E kolla_set_configs INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json INFO:__main__:Validating config file INFO:__main__:Kolla config strategy set to: COPY_ALWAYS INFO:__main__:Copying service configuration files INFO:__main__:Deleting /etc/libvirt/libvirtd.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/libvirtd.conf to /etc/libvirt/libvirtd.conf INFO:__main__:Deleting /etc/libvirt/passwd.db INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/passwd.db to /etc/libvirt/passwd.db INFO:__main__:Deleting /etc/libvirt/qemu.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/qemu.conf to /etc/libvirt/qemu.conf INFO:__main__:Deleting /etc/my.cnf.d/tripleo.cnf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/my.cnf.d/tripleo.cnf to /etc/my.cnf.d/tripleo.cnf INFO:__main__:Deleting /etc/nova/migration/authorized_keys INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/authorized_keys to /etc/nova/migration/authorized_keys INFO:__main__:Deleting /etc/nova/migration/identity INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/identity to /etc/nova/migration/identity INFO:__main__:Deleting /etc/nova/nova.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/nova.conf to /etc/nova/nova.conf INFO:__main__:Deleting /etc/nova/secret.xml INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/secret.xml to /etc/nova/secret.xml ERROR:__main__:Unexpected error: Traceback (most recent call last): File "/usr/local/bin/kolla_set_configs", line 411, in main execute_config_strategy(config) File "/usr/local/bin/kolla_set_configs", line 377, in execute_config_strategy copy_config(config) File "/usr/local/bin/kolla_set_configs", line 306, in copy_config config_file.copy() File "/usr/local/bin/kolla_set_configs", line 150, in copy self._merge_directories(source, dest) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 92, in _merge_directories self._set_properties(source, dest) File "/usr/local/bin/kolla_set_configs", line 117, in _set_properties self._set_properties_from_file(source, dest) File "/usr/local/bin/kolla_set_configs", line 122, in _set_properties_from_file shutil.copystat(source, dest) File "/usr/lib64/python2.7/shutil.py", line 98, in copystat os.utime(dst, (st.st_atime, st.st_mtime)) OSError: [Errno 30] Read-only file system: '/etc/pki/ca-trust/extracted' I created a dirty hack around this (at least I think I did) by adding a ntpdate and hwclock sync in custom OS::TripleO::NodeUserData (tested by setting the hwclock on a node into the past and re-deploying).
As a workaround when faced a similar issue when installing undercloud we had to touch all files, dirs and symlinks, remove containers and re-run the install: find /etc -exec touch -h {} + docker rm -f $(docker ps -a -q) rm -rf /var/lib/config-data/puppet-generated openstack undercloud install --verbose
(In reply to Aram Alipoor from comment #40) > As a workaround when faced a similar issue when installing undercloud we had > to touch all files, dirs and symlinks, remove containers and re-run the > install: > > > find /etc -exec touch -h {} + > docker rm -f $(docker ps -a -q) > rm -rf /var/lib/config-data/puppet-generated > > openstack undercloud install --verbose Very interesting. I hit this on upstream Rocky twice now. And this does indeed fix the issue for the Undercloud. Possibly need a new BZ for this though. I'll create one if I can get a reliable reproducer.
If it's useful, this is how I worked around it by executing this on the first boot. Include -e userdata_env.yaml in deploy and set your ntp server variable in network-environment.yaml, e.g. NtpServer: 'your.ntp.server' Contents of userdata_env.yaml: resource_registry: OS::TripleO::NodeUserData: userdata_custom.yaml Contents of userdata_custom.yaml: parameters: NtpServer: description: NTP server to use to sync hw clock bz#1578849 type: string default: pool.ntp.org description: > Do stuff on first boot resources: userdata: type: OS::Heat::MultipartMime properties: parts: - config: {get_resource: sync_hw_clock_config} sync_hw_clock_config: type: OS::Heat::SoftwareConfig properties: config: str_replace: template: | #!/bin/bash echo "pre" > /root/ntp.results echo "$NTPSERVER" >> /root/ntp.results date >> /root/ntp.results hwclock >> /root/ntp.results systemctl stop ntpd ntpdate $NTPSERVER hwclock --systohc echo "post" >> /root/ntp.results date >> /root/ntp.results hwclock >> /root/ntp.results params: $NTPSERVER: {get_param: NtpServer} outputs: OS::stack_id: value: {get_resource: userdata}
I'm re-opening this because I have also hit this issue and am not convinced the backport is in effect. This is namely because in the upstream commit, I see: EnablePackageInstall: default: 'false' description: Set to true to enable package installation at deploy time type: boolean therefore unless we are specifically setting this to boolean to true, this patch wont take effect. Does this make sense?
Please file a new bug. Once it's been closed in errata we can't reopen this. Also the EnablePackageInstall has no effect here because the package should already be on the image. When you open a new bug, please include all the logs and reproducer information.
(In reply to Alex Schultz from comment #46) > Please file a new bug. Once it's been closed in errata we can't reopen this. > Also the EnablePackageInstall has no effect here because the package should > already be on the image. When you open a new bug, please include all the > logs and reproducer information. Ok, we've had to workaround with the firstboot fix like other reporters so don't have logs for this any more. The package might already be on the image but clearly its not being configured soon enough.
It's likely that the hardware time itself is off when the system is provisioned. The host_prep_tasks are run first in the software deployment so from a deployment framework standpoint it's about as early as we can get. We could try adding hwclock to the host prep tasks as well. That's the only difference between the patch and what was mentioned in comment 44
If you come here because you have docker containers stuck in 'restarting' and you have the error: INFO:__main__:Deleting /etc/nova/secret.xml INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/secret.xml to /etc/nova/secret.xml ERROR:__main__:Unexpected error: Traceback (most recent call last): File "/usr/local/bin/kolla_set_configs", line 411, in main execute_config_strategy(config) File "/usr/local/bin/kolla_set_configs", line 377, in execute_config_strategy copy_config(config) File "/usr/local/bin/kolla_set_configs", line 306, in copy_config config_file.copy() File "/usr/local/bin/kolla_set_configs", line 150, in copy self._merge_directories(source, dest) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 92, in _merge_directories self._set_properties(source, dest) File "/usr/local/bin/kolla_set_configs", line 117, in _set_properties self._set_properties_from_file(source, dest) File "/usr/local/bin/kolla_set_configs", line 122, in _set_properties_from_file shutil.copystat(source, dest) File "/usr/lib64/python2.7/shutil.py", line 98, in copystat os.utime(dst, (st.st_atime, st.st_mtime)) OSError: [Errno 30] Read-only file system: '/etc/pki/ca-trust/extracted' You might be facing this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1794119