+++ This bug was initially created as a clone of Bug #1698630 +++ Description of problem: Deployment of overcloud using CI job fails. See below link to the jenkins job. Version-Release number of selected component (if applicable): 14.0-RHEL-7/2019-04-05.1 How reproducible: Happened twice on the same host Steps to Reproduce: 1.Run deployment using CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/build?delay=0sec Leave all parameters as is, only specify your host in IR_PROVISION_HOST field Actual results: Deployment fails Expected results: Deployment succeeds Additional info: overcloud install log finishes with the following "Debug: Processing report from controller-0.localdomain with processor Puppet::Reports::Store", "stderr: + STEP=5", "+ TAGS=file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location", "+ CONFIG='include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::volume_bundle'", "+ EXTRA_ARGS='--debug --verbose'", "+ '[' -d /tmp/puppet-etc ']'", "+ cp -a /tmp/puppet-etc/auth.conf /tmp/puppet-etc/hiera.yaml /tmp/puppet-etc/hieradata /tmp/puppet-etc/modules /tmp/puppet-etc/puppet.conf /tmp/puppet-etc/ssl /etc/puppet", "+ echo '{\"step\": 5}'", "+ export FACTER_deployment_type=containers", "+ FACTER_deployment_type=containers", "+ set +e", "+ puppet apply --debug --verbose --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location -e 'include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::volume_bundle'", "Warning: Undefined variable 'uuid'; ", " (file & line not available)", "Warning: Undefined variable 'deploy_config_name'; ", "Warning: ModuleLoader: moduleOvercloud configuration failed. 'cinder' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", "Warning: This method is deprecated, please use the stdlib validate_legacy function,", " with Pattern[]. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/db.pp\", 69]:[\"/etc/puppet/modules/cinder/manifests/init.pp\", 325]", " (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:28:in `deprecation')", " with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/config.pp\", 38]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder.pp\", 127]", " with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/volume.pp\", 44]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder/volume.pp\", 117]", "Warning: Unknown variable: 'ensure'. at /etc/puppet/modules/cinder/manifests/volume.pp:64:18", "Warning: ModuleLoader: module 'mysql' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", --- Additional comment from Roman Safronov on 2019-04-10 19:36:24 UTC --- Links to the failed deployment results: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/112/ https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/109/ Logs available via Build Artifacts links on these pages. --- Additional comment from Alex Schultz on 2019-04-10 21:52:04 UTC --- The bug report is incorrect around the failure cause. The deployment failed on compute-1 due to a duplicate key error durring nova_cellv2_discover_hosts atal: [compute-1]: FAILED! => { "failed_when_result": true, "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [ "Error running ['docker', 'run', '--name', 'nova_cellv2_discover_hosts', '--label', 'config_id=tripleo_step5', '--label', 'container_name=nova_cellv2_discover_hosts', '--label', 'manage d_by=paunch', '--label', 'config_data={\"start_order\": 0, \"command\": \"/docker-config-scripts/nova_cell_v2_discover_host.py\", \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/c erts/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\ ", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/config-data/nova_libvirt/etc/my.cnf.d/:/etc/my.cnf.d/:ro\", \"/var /lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/lib/docker-config-scripts/:/docker-config-scripts/\"], \"image\": \"192.168.24.1:8787 /rhosp14/openstack-nova-compute:2019-03-28.1\", \"detach\": false, \"net\": \"host\"}', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime :ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bun dle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem: ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/config-data/nova_libvirt/etc/my.cnf .d/:/etc/my.cnf.d/:ro', '--volume=/var/lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro', '--volume=/var/log/containers/nova:/var/log/nova', '--volume=/var/lib/docker-config-scripts/:/docke r-config-scripts/', '192.168.24.1:8787/rhosp14/openstack-nova-compute:2019-03-28.1', '/docker-config-scripts/nova_cell_v2_discover_host.py']. [1]", ...snip... "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'compute-0.localdomain' for key 'uniq_host_mappings0host'\") [SQL: u'INSERT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'host': u'compute-0.localdomain', 'cell_id': 5, 'created_at': datetime.datetime(2019, 4, 10, 15, 20, 50, 527925), 'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gkpj)", Should we be running the discover hosts on every compute node or is that a bootstrap only thing? --- Additional comment from melanie witt on 2019-04-11 21:46:22 UTC --- (In reply to Alex Schultz from comment #2) > The bug report is incorrect around the failure cause. The deployment failed > on compute-1 due to a duplicate key error durring nova_cellv2_discover_hosts > > ...snip... > > Should we be running the discover hosts on every compute node or is that a > bootstrap only thing? Thanks for including that error snippet. I did some looking around and found a couple of similar issues [1][2]. Indeed this is [unfortunately] expected behavior if running discover_hosts in parallel because of racing database updates for the same compute host. The guidance here is to deploy all of the compute hosts and then run discover_hosts once at the end. This will make discover_hosts map all of the compute hosts to cells in one go. For informational purposes, do you know if nova-manage discover_hosts is being run _on_ compute hosts (by giving compute hosts credentials to the API database in their nova.conf) or is it being run centrally on the same non-compute host? I ask because if the former, we would need a distributed lock in the nova-manage command to address the racing issue and we unfortunately don't have distributed lock support in nova at this time. If it's the latter, we could potentially add a lock to the database access in the nova-manage command if people would find it helpful. But ideally we hope running discover_hosts once per compute host batch will be acceptable for everyone. [1] https://bugs.launchpad.net/openstack-ansible/+bug/1752540 [2] https://github.com/bloomberg/chef-bcpc/issues/1378 --- Additional comment from Alex Schultz on 2019-04-11 22:24:57 UTC --- Martin Schuppert or Oli Walsh would be the folks to handle the question around the specifics of what this is supposed to do. From a framework standpoint we do have a way to execute things on a single node of a group of nodes (ie a compute bootstrap node) https://review.openstack.org/#/c/633230/ recently reworked the python script that gets run via this container. --- Additional comment from melanie witt on 2019-04-11 23:18:33 UTC --- OK, thanks. On the nova side, I'll propose a change to catch the exception and log a warning explaining the situation and augment the command help to have more information about when to run discover_hosts. Separately, I had also thought about whether we should just ignore duplicates in nova for this situation, but after discussing it in #openstack-nova we were thinking it's not really a good way of running this (having parallel discover_hosts competing to map compute hosts and hammering collisions in the database), so we'd prefer to log a warning instead with guidance on how discover_hosts should be used. --- Additional comment from Ollie Walsh on 2019-04-12 08:49:36 UTC --- (In reply to melanie witt from comment #3) ... > The guidance here is to deploy all of the compute hosts and then run > discover_hosts once at the end. This will make discover_hosts map all of the > compute hosts to cells in one go. So originally that's how it worked: we ran discovery once at the very end on one of the controllers. However there were cases where that wouldn't happen e.g https://bugzilla.redhat.com/show_bug.cgi?id=1562082. (In reply to Alex Schultz from comment #4) > Martin Schuppert or Oli Walsh would be the folks to handle the question > around the specifics of what this is supposed to do. From a framework > standpoint we do have a way to execute things on a single node of a group of > nodes (ie a compute bootstrap node) It's not that simple.... For each compute we need to check that it has registered before running host discovery or it's races. I expect we can split the task in two though. In docker_config step4, on every compute, we start the nova-compute container and then start a (detach=false) container to wait for it's service to appear in the service list. In docker_config step5, on the bootstrap node only, we run discovery. We could have multiple compute roles so there is still the potential for collisions here, so we should also have a retry loop in the task that runs host discovery. Martin - WDYT? --- Additional comment from Martin Schuppert on 2019-04-12 10:01:11 UTC --- (In reply to Ollie Walsh from comment #6) > (In reply to melanie witt from comment #3) > ... > > The guidance here is to deploy all of the compute hosts and then run > > discover_hosts once at the end. This will make discover_hosts map all of the > > compute hosts to cells in one go. > > So originally that's how it worked: we ran discovery once at the very end on > one of the controllers. However there were cases where that wouldn't happen > e.g https://bugzilla.redhat.com/show_bug.cgi?id=1562082. Yes, in the past it ran on the nova api controller host. Just for reference, it was moved with [1] to the compute service to also be able to deploy compute nodes without updating the controllers and also for the split control plane scenario where we won't have controllers in the same stack as the computes. [1] https://review.openstack.org/#/c/576481/ > > > > (In reply to Alex Schultz from comment #4) > > Martin Schuppert or Oli Walsh would be the folks to handle the question > > around the specifics of what this is supposed to do. From a framework > > standpoint we do have a way to execute things on a single node of a group of > > nodes (ie a compute bootstrap node) > > It's not that simple.... > > For each compute we need to check that it has registered before running host > discovery or it's races. I expect we can split the task in two though. > In docker_config step4, on every compute, we start the nova-compute > container and then start a (detach=false) container to wait for it's service > to appear in the service list. They should be up pretty quick, but yes we could run a task like waiting for placement to be up. > In docker_config step5, on the bootstrap node only, we run discovery. > > We could have multiple compute roles so there is still the potential for > collisions here, so we should also have a retry loop in the task that runs > host discovery. > > Martin - WDYT? Yes that sounds like a good way forward to reduce the possibility to hit the issue. With the proposed change to nova to just report a warning we'll not fail in that case.
*** Bug 1711531 has been marked as a duplicate of this bug. ***
*** Bug 1710118 has been marked as a duplicate of this bug. ***
First of all the initial fix was not in openstack-tripleo-heat-templates-8.3.1-16.el7ost, it got reshuffled and is in openstack-tripleo-heat-templates-8.3.1-19.el7ost. But the backport [1] missed to trigger discovery script via bootstrap_host_exec. [1] https://github.com/openstack/tripleo-heat-templates/commit/64f80a0b458add7e5cd8bea267996a5868ec30fc#diff-f9fbae7caa7594d60063e6915d9f4990 [2] https://review.opendev.org/664573
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1738
Does the solution included in z9?
(In reply to Armin Morattab from comment #24) > Does the solution included in z9? This was released with z7 via [1] (openstack-tripleo-heat-templates-8.3.1-42.el7ost and higher), so yes it is included in z9 [1] https://access.redhat.com/errata/RHBA-2019:1738