Bug 1699393 - [OSP13] deployment fails running nova_cellv2_discover_hosts process with a duplicate key
Summary: [OSP13] deployment fails running nova_cellv2_discover_hosts process with a du...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z7
: 13.0 (Queens)
Assignee: Martin Schuppert
QA Contact: Joe H. Rahme
URL:
Whiteboard:
: 1710118 1711531 (view as bug list)
Depends On: 1698630 1700876 1711531
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-12 15:06 UTC by Martin Schuppert
Modified: 2023-12-15 16:26 UTC (History)
25 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.3.1-42.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1698630
Environment:
Last Closed: 2019-07-10 13:05:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 655628 0 'None' MERGED Avoid concurrent nova cell_v2 discovery instances 2021-01-12 15:48:31 UTC
OpenStack gerrit 656074 0 'None' MERGED Run nova_cell_v2_discover_hosts.py on every deploy run 2021-01-12 15:48:31 UTC
OpenStack gerrit 664573 0 'None' MERGED Backport miss to run discovery via bootstrap_host_exec 2021-01-12 15:48:31 UTC
Red Hat Issue Tracker OSP-23657 0 None None None 2023-03-24 14:49:51 UTC
Red Hat Product Errata RHBA-2019:1738 0 None None None 2019-07-10 13:05:23 UTC

Description Martin Schuppert 2019-04-12 15:06:10 UTC
+++ This bug was initially created as a clone of Bug #1698630 +++

Description of problem:
Deployment of overcloud using CI job fails. See below link to the jenkins job.

Version-Release number of selected component (if applicable):
14.0-RHEL-7/2019-04-05.1

How reproducible:
Happened twice on the same host

Steps to Reproduce:
1.Run deployment using CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/build?delay=0sec

Leave all parameters as is, only specify your host in IR_PROVISION_HOST field


Actual results:
Deployment fails

Expected results:
Deployment succeeds

Additional info:

overcloud install log finishes with the following

        "Debug: Processing report from controller-0.localdomain with processor Puppet::Reports::Store", 
        "stderr: + STEP=5", 
        "+ TAGS=file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location", 
        "+ CONFIG='include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::volume_bundle'", 
        "+ EXTRA_ARGS='--debug --verbose'", 
        "+ '[' -d /tmp/puppet-etc ']'", 
        "+ cp -a /tmp/puppet-etc/auth.conf /tmp/puppet-etc/hiera.yaml /tmp/puppet-etc/hieradata /tmp/puppet-etc/modules /tmp/puppet-etc/puppet.conf /tmp/puppet-etc/ssl /etc/puppet", 
        "+ echo '{\"step\": 5}'", 
        "+ export FACTER_deployment_type=containers", 
        "+ FACTER_deployment_type=containers", 
        "+ set +e", 
        "+ puppet apply --debug --verbose --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location -e 'include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::volume_bundle'", 
        "Warning: Undefined variable 'uuid'; ", 
        "   (file & line not available)", 
        "Warning: Undefined variable 'deploy_config_name'; ", 
        "Warning: ModuleLoader: moduleOvercloud configuration failed.
 'cinder' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", 
        "Warning: This method is deprecated, please use the stdlib validate_legacy function,", 
        "                    with Pattern[]. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/db.pp\", 69]:[\"/etc/puppet/modules/cinder/manifests/init.pp\", 325]", 
        "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:28:in `deprecation')", 
        "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/config.pp\", 38]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder.pp\", 127]", 
        "                    with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/volume.pp\", 44]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder/volume.pp\", 117]", 
        "Warning: Unknown variable: 'ensure'. at /etc/puppet/modules/cinder/manifests/volume.pp:64:18", 
        "Warning: ModuleLoader: module 'mysql' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",

--- Additional comment from Roman Safronov on 2019-04-10 19:36:24 UTC ---

Links to the failed deployment results:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/112/


https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/109/

Logs available via Build Artifacts links on these pages.

--- Additional comment from Alex Schultz on 2019-04-10 21:52:04 UTC ---

The bug report is incorrect around the failure cause. The deployment failed on compute-1 due to a duplicate key error durring nova_cellv2_discover_hosts

atal: [compute-1]: FAILED! => {
    "failed_when_result": true, 
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "Error running ['docker', 'run', '--name', 'nova_cellv2_discover_hosts', '--label', 'config_id=tripleo_step5', '--label', 'container_name=nova_cellv2_discover_hosts', '--label', 'manage
d_by=paunch', '--label', 'config_data={\"start_order\": 0, \"command\": \"/docker-config-scripts/nova_cell_v2_discover_host.py\", \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\",
 \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/c
erts/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\
", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/config-data/nova_libvirt/etc/my.cnf.d/:/etc/my.cnf.d/:ro\", \"/var
/lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/lib/docker-config-scripts/:/docker-config-scripts/\"], \"image\": \"192.168.24.1:8787
/rhosp14/openstack-nova-compute:2019-03-28.1\", \"detach\": false, \"net\": \"host\"}', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime
:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bun
dle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:
ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/config-data/nova_libvirt/etc/my.cnf
.d/:/etc/my.cnf.d/:ro', '--volume=/var/lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro', '--volume=/var/log/containers/nova:/var/log/nova', '--volume=/var/lib/docker-config-scripts/:/docke
r-config-scripts/', '192.168.24.1:8787/rhosp14/openstack-nova-compute:2019-03-28.1', '/docker-config-scripts/nova_cell_v2_discover_host.py']. [1]", 


...snip...
        "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'compute-0.localdomain' for key 'uniq_host_mappings0host'\") [SQL: u'INSERT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'host': u'compute-0.localdomain', 'cell_id': 5, 'created_at': datetime.datetime(2019, 4, 10, 15, 20, 50, 527925), 'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gkpj)", 


Should we be running the discover hosts on every compute node or is that a bootstrap only thing?

--- Additional comment from melanie witt on 2019-04-11 21:46:22 UTC ---

(In reply to Alex Schultz from comment #2)
> The bug report is incorrect around the failure cause. The deployment failed
> on compute-1 due to a duplicate key error durring nova_cellv2_discover_hosts
> 
> ...snip...
>
> Should we be running the discover hosts on every compute node or is that a
> bootstrap only thing?

Thanks for including that error snippet. I did some looking around and found a couple of similar issues [1][2].

Indeed this is [unfortunately] expected behavior if running discover_hosts in parallel because of racing database updates for the same compute host.

The guidance here is to deploy all of the compute hosts and then run discover_hosts once at the end. This will make discover_hosts map all of the compute hosts to cells in one go.

For informational purposes, do you know if nova-manage discover_hosts is being run _on_ compute hosts (by giving compute hosts credentials to the API database in their nova.conf) or is it being run centrally on the same non-compute host? I ask because if the former, we would need a distributed lock in the nova-manage command to address the racing issue and we unfortunately don't have distributed lock support in nova at this time. If it's the latter, we could potentially add a lock to the database access in the nova-manage command if people would find it helpful. But ideally we hope running discover_hosts once per compute host batch will be acceptable for everyone.

[1] https://bugs.launchpad.net/openstack-ansible/+bug/1752540
[2] https://github.com/bloomberg/chef-bcpc/issues/1378

--- Additional comment from Alex Schultz on 2019-04-11 22:24:57 UTC ---

Martin Schuppert or Oli Walsh would be the folks to handle the question around the specifics of what this is supposed to do.  From a framework standpoint we do have a way to execute things on a single node of a group of nodes (ie a compute bootstrap node)

https://review.openstack.org/#/c/633230/ recently reworked the python script that gets run via this container.

--- Additional comment from melanie witt on 2019-04-11 23:18:33 UTC ---

OK, thanks. On the nova side, I'll propose a change to catch the exception and log a warning explaining the situation and augment the command help to have more information about when to run discover_hosts.

Separately, I had also thought about whether we should just ignore duplicates in nova for this situation, but after discussing it in #openstack-nova we were thinking it's not really a good way of running this (having parallel discover_hosts competing to map compute hosts and hammering collisions in the database), so we'd prefer to log a warning instead with guidance on how discover_hosts should be used.

--- Additional comment from Ollie Walsh on 2019-04-12 08:49:36 UTC ---

(In reply to melanie witt from comment #3)
...
> The guidance here is to deploy all of the compute hosts and then run
> discover_hosts once at the end. This will make discover_hosts map all of the
> compute hosts to cells in one go.

So originally that's how it worked: we ran discovery once at the very end on one of the controllers. However there were cases where that wouldn't happen e.g https://bugzilla.redhat.com/show_bug.cgi?id=1562082.



(In reply to Alex Schultz from comment #4)
> Martin Schuppert or Oli Walsh would be the folks to handle the question
> around the specifics of what this is supposed to do.  From a framework
> standpoint we do have a way to execute things on a single node of a group of
> nodes (ie a compute bootstrap node)

It's not that simple....

For each compute we need to check that it has registered before running host discovery or it's races. I expect we can split the task in two though.
In docker_config step4, on every compute, we start the nova-compute container and then start a (detach=false) container to wait for it's service to appear in the service list.
In docker_config step5, on the bootstrap node only, we run discovery.

We could have multiple compute roles so there is still the potential for collisions here, so we should also have a retry loop in the task that runs host discovery.

Martin - WDYT?

--- Additional comment from Martin Schuppert on 2019-04-12 10:01:11 UTC ---

(In reply to Ollie Walsh from comment #6)
> (In reply to melanie witt from comment #3)
> ...
> > The guidance here is to deploy all of the compute hosts and then run
> > discover_hosts once at the end. This will make discover_hosts map all of the
> > compute hosts to cells in one go.
> 
> So originally that's how it worked: we ran discovery once at the very end on
> one of the controllers. However there were cases where that wouldn't happen
> e.g https://bugzilla.redhat.com/show_bug.cgi?id=1562082.

Yes, in the past it ran on the nova api controller host. Just for reference,
it was moved with [1] to the compute service to also be able to deploy compute
nodes without updating the controllers and also for the split control plane scenario
where we won't have controllers in the same stack as the computes.

[1] https://review.openstack.org/#/c/576481/

> 
> 
> 
> (In reply to Alex Schultz from comment #4)
> > Martin Schuppert or Oli Walsh would be the folks to handle the question
> > around the specifics of what this is supposed to do.  From a framework
> > standpoint we do have a way to execute things on a single node of a group of
> > nodes (ie a compute bootstrap node)
> 
> It's not that simple....
> 
> For each compute we need to check that it has registered before running host
> discovery or it's races. I expect we can split the task in two though.
> In docker_config step4, on every compute, we start the nova-compute
> container and then start a (detach=false) container to wait for it's service
> to appear in the service list.

They should be up pretty quick, but yes we could run a task like waiting
for placement to be up.
 
> In docker_config step5, on the bootstrap node only, we run discovery.
> 
> We could have multiple compute roles so there is still the potential for
> collisions here, so we should also have a retry loop in the task that runs
> host discovery.
> 
> Martin - WDYT?

Yes that sounds like a good way forward to reduce the possibility to hit the
issue. With the proposed change to nova to just report a warning we'll not
fail in that case.

Comment 4 Martin Schuppert 2019-05-20 06:32:36 UTC
*** Bug 1711531 has been marked as a duplicate of this bug. ***

Comment 10 Bob Fournier 2019-06-05 14:59:24 UTC
*** Bug 1710118 has been marked as a duplicate of this bug. ***

Comment 18 Martin Schuppert 2019-06-11 13:49:58 UTC
First of all the initial fix was not in openstack-tripleo-heat-templates-8.3.1-16.el7ost, it got reshuffled and is in openstack-tripleo-heat-templates-8.3.1-19.el7ost.
But the backport [1] missed to trigger discovery script via bootstrap_host_exec.

[1] https://github.com/openstack/tripleo-heat-templates/commit/64f80a0b458add7e5cd8bea267996a5868ec30fc#diff-f9fbae7caa7594d60063e6915d9f4990
[2] https://review.opendev.org/664573

Comment 23 errata-xmlrpc 2019-07-10 13:05:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1738

Comment 24 Armin Morattab 2019-12-20 18:44:20 UTC
Does the solution included in z9?

Comment 25 Martin Schuppert 2019-12-30 14:05:39 UTC
(In reply to Armin Morattab from comment #24)
> Does the solution included in z9?

This was released with z7 via [1] (openstack-tripleo-heat-templates-8.3.1-42.el7ost and higher), so yes it is included in z9

[1] https://access.redhat.com/errata/RHBA-2019:1738


Note You need to log in before you can comment on or make changes to this bug.