Bug 1698630

Summary: [OSP15] deployment fails running nova_cellv2_discover_hosts process with a duplicate key
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: openstack-tripleo-heat-templatesAssignee: Martin Schuppert <mschuppe>
Status: CLOSED ERRATA QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: aschultz, jjoyce, jschluet, lyarwood, mburns, mcornea, michele, mschuppe, mwitt, owalsh, slinaber, tvignaud
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 15.0 (Stein)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-10.5.1-0.20190429000408.3415df5.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1699393 1700876 1711531 (view as bug list) Environment:
Last Closed: 2019-09-21 11:21:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1699393, 1700876, 1711531    

Description Roman Safronov 2019-04-10 19:33:14 UTC
Description of problem:
Deployment of overcloud using CI job fails. See below link to the jenkins job.

Version-Release number of selected component (if applicable):
14.0-RHEL-7/2019-04-05.1

How reproducible:
Happened twice on the same host

Steps to Reproduce:
1.Run deployment using CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-geneve-dvr/build?delay=0sec

Leave all parameters as is, only specify your host in IR_PROVISION_HOST field


Actual results:
Deployment fails

Expected results:
Deployment succeeds

Additional info:

overcloud install log finishes with the following

        "Debug: Processing report from controller-0.localdomain with processor Puppet::Reports::Store", 
        "stderr: + STEP=5", 
        "+ TAGS=file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location", 
        "+ CONFIG='include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::volume_bundle'", 
        "+ EXTRA_ARGS='--debug --verbose'", 
        "+ '[' -d /tmp/puppet-etc ']'", 
        "+ cp -a /tmp/puppet-etc/auth.conf /tmp/puppet-etc/hiera.yaml /tmp/puppet-etc/hieradata /tmp/puppet-etc/modules /tmp/puppet-etc/puppet.conf /tmp/puppet-etc/ssl /etc/puppet", 
        "+ echo '{\"step\": 5}'", 
        "+ export FACTER_deployment_type=containers", 
        "+ FACTER_deployment_type=containers", 
        "+ set +e", 
        "+ puppet apply --debug --verbose --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location -e 'include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::volume_bundle'", 
        "Warning: Undefined variable 'uuid'; ", 
        "   (file & line not available)", 
        "Warning: Undefined variable 'deploy_config_name'; ", 
        "Warning: ModuleLoader: moduleOvercloud configuration failed.
 'cinder' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", 
        "Warning: This method is deprecated, please use the stdlib validate_legacy function,", 
        "                    with Pattern[]. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/db.pp\", 69]:[\"/etc/puppet/modules/cinder/manifests/init.pp\", 325]", 
        "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:28:in `deprecation')", 
        "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/config.pp\", 38]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder.pp\", 127]", 
        "                    with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/volume.pp\", 44]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder/volume.pp\", 117]", 
        "Warning: Unknown variable: 'ensure'. at /etc/puppet/modules/cinder/manifests/volume.pp:64:18", 
        "Warning: ModuleLoader: module 'mysql' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",

Comment 2 Alex Schultz 2019-04-10 21:52:04 UTC
The bug report is incorrect around the failure cause. The deployment failed on compute-1 due to a duplicate key error durring nova_cellv2_discover_hosts

atal: [compute-1]: FAILED! => {
    "failed_when_result": true, 
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "Error running ['docker', 'run', '--name', 'nova_cellv2_discover_hosts', '--label', 'config_id=tripleo_step5', '--label', 'container_name=nova_cellv2_discover_hosts', '--label', 'manage
d_by=paunch', '--label', 'config_data={\"start_order\": 0, \"command\": \"/docker-config-scripts/nova_cell_v2_discover_host.py\", \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\",
 \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/c
erts/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\
", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/config-data/nova_libvirt/etc/my.cnf.d/:/etc/my.cnf.d/:ro\", \"/var
/lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/lib/docker-config-scripts/:/docker-config-scripts/\"], \"image\": \"192.168.24.1:8787
/rhosp14/openstack-nova-compute:2019-03-28.1\", \"detach\": false, \"net\": \"host\"}', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime
:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bun
dle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:
ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/config-data/nova_libvirt/etc/my.cnf
.d/:/etc/my.cnf.d/:ro', '--volume=/var/lib/config-data/nova_libvirt/etc/nova/:/etc/nova/:ro', '--volume=/var/log/containers/nova:/var/log/nova', '--volume=/var/lib/docker-config-scripts/:/docke
r-config-scripts/', '192.168.24.1:8787/rhosp14/openstack-nova-compute:2019-03-28.1', '/docker-config-scripts/nova_cell_v2_discover_host.py']. [1]", 


...snip...
        "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'compute-0.localdomain' for key 'uniq_host_mappings0host'\") [SQL: u'INSERT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'host': u'compute-0.localdomain', 'cell_id': 5, 'created_at': datetime.datetime(2019, 4, 10, 15, 20, 50, 527925), 'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gkpj)", 


Should we be running the discover hosts on every compute node or is that a bootstrap only thing?

Comment 3 melanie witt 2019-04-11 21:46:22 UTC
(In reply to Alex Schultz from comment #2)
> The bug report is incorrect around the failure cause. The deployment failed
> on compute-1 due to a duplicate key error durring nova_cellv2_discover_hosts
> 
> ...snip...
>
> Should we be running the discover hosts on every compute node or is that a
> bootstrap only thing?

Thanks for including that error snippet. I did some looking around and found a couple of similar issues [1][2].

Indeed this is [unfortunately] expected behavior if running discover_hosts in parallel because of racing database updates for the same compute host.

The guidance here is to deploy all of the compute hosts and then run discover_hosts once at the end. This will make discover_hosts map all of the compute hosts to cells in one go.

For informational purposes, do you know if nova-manage discover_hosts is being run _on_ compute hosts (by giving compute hosts credentials to the API database in their nova.conf) or is it being run centrally on the same non-compute host? I ask because if the former, we would need a distributed lock in the nova-manage command to address the racing issue and we unfortunately don't have distributed lock support in nova at this time. If it's the latter, we could potentially add a lock to the database access in the nova-manage command if people would find it helpful. But ideally we hope running discover_hosts once per compute host batch will be acceptable for everyone.

[1] https://bugs.launchpad.net/openstack-ansible/+bug/1752540
[2] https://github.com/bloomberg/chef-bcpc/issues/1378

Comment 4 Alex Schultz 2019-04-11 22:24:57 UTC
Martin Schuppert or Oli Walsh would be the folks to handle the question around the specifics of what this is supposed to do.  From a framework standpoint we do have a way to execute things on a single node of a group of nodes (ie a compute bootstrap node)

https://review.openstack.org/#/c/633230/ recently reworked the python script that gets run via this container.

Comment 5 melanie witt 2019-04-11 23:18:33 UTC
OK, thanks. On the nova side, I'll propose a change to catch the exception and log a warning explaining the situation and augment the command help to have more information about when to run discover_hosts.

Separately, I had also thought about whether we should just ignore duplicates in nova for this situation, but after discussing it in #openstack-nova we were thinking it's not really a good way of running this (having parallel discover_hosts competing to map compute hosts and hammering collisions in the database), so we'd prefer to log a warning instead with guidance on how discover_hosts should be used.

Comment 6 Ollie Walsh 2019-04-12 08:49:36 UTC
(In reply to melanie witt from comment #3)
...
> The guidance here is to deploy all of the compute hosts and then run
> discover_hosts once at the end. This will make discover_hosts map all of the
> compute hosts to cells in one go.

So originally that's how it worked: we ran discovery once at the very end on one of the controllers. However there were cases where that wouldn't happen e.g https://bugzilla.redhat.com/show_bug.cgi?id=1562082.



(In reply to Alex Schultz from comment #4)
> Martin Schuppert or Oli Walsh would be the folks to handle the question
> around the specifics of what this is supposed to do.  From a framework
> standpoint we do have a way to execute things on a single node of a group of
> nodes (ie a compute bootstrap node)

It's not that simple....

For each compute we need to check that it has registered before running host discovery or it's races. I expect we can split the task in two though.
In docker_config step4, on every compute, we start the nova-compute container and then start a (detach=false) container to wait for it's service to appear in the service list.
In docker_config step5, on the bootstrap node only, we run discovery.

We could have multiple compute roles so there is still the potential for collisions here, so we should also have a retry loop in the task that runs host discovery.

Martin - WDYT?

Comment 7 Martin Schuppert 2019-04-12 10:01:11 UTC
(In reply to Ollie Walsh from comment #6)
> (In reply to melanie witt from comment #3)
> ...
> > The guidance here is to deploy all of the compute hosts and then run
> > discover_hosts once at the end. This will make discover_hosts map all of the
> > compute hosts to cells in one go.
> 
> So originally that's how it worked: we ran discovery once at the very end on
> one of the controllers. However there were cases where that wouldn't happen
> e.g https://bugzilla.redhat.com/show_bug.cgi?id=1562082.

Yes, in the past it ran on the nova api controller host. Just for reference,
it was moved with [1] to the compute service to also be able to deploy compute
nodes without updating the controllers and also for the split control plane scenario
where we won't have controllers in the same stack as the computes.

[1] https://review.openstack.org/#/c/576481/

> 
> 
> 
> (In reply to Alex Schultz from comment #4)
> > Martin Schuppert or Oli Walsh would be the folks to handle the question
> > around the specifics of what this is supposed to do.  From a framework
> > standpoint we do have a way to execute things on a single node of a group of
> > nodes (ie a compute bootstrap node)
> 
> It's not that simple....
> 
> For each compute we need to check that it has registered before running host
> discovery or it's races. I expect we can split the task in two though.
> In docker_config step4, on every compute, we start the nova-compute
> container and then start a (detach=false) container to wait for it's service
> to appear in the service list.

They should be up pretty quick, but yes we could run a task like waiting
for placement to be up.
 
> In docker_config step5, on the bootstrap node only, we run discovery.
> 
> We could have multiple compute roles so there is still the potential for
> collisions here, so we should also have a retry loop in the task that runs
> host discovery.
> 
> Martin - WDYT?

Yes that sounds like a good way forward to reduce the possibility to hit the
issue. With the proposed change to nova to just report a warning we'll not
fail in that case.

Comment 10 Martin Schuppert 2019-04-24 11:24:40 UTC
*** Bug 1702494 has been marked as a duplicate of this bug. ***

Comment 15 melanie witt 2019-06-14 14:45:58 UTC
Just adding a note that the nova patch to warn and return status code 2 from nova-manage when duplicate compute hosts are encountered during discover_hosts has merged yesterday upstream:

https://review.opendev.org/651947

Comment 17 Joe H. Rahme 2019-08-01 16:09:51 UTC

(undercloud) [stack@undercloud-0 ~]$  ansible -i /usr/bin/tripleo-ansible-inventory Compute -b -a 'grep -A 20 nova_cell_v2_discover_hosts.py /var/log/messages'
 [ERROR]: /usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major
browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)   SubjectAltNameWarning /usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2
has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
SubjectAltNameWarning /usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being
removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)   SubjectAltNameWarning

[DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using

 the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be 
disabled by setting deprecation_warnings=False in ansible.cfg.
compute-1 | CHANGED | rc=0 >>
Jul 31 10:06:41 compute-1 ansible-command[135920]: Invoked with _raw_params=grep -A 20 nova_cell_v2_discover_hosts.py /var/log/messages warn=True _uses_shell=False stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 31 10:07:42 compute-1 systemd[1]: Starting iscsid healthcheck...
Jul 31 10:07:42 compute-1 systemd[1]: Starting logrotate_crond healthcheck...
Jul 31 10:07:42 compute-1 systemd-logind[909]: Session 23 logged out. Waiting for processes to exit.
Jul 31 10:07:42 compute-1 systemd-logind[909]: Removed session 23.
Jul 31 10:07:42 compute-1 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping.
Jul 31 10:07:42 compute-1 systemd[1]: Stopping User Manager for UID 1000...
Jul 31 10:07:42 compute-1 systemd[135796]: Stopped target Default.
Jul 31 10:07:42 compute-1 systemd[135796]: Stopped target Basic System.
Jul 31 10:07:42 compute-1 systemd[135796]: Stopped target Paths.
Jul 31 10:07:42 compute-1 systemd[135796]: Stopped target Sockets.
Jul 31 10:07:42 compute-1 systemd[135796]: Closed D-Bus User Message Bus Socket.
Jul 31 10:07:42 compute-1 systemd[135796]: Stopped target Timers.
Jul 31 10:07:42 compute-1 systemd[135796]: Stopped Mark boot as successful after the user session has run 2 minutes.
Jul 31 10:07:42 compute-1 systemd[135796]: Reached target Shutdown.
Jul 31 10:07:42 compute-1 systemd[135796]: Starting Exit the Session...
Jul 31 10:07:42 compute-1 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping.
Jul 31 10:07:42 compute-1 systemd[1]: Stopped User Manager for UID 1000.
Jul 31 10:07:42 compute-1 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping.
Jul 31 10:07:42 compute-1 systemd[1]: Removed slice User Slice of UID 1000.
Jul 31 10:07:42 compute-1 systemd[1]: Stopping /run/user/1000 mount wrapper...

[DEPRECATION WARNING]: Distribution redhat 8.0 on host compute-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using
 the discovered platform python for this host. See https://docs.ansible.com/ansible/2.8/reference_appendices/interpreter_discovery.html for more information. This feature will be removed in version 2.12. Deprecation warnings can be 
disabled by setting deprecation_warnings=False in ansible.cfg.
compute-0 | CHANGED | rc=0 >>
Jul 31 10:06:41 compute-0 ansible-command[138926]: Invoked with _raw_params=grep -A 20 nova_cell_v2_discover_hosts.py /var/log/messages warn=True _uses_shell=False stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jul 31 10:07:41 compute-0 systemd[1]: Starting ceilometer_agent_compute healthcheck...
Jul 31 10:07:41 compute-0 systemd[1]: Starting ovn_controller healthcheck...
Jul 31 10:07:41 compute-0 systemd[1]: Starting logrotate_crond healthcheck...
Jul 31 10:07:41 compute-0 systemd[1]: Starting nova_compute healthcheck...
Jul 31 10:07:41 compute-0 systemd-logind[916]: Session 23 logged out. Waiting for processes to exit.
Jul 31 10:07:41 compute-0 systemd-logind[916]: Removed session 23.
Jul 31 10:07:42 compute-0 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping.
Jul 31 10:07:42 compute-0 systemd[1]: Stopping User Manager for UID 1000...
Jul 31 10:07:42 compute-0 systemd[138807]: Stopped target Default.
Jul 31 10:07:42 compute-0 systemd[138807]: Stopped target Basic System.
Jul 31 10:07:42 compute-0 systemd[138807]: Stopped target Timers.
Jul 31 10:07:42 compute-0 systemd[138807]: Stopped target Sockets.
Jul 31 10:07:42 compute-0 systemd[138807]: Closed D-Bus User Message Bus Socket.
Jul 31 10:07:42 compute-0 systemd[138807]: Stopped Mark boot as successful after the user session has run 2 minutes.
Jul 31 10:07:42 compute-0 systemd[138807]: Stopped target Paths.
Jul 31 10:07:42 compute-0 systemd[138807]: Reached target Shutdown.
Jul 31 10:07:42 compute-0 systemd[138807]: Starting Exit the Session...
Jul 31 10:07:42 compute-0 systemd[1]: Stopped User Manager for UID 1000.
Jul 31 10:07:42 compute-0 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping.
Jul 31 10:07:42 compute-0 systemd[1]: Stopping /run/user/1000 mount wrapper...

Comment 21 errata-xmlrpc 2019-09-21 11:21:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811