Bug 1710118 - [DCN][spine & leaf][Scale] 60 compute node deployment Deployment to server failed DBDuplicateEntry
Summary: [DCN][spine & leaf][Scale] 60 compute node deployment Deployment to server fa...
Keywords:
Status: CLOSED DUPLICATE of bug 1699393
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: All
OS: All
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-14 22:03 UTC by bjacot
Modified: 2019-06-05 14:59 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-05 14:59:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
templates used (10.76 KB, application/gzip)
2019-05-23 19:37 UTC, bjacot
no flags Details
sos report for overcloud-compute1-0 (11.40 MB, application/x-xz)
2019-06-03 13:21 UTC, bjacot
no flags Details

Description bjacot 2019-05-14 22:03:54 UTC
Description of problem:
3: controller 60:compute node deployment failed with duplicate entry.  Compute nodes are spread accross 10 leaf networks.

Version-Release number of selected component (if applicable):
OSP 13 2019-04-23.1

How reproducible:
70% of the time

Steps to Reproduce:
1. Deploy UC with 10 leaf networks
2. Prepare oc with 3:controllers 6:computes on each leaf network. 10 leaf total
3. Deploy OC

Actual results:
failed:

2019-05-14 20:42:24Z [overcloud.AllNodesDeploySteps.Compute9Deployment_Step5.5]: SIGNAL_IN_PROGRESS  Signal: deployment 637b6b6c-6dae-4e68-b84d-77ace81ff340 succeeded

 Stack overcloud CREATE_FAILED 

overcloud.AllNodesDeploySteps.Compute5Deployment_Step5.2:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: cf4b6185-4bfc-4338-9b84-b8b91c7ed605
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "    raise errorclass(errno, errval)", 
            "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'overcloud-compute6-4.localdomain' for key 'uniq_host_mappings0host'\") [SQL: u'INSE
RT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'host': u'overcloud-compute6-
4.localdomain', 'cell_id': 5, 'created_at': datetime.datetime(2019, 5, 14, 20, 42, 12, 365188), 'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gk
pj)", 
            "stderr: "
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/1a7b531c-e254-4971-a56d-6e64fd5a69d5_playbook.retry
    
    PLAY RECAP *********************************************************************

Expected results:
pass

Additional info:

workaround:
do a smaller deployment and scale up the environment.

Comment 1 Bob Fournier 2019-05-21 21:25:02 UTC
Brad - can you add a sosreport or full set of logs?  Its not clear where this error is coming from.

Comment 2 Harald Jensås 2019-05-22 09:06:09 UTC
For reference her is the code where we are hitting a constraint issue:

Master: https://opendev.org/openstack/nova/src/branch/master/nova/db/sqlalchemy/api_models.py#L156-L166
Queens: https://opendev.org/openstack/nova/src/branch/stable/queens/nova/db/sqlalchemy/api_models.py#L146-L156

We are placing "overcloud-compute6-4.localdomain" into the Cell HostMapping, but that host already exist.


  This may just be as expected with index stating at 0 for one and 1 for the other, but we have ``overcloud.AllNodesDeploySteps.Compute5Deployment_Step5.2`` and "overcloud-compute6-4.localdomain". "Compute5" and "compute6"? Could there be a template mistake? (I think not, since this is reproducible 70% of the time. A template error should result in 100% ...)

@Brad, can you also share templates and deploy command?

Actual access to a environment where this was reproduced would also be great.

Comment 3 bjacot 2019-05-23 19:37:08 UTC
Created attachment 1572675 [details]
templates used

Comment 4 bjacot 2019-05-23 19:39:33 UTC
Hey Herald,

I uploaded a copy of my templates.  Also, this does not seem to happen all the time.  Ill work on getting my environment back up and this issue reproducible.  

$ cat overcloud_deploy.sh 
#!/bin/bash

openstack overcloud deploy \
--timeout 240 \
--templates /usr/share/openstack-tripleo-heat-templates \
-n /home/stack/virt/network_data_spine_leaf.yaml \
-r /home/stack/virt/roles_data_spine_leaf.yaml \
--libvirt-type kvm \
--ntp-server 192.168.220.1 \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network-environment.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /home/stack/docker-images.yaml \
--log-file overcloud_deployment_$(date +%m_%d_%y__%H_%M_%S).log

Comment 6 Harald Jensås 2019-05-27 17:52:27 UTC
(In reply to bjacot from comment #4)
> Hey Herald,
> 
> I uploaded a copy of my templates.  Also, this does not seem to happen all
> the time.  Ill work on getting my environment back up and this issue
> reproducible.  
> 
> $ cat overcloud_deploy.sh 
> #!/bin/bash
> 
> openstack overcloud deploy \
> --timeout 240 \
> --templates /usr/share/openstack-tripleo-heat-templates \
> -n /home/stack/virt/network_data_spine_leaf.yaml \
> -r /home/stack/virt/roles_data_spine_leaf.yaml \
> --libvirt-type kvm \
> --ntp-server 192.168.220.1 \
> -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.
> yaml \
> -e /home/stack/virt/network-environment.yaml \
> -e /home/stack/virt/nodes_data.yaml \
> -e /home/stack/docker-images.yaml \
> --log-file overcloud_deployment_$(date +%m_%d_%y__%H_%M_%S).log

The templates and deploy command looks good.

I wonder if this could be a race, or if it could be related to teardown/re-scheduling on error issues. (for example this, https://bugs.launchpad.net/nova/+bug/1815799 ?)

We need those sosreport's Bob asked for (the logs), and if we don't see anything we may want to involve DFG:Compute.

Comment 8 Bob Fournier 2019-05-31 21:02:28 UTC
Full traceback from 'openstack stack failures list overcloud --long'

          "DEBUG:novaclient.v2.client:GET call to compute for http://172.25.0.14:8774/v2.1/os-services?binary=nova-compute used request id req-f04c0768-58ff-48fa-ac09-3b4b7fb96627", 
            "INFO:nova_cell_v2_discover_host:(cellv2) Service registered, running discovery", 
            "Found 2 cell mappings.", 
            "Skipping cell0 since it does not contain hosts.", 
            "Getting computes from cell 'default': b4a8af65-cad8-4db4-a02a-220416aeee74", 
            "Creating host mapping for service overcloud-compute1-0.localdomain", 
            "An error has occurred:", 
            "Traceback (most recent call last):", 
            "  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1654, in main", 
            "    ret = fn(*fn_args, **fn_kwargs)", 
            "  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1323, in discover_hosts", 
            "    by_service)", 
            "  File \"/usr/lib/python2.7/site-packages/nova/objects/host_mapping.py\", line 265, in discover_hosts", 
            "  File \"/usr/lib/python2.7/site-packages/nova/objects/host_mapping.py\", line 224, in _check_and_create_host_mappings", 
            "    status_fn)", 
            "  File \"/usr/lib/python2.7/site-packages/nova/objects/host_mapping.py\", line 211, in _check_and_create_service_host_mappings", 
            "    host_mapping.create()", 
            "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 226, in wrapper", 
            "    return fn(self, *args, **kwargs)", 
            "  File \"/usr/lib/python2.7/site-packages/nova/objects/host_mapping.py\", line 114, in create", 
            "    db_mapping = self._create_in_db(self._context, changes)", 
            "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 988, in wrapper", 
            "    return fn(*args, **kwargs)", 
            "  File \"/usr/lib/python2.7/site-packages/nova/objects/host_mapping.py\", line 107, in _create_in_db", 
            "    return _apply_updates(context, db_mapping, updates)", 
            "  File \"/usr/lib/python2.7/site-packages/nova/objects/host_mapping.py\", line 33, in _apply_updates", 
            "    db_mapping.save(context.session)", 
            "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/models.py\", line 50, in save", 
            "    session.flush()", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py\", line 2243, in flush", 
            "    self._flush(objects)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py\", line 2369, in _flush", 
            "    transaction.rollback(_capture_exception=True)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py\", line 66, in __exit__", 
            "    compat.reraise(exc_type, exc_value, exc_tb)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py\", line 2333, in _flush", 
            "    flush_context.execute()", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py\", line 391, in execute", 
            "    rec.execute(self)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py\", line 556, in execute", 
            "    uow", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py\", line 181, in save_obj", 
            "    mapper, table, insert)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py\", line 866, in _emit_insert_statements", 
            "    execute(statement, params)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py\", line 948, in execute", 
            "    return meth(self, multiparams, params)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py\", line 269, in _execute_on_connection", 
            "    return connection._execute_clauseelement(self, multiparams, params)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1060, in _execute_clauseelement", 
            "    compiled_sql, distilled_params", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1200, in _execute_context", 
            "    context)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1409, in _handle_dbapi_exception", 
            "    util.raise_from_cause(newraise, exc_info)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py\", line 203, in raise_from_cause", 
            "    reraise(type(exception), exception, tb=exc_tb, cause=cause)", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1193, in _execute_context", 
            "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py\", line 507, in do_execute", 
            "    cursor.execute(statement, parameters)", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/cursors.py\", line 166, in execute", 
            "    result = self._query(query)", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/cursors.py\", line 322, in _query", 
            "    conn.query(q)", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 856, in query", 
            "    self._affected_rows = self._read_query_result(unbuffered=unbuffered)", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1057, in _read_query_result", 
            "    result.read()", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1340, in read", 
            "    first_packet = self.connection._read_packet()", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1014, in _read_packet", 
            "    packet.check_error()", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 393, in check_error", 
            "    err.raise_mysql_exception(self._data)", 
            "  File \"/usr/lib/python2.7/site-packages/pymysql/err.py\", line 107, in raise_mysql_exception", 
            "    raise errorclass(errno, errval)", 
            "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'overcloud-compute1-0.localdomain' for key 'uniq_host_mappings0host'\") [SQL: u'INSERT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'host': u'overcloud-compute1-0.localdomain', 'cell_id': 5, 'created_at': datetime.datetime(2019, 5, 31, 20, 26, 44, 449336), 'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gkpj)", 
            "stderr: "

Comment 9 Bob Fournier 2019-05-31 21:08:07 UTC
$ openstack server list
+--------------------------------------+-------------------------+--------+-------------------------+----------------+-----------+
| ID                                   | Name                    | Status | Networks                | Image          | Flavor    |
+--------------------------------------+-------------------------+--------+-------------------------+----------------+-----------+
| 051772dc-6ba1-47a1-bc63-f2394b9ee65c | overcloud-controller0-0 | ACTIVE | ctlplane=192.168.220.31 | overcloud-full | control0  |
| 0eb99cf7-905d-4718-858e-bc79d2d8ed0d | overcloud-controller0-1 | ACTIVE | ctlplane=192.168.220.21 | overcloud-full | control0  |
| 3c8e5b53-468a-4c48-94aa-6b3b6b278a38 | overcloud-compute4-0    | ACTIVE | ctlplane=192.168.224.10 | overcloud-full | compute4  |
| 3e0af48c-cc87-4f00-9596-324006c25728 | overcloud-compute10-0   | ACTIVE | ctlplane=192.168.230.32 | overcloud-full | compute10 |
| 46e8149c-1b12-4213-b8bd-00d94e2ae655 | overcloud-compute7-0    | ACTIVE | ctlplane=192.168.227.18 | overcloud-full | compute7  |
| 4b3e2ecb-25dd-4200-bd65-aab7715b081a | overcloud-controller0-2 | ACTIVE | ctlplane=192.168.220.11 | overcloud-full | control0  |
| a363f5d3-fd7f-4a98-aa37-97e582e7b97b | overcloud-compute3-0    | ACTIVE | ctlplane=192.168.223.26 | overcloud-full | compute3  |
| c458e8a4-0723-4c64-8120-c4482cb84255 | overcloud-compute1-0    | ACTIVE | ctlplane=192.168.221.10 | overcloud-full | compute1  |
| c19e2a2a-03d7-4054-b04a-bcaf2417dc72 | overcloud-compute8-0    | ACTIVE | ctlplane=192.168.228.12 | overcloud-full | compute8  |
| e19303af-7c78-42f7-acc9-f5c4fbad9fd0 | overcloud-compute6-0    | ACTIVE | ctlplane=192.168.226.10 | overcloud-full | compute6  |
| 802be941-e1f6-451f-9603-de40db03f935 | overcloud-compute9-0    | ACTIVE | ctlplane=192.168.229.19 | overcloud-full | compute9  |
| 5fb3fea4-7a25-4ad6-8064-ceeb38d1214e | overcloud-compute5-0    | ACTIVE | ctlplane=192.168.225.13 | overcloud-full | compute5  |
| a74a7022-1e3f-44c2-801a-e629e19f998c | overcloud-compute2-0    | ACTIVE | ctlplane=192.168.222.16 | overcloud-full | compute2  |
| d4399790-563e-45db-93cb-049def7023da | overcloud-compute11-0   | ACTIVE | ctlplane=192.168.231.23 | overcloud-full | compute11 |
+--------------------------------------+-------------------------+--------+-------------------------+----------------+-----------+

$ openstack server show overcloud-compute1-0
+-------------------------------------+----------------------------------------------------------+
| Field                               | Value                                                    |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                   |
| OS-EXT-AZ:availability_zone         | nova                                                     |
| OS-EXT-SRV-ATTR:host                | core-undercloud-0.redhat.local                           |
| OS-EXT-SRV-ATTR:hypervisor_hostname | dcc268d3-f1d2-4f4e-a37d-32a685306773                     |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000007                                        |
| OS-EXT-STS:power_state              | Running                                                  |
| OS-EXT-STS:task_state               | None                                                     |
| OS-EXT-STS:vm_state                 | active                                                   |
| OS-SRV-USG:launched_at              | 2019-05-31T19:43:17.000000                               |
| OS-SRV-USG:terminated_at            | None                                                     |
| accessIPv4                          |                                                          |
| accessIPv6                          |                                                          |
| addresses                           | ctlplane=192.168.221.10                                  |
| config_drive                        | True                                                     |
| created                             | 2019-05-31T19:36:07Z                                     |
| flavor                              | compute1 (632d1aef-932c-4eb6-9a6d-720729b6bc66)          |
| hostId                              | d478fae80ef194fad70bacdb3080d9e0d2f15caddadf33dd60b44995 |
| id                                  | c458e8a4-0723-4c64-8120-c4482cb84255                     |
| image                               | overcloud-full (8a62c585-e0fe-4a28-abd1-14a90e301297)    |
| key_name                            | default                                                  |
| name                                | overcloud-compute1-0                                     |
| progress                            | 0                                                        |
| project_id                          | 523216891f884db599dac64fc58acad1                         |
| properties                          |                                                          |
| security_groups                     | name='default'                                           |
| status                              | ACTIVE                                                   |
| updated                             | 2019-05-31T19:43:17Z                                     |
| user_id                             | f46eed435d474bde85f3ebe3f09c42ae                         |
| volumes_attached                    |                                                          |
+-------------------------------------+----------------------------------------------------------+

Comment 10 bjacot 2019-06-03 13:21:54 UTC
Created attachment 1576661 [details]
sos report for overcloud-compute1-0

Comment 11 Bob Fournier 2019-06-04 17:40:13 UTC
Brad - if the setup is still available can we also get the sosreport from the controller to try and understand when entries were created.

Including Compute DFG to help understand what can cause the DBDuplicateEntry in nova db.

Comment 12 bjacot 2019-06-04 21:49:42 UTC
Hey Bob and Compute DFG.

I saw this issue again today.  I grabbed SOS reports for the 3:controllers and overcloud-compute8-5.localdomain.  I uploaded the files here http://rhos-release.virt.bos.redhat.com/log/bz1710118/.

Today's error:

overcloud.AllNodesDeploySteps.Compute4Deployment_Step5.5:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 0a4a89ab-2d16-423d-aa61-8d0c48b3ec7c
  status: CREATE_FAILED
  status_reason: |
    Error: resources[5]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "    raise errorclass(errno, errval)", 
            "DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u\"Duplicate entry 'overcloud-compute8-5.localdomain' for key 'uniq_host_mappings0host'\") [SQL
: u'INSERT INTO host_mappings (created_at, updated_at, cell_id, host) VALUES (%(created_at)s, %(updated_at)s, %(cell_id)s, %(host)s)'] [parameters: {'created_at':
 datetime.datetime(2019, 6, 4, 21, 8, 15, 453649), 'cell_id': 6, 'host': u'overcloud-compute8-5.localdomain', 'updated_at': None}] (Background on this error at: h
ttp://sqlalche.me/e/gkpj)", 
            "stderr: "
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0d5fe8c3-b9e3-4c81-930c-b8c851a129a8_playbook.retry

Comment 13 Martin Schuppert 2019-06-05 13:29:12 UTC
As discussed on IRC, this is likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1699393 and fixed in openstack-tripleo-heat-templates-8.3.1-16.el7ost and later.

Comment 14 Bob Fournier 2019-06-05 14:59:24 UTC
Thanks Martin! Closing as duplicate.

*** This bug has been marked as a duplicate of bug 1699393 ***


Note You need to log in before you can comment on or make changes to this bug.