Bug 1691049 - openstack overcloud node provide hangs due the connectivity issues between containers
Summary: openstack overcloud node provide hangs due the connectivity issues between co...
Keywords:
Status: CLOSED DUPLICATE of bug 1686817
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-20 17:48 UTC by Yuri Obshansky
Modified: 2023-03-21 19:15 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-22 14:19:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-23423 0 None None None 2023-03-21 19:15:40 UTC

Description Yuri Obshansky 2019-03-20 17:48:15 UTC
Description of problem:
$ openstack overcloud node provide --all-manageable 
hangs for hours (I didn't measure exactly time)
Actually that command changed baremetal nodes state, but hangs with mesage
"Waiting for messages on queue 'tripleo' with no timeout".
Many errors in mistral logs inform about DB connection issue
For example: /var/log/containers/mistral/engine.log
2019-03-19 19:07:06.488 1 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConne
ctionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT 1'] (Background on this error at:
http://sqlalche.me/e/e3q8)
More errors:
http://pastebin.test.redhat.com/741012

/var/log/containers/mysql/mariadb.log
2019-03-19 19:07:18 17 [Warning] Aborted connection 17 to db: 'keystone' user: 'keystone' host: 'site-undercloud-0.localdomain' (Got an error read
ing communication packets)
2019-03-19 19:07:22 23 [Warning] Aborted connection 23 to db: 'heat' user: 'heat' host: 'site-undercloud-0.localdomain' (Got an error reading comm
unication packets)
2019-03-19 19:07:22 25 [Warning] Aborted connection 25 to db: 'heat' user: 'heat' host: 'site-undercloud-0.localdomain' (Got an error reading comm
unication packets)


Version-Release number of selected component (if applicable):
RHEL 8 image -  http://rhos-qe-mirror-tlv.usersys.redhat.com/brewroot/packages/rhel-guest-image/8.0/1776/images/rhel-guest-image-8.0-1776.x86_64.qcow2
Compose: RHOS_TRUNK-15.0-RHEL-8-20190314.n.0

How reproducible:
100%

Steps to Reproduce:
1. Install undercloud 
2. Run introspection
3. Move all baremetal nodes to manageable
$ for node in `openstack baremetal node list -f value -c UUID`; do echo $node; openstack baremetal node manage $node; done
4. Move all baremetal nodes to available
$ openstack overcloud node provide --all-manageable

Actual results:
Command hangs ->
Waiting for messages on queue 'tripleo' with no timeout.

Expected results:
Should finish

Additional info:

Comment 6 Damien Ciabrini 2019-03-22 13:59:41 UTC
Had a look today on an environment provided by Yuri

The symptoms I'm seeing are slightly different from the ones originally reported, but the consequence is the same: "openstack overcloud node provide --all-manageable" is stuck and never finishes.

A couple of remarks:

1. The DB disconnection logs reported above are most probably a red herring. It reminds me of the errors that you get when configuring too many workers for a service [1]. What changed in the mean time is that now with Mariadb 10.3, the error is also reported server-side, which probably explains those logs in 
/var/log/containers/mysql/mariadb.log:
2019-03-19 19:07:18 17 [Warning] Aborted connection 17 to db: 'keystone' user: 'keystone' host: 'site-undercloud-0.localdomain' (Got an error read
ing communication packets)
[...]

2. When connected on the env, I can see that both mysql and rabbitmq containers are still running and responding fine apparently:

$ ironic node-list
+--------------------------------------+-------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name              | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-------------------+---------------+-------------+--------------------+-------------+
| 1a535c0c-d26e-4ce3-a5c7-4d08e62a09a4 | dcn1-compute-0    | None          | power off   | available          | False       |
| f783df0c-f123-4b90-adc5-ecfc3a93d5be | dcn2-compute-0    | None          | power off   | available          | False       |
| f237f47a-f0b6-41d8-ba1a-89f61785a318 | site-compute-0    | None          | power off   | available          | False       |
| b6fc6beb-dda9-44e5-ae65-605168bc5224 | site-controller-0 | None          | power off   | available          | False       |
| cdd1003b-b784-47d7-be2b-f2623f8b5b0d | site-controller-1 | None          | power off   | available          | False       |
| d8af97a4-94be-466b-bf5c-561835b7192a | site-controller-2 | None          | power off   | available          | False       |
+--------------------------------------+-------------------+---------------+-------------+--------------------+-------------+
 
So I wonder if the stalled behaviour in "openstack overcloud node provide --all-manageable" isn't due to mistral (openstack overcloud commands run through mistral).

3. looking at mistral logs, I effectively see that some workflow errored out:

2019-03-22 12:23:40.801 1 WARNING mistral.actions.openstack.base [req-d5012f25-6d22-44ac-bd1f-6907353af620 96b0314a8bbc43f487377f2b8fb7e260 2a61b3
502c83486ca907c87d758964e0 - default default] Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/mistral/actions/openstack/base.py", line 117, in run
    result = method(**self._kwargs_for_run)
  File "/usr/lib/python3.6/site-packages/novaclient/base.py", line 418, in find
    raise exceptions.NotFound(404, msg)
novaclient.exceptions.NotFound: No Hypervisor matching {'hypervisor_hostname': 'cdd1003b-b784-47d7-be2b-f2623f8b5b0d'}. (HTTP 404)
: novaclient.exceptions.NotFound: No Hypervisor matching {'hypervisor_hostname': 'cdd1003b-b784-47d7-be2b-f2623f8b5b0d'}. (HTTP 404)
2019-03-22 12:23:40.801 1 WARNING mistral.executors.default_executor [req-d5012f25-6d22-44ac-bd1f-6907353af620 96b0314a8bbc43f487377f2b8fb7e260 2a
61b3502c83486ca907c87d758964e0 - default default] The action raised an exception [action_ex_id=cfa5959b-2741-41fd-8ef6-67d8d13f88e9, action_cls='<
class 'mistral.actions.action_factory.NovaAction'>', attributes='{'client_method_name': 'hypervisors.find'}', params='{'hypervisor_hostname': 'cdd
1003b-b784-47d7-be2b-f2623f8b5b0d'}']
 NovaAction.hypervisors.find failed: No Hypervisor matching {'hypervisor_hostname': 'cdd1003b-b784-47d7-be2b-f2623f8b5b0d'}. (HTTP 404): mistral.e
xceptions.ActionException: NovaAction.hypervisors.find failed: No Hypervisor matching {'hypervisor_hostname': 'cdd1003b-b784-47d7-be2b-f2623f8b5b0
d'}. (HTTP 404)
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor Traceback (most recent call last):
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor   File "/usr/lib/python3.6/site-packages/mistral/actions/openstack/base.py", li
ne 117, in run
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor     result = method(**self._kwargs_for_run)
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor   File "/usr/lib/python3.6/site-packages/novaclient/base.py", line 418, in find
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor     raise exceptions.NotFound(404, msg)
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor novaclient.exceptions.NotFound: No Hypervisor matching {'hypervisor_hostname': 
'cdd1003b-b784-47d7-be2b-f2623f8b5b0d'}. (HTTP 404)
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor 
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor During handling of the above exception, another exception occurred:
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor 
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor Traceback (most recent call last):
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor   File "/usr/lib/python3.6/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor   File "/usr/lib/python3.6/site-packages/mistral/actions/openstack/base.py", line 130, in run
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor     (self.__class__.__name__, self.client_method_name, str(e))
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor mistral.exceptions.ActionException: NovaAction.hypervisors.find failed: No Hypervisor matching {'hypervisor_hostname': 'cdd1003b-b784-47d7-be2b-f2623f8b5b0d'}. (HTTP 404)
2019-03-22 12:23:40.801 1 ERROR mistral.executors.default_executor 

So it looks like a nova API returned an error


4. when looking at nova logs, I can see:

2019-03-22 12:23:29.938 1 ERROR nova.virt.ironic.driver [req-688eac33-96a1-4ef3-8002-6fb1ee7d6425 - - - - -] An unknown error has occurred when tr
ying to get the list of nodes from the Ironic inventory. Error: maximum recursion depth exceeded while calling a Python object: RecursionError: ma
ximum recursion depth exceeded while calling a Python object
2019-03-22 12:23:29.938 1 WARNING nova.compute.manager [req-688eac33-96a1-4ef3-8002-6fb1ee7d6425 - - - - -] Virt driver is not ready.: nova.except
ion.VirtDriverNotReady: Virt driver is not ready.



So, bottom line, it looks like the DB or rabbit are not the root cause of the failure, but something in Nova got misconfigured in the first place?
 


[1] http://lists.openstack.org/pipermail/openstack-dev/2015-December/082717.html

Comment 8 Michele Baldessari 2019-03-22 14:19:44 UTC

*** This bug has been marked as a duplicate of bug 1686817 ***


Note You need to log in before you can comment on or make changes to this bug.