Bug 1658151
Summary: | ceilometer HTTP Error 500 after controllers were temporarily disconnected | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Noam Manos <nmanos> |
Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
Status: | CLOSED NOTABUG | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 14.0 (Rocky) | CC: | dasmith, eglynn, jdanjou, jhakimra, jruzicka, kchamart, mkolesni, oblaut, sbauza, sgordon, tfreger, vromanso |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-12-21 15:38:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Noam Manos
2018-12-11 11:55:09 UTC
The 500 error is returned by Nova API, reassigning. I see two important things here: 2018-12-11 09:36:24.910 144920 ERROR neutron.agent.dhcp.agent [req-d47e84c0-1e3f-4784-8507-fb2dd90d8f27 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae8e9488964b42d2a59756561074988a and: 2018-12-11 09:34:34.933 26 ERROR oslo_db.sqlalchemy.engines [req-1ec53a36-a8be-4627-ae38-aaebe6a9e308 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8) many times over. This indicates that the database isn't in a usable state and probably didn't recover correctly from the snapshot process. It looks like the same is true for the message queue. Because of this, I don't think this is a bug in Nova (or really any other OSP component), but I didn't think that creating a snapshot affected the guest OS. So I'm not going to close this NOTABUG just yet to double check snapshot side-effects, but this is what I'm leaning towards. > I didn't think that creating a > snapshot affected the guest OS. So I'm not going to close this NOTABUG just > yet to double check snapshot side-effects, but this is what I'm leaning > towards. I was wrong. Unless the snapshot is both consistent and live, it will affect the guest OS. There are two ways around this: 1. Do a consistent and live snapshot to make sure the guest OS isn't affected :) 2. Assume any other type of snapshot will affect the guest OS. We have a guide about rebooting the overcloud [1] to make sure the services are allowed to recover. Following this guide, but replacing the reboot action with the snapshot action should allow the services to recover between snapshots and should avoid database and/or message queue failure. Cheers! [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/sect-rebooting_the_overcloud |