Description of problem: In case Neutron is slow and activate step from https://specs.openstack.org/openstack/neutron-specs/_images/seqdiag-7d93af8770687c1a95685208b8e6ca0553a8fb25.png fails, the multiple port binding is left in the Neutron database and VM ends up in ERROR state. Even though the activate is successful on Neutron side (but slow) and live-migration was successful, the things are left in inconsistent broken state. Version-Release number of selected component (if applicable): OSP 16.1.5 How reproducible: Always Steps to Reproduce: 1. Put some sleep, like 120 seconds, in Neutron activate method: https://github.com/openstack/neutron/blob/150396625aaa870640a6fb9d636aeb90cea4da3e/neutron/plugins/ml2/plugin.py#L2556 2. Run live migration 3. Observe ml2_port_bindings table for the VM port Actual results: Two port bindings remain and VM is in error state Expected results: Nova can be more resilient to failures in active (probably trying indefinitely to activate the port) or have a mechanism to finish migration in such state, because everything is actually correct from the VM standpoint. Additional info:
Nova does have the [neutron]http_retries config option, that one can set to any number > 0 to make Nova retry all calls to the Neutron API if the initial request errors out. I'm not sure what else Nova can do here - sure, once we're at the point that we have to activate the port binding on the destination everything is set up correctly from Nova's POV, but until we get confirmation from Neutron that the port binding has been updated, we can't reasonably move the instance back to ACTIVE, so I feel like setting it to ERROR if the port binding update fails is the correct thing do. We'll discuss this with the wider team on our bug triage call on Wednesday.
(In reply to Artom Lifshitz from comment #2) > Nova does have the [neutron]http_retries config option, that one can set to > any number > 0 to make Nova retry all calls to the Neutron API if the > initial request errors out. Yeah, this defaults to 3 retries and it does apply to a timeout situation (the retries handle "ConnectionError" which can be things like timeout, SSL error, connection refused, etc) [1]. But if neutron is being consistently slow, then it's possible all 4 attempts will still timeout each time. It's not clear to me what's the behavior of the API if the binding is already activated and we keep doing the PUT to activate it? The API is unfortunately not documented [2]. I'm not sure whether this is a good idea but I wonder if we could do something like catch a timeout exception from the PUT to activate the port and in that case, do a GET for the port bindings and check whether the destination binding is ACTIVE and the source INACTIVE (check whether the activate actually did work) and if so, go ahead and put the instance into ACTIVE state as a successfully completed migration. [1] https://github.com/openstack/keystoneauth/blob/112bcae1fbec355fcb58da07ae1d9f1adf8b77ba/keystoneauth1/exceptions/connection.py#L38 [2] https://bugs.launchpad.net/neutron/+bug/1853873
I have reproduced the issue in a local devstack by doing the following patch as suggested in comment 0: --- a/neutron/plugins/ml2/plugin.py +++ b/neutron/plugins/ml2/plugin.py @@ -13,6 +13,8 @@ # License for the specific language governing permissions and limitations # under the License. +import time + from eventlet import greenthread import netaddr from netaddr.strategy import eui48 @@ -2566,6 +2568,7 @@ class Ml2Plugin(db_base_plugin_v2.NeutronDbPluginV2, cur_context, _, try_again = self._commit_port_binding( original_context, bind_context, need_notify=True, update_binding_levels=False) + time.sleep(120) if not try_again: self.notifier.binding_deactivate(context, port_id, active_binding.host, And then doing live migrations: (openstack) server create --image 57e9f98f-c7c9-4aac-a289-95ea6bcbcc83 --flavor c1 eight (openstack) server migrate --live --block-migration eight I have found that the behavior on the nova side is actually a bug where our [neutron]http_retries config option is _not_ being honored in the case of the neutron port binding API. In our neutron API client code, we actually have two different ways of creating a client to talk to neutron [1] and when the [neutron]http_retries config option was added to nova, the second client (that calls the neutron port binding API) was missed for adding the default retry. So, the call to port binding API /activate is not doing any retries by default. You can work around the bug by adding to your nova.conf on the compute hosts: [neutron] connect_retries = 3 and the keystoneauth code we use in the clients will pick up the setting automatically. When retries are actually being done, nova will timeout during the first call and then retry and get a 409 Conflict response from neutron which nova treats as a successful port binding activation and completes the live migration without error. These excerpts show the behavior with [neutron]connect_retries = 3 on the nova side and the neutron side: n-cpu: May 26 01:09:14 ubuntu-subnode nova-compute[411810]: INFO keystoneauth.session [None req-9327f0c5-e70f-4d54-93af-cae935d1e3bd None None] Failure: Request to http://192.168.33.10:9696/v2.0/ports/d235be7e-6bf8-4403-afbf-ab2963e6574b/bindings/ubuntu-controller/activate timed out. Retrying in 0.5s. May 26 01:09:15 ubuntu-subnode nova-compute[411810]: WARNING nova.network.neutron [None req-9327f0c5-e70f-4d54-93af-cae935d1e3bd None None] Binding for port d235be7e-6bf8-4403-afbf-ab2963e6574b and host ubuntu-controller is already active. q-svc: May 26 01:09:15 ubuntu-controller neutron-server[144111]: INFO neutron.wsgi [req-8160f605-287b-494d-bb22-db3f8b3ee756 req-d439d52a-c18a-43fc-9c7d-79087b44dbae service neutron] 192.168.33.11 "PUT /v2.0/por ts/d235be7e-6bf8-4403-afbf-ab2963e6574b/bindings/ubuntu-controller/activate HTTP/1.1" status: 409 len: 386 time: 0.1463201 [...] May 26 01:10:45 ubuntu-controller neutron-server[144112]: INFO neutron.wsgi [req-8160f605-287b-494d-bb22-db3f8b3ee756 req-17d0d362-c8a1-4f67-bd49-0c68f50715ef service neutron] 192.168.33.11 "PUT /v2.0/ports/d235be7e-6bf8-4403-afbf-ab2963e6574b/bindings/ubuntu-controller/activate HTTP/1.1" status: 200 len: 443 time: 120.8294811 On the neutron side, we can see here ^ that the 409 Conflict reply actually happens before the original first request from nova because of the sleep(120) that was patched in to reproduce the issue. To fix the nova bug, we need to pass CONF.neutron.http_retries as connect_retries to the keystoneauth based client for calling the port binding API. [1] https://github.com/openstack/nova/blob/b0cd985f0c09088098f74cc0cb1df616cc0ef12b/nova/network/neutron.py#L232-L274
I found that the repeating neutron DuplicateEntry error is actually caused by a neutron side retry sequence: q-svc: May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG neutron.plugins.ml2.managers [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] DB exception raised by Mechanism driver 'openvswitch' in update_port_precommit {{(pid=144110) _call_on_drivers /opt/stack/neutron/neutron/plugins/ml2/managers.py:481}} May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers Traceback (most recent call last): May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers self.dialect.do_execute( May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 608, in do_execute May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers cursor.execute(statement, parameters) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 148, in execute May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers result = self._query(query) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 310, in _query May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers conn.query(q) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 548, in query May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers self._affected_rows = self._read_query_result(unbuffered=unbuffered) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 775, in _read_query_result May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers result.read() May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 1156, in read May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers first_packet = self.connection._read_packet() May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 725, in _read_packet May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers packet.raise_for_error() May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/protocol.py", line 221, in raise_for_error May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers err.raise_mysql_exception(self._data) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/err.py", line 143, in raise_mysql_exception May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers raise errorclass(errno, errval) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers pymysql.err.IntegrityError: (1062, "Duplicate entry '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for k ey 'ml2_port_bindings.PRIMARY'") [...] May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s] May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers [parameters: {'host': 'ubuntu-subnode', 'vif_type': 'unbound', 'vif_details': '', 'ml2_port_bindings_port_id': '17c5e5e6-be12-44b0-898f-80e0d731fd71', 'ml2_port_bindings_host': 'ubuntu-controller'}] May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers (Background on this error at: http://sqlalche.me/e/13/gkpj) May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG neutron_lib.db.api [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Retry wrapper got retriable exception: (pymysql.err.IntegrityError) (1062, "Duplicate entry '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for key 'ml2_port_bindings.PRIMARY'") May 26 00:59:49 ubuntu-controller neutron-server[144110]: [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s] May 26 00:59:49 ubuntu-controller neutron-server[144110]: [parameters: {'host': 'ubuntu-subnode', 'vif_type': 'unbound', 'vif_details': '', 'ml2_port_bindings_port_id': '17c5e5e6-be12-44b0-898f-80e0d731fd 71', 'ml2_port_bindings_host': 'ubuntu-controller'}] May 26 00:59:49 ubuntu-controller neutron-server[144110]: (Background on this error at: http://sqlalche.me/e/13/gkpj) {{(pid=144110) wrapped /usr/local/lib/python3.8/dist-packages/neutron_lib/db/api.py:183}} May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG oslo_db.api [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Performing DB retry for func tion neutron.plugins.ml2.plugin.Ml2Plugin.update_port {{(pid=144110) wrapper /usr/local/lib/python3.8/dist-packages/oslo_db/api.py:155}} [...] I think this ^ would be a separate bug to open for the neutron component.
(In reply to melanie witt from comment #7) > I found that the repeating neutron DuplicateEntry error is actually caused > by a neutron side retry sequence: > > q-svc: > > May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG > neutron.plugins.ml2.managers [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 > req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] DB exception > raised by Mechanism driver 'openvswitch' in update_port_precommit > {{(pid=144110) _call_on_drivers > /opt/stack/neutron/neutron/plugins/ml2/managers.py:481}} > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers Traceback (most recent call last): > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line > 1276, in _execute_context > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers self.dialect.do_execute( > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line > 608, in do_execute > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers cursor.execute(statement, parameters) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 148, in > execute > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers result = self._query(query) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 310, in > _query > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers conn.query(q) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 548, > in query > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers self._affected_rows = > self._read_query_result(unbuffered=unbuffered) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 775, > in _read_query_result > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers result.read() > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 1156, > in read > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers first_packet = > self.connection._read_packet() > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 725, > in _read_packet > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers packet.raise_for_error() > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/protocol.py", line 221, in > raise_for_error > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers err.raise_mysql_exception(self._data) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers File > "/usr/local/lib/python3.8/dist-packages/pymysql/err.py", line 143, in > raise_mysql_exception > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers raise errorclass(errno, errval) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers pymysql.err.IntegrityError: (1062, "Duplicate > entry '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for k > ey 'ml2_port_bindings.PRIMARY'") > > [...] > > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers [SQL: UPDATE ml2_port_bindings SET > host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE > ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND > ml2_port_bindings.host = %(ml2_port_bindings_host)s] > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers [parameters: {'host': 'ubuntu-subnode', > 'vif_type': 'unbound', 'vif_details': '', 'ml2_port_bindings_port_id': > '17c5e5e6-be12-44b0-898f-80e0d731fd71', 'ml2_port_bindings_host': > 'ubuntu-controller'}] > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers (Background on this error at: > http://sqlalche.me/e/13/gkpj) > May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR > neutron.plugins.ml2.managers > May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG > neutron_lib.db.api [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 > req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Retry wrapper got > retriable exception: (pymysql.err.IntegrityError) (1062, "Duplicate entry > '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for key > 'ml2_port_bindings.PRIMARY'") > May 26 00:59:49 ubuntu-controller neutron-server[144110]: [SQL: UPDATE > ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, > vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = > %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = > %(ml2_port_bindings_host)s] > May 26 00:59:49 ubuntu-controller neutron-server[144110]: [parameters: > {'host': 'ubuntu-subnode', 'vif_type': 'unbound', 'vif_details': '', > 'ml2_port_bindings_port_id': '17c5e5e6-be12-44b0-898f-80e0d731fd > 71', 'ml2_port_bindings_host': 'ubuntu-controller'}] > May 26 00:59:49 ubuntu-controller neutron-server[144110]: (Background on > this error at: http://sqlalche.me/e/13/gkpj) {{(pid=144110) wrapped > /usr/local/lib/python3.8/dist-packages/neutron_lib/db/api.py:183}} > May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG oslo_db.api > [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 > req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Performing DB > retry for func > tion neutron.plugins.ml2.plugin.Ml2Plugin.update_port {{(pid=144110) wrapper > /usr/local/lib/python3.8/dist-packages/oslo_db/api.py:155}} > > [...] > > I think this ^ would be a separate bug to open for the neutron component. I think this is actually correct behavior. Neutron should retry the port binding multiple times - even if the DB is in bad shape. This is triggered on every port update to fix failed_binding statuses. I'll needinfo Slawek to keep me honest here and to get his opinion.
(In reply to Jakub Libosvar from comment #8) > (In reply to melanie witt from comment #7) > > I think this ^ would be a separate bug to open for the neutron component. > > I think this is actually correct behavior. Neutron should retry the port > binding multiple times - even if the DB is in bad shape. This is triggered > on every port update to fix failed_binding statuses. I'll needinfo Slawek to > keep me honest here and to get his opinion. Yeah.. what I found odd about the behavior is that it continues to retry forever AFAICT and it didn't need to because the activation eventually worked (delayed by sleep(120)) but it continued to try to apply the same DB update repeatedly anyway, even though the DB was already in the desired state.
I'm not sure if it should keep trying to update DB forever. IMHO it should stops at some point. Please open new BZ for that and I will try to reproduce such case and take a deeper look.
(In reply to melanie witt from comment #9) > (In reply to Jakub Libosvar from comment #8) > > (In reply to melanie witt from comment #7) > > > I think this ^ would be a separate bug to open for the neutron component. > > > > I think this is actually correct behavior. Neutron should retry the port > > binding multiple times - even if the DB is in bad shape. This is triggered > > on every port update to fix failed_binding statuses. I'll needinfo Slawek to > > keep me honest here and to get his opinion. > > Yeah.. what I found odd about the behavior is that it continues to retry > forever AFAICT and it didn't need to because the activation eventually > worked (delayed by sleep(120)) but it continued to try to apply the same DB > update repeatedly anyway, even though the DB was already in the desired > state. A-ha, I missed that it never stops attempting the DB update. Thanks for clarification.
*** Bug 1968747 has been marked as a duplicate of this bug. ***
*** Bug 1973635 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762
*** Bug 2017803 has been marked as a duplicate of this bug. ***
*** Bug 2103781 has been marked as a duplicate of this bug. ***