Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1891461

Summary:	A slight cut in rabbitmq/db connectivity may trigger a full sync in l3 agent that leads to total cut in FIP availability in version <=z7.
Product:	Red Hat OpenStack	Reporter:	Sofer Athlan-Guyot <sathlang>
Component:	openstack-tripleo-heat-templates	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED CANTFIX	QA Contact:	David Rosenfeld <drosenfe>
Severity:	high	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	bcafarel, mburns
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-04 16:42:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2020-10-26 10:56:30 UTC

Description of problem:

Hi, during debugging of https://bugzilla.redhat.com/show_bug.cgi?id=1885342, it was found that a cut in ctl plane (db/rabbitmb) triggers a full sync that lead to a cut in fip connectivity that is not recoverable for old osp13 release (<=z7) with a networker role.

The ultimate cause is known and described in that bugzilla[1].

At first it was though that we needed a big cut (~30min), but even a small cut (~2min) is enough to trigger it.

I'm opening this bug to document the issue as it stands today, and make sure
 - there is no way to prevents the full sync for small cut on old release.
 - that the reboot of the networker is the only way out of this issue, or is there a way to prevent the full for such small cut ?

The issue from the log point of view:

   The cut in messaging which last ~1min

     2020-10-24 13:40:31.959 27120 ERROR oslo.messaging._drivers.impl_rabbit [-] [2b131c5a-620f-4f0e-85f8-358e1c3f8472] AMQP server on messaging-2.internalapi.redhat.local:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: None: error: [Errno 104] Connection reset by peer
     2020-10-24 13:41:41.413 27120 INFO oslo.messaging._drivers.impl_rabbit [-] [2b131c5a-620f-4f0e-85f8-358e1c3f8472] Reconnected to AMQP server on messaging-0.internalapi.redhat.local:5672 via [amqp] client with port 41380.
     2020-10-24 13:41:41.422 27120 INFO oslo.messaging._drivers.impl_rabbit [-] [5ac3cab5-21dd-438f-9121-f0f65aa20278] Reconnected to AMQP server on messaging-0.internalapi.redhat.local:5672 via [amqp] client with port 41378.


   Then a cut in db which last again ~1min


     2020-10-24 13:42:27.069 27120 ERROR neutron.agent.l3.agent RemoteError: Remote error: DBConnectionError (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
     2020-10-24 13:42:27.069 27120 ERROR neutron.agent.l3.agent [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming\n    res = self.dispatcher.dispatch(message)\n', u'  File "/usr/lib/python2.7/
     site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch\n    return self._do_dispatch(endpoint, method, ctxt, args)\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
     2020-10-24 13:42:27.069 27120 ERROR neutron.agent.l3.agent
     2020-10-24 13:42:27.070 27120 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 13.80 sec
     2020-10-24 13:42:51.868 27120 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process._check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
     2020-10-24 13:42:51.868 27120 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
     2020-10-24 13:42:57.866 27120 DEBUG oslo_service.periodic_task [req-b511e20a-ac7e-4834-99be-c97c89c892fd - - - - -] Running periodic task L3NATAgentWithStateReport.periodic_sync_routers_task run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
     2020-10-24 13:43:09.871 27120 ERROR neutron.agent.l3.agent [-] Failed reporting state!: RemoteError: Remote error: DBConnectionError (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)

     2020-10-24 13:43:09.871 27120 ERROR neutron.agent.l3.agent
     2020-10-24 13:43:09.872 27120 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 12.80 sec
     2020-10-24 13:43:16.033 27120 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process._check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
     2020-10-24 13:43:16.035 27120 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.002s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285


   Then the full sync:


     2020-10-24 13:43:16.239 27120 INFO neutron.agent.l3.agent [-] Agent has just been revived. Doing a full sync.


   Then the bug[1]:

     2020-10-24 13:43:34.662 27120 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4" failed: Invalid argument

     2020-10-24 13:43:34.662 27120 WARNING neutron.agent.linux.ip_lib [-] Setting ['sysctl', '-w', u'net.ipv6.conf.qg-30c3ad58-5c.accept_ra=0'] in namespace qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4 failed: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4" failed: Invalid argument
     .: ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4" failed: Invalid argument
     2020-10-24 13:43:34.662 27120 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4', 'sysctl', '-w', 'net.ipv6.conf.qg-30c3ad58-5c.forwarding=1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:108
     2020-10-24 13:43:34.668 27120 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4" failed: Invalid argument

     2020-10-24 13:43:34.668 27120 WARNING neutron.agent.linux.ip_lib [-] Setting ['sysctl', '-w', u'net.ipv6.conf.qg-30c3ad58-5c.forwarding=1'] in namespace qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4 failed: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4" failed: Invalid argument
     .: ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-8f44da3d-59ae-4917-bb33-a6e7073971d4" failed: Invalid argument


Version-Release number of selected component (if applicable):  This happen only if you have osp13 <= z7 (on rhel-7.6) and have networker role installed.  Osp13z8 and above don't have the issue.


How reproducible:  Not always, sometime the cut doesn't trigger the full sync.


Steps to Reproduce:
1. deploy osp13 <= z7 with networkers
2. cut the db or messaging access from the networkers for ~1min
3. see the full sync happening and and fails.

Actual results: The full sync fail and namespaces are left non-working on the networkers as described in [1].


Expected results: Small cut don't trigger the full sync and the bug.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1758547#c55

Comment 4 Sofer Athlan-Guyot 2020-12-09 13:21:53 UTC

Ok,

so just to be clear the workaround is really for another issue. It just help triggers this issue which has currently no clean solution beside restart the networker.

To be more specific updating the messaging nodes shouldn't cause a cut in connectivity as we doing a rolling update. The problem then is that that cut in connectivity cascade into a unresolved issue (for those version of OSP) with neutron which totally cut the network.... until someone restart the networker.

My request here was to investigate why the rolling update was triggering cut in messaging connectivity.

Currently CI is currently broken for other reason, so I don't have a current clear view of this issue.

So feel free to close this, I'll re-raise it if I have reproducer now that we don't use the mentioned workaround anymore as rhos-release solve that problem.

Thanks,

Comment 5 Bernard Cafarelli 2021-01-04 16:42:37 UTC

Sounds good, closing for now then