Description of problem: neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_both_ha_router_lost_gw_connection may fail with timeout neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_ha_router_lost_gw_connection may fail with mismatch_error Version-Release number of selected component (if applicable): latest OSP 11 How reproducible: recent tests show one of these 2 failures
More tests: * upstream stable/ocata devstack on centos 7.4: tests do not fail (left running all night) * osp 11 installed with packstack on rhel 7.4 + checkout of rhos-11.0-patches: tests do not fail * connecting to one of the CI nodes, rhel 7.5: tests almost always fail now, sample with both timeout and mismatch below neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_ha_router_lost_gw_connection --------------------------------------------------------------------------------------------------- Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "neutron/tests/base.py", line 117, in func return f(self, *args, **kwargs) File "neutron/tests/functional/agent/l3/test_ha_router.py", line 391, in test_ha_router_lost_gw_connection self.assertEqual(master_router, new_slave) File "/usr/lib/python2.7/site-packages/testtools/testcase.py", line 350, in assertEqual self.assertThat(observed, matcher, message) File "/usr/lib/python2.7/site-packages/testtools/testcase.py", line 435, in assertThat raise mismatch_error testtools.matchers._impl.MismatchError: !=: reference = <neutron.agent.l3.ha_router.HaRouter object at 0x7f79d481d9d0> actual = <neutron.agent.l3.ha_router.HaRouter object at 0x7f79d46ad210> Captured stderr: ~~~~~~~~~~~~~~~~ neutron/agent/ovsdb/native/connection.py:116: DeprecationWarning: Using function/method 'Connection._idl_factory()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory function instead self.idl = self.idl_factory() neutron/agent/ovsdb/native/connection.py:98: DeprecationWarning: Using function/method 'Connection.get_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use idlutils.get_schema_helper(conn, schema, retry=True) helper = self.get_schema_helper() neutron/agent/ovsdb/native/connection.py:99: DeprecationWarning: Using function/method 'Connection.update_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory and ovs.db.SchemaHelper for filtering self.update_schema_helper(helper) neutron/common/utils.py:804: DeprecationWarning: Raising eventlet.TimeoutError by default has been deprecated in version 'Ocata' and will be removed in version 'Pike': wait_until_true() now raises WaitTimeout error by default. removal_version="Pike") neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_both_ha_router_lost_gw_connection -------------------------------------------------------------------------------------------------------- Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "neutron/tests/base.py", line 119, in func self.fail('Execution of this test timed out: %s' % e) File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail raise self.failureException(msg) AssertionError: Execution of this test timed out: Timed out after 60 seconds Captured stderr: ~~~~~~~~~~~~~~~~ neutron/agent/ovsdb/native/connection.py:116: DeprecationWarning: Using function/method 'Connection._idl_factory()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory function instead self.idl = self.idl_factory() neutron/agent/ovsdb/native/connection.py:98: DeprecationWarning: Using function/method 'Connection.get_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use idlutils.get_schema_helper(conn, schema, retry=True) helper = self.get_schema_helper() neutron/agent/ovsdb/native/connection.py:99: DeprecationWarning: Using function/method 'Connection.update_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory and ovs.db.SchemaHelper for filtering self.update_schema_helper(helper) neutron/common/utils.py:804: DeprecationWarning: Raising eventlet.TimeoutError by default has been deprecated in version 'Ocata' and will be removed in version 'Pike': wait_until_true() now raises WaitTimeout error by default. removal_version="Pike") I captured journalctl, attaching it to the bz
Created attachment 1423562 [details] journalctl log while running failing tests
The OS versions made me test a 7.4 to 7.5 update, looks like the failure only appears with 7.5 packages. On the "osp 11 installed with packstack on rhel 7.4 + checkout of rhos-11.0-patches" setup, I can reproduce after running yum update on the system (only system updates). Looking into possible changes
Just updating keepalived to the 7.5 version is enough to trigger the failures, checking specific changes between builds (both are upstream 1.3.5)
After 1.3.5-4 "Fix bugs related to failures when load modules and/or segfaults" for #1508435 the health check script can not be found anymore Before: defiant-rhos Keepalived_vrrp: VRRP_Script(ha_health_check_1) succeeded After: defiant-rhos Keepalived_vrrp: WARNING - default user 'keepalived_script' for script execution does not exist - please create. defiant-rhos Keepalived_vrrp: Unable to access script `/tmp/tmpBwr_Ci/tmpQckjUG/ha_confs/83c87422-74b7-46c8-8f44-79f3fee54062/ha_check_script_1.sh` defiant-rhos Keepalived_vrrp: Disabling track script ha_health_check_1 since not found That explains the timeouts observed in tests I guess
OK, so the possible race fixed by https://bugs.launchpad.net/neutron/+bug/1674780 occurs more often with keepalived from rhel 7.5 That explains the "Unable to access script" logs while nothing had changed in the file generation/path/content. Backporting this change to get CI opinion
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1614