Bug 1565055

Summary: Functional tests fail on L3HATestFailover
Product: Red Hat OpenStack Reporter: Bernard Cafarelli <bcafarel>
Component: openstack-neutronAssignee: Bernard Cafarelli <bcafarel>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: medium Docs Contact:
Priority: medium    
Version: 11.0 (Ocata)CC: amuller, chrisw, nyechiel, srevivo
Target Milestone: z5Keywords: AutomationBlocker, Triaged, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-10.0.4-8.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1567493 (view as bug list) Environment:
Last Closed: 2018-05-18 16:56:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1567493    
Attachments:
Description Flags
journalctl log while running failing tests none

Description Bernard Cafarelli 2018-04-09 09:24:07 UTC
Description of problem:

neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_both_ha_router_lost_gw_connection may fail with timeout
neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_ha_router_lost_gw_connection may fail with mismatch_error

Version-Release number of selected component (if applicable): latest OSP 11


How reproducible: recent tests show one of these 2 failures

Comment 5 Bernard Cafarelli 2018-04-18 12:44:16 UTC
More tests:
* upstream stable/ocata devstack on centos 7.4: tests do not fail (left running all night)
* osp 11 installed with packstack on rhel 7.4 + checkout of rhos-11.0-patches: tests do not fail
* connecting to one of the CI nodes, rhel 7.5: tests almost always fail now, sample with both timeout and mismatch below

neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_ha_router_lost_gw_connection
---------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/base.py", line 117, in func
        return f(self, *args, **kwargs)
      File "neutron/tests/functional/agent/l3/test_ha_router.py", line 391, in test_ha_router_lost_gw_connection
        self.assertEqual(master_router, new_slave)
      File "/usr/lib/python2.7/site-packages/testtools/testcase.py", line 350, in assertEqual
        self.assertThat(observed, matcher, message)
      File "/usr/lib/python2.7/site-packages/testtools/testcase.py", line 435, in assertThat
        raise mismatch_error
    testtools.matchers._impl.MismatchError: !=:
    reference = <neutron.agent.l3.ha_router.HaRouter object at 0x7f79d481d9d0>
    actual    = <neutron.agent.l3.ha_router.HaRouter object at 0x7f79d46ad210>
    
    

Captured stderr:
~~~~~~~~~~~~~~~~
    neutron/agent/ovsdb/native/connection.py:116: DeprecationWarning: Using function/method 'Connection._idl_factory()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory function instead
      self.idl = self.idl_factory()
    neutron/agent/ovsdb/native/connection.py:98: DeprecationWarning: Using function/method 'Connection.get_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use idlutils.get_schema_helper(conn, schema, retry=True)
      helper = self.get_schema_helper()
    neutron/agent/ovsdb/native/connection.py:99: DeprecationWarning: Using function/method 'Connection.update_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory and ovs.db.SchemaHelper for filtering
      self.update_schema_helper(helper)
    neutron/common/utils.py:804: DeprecationWarning: Raising eventlet.TimeoutError by default has been deprecated in version 'Ocata' and will be removed in version 'Pike': wait_until_true() now raises WaitTimeout error by default.
      removal_version="Pike")
    

neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_both_ha_router_lost_gw_connection
--------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/base.py", line 119, in func
        self.fail('Execution of this test timed out: %s' % e)
      File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
        raise self.failureException(msg)
    AssertionError: Execution of this test timed out: Timed out after 60 seconds
    

Captured stderr:
~~~~~~~~~~~~~~~~
    neutron/agent/ovsdb/native/connection.py:116: DeprecationWarning: Using function/method 'Connection._idl_factory()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory function instead
      self.idl = self.idl_factory()
    neutron/agent/ovsdb/native/connection.py:98: DeprecationWarning: Using function/method 'Connection.get_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use idlutils.get_schema_helper(conn, schema, retry=True)
      helper = self.get_schema_helper()
    neutron/agent/ovsdb/native/connection.py:99: DeprecationWarning: Using function/method 'Connection.update_schema_helper()' is deprecated in version 'Ocata' and will be removed in version 'Pike': Use an idl_factory and ovs.db.SchemaHelper for filtering
      self.update_schema_helper(helper)
    neutron/common/utils.py:804: DeprecationWarning: Raising eventlet.TimeoutError by default has been deprecated in version 'Ocata' and will be removed in version 'Pike': wait_until_true() now raises WaitTimeout error by default.
      removal_version="Pike")

I captured journalctl, attaching it to the bz

Comment 6 Bernard Cafarelli 2018-04-18 12:44:53 UTC
Created attachment 1423562 [details]
journalctl log while running failing tests

Comment 7 Bernard Cafarelli 2018-04-19 10:25:38 UTC
The OS versions made me test a 7.4 to 7.5 update, looks like the failure only appears with 7.5 packages.

On the "osp 11 installed with packstack on rhel 7.4 + checkout of rhos-11.0-patches" setup, I can reproduce after running yum update on the system (only system updates). Looking into possible changes

Comment 9 Bernard Cafarelli 2018-04-19 15:20:37 UTC
Just updating keepalived to the 7.5 version is enough to trigger the failures, checking specific changes between builds (both are upstream 1.3.5)

Comment 10 Bernard Cafarelli 2018-04-19 15:48:39 UTC
After 1.3.5-4 "Fix bugs related to failures when load modules and/or segfaults" for #1508435 the health check script can not be found anymore

Before:
defiant-rhos Keepalived_vrrp: VRRP_Script(ha_health_check_1) succeeded
After:
defiant-rhos Keepalived_vrrp: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
defiant-rhos Keepalived_vrrp: Unable to access script `/tmp/tmpBwr_Ci/tmpQckjUG/ha_confs/83c87422-74b7-46c8-8f44-79f3fee54062/ha_check_script_1.sh`
defiant-rhos Keepalived_vrrp: Disabling track script ha_health_check_1 since not found

That explains the timeouts observed in tests I guess

Comment 12 Bernard Cafarelli 2018-04-23 10:26:46 UTC
OK, so the possible race fixed by https://bugs.launchpad.net/neutron/+bug/1674780 occurs more often with keepalived from rhel 7.5

That explains the "Unable to access script" logs while nothing had changed in the file generation/path/content.

Backporting this change to get CI opinion

Comment 18 errata-xmlrpc 2018-05-18 16:56:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1614