Bug 2137934

Summary: Controller node nova-* services are down after reboot (happens only with FDP repo)
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Maor <mblue>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: other QA Contact: qding
Status: CLOSED WORKSFORME Docs Contact:
Severity: urgent    
Priority: urgent CC: chrisw, ctrautma, ekuris, fleitner, qding, scohen
Version: FDP 22.J   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-28 14:28:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Maor 2022-10-26 15:56:26 UTC
Description of problem:
Controller node services (nova-conductor, nova-scheduler, nova-compute are down) not starting up after node reboot on test - 'test_ovn_dns_name_after_networker_reboot'.
This seems to happen only *when FDP repo is used* for this job.

Job link:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-17.0_director-rhel-virthost-3cont_2comp-ipv4-geneve-gate-ovn/238/

Failure Links:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-17.0_director-rhel-virthost-3cont_2comp-ipv4-geneve-gate-ovn/238/testReport/neutron_plugin.tests.scenario.test_internal_dns/InternalDNSInterruptionsAdvancedTestOvn/test_ovn_dns_name_after_networker_reboot_id_31275dd6_744b_41d2_b4ae_43116901107d_slow_/

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-17.0_director-rhel-virthost-3cont_2comp-ipv4-geneve-gate-ovn/241/testReport/neutron_plugin.tests.scenario.test_internal_dns/InternalDNSInterruptionsAdvancedTestOvn/test_ovn_dns_name_after_networker_reboot_id_31275dd6_744b_41d2_b4ae_43116901107d_slow_/

Jon run that passes without FDP repo:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-17.0_director-rhel-virthost-3cont_2comp-ipv4-geneve-gate-ovn/242/

Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220909.n.0
FDP 22.J (Didn't happen with non-FDP runs for this job)

How reproducible:
100% AFAIK
I'm working on deploying and reproducing this failure on live environment for future debugging.

Steps to Reproduce:
1. Run this test 'test_ovn_dns_name_after_networker_reboot' on this job

Actual results:
Services fail to start up after reboot, test fails.

Expected results:
Services coming back up after reboot, test passing.

Additional info:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-17.0_director-rhel-virthost-3cont_2comp-ipv4-geneve-gate-ovn/238/undercloud-0/home/stack/tempest-dir/tempest.log.gz

 |
 V

2022-10-17 15:49:47.839 321074 INFO tempest_helper_plugin.common.waiters [-] Node '88ed3b0d-aa0b-42fa-8d63-e1c52f642c94' reached state -> power:power on, provision:active, maintenance:False
2022-10-17 15:51:18.943 321074 INFO tempest.lib.common.rest_client [req-b7c9610b-c945-4bae-b0c8-7e938b68a803 ] Request (InternalDNSInterruptionsAdvancedTestOvn:test_ovn_dns_name_after_networker_reboot): 200 GET http://10.0.0.125:8774/v2.1/os-services 1.083s
2022-10-17 15:51:18.944 321074 DEBUG tempest.lib.common.rest_client [req-b7c9610b-c945-4bae-b0c8-7e938b68a803 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Mon, 17 Oct 2022 15:51:17 GMT', 'server': 'Apache', 'content-length': '1614', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version,Accept-Encoding', 'x-openstack-request-id': 'req-b7c9610b-c945-4bae-b0c8-7e938b68a803', 'x-compute-request-id': 'req-b7c9610b-c945-4bae-b0c8-7e938b68a803', 'content-type': 'application/json', 'connection': 'close', 'status': '200', 'content-location': 'http://10.0.0.125:8774/v2.1/os-services'}
        Body: b'{"services": [{"binary": "nova-conductor", "host": "controller-0.redhat.local", "id": 2, "zone": "internal", "status": "enabled", "state": "down", "updated_at": "2022-10-17T15:48:24.000000", "disabled_reason": null}, {"binary": "nova-scheduler", "host": "controller-0.redhat.local", "id": 8, "zone": "internal", "status": "enabled", "state": "down", "updated_at": "2022-10-17T15:48:25.000000", "disabled_reason": null}, {"binary": "nova-conductor", "host": "controller-1.redhat.local", "id": 26, "zone": "internal", "status": "enabled", "state": "up", "updated_at": "2022-10-17T15:51:08.000000", "disabled_reason": null}, {"binary": "nova-conductor", "host": "controller-2.redhat.local", "id": 29, "zone": "internal", "status": "enabled", "state": "up", "updated_at": "2022-10-17T15:51:16.000000", "disabled_reason": null}, {"binary": "nova-scheduler", "host": "controller-1.redhat.local", "id": 41, "zone": "internal", "status": "enabled", "state": "up", "updated_at": "2022-10-17T15:51:14.000000", "disabled_reason": null}, {"binary": "nova-scheduler", "host": "controller-2.redhat.local", "id": 53, "zone": "internal", "status": "enabled", "state": "up", "updated_at": "2022-10-17T15:51:08.000000", "disabled_reason": null}, {"binary": "nova-compute", "host": "compute-0.redhat.local", "id": 65, "zone": "nova", "status": "enabled", "state": "up", "updated_at": "2022-10-17T15:51:09.000000", "disabled_reason": null}, {"binary": "nova-compute", "host": "compute-1.redhat.local", "id": 68, "zone": "nova", "status": "enabled", "state": "up", "updated_at": "2022-10-17T15:51:09.000000", "disabled_reason": null}]}' _log_request_full /usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py:455
2022-10-17 15:51:21.024 321074 INFO tempest.lib.common.rest_client [req-d943a211-475a-4a3f-bc8e-d00b7382fafa ] Request (InternalDNSInterruptionsAdvancedTestOvn:test_ovn_dns_name_after_networker_reboot): 200 GET http://10.0.0.125:8774/v2.1/os-services 1.075s
2022-10-17 15:51:21.025 321074 DEBUG tempest.lib.common.rest_client [req-d943a211-475a-4a3f-bc8e-d00b7382fafa ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None

Comment 1 Flavio Leitner 2022-10-27 15:01:34 UTC
Hi Maor,

Can you help me understand why is this an Open vSwitch issue?

The failure I see in the log is below:
"""
  tempest.lib.exceptions.TimeoutException: Request timed out
  Details: ServicesClient failed to reach within the required time (300 s).
"""

and that could be anything.

Thanks,
fbl