Bug 1970353

Summary: osp16.1 Neutron api unresponsive after controller hard reboot
Product: Red Hat OpenStack Reporter: pkomarov
Component: openstack-neutronAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED DUPLICATE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: chrisw, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-27 13:11:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pkomarov 2021-06-10 10:52:43 UTC
Description of problem:
After a controller hard reboot Neutron api unresponsive

Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20210604.n.0


How reproducible:
100%

Steps to Reproduce:
1. deploy osp16.1 puddle : RHOS-16.1-RHEL-8-20210604.n.0
2.  hard reboot a controller 
3.  try to create a network : 
openstack network create ...

Actual results:

stderr:
Error while executing command: HttpException: 504, 504 Gateway Time-out: The server didn't respond in time.


Additional info:
Two affected CI jobs : 
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/188/

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/187/


Logs : 

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/188/

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/187/


Tobiko tests results analysis : 

for both jobs, we can see that tests create resources pass: 
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/188/infrared/.workspaces/workspace_2021-06-02_20-05-41/tobiko_faults/tobiko_faults_01_create_resources_scenario.html

This is before any disruptions.

During the disruption tests we can see that the first health check only test passes : 
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/188/infrared/.workspaces/workspace_2021-06-02_20-05-41/tobiko_faults/tobiko_faults_02_faults_faults.html

look for the test : 
tobiko/tests/faults/ha/test_cloud_recovery.py::DisruptTripleoNodesTest::test_0vercloud_health_check

So before any disruptions all overcloud services including neutron were checked ok.

looking again in the faults tests: 
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/188/infrared/.workspaces/workspace_2021-06-02_20-05-41/tobiko_faults/tobiko_faults_02_faults_faults.html

we see that most testst are failed those include controller reboots for most of them.

Then we have the check resource tests : 
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/188/infrared/.workspaces/workspace_2021-06-02_20-05-41/tobiko_faults/tobiko_faults_03_verify_resources_scenario.html


Those run after the disruptions , and they're all failed on neutronclient timeout exceptions: 

2021-06-03 05:29:55.237 889655 DEBUG neutronclient.v2_0.client [-] Error message: <html><body><h1>504 Gateway Time-out</h1>

second type of errors I see here are also heatclient related:  

heatclient.exc.HTTPServiceUnavailable: ERROR: b'<html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html>\n'