Bug 2117182 - HA testing including hard reboot controllers and post OSP update - network resources creation can fail with Gateway Time-out
Summary: HA testing including hard reboot controllers and post OSP update - network re...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: osp-director-operator-container
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: z4
: ---
Assignee: Jakub Libosvar
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-10 09:16 UTC by pkomarov
Modified: 2022-11-16 14:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-16 14:14:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-18139 0 None None None 2022-08-10 09:27:29 UTC

Comment 2 pkomarov 2022-08-15 20:22:25 UTC
changing to Networking as main dfg since the issue is specifically network related as it reproduces : 

#running the boot_oc_workload.sh on the openstackclient breaks on : 

+ openstack router set --external-gateway public internal_net_6804d6651b_router
HttpException: 504: Server Error for url: https://overcloud.osptest.test.metalkube.org:13696/v2.0/routers?name=internal_net_6804d6651b_router, 504 Gateway Time-out: The server didn't respond in time.

From the neutron logs I couldn't find the root cause yet:
[cloud-admin@openstackclient ~]$ ansible controller -b -mshell -a 'grep -ri internal_net_6804d6651b_router /var/log'
controller-1 | CHANGED | rc=0 >>
Binary file /var/log/journal/e3929330acc880a10e2702906537133c/system.journal matches
/var/log/containers/neutron/server.log:2022-08-15 15:00:59.995 15 DEBUG neutron.api.v2.base [req-91d4669f-26af-4722-8a54-2abeecf3575c 63b79e343f904c12a331da5e56e088e3 084edfb8d4e9430f90ab05f9f840373f - default default] Request body: {'router': {'admin_state_up': True, 'name': 'internal_net_6804d6651b_router'}} prepare_request_body /usr/lib/python3.6/site-packages/neutron/api/v2/base.py:719
/var/log/containers/neutron/server.log:2022-08-15 15:01:00.567 15 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddLRouterCommand(name=neutron-db9f1c9d-3611-43b8-93ee-903449e30f11, columns={'external_ids': {'neutron:router_name': 'internal_net_6804d6651b_router', 'neutron:gw_port_id': '', 'neutron:revision_number': '1', 'neutron:availability_zone_hints': ''}, 'enabled': True, 'options': {'always_learn_from_arp_request': 'false', 'dynamic_neigh_routers': 'true'}}, may_exist=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89
/var/log/messages:Aug 15 15:26:58 controller-1 platform-python[581006]: ansible-command Invoked with _raw_params=grep -ri internal_net_6804d6651b_router /var/log _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
controller-2 | CHANGED | rc=0 >>
Binary file /var/log/journal/e3929330acc880a10e2702906537133c/system.journal matches
/var/log/containers/neutron/server.log:2022-08-15 15:01:16.833 15 INFO neutron.wsgi [req-129eff0e-51f4-40e1-97d5-cc82dd722828 63b79e343f904c12a331da5e56e088e3 084edfb8d4e9430f90ab05f9f840373f - default default] 172.17.0.20 "GET /v2.0/routers/internal_net_6804d6651b_router HTTP/1.1" status: 404  len: 311 time: 0.1050549
/var/log/containers/neutron/server.log:2022-08-15 15:01:39.799 17 INFO neutron.wsgi [req-45c683be-665a-45ea-975e-0cf01f6a5cef 63b79e343f904c12a331da5e56e088e3 084edfb8d4e9430f90ab05f9f840373f - default default] 172.17.0.20 "GET /v2.0/routers/internal_net_6804d6651b_router HTTP/1.1" status: 404  len: 311 time: 0.2596943
/var/log/messages:Aug 15 15:26:58 controller-2 platform-python[974210]: ansible-command Invoked with _raw_params=grep -ri internal_net_6804d6651b_router /var/log _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
controller-0 | CHANGED | rc=0 >>
Binary file /var/log/journal/e3929330acc880a10e2702906537133c/system.journal matches
/var/log/containers/haproxy/haproxy.log:Aug 15 15:01:16 controller-0 haproxy[12]: 10.0.0.251:46100 [15/Aug/2022:15:01:16.727] neutron~ neutron/controller-2.internalapi.osptest.test.metalkube.org 0/0/1/108/109 404 311 - - ---- 3/1/0/1/0 0/0 "GET /v2.0/routers/internal_net_6804d6651b_router HTTP/1.1"
/var/log/containers/haproxy/haproxy.log:Aug 15 15:01:28 controller-0 haproxy[12]: 10.0.0.251:46100 [15/Aug/2022:15:01:16.842] neutron~ neutron/controller-0.internalapi.osptest.test.metalkube.org 0/0/0/11224/11224 200 652 - - ---- 1/1/0/1/0 0/0 "GET /v2.0/routers?name=internal_net_6804d6651b_router HTTP/1.1"
/var/log/containers/haproxy/haproxy.log:Aug 15 15:01:39 controller-0 haproxy[12]: 10.0.0.251:47478 [15/Aug/2022:15:01:39.538] neutron~ neutron/controller-2.internalapi.osptest.test.metalkube.org 0/0/2/261/263 404 311 - - ---- 3/1/0/1/0 0/0 "GET /v2.0/routers/internal_net_6804d6651b_router HTTP/1.1"
/var/log/containers/haproxy/haproxy.log:Aug 15 15:03:39 controller-0 haproxy[12]: 10.0.0.251:47478 [15/Aug/2022:15:01:39.827] neutron~ neutron/controller-0.internalapi.osptest.test.metalkube.org 0/0/2/-1/120004 504 194 - - sH-- 1/1/0/0/0 0/0 "GET /v2.0/routers?name=internal_net_6804d6651b_router HTTP/1.1"
/var/log/containers/neutron/server.log:2022-08-15 15:01:28.066 16 INFO neutron.wsgi [req-607d6aab-289d-4596-9e2c-46ba539870c3 63b79e343f904c12a331da5e56e088e3 084edfb8d4e9430f90ab05f9f840373f - default default] 172.17.0.20 "GET /v2.0/routers?name=internal_net_6804d6651b_router HTTP/1.1" status: 200  len: 652 time: 11.2204776
/var/log/containers/neutron/server.log:2022-08-15 15:04:28.133 16 INFO neutron.wsgi [req-983eedc0-0851-43fd-beba-0c75cc3a4eb3 63b79e343f904c12a331da5e56e088e3 084edfb8d4e9430f90ab05f9f840373f - default default] 172.17.0.20 "GET /v2.0/routers?name=internal_net_6804d6651b_router HTTP/1.1" status: 200  len: 0 time: 168.3015454
/var/log/containers/neutron/server.log:2022-08-15 15:04:48.795 22 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckRevisionNumberCommand(name=neutron-db9f1c9d-3611-43b8-93ee-903449e30f11, resource={'id': 'db9f1c9d-3611-43b8-93ee-903449e30f11', 'name': 'internal_net_6804d6651b_router', 'tenant_id': '084edfb8d4e9430f90ab05f9f840373f', 'admin_state_up': True, 'status': 'ACTIVE', 'external_gateway_info': None, 'gw_port_id': None, 'description': '', 'availability_zones': [], 'distributed': False, 'ha': False, 'ha_vr_id': 0, 'availability_zone_hints': [], 'routes': [], 'tags': [], 'created_at': '2022-08-15T15:01:00Z', 'updated_at': '2022-08-15T15:01:31Z', 'revision_number': 2, 'project_id': '084edfb8d4e9430f90ab05f9f840373f'}, resource_type=routers, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89
/var/log/containers/neutron/server.log:2022-08-15 15:04:48.796 22 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): UpdateLRouterCommand(name=neutron-db9f1c9d-3611-43b8-93ee-903449e30f11, columns={'external_ids': {'neutron:router_name': 'internal_net_6804d6651b_router', 'neutron:gw_port_id': '', 'neutron:revision_number': '2', 'neutron:availability_zone_hints': ''}, 'enabled': True}, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89
/var/log/messages:Aug 15 15:26:58 controller-0 platform-python[932056]: ansible-command Invoked with _raw_params=grep -ri internal_net_6804d6651b_router /var/log _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None

Comment 3 pkomarov 2022-08-22 12:03:51 UTC
Closing as this issue doesn't reproduce on a BM host with more power - 96 vs 60 cores and 250 G mem

Comment 4 Udi Shkalim 2022-09-04 16:45:15 UTC
Re-opening this one as we see the same error in HA testing.
Only after reboot on the OVN master node/controller does the cluster regain activity.
IMO, not related to HW.


Creating network internal_net_3244a4dee9
+ openstack network create internal_net_3244a4dee9
Error while executing command: HttpException: 504, The server didn't respond in time.: 504 Gateway Time-out
+ echo 'Creating subnet internal_net_3244a4dee9_subnet'
Creating subnet internal_net_3244a4dee9_subnet
+ openstack subnet create --subnet-range 192.168.0.0/24 --allocation-pool start=192.168.0.10,end=192.168.0.100 --gateway 192.168.0.254 --dns-nameserver 172.22.0.1 --network internal_net_3244a4dee9 internal_net_3244a4dee9_subnet
HttpException: 504: Server Error for url: https://overcloud.osptest.test.metalkube.org:13696/v2.0/subnets, 504 Gateway Time-out: The server didn't respond in time.
+ echo 'Add subnet internal_net_3244a4dee9_subnet to router internal_net_3244a4dee9_router'
Add subnet internal_net_3244a4dee9_subnet to router internal_net_3244a4dee9_router
+ openstack router add subnet internal_net_3244a4dee9_router internal_net_3244a4dee9_subnet
No Router found for internal_net_3244a4dee9_router
+ echo 'Set external-gateway for internal_net_3244a4dee9_router'
Set external-gateway for internal_net_3244a4dee9_router
+ openstack router set --external-gateway public internal_net_3244a4dee9_router
No Router found for internal_net_3244a4dee9_router

## create security group
openstack security group list | grep ${SECGROUP_NAME}
+ grep allow-icmp-ssh-3244a4dee9
+ openstack security group list
if [ $? -ne 0 ]; then
    echo "Creating security group ${SECGROUP_NAME}"
    openstack security group create ${SECGROUP_NAME}

    echo "Creating rules for ports 22,80,443 in security group ${SECGROUP_NAME}"
    openstack security group rule create --proto icmp ${SECGROUP_NAME}
    openstack security group rule create --proto tcp --dst-port 22 ${SECGROUP_NAME}
    openstack security group rule create --proto tcp --dst-port 80 ${SECGROUP_NAME}
    openstack security group rule create --proto tcp --dst-port 443 ${SECGROUP_NAME}
fi
+ '[' 1 -ne 0 ']'
+ echo 'Creating security group allow-icmp-ssh-3244a4dee9'
Creating security group allow-icmp-ssh-3244a4dee9
+ openstack security group create allow-icmp-ssh-3244a4dee9
Error while executing command: HttpException: 504, 504 Gateway Time-out: The server didn't respond in time.
+ echo 'Creating rules for ports 22,80,443 in security group allow-icmp-ssh-3244a4dee9'
Creating rules for ports 22,80,443 in security group allow-icmp-ssh-3244a4dee9
+ openstack security group rule create --proto icmp allow-icmp-ssh-3244a4dee9
Error while executing command: HttpException: 504, 504 Gateway Time-out: The server didn't respond in time.
+ openstack security group rule create --proto tcp --dst-port 22 allow-icmp-ssh-3244a4dee9
Error while executing command: HttpException: 504, The server didn't respond in time.: 504 Gateway Time-out
+ openstack security group rule create --proto tcp --dst-port 80 allow-icmp-ssh-3244a4dee9
Error while executing command: HttpException: 504, The server didn't respond in time.: 504 Gateway Time-out
+ openstack security group rule create --proto tcp --dst-port 443 allow-icmp-ssh-3244a4dee9

Comment 19 Udi Shkalim 2022-11-16 14:14:19 UTC
While trying to have a reproducer environment it seems that the problem is fixed 5/5 HA runs are passing


Note You need to log in before you can comment on or make changes to this bug.