Bug 1857743

Summary: NodePort stuck open within SDN after unidling
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aconstan, agk, cdc, nstielau
Version: 3.11.0   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In unidling proxy mode of openshift-sdn, it was expected for the endpoint to get deleted always after the service gets deleted. However this might not be the case always and if the endpoint got deleted before the service, it would induce this bug. Consequence: Though the service and endpoint would get deleted, the nodeport wasn't getting deleted and would remain stuck open in the unidling proxy mode. Fix: Removed dependency on the order in which service and endpoint need to be deleted and ensure all the necessary resources get deleted completely irrespective of which one (service or endpoint) gets deleted first in the unidling proxy mode. This was fixed in openshift-sdn 4.6 and backported to 3.11. Result: nodeport gets deleted correctly.
Story Points: ---
Clone Of:
: 1870064 (view as bug list) Environment:
Last Closed: 2020-10-27 16:15:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1870064, 1870303    

Description Matthew Robson 2020-07-16 13:19:26 UTC
Description of problem:

Customer is leveraging NodePort for service access through their DMZ for firewall purposes.

One of their application was idled and never came back when traffic resumed.

The observation is that the pod returned to service but was not accessible via the NodePort. The pod was accessible directly via its endpoints, non NodePort service / Route and via a new NodePort service.

The NodePort was correctly bound and listening on the host, but was not receiving any traffic.

Deleting the NodePort service did not remove the in use port from the host, but it did remove the service from etcd.


Version-Release number of selected component (if applicable):
3.10


How reproducible:
First time this has been observed. Customer has 60 NodePort services.


Steps to Reproduce:
1. TBD
2.
3.

Actual results:
Service is inaccessible.

Expected results:
Should recover after unidling.

Additional info:

Comment 21 zhaozhanqi 2020-08-26 09:03:34 UTC
Verified this bug on 4.6.0-0.nightly-2020-08-20-234448

Following step in comment 9

Comment 23 errata-xmlrpc 2020-10-27 16:15:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196