Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2091780

Summary: unidling of resources via traffic to an idled service shouldn't cost a refused connection
Product: OpenShift Container Platform Reporter: Anurag saxena <anusaxen>
Component: NetworkingAssignee: Andrea Panattoni <apanatto>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: medium CC: akaris, surya
Version: 4.11   
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:46:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Andrea Panattoni 2023-01-20 14:52:37 UTC
Found the root cause of the problem. 

TL-DR: OVNk reconfigures LoadBalancer with `reject=true` too early, 
the first SYN packet sent by the client triggers the unidling process, and the second SYN packet triggers a TCP reject.

Proposed solutions:
1. Unidling controller should wait for pods to be ready before removing `idling.alpha.openshift.io/idled-at` annotation.
2. OVNk should wait for endpoints to be ready before setting reject=true on Load_Balancer

Details:

I reproduced the issue with the attached scenario, issuing commands:

```
$ oc idle -n ocpbugsm-44873 http-service
The service "ocpbugsm-44873/http-service" has been marked as idled 
The service will unidle ReplicationController "ocpbugsm-44873/http-rc" to 2 replicas once it receives traffic 
ReplicationController "ocpbugsm-44873/http-rc" has been idled 

$ oc exec -n ocpbugsm-44873 client-pod -- curl -v http://http-service.ocpbugsm-44873:8000 2>&1
* Rebuilt URL to: http://http-service.ocpbugsm-44873:8000/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 172.30.219.139...
* TCP_NODELAY set
* connect to 172.30.219.139 port 8000 failed: Connection refused
* Failed to connect to http-service.ocpbugsm-44873 port 8000: Connection refused
* Closing connection 0
curl: (7) Failed to connect to http-service.ocpbugsm-44873 port 8000: Connection refused
command terminated with exit code 7
```

From tcpdump on client pod side, I see the second SYN packet is the cause of the rejection:

```
11:21:12.479752 IP 10.131.0.24.48864 > 172.30.219.139.irdmi: Flags [S], seq 4268499129, win 26583, options [mss 8861,sackOK,TS val 3706051379 ecr 0,nop,wscale 7], length 0
11:21:13.491617 IP 10.131.0.24.48864 > 172.30.219.139.irdmi: Flags [S], seq 4268499129, win 26583, options [mss 8861,sackOK,TS val 3706052391 ecr 0,nop,wscale 7], length 0
11:21:13.492365 IP 172.30.219.139.irdmi > 10.131.0.24.48864: Flags [R.], seq 0, ack 4268499130, win 0, length 0
```

And from ovnkube-master logs:

```
2023-01-20T11:21:12.489634234Z I0120 11:21:12.489583       1 event.go:285] Event(v1.ObjectReference{Kind:"Service", Namespace:"ocpbugsm-44873", Name:"http-service", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NeedPods' The service http-service needs pods
2023-01-20T11:21:12.556406348Z I0120 11:21:12.556360       1 services_controller.go:248] Processing sync for service ocpbugsm-44873/http-service
2023-01-20T11:21:12.556406348Z I0120 11:21:12.556395       1 kube.go:315] Getting endpoints for slice ocpbugsm-44873/http-service-5sb47
2023-01-20T11:21:12.556452629Z I0120 11:21:12.556400       1 kube.go:359] LB Endpoints for ocpbugsm-44873/http-service are: [] / [] on port: 0
2023-01-20T11:21:12.556452629Z I0120 11:21:12.556426       1 services_controller.go:312] Service ocpbugsm-44873/http-service has 1 cluster-wide and 0 per-node configs, making 1 and 0 load balancers
2023-01-20T11:21:12.556764814Z I0120 11:21:12.556739       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" 
                                                            "operations"="[{Op:update Table:Load_Balancer Row:map[external_ids:{GoMap:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:ocpbugsm-44873/http-service]} 
                                                            name:Service_ocpbugsm-44873/http-service_TCP_cluster 
                                                            options:{GoMap:map[event:false reject:true skip_snat:false]} 
                                                            protocol:{GoSet:[tcp]} selection_fields:{GoSet:[]} vips:{GoMap:map[172.30.219.139:8000:]}] 
                                                            Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {cb0e516e-41e3-4e75-bdff-3329887997f5}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
2023-01-20T11:21:12.559968555Z I0120 11:21:12.559925       1 services_controller.go:252] Finished syncing service http-service on namespace ocpbugsm-44873 : 3.571457ms
2023-01-20T11:21:12.589126420Z I0120 11:21:12.589078       1 obj_retry.go:1294] Creating *v1.Pod ocpbugsm-44873/http-rc-mp688 took: 460ns
2023-01-20T11:21:12.589181661Z I0120 11:21:12.589144       1 obj_retry.go:1294] Creating *factory.egressIPPod ocpbugsm-44873/http-rc-mp688 took: 32.59µs
2023-01-20T11:21:12.596951675Z I0120 11:21:12.596892       1 obj_retry.go:1411] Updating *v1.Pod ocpbugsm-44873/http-rc-mp688
...
```

Between the two SYN packets:
- ovnk catches the empty_lb event from OVN and raise a `The service http-service needs pods` k8s event
- Unidling controller remove `...idled-at` annotation from the service
- ovnk reconfigure the load balancer with reject=true

Comment 3 Shiftzilla 2023-03-09 01:46:06 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9831