1542715 – [RFE] Improve pod termination to minimize application downtime

Bug 1542715 - [RFE] Improve pod termination to minimize application downtime

Summary: [RFE] Improve pod termination to minimize application downtime

Keywords:
Status:	CLOSED DUPLICATE of bug 1318680
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RFE
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-06 21:22 UTC by emahoney
Modified:	2023-09-14 04:16 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-20 20:00:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description emahoney 2018-02-06 21:22:55 UTC

Description of problem: With the current haproxy handling of a pod SIGTERM, there is a period of time such that a pod does not serve traffic and end users get a http 503 response. We have done quite a bit of testing using using the methods below and have come up with what we consider a workaround. The end goal of this BZ is to minimize the period of time between when a pod is sigkilled, and its replica starts serving X.

CURRENT WORKAROUND: Preliminary testing using 'sleep 60' within the preStop hook(below) appears to be enough to delay SIGTERM and allow Terminating pods to be removed from HAProxy before listening connections are closed.

cat test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: abc-test
spec:
  containers:
    - image: xyz-project/abc-test:latest
      name: prestop-test
      lifecycle:
        preStop:
          exec:
            command:
            - "/tmp/prestop.sh"


Version-Release number of selected component (if applicable):
Tested in 3.2 and 3.5

How reproducible:
So far, both tested versions exhibited the behavior. 

Steps to Reproduce:
1. Deploy httpd-example template 
    - Create route for this service.
2. oc rsh to the pods and modify the index.html that they serve such that podA serves 'podA' and podB serves 'podB'.


3. Create bash script that curls the route endpoint every second for 120/240 seconds. 
      
~~~
#!/bin/bash
X=`oc get svc |grep docker-registry| awk '{print $2}' `
echo $X
for i in {1..30}; do
     content="$(date ; curl -s "$X":5000)"
     echo "$content $i" >> output.txt
     if [ "$i" -eq "4" ] ; then {
echo "Deleting" >> output.txt
       oc delete po `oc get pods | grep docker-registry | awk '{print $1}'`
 echo "deleted"    >> output.txt
}
     fi
done
~~~

4. Create a second replica for the svc in the RC (i.e wants 2)

5. Run the script from a node with oc. The script runs for 30 requests and we can see that around 20 request gets error messages and we have tried with different scenarios as well.Out of which, there was a scenario where we had 1000 requests and after running the script for a minute with the help of crontab, we found that around 200 requests got error message.

Actual results: Currently, without the prestop hook, it looks like we can get down to around 2s where podA is not serving and podB still does not serve requests. 


Expected results: 0 downtime from the client perspective.


Additional info:

Comment 2 Avesh Agarwal 2018-02-06 22:13:02 UTC

Derek,

Wondering what should be the target release for an RFE? Please help.

Comment 19 Eric Rich 2019-04-20 20:00:38 UTC


*** This bug has been marked as a duplicate of bug 1318680 ***

Comment 21 Red Hat Bugzilla 2023-09-14 04:16:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.