Bug 1542715
| Summary: | [RFE] Improve pod termination to minimize application downtime | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | emahoney |
| Component: | RFE | Assignee: | Derek Carr <decarr> |
| Status: | CLOSED DUPLICATE | QA Contact: | Xiaoli Tian <xtian> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.5.0 | CC: | antgarci, aos-bugs, decarr, erich, hgomes, jmalde, jokerman, mmccomas, nbhatt |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-04-20 20:00:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Derek, Wondering what should be the target release for an RFE? Please help. *** This bug has been marked as a duplicate of bug 1318680 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Description of problem: With the current haproxy handling of a pod SIGTERM, there is a period of time such that a pod does not serve traffic and end users get a http 503 response. We have done quite a bit of testing using using the methods below and have come up with what we consider a workaround. The end goal of this BZ is to minimize the period of time between when a pod is sigkilled, and its replica starts serving X. CURRENT WORKAROUND: Preliminary testing using 'sleep 60' within the preStop hook(below) appears to be enough to delay SIGTERM and allow Terminating pods to be removed from HAProxy before listening connections are closed. cat test-pod.yaml apiVersion: v1 kind: Pod metadata: name: abc-test spec: containers: - image: xyz-project/abc-test:latest name: prestop-test lifecycle: preStop: exec: command: - "/tmp/prestop.sh" Version-Release number of selected component (if applicable): Tested in 3.2 and 3.5 How reproducible: So far, both tested versions exhibited the behavior. Steps to Reproduce: 1. Deploy httpd-example template - Create route for this service. 2. oc rsh to the pods and modify the index.html that they serve such that podA serves 'podA' and podB serves 'podB'. 3. Create bash script that curls the route endpoint every second for 120/240 seconds. ~~~ #!/bin/bash X=`oc get svc |grep docker-registry| awk '{print $2}' ` echo $X for i in {1..30}; do content="$(date ; curl -s "$X":5000)" echo "$content $i" >> output.txt if [ "$i" -eq "4" ] ; then { echo "Deleting" >> output.txt oc delete po `oc get pods | grep docker-registry | awk '{print $1}'` echo "deleted" >> output.txt } fi done ~~~ 4. Create a second replica for the svc in the RC (i.e wants 2) 5. Run the script from a node with oc. The script runs for 30 requests and we can see that around 20 request gets error messages and we have tried with different scenarios as well.Out of which, there was a scenario where we had 1000 requests and after running the script for a minute with the help of crontab, we found that around 200 requests got error message. Actual results: Currently, without the prestop hook, it looks like we can get down to around 2s where podA is not serving and podB still does not serve requests. Expected results: 0 downtime from the client perspective. Additional info: