Bug 1573207
Summary: | HTTP requests failing during deployment scaleup and scaledown | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sudarshan Chaudhari <suchaudh> |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, public, vcorrea |
Version: | 3.7.0 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-30 17:22:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sudarshan Chaudhari
2018-04-30 13:37:14 UTC
I'm the reporter of this issue via a support case. > 1. Create a new application "hello-openshift" This is not entirely correct. I supplied a full configuration which contains readiness and liveness probes. If I'd have to guess what's happening at a high level it's that the application router is still forwarding requests to terminating pods, or that they're sent the signal to terminate before the request has been handled. I'd expected a deployment to wait until all outstanding requests have been handled (within the termination grace period) and to ensure that no traffic is directed at a pod destined for termination. My reproduction code in full (hostname replaced): --- ns=poc-rolling-updates oc new-project --skip-config-write "$ns" oc -n "$ns" apply -f - <<'EOF' apiVersion: v1 kind: List metadata: {} items: - apiVersion: v1 kind: DeploymentConfig metadata: name: deployment-example spec: replicas: 10 revisionHistoryLimit: 2 selector: app: deployment-example deploymentconfig: deployment-example strategy: activeDeadlineSeconds: 21600 resources: {} rollingParams: intervalSeconds: 1 maxSurge: 1 maxUnavailable: 0 timeoutSeconds: 600 updatePeriodSeconds: 1 type: Rolling template: metadata: annotations: null creationTimestamp: null labels: app: deployment-example deploymentconfig: deployment-example spec: containers: - image: openshift/hello-openshift:latest imagePullPolicy: Always livenessProbe: failureThreshold: 3 httpGet: path: / port: 8080 scheme: HTTP initialDelaySeconds: 1 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: deployment-example ports: - containerPort: 8080 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: / port: 8080 scheme: HTTP initialDelaySeconds: 1 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 test: false triggers: [] - apiVersion: v1 kind: Service metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: null labels: app: deployment-example name: deployment-example spec: ports: - name: 8080-tcp port: 8080 protocol: TCP targetPort: 8080 selector: app: deployment-example deploymentconfig: deployment-example sessionAffinity: None type: ClusterIP - apiVersion: v1 kind: Route metadata: name: deployment-example spec: port: targetPort: 8080-tcp to: kind: Service name: deployment-example weight: 100 wildcardPolicy: None EOF # Start in separate terminal (make sure to use correct domain) while true; do if ! curl --fail --silent --show-error --max-time 1 -o /dev/null "http://deployment-example-${ns}.app.example.com"; then echo "$(date): Request failed" fi done # Repeatedly rescale deployment while sleep 2; do current=$(oc -n "$ns" get dc/deployment-example -o json | jq .spec.replicas) if (( current < 20 )); then replicas=20 else replicas=10 fi echo "$(date): Scale from ${current} to ${replicas} replicas" oc -n "$ns" scale dc/deployment-example --replicas="$replicas" oc -n "$ns" deploy --follow --latest dc/deployment-example echo "$(date): Wait for pods to become available" while [[ "$(oc -n "$ns" get dc/deployment-example -o json | jq .status.unavailableReplicas)" -ne 0 ]]; do sleep 1; done echo done --- First, does your deployment config have liveness and readiness checks enabled? Second, what is your router's RELOAD_INTERVAL set to? It defaults to 5 seconds. The lowest it can go is 1s (so, 1 second). But if you have a lot of routes, then you may need to slow down the deployment to compensate (.spec.minReadySeconds). I was unable to reproduce this as soon as I enabled readiness checks on my hello-openshift pod. Ben, I guess our comments overlapped. We do have readiness and liveness probes in the actual reproduction case. I did try setting RELOAD_INTERVAL=2s as per Sudarshan's request in the support case. Setting: oc annotate route hello-openshift router.openshift.io/haproxy.health.check.interval=500ms --overwrite Helped considerably, BUT you will be sending TCP health checks to pods every 1/2 second. The real problem is that the termination is being sent and the pod is getting sent the TERM signal, which it doesn't handle and thus, exits immediately. You need to add a little delay between when the TERM is received and when the pod shuts down so that you can handle the slight delay between when the pod is killed and when the router notices it. See: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods If you can't or don't want to change the image so it handles it, you can install a preStop hook that sleeps for 10 seconds. You'll need to get a sleep binary into the pod, but that should not be hard. Closed because I believe this is functioning as designed... but if adding a termination handler or a preStop hook doesn't resolve the problem, please feel free to re-open it. *** Bug 1575761 has been marked as a duplicate of this bug. *** |