Bug 1701392
Summary: | [OCP4 Beta] Rolling update of router-default deployment is not possible | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stuart Auchterlonie <sauchter> |
Component: | Networking | Assignee: | Dan Mace <dmace> |
Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | urgent | CC: | aos-bugs, bbennett, florin-alexandru.peter, jokerman, mmccomas |
Version: | 4.1.0 | Keywords: | BetaBlocker |
Target Milestone: | --- | ||
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:47:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stuart Auchterlonie
2019-04-18 20:35:44 UTC
looks very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1689779 but it has been fixed. Confirmed that no issue with latest 4.0.0-0.nightly-2019-04-18-170158 build on AWS, but the difference is the `endpointPublishingStrategy` setting in ingresscontroller, it might be port conflict if using HostNetwork when rolling update. ---AWS--- endpointPublishingStrategy: type: LoadBalancerService ---Customer--- endpointPublishingStrategy: type: HostNetwork Just got a test env on Bare Metal which using `HostNetwork` and can reproduce this issue: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-18-170158 True False 52m Cluster version is 4.0.0-0.nightly-2019-04-18-170158 $ oc get node NAME STATUS ROLES AGE VERSION dell-r730-063.dsal.lab.eng.rdu2.redhat.com Ready master 70m v1.13.4+d4ce02c1d dell-r730-064.dsal.lab.eng.rdu2.redhat.com Ready master 70m v1.13.4+d4ce02c1d dell-r730-065.dsal.lab.eng.rdu2.redhat.com Ready master 70m v1.13.4+d4ce02c1d dell-r730-066.dsal.lab.eng.rdu2.redhat.com Ready worker 70m v1.13.4+d4ce02c1d dell-r730-067.dsal.lab.eng.rdu2.redhat.com Ready worker 70m v1.13.4+d4ce02c1d $ oc -n openshift-ingress get rs NAME DESIRED CURRENT READY AGE router-default-69dc5c9b8c 2 2 2 59m router-default-6d77f7444f 1 1 0 6m21s $ oc -n openshift-ingress get pod NAME READY STATUS RESTARTS AGE router-default-69dc5c9b8c-wcqqw 1/1 Running 0 59m router-default-69dc5c9b8c-xvxh6 1/1 Running 0 59m router-default-6d77f7444f-wndvc 0/1 Pending 0 6m32s $ oc -n openshift-ingress describe pod router-default-6d77f7444f-wndvc <---snip---> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 48s (x25 over 2m54s) default-scheduler 0/5 nodes are available: 2 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match node selector. In the HostNetwork setup and with the router containers use a static port on the host interface. Therefore, additional nodes are required to surge a rolling deployment. This is also the case in 3.x given a router with the same rolling update parameters. The primary difference is that the end-user can control those parameters in 3.x. In 4.x, the default rolling deployment parameters are max surge 25% and max unavailable 25%. The absolute value of the proportional max unavailable percent is rounded down using a floor function [1]. Given a worker node pool of 3, this means the max unavailable value is zero. Given the default install topology (3 workers), the host network constraint, and the rolling update parameters (which are immutable), the only way to execute the in-place rolling upgrade in this case would be to add more workers. For now, this can be a documentation issue. Going forward, we can consider things like: 1. Changing our default rolling update parameters 2. Exposing the rolling update parameters through the configuration API [1] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment I did a little more digging here, and found the underlying difference between our 3.x setup. https://github.com/openshift/origin/blob/master/pkg/apps/util/util.go#L397 https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/util/deployment_util.go#L880 In 3.x we're setting surge to 0, which triggers this fencepost condition to set a floor of 1 for unavailability even when the spec value is proportional. We are going to consider doing the same. I'm going to keep this bug open while we evaluate our defaults. We're going to fix this by making the deployment strategy dynamic with https://github.com/openshift/cluster-ingress-operator/pull/219. Verified with 4.1.0-0.nightly-2019-05-04-210601 on vSphere and issue has been fixed. $ oc get deployment/router-default -n openshift-ingress -o yaml <---snip---> spec: strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 25% type: RollingUpdate Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |