Bug 1670727
Summary: | port conflict between new cvo pod and the old one during upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> |
Installer sub component: | openshift-installer | QA Contact: | liujia <jiajliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | high | CC: | adahiya, aos-bugs, ccoleman, crawford, jokerman, mmccomas, wking |
Version: | 4.1.0 | ||
Target Milestone: | --- | ||
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:42:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
liujia
2019-01-30 09:23:33 UTC
Abhinav, can you take a look at this. It might just be as simple as using SO_REUSEPORT (though, that might end up being a terrible idea). AFAIK, upgrade should be a feature in beta3 release, so change the target release back to 4.0.0. Please correct me if my info is not correct. Since the issue found long time ago, so I re-check it with recent build. The result shows that the issue should be fixed now. Steps: 1. Setup cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-04-234414 True False 4m40s Cluster version is 4.0.0-0.nightly-2019-03-04-234414 # oc get pod NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-57678fd794-4qd2x 1/1 Running 1 50m 2. Edit cv config to set upstream to https://openshift-release.svc.ci.openshift.org/graph and get available update successfully. 3. Do upgrade manually # oc adm upgrade --to 4.0.0-0.nightly-2019-03-05-045224 Updating to 4.0.0-0.nightly-2019-03-05-045224 4. Upgrade succeed with the new cvo pod running well. # oc get po NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-7cdf5d6bbb-xgvfk 1/1 Running 0 15m # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-05-045224 True False 18m Cluster version is 4.0.0-0.nightly-2019-03-05-045224 @Abhinav Dahiya Could u help double confirm if it is fixed now, and better to give a pr link for it? If fixed, would u mind to change the bug status to ON_QA now, and QE will verify it. Thx. > Could u help double confirm if it is fixed now... I expect this is a race between the old CVO pod being retired and the new CVO pod being created, in which case we'll only see it occasionally, when those pods happen to be scheduled on the same node and happen to overlap in time (or at least the kernel has not yet finished reaping the port from the outgoing pod). When that happens, it's not clear to me which set of metrics we should be serving, or whether it matters. So I'm with Alex's comment 2 that we should just use SO_REUSEPORT and metrics scrapers will get either one or the other. Looks like semi-convenient use of SO_REUSEPORT is blocked on Go 1.11 [1]? [1]: https://github.com/golang/go/commit/3c4d3bdd3b454ef45ce00559d705fe5dc6f57cad Or we can flip the deployment to update style recreate. I don't think we need the old pod. Hmm, from [1]: > To prevent port hijacking, all of the processes binding to the same address must have the same effective UID. I dunno if that will play well with our containers. As I understand it, Kubernetes user namespacing is still in the works [2,3], and OpenShift works around that by using ephemeral random UIDs for each pod (e.g. see [4]). [1]: http://man7.org/linux/man-pages/man7/socket.7.html [2]: kubernetes/enhancements/issues/127 [3]: https://github.com/kubernetes/kubernetes/pull/64005 [4]: https://github.com/openshift/release/pull/1178#issuecomment-415213896 (In reply to Clayton Coleman from comment #8) > Or we can flip the deployment to update style recreate. I don't think we > need the old pod. Sounds good to me. I've filed [1] with this. [1]: https://github.com/openshift/cluster-version-operator/pull/140 The pull request just landed. We don't need any info. Version: From 4.0.0-0.nightly-2019-04-02-081046 to 4.0.0-0.nightly-2019-04-02-133735 Steps: 1. Setup cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-02-081046 True False 4m38s Cluster version is 4.0.0-0.nightly-2019-04-02-081046 # oc describe deployment cluster-version-operator -n openshift-cluster-version|grep StrategyType: StrategyType: RollingUpdate //before upgrade,the strategy was rollingupdate sh-4.2# cluster-version-operator version ClusterVersionOperator v4.0.22-201904011459-dirty 2. Edit cv config to set upstream to my dummy server and get available update successfully. 3. Do upgrade manually # oc adm upgrade --to=4.0.0-0.nightly-2019-04-02-133735 Updating to 4.0.0-0.nightly-2019-04-02-133735 Upgrade succeed with cvo works well. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-02-133735 True False 20m Cluster version is 4.0.0-0.nightly-2019-04-02-133735 # oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-576749f6f4-8xq9d 1/1 Running 0 47m # oc get deployment cluster-version-operator -o json -n openshift-cluster-version|jq ".spec.strategy" { "rollingUpdate": { "maxSurge": "25%", "maxUnavailable": "25%" }, "type": "RollingUpdate" } //after upgrade, the strategy was still rollingupdate. But according to pr#140 in comment10. We should change the strategy to "Recreate". Checked the manifest file in cvo pod. The cvo deployment manifest file was updated to "Recreate" strategy actually. sh-4.2# cat manifests/0000_00_cluster-version-operator_03_deployment.yaml |grep strategy strategy: Recreate sh-4.2# cluster-version-operator version There should be another config which cvo deployment depends on. So assign the bug back for further fix. Huh, looks like I somehow picked completely the wrong way to set the strategy in cvo#140. Hopefully fixed in [1]. [1]: https://github.com/openshift/cluster-version-operator/pull/155 cvo#155 has been merged [1]. [1]: https://github.com/openshift/cluster-version-operator/pull/155#event-2250667610 Version: From 4.0.0-0.nightly-2019-04-18-170158 to 4.0.0-0.nightly-2019-04-18-190537 Steps: 1. Setup cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-18-170158 True False 3h5m Cluster version is 4.0.0-0.nightly-2019-04-18-170158 # oc describe deployment cluster-version-operator -n openshift-cluster-version|grep StrategyType: StrategyType: Recreate //before upgrade, the strategytype updated for a fresh install 2. Edit cv config to set upstream to my dummy server and get available update successfully. 3. Do upgrade manually # oc adm upgrade --to 4.0.0-0.nightly-2019-04-18-190537 Updating to 4.0.0-0.nightly-2019-04-18-190537 Upgrade succeed with cvo works well. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-18-190537 True False 4m25s Cluster version is 4.0.0-0.nightly-2019-04-18-190537 # oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-6c97779749-48w7b 1/1 Running 1 46m # oc get deployment cluster-version-operator -o json -n openshift-cluster-version|jq ".spec.strategy" { "type": "Recreate" //after upgrade, this strategy keep recreate. } Verify the bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |