Hide Forgot
Description of problem: Do upgrade with "enable auto update" or do upgrade manually both hit the issue. So take manually upgrade for example: Upgrade can not finish and keep trying to deploy the new version cvo pod. The new cvo pod keep restart due to 9099 port is used by existing old version cvo pod. So the new one can not deploy successfully. # oc get pod NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-5cb4685c64-5cxxf 1/1 Running 0 2h pod/cluster-version-operator-fbf467876-vrvhh 0/1 CrashLoopBackOff 18 1h pod/version-4.0.0-0.nightly-2019-01-25-201056-mbbr6-hpllg 0/1 Completed 0 1h # oc logs pod/cluster-version-operator-fbf467876-vrvhh I0129 08:54:04.967210 1 start.go:23] ClusterVersionOperator v4.0.0-0.147.0.0-dirty I0129 08:54:04.967428 1 merged_client_builder.go:122] Using in-cluster configuration I0129 08:54:04.969535 1 updatepayload.go:63] Loading updatepayload from "/" I0129 08:54:04.971565 1 leaderelection.go:185] attempting to acquire leader lease openshift-cluster-version/version... F0129 08:54:04.971868 1 start.go:144] Unable to start metrics server: listen tcp 0.0.0.0:9099: bind: address already in use Version-Release number of the following components: # oc version oc v4.0.0-0.149.0 kubernetes v1.12.4+50c2f2340a features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://jliu-demo-api.qe.devcluster.openshift.com:6443 kubernetes v1.11.0+dde478551e # cluster-version-operator version ClusterVersionOperator v4.0.0-0.147.0.0-dirty release payload version: 4.0.0-0.nightly-2019-01-25-200832 How reproducible: always Steps to Reproduce: 1. Install ocp cluster with v4.0.0-0.147.0.0-dirty openshift-install based on registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-25-200832 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-01-25-200832 True False 1m Cluster version is 4.0.0-0.nightly-2019-01-25-200832 2. Edit CV config to change upstream address to point to https://openshift-release.svc.ci.openshift.org/graph Check CVO has received update graph from server and wrote them into CV: # oc get clusterversion -o json|jq ".items[0].status.availableUpdates" [ { "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-25-201056", "version": "4.0.0-0.nightly-2019-01-25-201056" }, { "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-25-205123", "version": "4.0.0-0.nightly-2019-01-25-205123" } ] 3. Do upgrade manually. # oc adm upgrade --to=4.0.0-0.nightly-2019-01-25-201056 Updating to 4.0.0-0.nightly-2019-01-25-201056 The deployment has been updated and upgrade job has finished successfully, but the deploy of new pod fail. { "completionTime": null, "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-25-201056", "startedTime": "2019-01-29T07:51:34Z", "state": "Partial", "version": "4.0.0-0.nightly-2019-01-25-201056" }, Actual results: Upgrade can not finished with new cvo pod keep restart due to port conflict. Expected results: Upgrade process should handle the new pod and old pod transfer during deploy. Additional info: Try to scale down and scale up to workaround this issue. # oc scale --replicas=0 deployment cluster-version-operator deployment.extensions/cluster-version-operator scaled # oc scale --replicas=1 deployment cluster-version-operator deployment.extensions/cluster-version-operator scaled # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-01-25-201056 True False 7m Cluster version is 4.0.0-0.nightly-2019-01-25-201056
Abhinav, can you take a look at this. It might just be as simple as using SO_REUSEPORT (though, that might end up being a terrible idea).
AFAIK, upgrade should be a feature in beta3 release, so change the target release back to 4.0.0. Please correct me if my info is not correct.
Since the issue found long time ago, so I re-check it with recent build. The result shows that the issue should be fixed now. Steps: 1. Setup cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-04-234414 True False 4m40s Cluster version is 4.0.0-0.nightly-2019-03-04-234414 # oc get pod NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-57678fd794-4qd2x 1/1 Running 1 50m 2. Edit cv config to set upstream to https://openshift-release.svc.ci.openshift.org/graph and get available update successfully. 3. Do upgrade manually # oc adm upgrade --to 4.0.0-0.nightly-2019-03-05-045224 Updating to 4.0.0-0.nightly-2019-03-05-045224 4. Upgrade succeed with the new cvo pod running well. # oc get po NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-7cdf5d6bbb-xgvfk 1/1 Running 0 15m # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-05-045224 True False 18m Cluster version is 4.0.0-0.nightly-2019-03-05-045224 @Abhinav Dahiya Could u help double confirm if it is fixed now, and better to give a pr link for it? If fixed, would u mind to change the bug status to ON_QA now, and QE will verify it. Thx.
> Could u help double confirm if it is fixed now... I expect this is a race between the old CVO pod being retired and the new CVO pod being created, in which case we'll only see it occasionally, when those pods happen to be scheduled on the same node and happen to overlap in time (or at least the kernel has not yet finished reaping the port from the outgoing pod). When that happens, it's not clear to me which set of metrics we should be serving, or whether it matters. So I'm with Alex's comment 2 that we should just use SO_REUSEPORT and metrics scrapers will get either one or the other.
Looks like semi-convenient use of SO_REUSEPORT is blocked on Go 1.11 [1]? [1]: https://github.com/golang/go/commit/3c4d3bdd3b454ef45ce00559d705fe5dc6f57cad
Or we can flip the deployment to update style recreate. I don't think we need the old pod.
Hmm, from [1]: > To prevent port hijacking, all of the processes binding to the same address must have the same effective UID. I dunno if that will play well with our containers. As I understand it, Kubernetes user namespacing is still in the works [2,3], and OpenShift works around that by using ephemeral random UIDs for each pod (e.g. see [4]). [1]: http://man7.org/linux/man-pages/man7/socket.7.html [2]: kubernetes/enhancements/issues/127 [3]: https://github.com/kubernetes/kubernetes/pull/64005 [4]: https://github.com/openshift/release/pull/1178#issuecomment-415213896
(In reply to Clayton Coleman from comment #8) > Or we can flip the deployment to update style recreate. I don't think we > need the old pod. Sounds good to me. I've filed [1] with this. [1]: https://github.com/openshift/cluster-version-operator/pull/140
The pull request just landed.
We don't need any info.
Version: From 4.0.0-0.nightly-2019-04-02-081046 to 4.0.0-0.nightly-2019-04-02-133735 Steps: 1. Setup cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-02-081046 True False 4m38s Cluster version is 4.0.0-0.nightly-2019-04-02-081046 # oc describe deployment cluster-version-operator -n openshift-cluster-version|grep StrategyType: StrategyType: RollingUpdate //before upgrade,the strategy was rollingupdate sh-4.2# cluster-version-operator version ClusterVersionOperator v4.0.22-201904011459-dirty 2. Edit cv config to set upstream to my dummy server and get available update successfully. 3. Do upgrade manually # oc adm upgrade --to=4.0.0-0.nightly-2019-04-02-133735 Updating to 4.0.0-0.nightly-2019-04-02-133735 Upgrade succeed with cvo works well. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-02-133735 True False 20m Cluster version is 4.0.0-0.nightly-2019-04-02-133735 # oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-576749f6f4-8xq9d 1/1 Running 0 47m # oc get deployment cluster-version-operator -o json -n openshift-cluster-version|jq ".spec.strategy" { "rollingUpdate": { "maxSurge": "25%", "maxUnavailable": "25%" }, "type": "RollingUpdate" } //after upgrade, the strategy was still rollingupdate. But according to pr#140 in comment10. We should change the strategy to "Recreate". Checked the manifest file in cvo pod. The cvo deployment manifest file was updated to "Recreate" strategy actually. sh-4.2# cat manifests/0000_00_cluster-version-operator_03_deployment.yaml |grep strategy strategy: Recreate sh-4.2# cluster-version-operator version There should be another config which cvo deployment depends on. So assign the bug back for further fix.
Huh, looks like I somehow picked completely the wrong way to set the strategy in cvo#140. Hopefully fixed in [1]. [1]: https://github.com/openshift/cluster-version-operator/pull/155
cvo#155 has been merged [1]. [1]: https://github.com/openshift/cluster-version-operator/pull/155#event-2250667610
Version: From 4.0.0-0.nightly-2019-04-18-170158 to 4.0.0-0.nightly-2019-04-18-190537 Steps: 1. Setup cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-18-170158 True False 3h5m Cluster version is 4.0.0-0.nightly-2019-04-18-170158 # oc describe deployment cluster-version-operator -n openshift-cluster-version|grep StrategyType: StrategyType: Recreate //before upgrade, the strategytype updated for a fresh install 2. Edit cv config to set upstream to my dummy server and get available update successfully. 3. Do upgrade manually # oc adm upgrade --to 4.0.0-0.nightly-2019-04-18-190537 Updating to 4.0.0-0.nightly-2019-04-18-190537 Upgrade succeed with cvo works well. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-18-190537 True False 4m25s Cluster version is 4.0.0-0.nightly-2019-04-18-190537 # oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-6c97779749-48w7b 1/1 Running 1 46m # oc get deployment cluster-version-operator -o json -n openshift-cluster-version|jq ".spec.strategy" { "type": "Recreate" //after upgrade, this strategy keep recreate. } Verify the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758