Description of problem: We are having trouble with eclipse-che deployments on us-east-2 after 3.6 upgrade. Specifically issues with pods being scaled down to 0 immediately after deployment. We are unsure what is the direct cause of the issue, if it is something in che that 3.6 treats with more scrutiny or something more directly related to openshift. Clayton, Justin and Matt were on a call with a few OSIO people and have more background on this issue. Requesting clayton, amisevsk to supplement with deployment config and other relevant details Version-Release number of selected component (if applicable): OpenShift Master: v3.6.173.0.5 (online version 3.5.0.20) Kubernetes Master: v1.6.1+5115d708d7 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Recorded the changes when one of these was created: 1. CREATE initial create 2. UPDATE no-op update (no idea who, no status data) 3. UPDATE scaled from 1 -> 0 (very rapidly, probably a controller of some form) 4. UPDATE deployment controller updates status with the result from #2 Afterwards I observe that the RS is set to scale 0. We also validated a pod is briefly created (from step 2) before being removed (by the controller). There are two primary problems: 1. something is scaling to 0 immediately - is this part of OSIO, or is it the deployment controller gone rogue 2. the deployment controller never updates the status of the deployment to reflect that it is scaled to zero (which seems like a bug)
Additional data we need: 1. When creating the same deployment in a completely different context, do we still observe the same scale to zero (basically identify whether this is a cluster problem, a deployment controller problem, or a rogue user-provided controller problem) Investigations: 1. Why do we get two events - create, then no-op? Is this the controller checking in but not setting status? Another controller? the client?
Created attachment 1313835 [details] Sequence of writes observed via watch to the deployment
as a data point, the same deployment run in the same way was tried on OCP 3.5 earlier today, and had no problems ( the replica scales to 1, and brings up the service ). Looking at the replica set, the deployment for it - they both specify replaces:1 at time of creation and remain at 1 through the life of the app.
Also, need a user to investigate whether setting Deployment scale to 1 after the system has settled, what is the outcome on both deployment status and the underlying RS?
Adding what we've found from the Che side: - Eclipse Che uses fabric8 kubernetes-client to create OpenShift resources - During workspace creation, Che creates routes, a service, and a deployment for the workspace. All resources are created successfully - Immediately after the deployment is created and its pod starts, for some (as of yet unclear) reason, kubernetes-client attempts to scale this deployment down to 0 replicas. - The attempt to scale the deployment to 0 fails, and kubernetes-client waits 10 minutes until deleting the deployment and cleaning up after itself. - Looking at the deployment that is created by Che in the OpenShift console, we can see that it starts off at 1 replica, but is shortly thereafter modified to be zero replicas. - If the deployment is exported and created in OpenShift manually (with one replica) it starts normally and stays running -- it is not scaled down. - As far as I can tell, the scale down call did not occur prior to OS 3.6 As a result, it seems like the scaling down of workspace deployments is being caused by Che. However, a second issue seems to be occurring, where the number of active replicas, according to their deployment does not properly get set to 0 when the deployment is scaled down. I think that this is the issue that is causing Che to fail to launch workspaces. Kubernetes-client is attempting to scale deployments down to zero, and this is successful, but if you `oc describe` the deployment, the number of replicas is listed as 1/0. - Once the ten minute timeout elapses, the error 2017-08-15 17:41:00,613[aceSharedPool-0] [ERROR] [c.d.i.DeploymentOperationsImpl 149] - 1/0 pod(s) ready for Deployment: che-ws-slygsb8hunnr00md in namespace: amisevsk-che after waiting for 600 seconds so giving up appears in the Che logs (note that the class logging the error is part of kubernetes-client, not Che). - After the deployment is scaled down and its pod terminated, checking the deployment with oc yields [1]. Note that the deployment lists 1 replica available despite the replica set showing 0. - This can be reproduced manually -- attempting to scale a deployment down through the Openshift console will yield 1/0 replicas on the deployment, and 0/0 on the replicaset. [2] I think what's happening is that kubernetes client is waiting on the deployment to show that it has successfully scaled down. This never happens even though the pod is shut down and the replica set is scaled to zero. As a result kubernetes-client hangs and this causes Che to stop working. [1] - https://gist.github.com/amisevsk/a55ceb933fb59bf716066a86fba05308 [2] - https://i.snag.gy/thkVuz.jpg
I've managed to reproduce the issue running locally on minishift with OpenShift version openshift v3.6.0+c4dd4cf kubernetes v1.6.1+5115d708d7
Correction to my earlier comment: - The issue of Che scaling down deployments is reproduced - The issue of deployments staying at 1/0 is *not*. So Che works half-okay in this case.
I have tried to create a Deployment in v3.6.173.0.5-1 and scale it to 0 and availability went to 0 as well. There can be several reason why it is showing 1/0, most notably the "replicas" count is the desired state, not the state cluster is in - meaning such state hasn't been reconciled yet - e.g. there is a terminating pod. Please provide output of `oc get deployments -o yaml`, `oc get rs -o yaml` and `oc get po -o yaml` if that still seems like an issue. But looking at [1] I can see that DESIRED is 0 but there is 1 pod under CURRENT so is the reason why it's available. (You forgot to list pods there.) On the main question of scaling replicas to 0 if the deployment is created/manipulated by your fabric8-client I would guess that if you can reproduce it on a Deployment without that client/controller involved, it's a bug in upstream Deployments, if not the chances are that your app/fabric8-client has some bug and not OpenShift. Reading that if you export it and create it without that client/controller involved and it works fine would indicate the later. I am not really sure what the fabric8-client/controller does but it is possible that upgrade from 3.5 -> 3.6 just uncovered some bug there? Considering that it works fine when you create it manually.
I was never seeing state reconciled, even with zero pods. I.e. the scaled to zero deployment never saw status update to reflect zero pods.
Created attachment 1314528 [details] yml files and script to setup example with issue Reproduction steps: 1. extract files 2. run setup.sh to create imagestream, imagestreamtags, and deployments; wait until deployments are ready 3. execute `oc get all`, compare against file oc-get-all-before-scale 4. scale ruby-istag and ruby-remote to 0 replicas 5. execute `oc get all`, compare against file oc-get-all-after-scale
Tomáš, I was unclear in my earlier explanation; I apologize. The originally reported issue of scaling down deployments immediately was indeed a Che issue (related to a change in OS 3.6), and we have resolved it. However, there is still an underlying issue on OSO that caused a relatively simple Che issue to be very difficult to diagnose -- scaling down a deployment should not be an issue that causes Che to hang. The scaling issue seems to only manifest when Deployment is defined from an ImageStreamTag. If the same deployment is defined with a normal docker image, the issue does not occur. I've created an attachment with a simple example to demonstrate. The two deployments are identical, except that one uses an ImageStreamTag, while the other uses the image that the ImageStreamTag is defined from. The output I get from step 3 above is: is/ruby 172.30.98.11:5000/amisevsk-che/ruby latest,2.3 3 minutes ago NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/ruby-istag 1 0 0 0 3m deploy/ruby-remote 1 1 1 1 3m NAME DESIRED CURRENT READY AGE rs/ruby-istag-2604530443 1 1 1 3m rs/ruby-remote-1464138083 1 1 1 3m NAME READY STATUS RESTARTS AGE po/ruby-istag-2604530443-qz5x8 1/1 Running 0 3m po/ruby-remote-1464138083-lpp6k 1/1 Running 0 3m and the output after step 5 is: NAME DOCKER REPO TAGS UPDATED is/ruby 172.30.98.11:5000/amisevsk-che/ruby latest,2.3 5 minutes ago NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/ruby-istag 0 1 0 1 5m deploy/ruby-remote 0 0 0 0 5m NAME DESIRED CURRENT READY AGE rs/ruby-istag-2604530443 0 0 0 5m rs/ruby-remote-1464138083 0 0 0 5m It seems that the CURRENT and AVAILABLE fields are a step behind where they should be. Also, I will link to [1], since I do not believe in coincidences Let me know if any more info is needed. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482337
Thanks, I was able to reproduce the issue and I believe that both issues are connected; more precisely caused by some bug in upstream deployment controller that prevents updating Deployment status. Just one update of status succeeds - when doing the scale. That's why it stucks in 1 available replica although there is no pod. There is actually an error in master logs indicating deployment has a collision: E0817 14:52:16.907932 14594 deployment_controller.go:485] replicasets.extensions "ruby-istag-89978933" already exists I am yet to find why is it trying to deploy from the same hash which it shouldn't. I know there were some collisions in upstream deployments that should be fixed with collision avoidance algorithm. Looking into it.
There is a bug in the platform causing this weird behaviour. Essentially when you reference an ImageStream by TAG in upstream Deployment that will create a ReplicaSet with hash of template.spec in its name. But some rouge controller in OpenShift (we are suspecting the new generic image trigger although that's yet to be confirmed) will rewrite the ReplicaSet's image to sha. Since all our tools first resolve a TAG into SHA this doesn't manifest. But because the Deployment controller compares it's template to the ReplicaSet's to find out if it has the latest version or it should deploy, because the TAG is replaced by SHA only in RS those will ALWAYS differ so it tries to create a new ReplicaSet. Because it's the same ReplicaSet as it was trying to create initially it has the same hash and the names collide. In newest version like master we have collision avoidance and this will manifest as Deployment creating new RS over and over again :/ I have confirmed it worked with 3.5, although you shouldn't use TAG like that - see A). This is certainly a bug we will have to fix, but to get you unblocked: A) Use SHA instead of a TAG that you should be doing anyways to maintain immutability. There is an API endpoint you can call. (If you scale that Deployment and the tag gets rewritten in the meantime you will have pods using different image.) B) Use DeploymentConfigs that have native support for ImageStreams. C) Use the new generic image trigger that allows you to have triggers for upstream objects. I am afraid we don't have documentation for it, at least I couldn't find it - Clayton, is there something? Set annotation like this and you should be good to go: ``` apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: image.alpha.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"ruby:latest"},"fieldPath":"spec.template.spec.containers[0].image"}]' ``` C) seems like the easiest change for you I guess
The image policy plugin changed in 3.6 to consider more resources than previously (where it only considered pods) to support the "local name" feature that allows deployments to reference pods by name. This inadvertently added replica sets to the set of resources that can have images be rewritten (without explicit action by the user). The fix for 3.6.1 is going to be to require an admin to explicitly request rewrites via the config for certain types, and to exclude resources like RS, RC, and Jobs from getting this behavior automatically. An admin in 3.6.0 can disable this behavior in their master-config.yaml in the ImagePolicyConfig object as executionRules: ... resolutionRules: - targetResource: resource: pods localNames: true type: Attempt (add a new top level resolutionRules with only pods).
3.6.x fix https://github.com/openshift/origin/pull/15867
master fix https://github.com/openshift/origin/pull/15868
*** Bug 1482337 has been marked as a duplicate of this bug. ***
fixes is merged in master and 3.6 branch now
When the env updated will verify it.
So unsure if I am seeing this exact behavior but I figured I post and asks since the problem seems similar enough. I currently have the pv-recycler constantly create itself, fail, delete, rinse and repeat and it keeps failing with: Failed to pull image "openshift3/ose-recycler:v3.6.173.0.21": rpc error: code = 2 desc = unauthorized: authentication required I just upgraded us to v3.6.173.0.21
Verified with the latest version, the issue has fixed: openshift v3.6.173.0.37 kubernetes v1.6.1+5115d708d7 Before scale: [root@zhouy testjson]# oc get all NAME DOCKER REPO TAGS UPDATED imagestreams/ruby 172.30.98.11:5000/zhouy/ruby latest,2.3 7 minutes ago NAME READY STATUS RESTARTS AGE po/ruby-istag-3728341130-t7n8x 1/1 Running 0 6m po/ruby-remote-1464138083-dbpkw 1/1 Running 0 4m NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/ruby-istag 1 1 1 1 6m deploy/ruby-remote 1 1 1 1 4m NAME DESIRED CURRENT READY AGE rs/ruby-istag-3728341130 1 1 1 6m rs/ruby-remote-1464138083 1 1 1 4m After scale: [root@zhouy testjson]# oc get all NAME DOCKER REPO TAGS UPDATED imagestreams/ruby 172.30.98.11:5000/zhouy/ruby 2.3,latest 8 seconds ago NAME READY STATUS RESTARTS AGE po/ruby-remote-1464138083-g5fwm 1/1 Terminating 0 6s NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/ruby-istag 0 0 0 0 7s deploy/ruby-remote 0 0 0 0 6s NAME DESIRED CURRENT READY AGE rs/ruby-istag-3728341130 0 0 0 7s rs/ruby-remote-1464138083 0 0 0 6s
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days