Bug 1370056

Summary: [preview]Deployments fail because replicationcontrollers request too few resources
Product: OpenShift Online Reporter: Pieter Nagel <pieter>
Component: DeploymentsAssignee: Abhishek Gupta <abhgupta>
Status: CLOSED WONTFIX QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.xCC: abhgupta, aos-bugs, pieter, pweil, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-15 07:09:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
oc describe output for replication controller that failed to create
none
Output of `oc get all -o yaml`
none
tau-web-dev.common.yaml
none
tau-web-dev.client.yaml
none
Output requested in comment 10 none

Description Pieter Nagel 2016-08-25 08:09:26 UTC
Created attachment 1193894 [details]
oc describe output for replication controller that failed to create

Description of problem:

When deploying a deploymentconfig that only specifies memory limits for the pod containers, and not for the replication controller (don't know if it is even possible to do so), the deployment fails because the replication controller failed to be created.

Events tab shows messages like "Error creating: pods "tau-web-dev-lautus-1-" is forbidden: [Minimum cpu usage per Container is 29m, but request is 7m., Minimum memory usage per Container is 150Mi, but request is 38Mi.]"

How reproducible:

Seems somewhat random. I had the same bug yesterday on a different deploymentconfig, but then the bug disappeared to be replaced by the bug I reported as bug no 1370032.


Steps to Reproduce:
1. Log in to project tau-dev of github user pjnagel.
2. Start a deployment of dc/tau-web-dev-lautus

Actual results:
Deployment eventually fails after 10m. In the overview on the web console, deployment never gets to the point where it allocates more than 0 pods.

Expected results:
Deployment should at least proceed to the point where not only a replication controller is created, but also a pod that hosts the containers and that at least pulls the image.

Comment 1 Michal Fojtik 2016-08-25 10:37:40 UTC
I think the replication controller was created (you can check the `oc get rc` but it failed to create pods because of quota. The quota is set for the Pod, per container, so the Pod won't be created if the requirements defined in the Pod are out of quota.

Comment 2 Michal Fojtik 2016-08-25 10:38:35 UTC
Pieter, can you try to list the RC when this happen? Also the output of `oc get all --all -o yaml` or the dump of DC and RC will help.

Comment 3 Pieter Nagel 2016-08-25 11:26:22 UTC
Created attachment 1193988 [details]
Output of `oc get all -o yaml`

Comment 4 Pieter Nagel 2016-08-25 11:26:58 UTC
Before I give you the requested info, just some background as to what state things are in now:

I emptied out my project by deleting every imagestream, service, route, deploymentconfig, rc, buildconfig and pods until `oc get all` showed nothing. Then I recreated everything from scratch.

I then clicked build on the tau-web-dev-gfa buildconfig, this yielded an RC that suffered the same problem. Events tab listed messages like "Error creating: pods "tau-web-dev-gfa-1-" is forbidden: [Minimum memory usage per Container is 150Mi, but request is 38Mi., Minimum cpu usage per Container is 29m, but request is 7m.]"

Then while that second deployment was running, I ran `oc get rc` (I assume that's what you meant by 'list the RC'):

NAME                DESIRED   CURRENT   AGE
tau-web-dev-gfa-1   0         0         1h
tau-web-dev-gfa-2   1         0         56s

I ran it a few times but never saw anything different.

I attached the output of `oc get all -o yaml` (--all is an invalid flag on my version of oc v3.2.0.44)

Comment 5 Pieter Nagel 2016-08-29 08:49:13 UTC
I think I just figured out how to replicate this bug.

In short, adding a memcached container to the deployment config seems to 'poison' the openshift project. Thereafter replicationcontrollers exhibit this bug. The weird thing is, the bug persists even after removing the memcached container from the deployment config. Deleting the entire project and recreating from the pre-memcached definitions clears the bug.

Here are the detailed instructions for replicating the bug:

Replicating the working baseline:

- Delete tau-dev project, if exists.
- Create tau-dev project.
- Run `oc process -f tau-web-dev.common.yaml | oc create -f -`
- Ensure the tau-web-dev build succeeds.
- Run `oc create -f tau-web-dev.client.yaml`
- Use 'add to project' to instantiate template tau-web-dev-client-template, choose 'foo' for the client.
- Run the deployment of tau-web-dev-foo.
*** The deployment will succeed, in the sense that it ends with a running pod. 

Replicating the breakage:

- Edit tau-web-dev.client.yaml to add the following container before the tau-web container:
       - image: memcached
         name: memcached
         resources:
           limits:
             memory: 64Mi
- Run `oc apply -f tau-web-dev.client.yaml`
- Use 'add to project' to instantiate template tau-web-dev-client-template, choose 'bar' for the client.
- Run the deployment of tau-web-dev-bar.
*** The deployment will fail, with events like "Error creating: pods "tau-web-dev-bar-1-" is forbidden: [Minimum cpu usage per Container is 29m, but request is 7m., Minimum memory usage per Container is 150Mi, but request is 38Mi.]" on the tau-web-dev-bar-1 replication controller.

The breakage is persistent, even if memcached container is removed:

- Run `oc export -o yaml dc/tau-web-dev-bar > foo`
- Edit foo to revert the change that added the memcached container.
- Run `oc delete dc/tau-web-dev-bar`
- Run `oc create -f foo`
- Run the deployment of tau-web-dev-bar.
*** The deployment will fail again, with similar events.

Comment 6 Pieter Nagel 2016-08-29 08:49:58 UTC
Created attachment 1195215 [details]
tau-web-dev.common.yaml

Comment 7 Pieter Nagel 2016-08-29 08:50:19 UTC
Created attachment 1195216 [details]
tau-web-dev.client.yaml

Comment 8 Pieter Nagel 2016-08-29 08:58:49 UTC
The pernicious thing about this bug is that the breakage persists for any deploymentconfig that re-uses the 'poisoned' name.

Deleting the deploymentconfig and recreating it with a previously good config does not clear the bug.

Comment 9 Pieter Nagel 2016-08-29 09:13:38 UTC
It's got nothing to do with the memcached image from dockerhub itself, I just replaced the image memcached with hello-world and I could still recreate the bug.

Comment 10 Michail Kargakis 2016-08-29 09:21:21 UTC
Can you post the output of the following commands:

oc logs dc/tau-web-dev-gfa
oc describe project tau-dev
oc get events
oc status -v

Comment 11 Pieter Nagel 2016-08-29 10:21:46 UTC
Created attachment 1195305 [details]
Output requested in comment 10

Comment 12 Pieter Nagel 2016-08-29 10:22:11 UTC
Please note: you asked for logs of dc/tau-web-dev-gfa, what I give you is based on comment 5 where the failing case is dc/tau-web-dev-bar and the succeeding case is dc/tau-web-dev-foo.

I followed all the steps in my comment 5, excluding the steps under heading 'The breakage is persistent, even if memcached container is removed:'.

The output you requested is attached as debug.out

Comment 13 Pieter Nagel 2016-08-29 10:22:22 UTC
Please note: you asked for logs of dc/tau-web-dev-gfa, what I give you is based on comment 5 where the failing case is dc/tau-web-dev-bar and the succeeding case is dc/tau-web-dev-foo.

I followed all the steps in my comment 5, excluding the steps under heading 'The breakage is persistent, even if memcached container is removed:'.

The output you requested is attached as debug.out

Comment 14 Pieter Nagel 2016-08-29 10:24:44 UTC
Earlier today I increased the memory limit for the memcached container from 64Mi to 256Mi and then the deployment worked. I deleted and recreated the project as per comment 5, I have not yet tested whether that was at all necessary.

Comment 15 Pieter Nagel 2016-08-29 10:31:28 UTC
I just confirmed, given a deploymentconfig in the 'poisoned' state (64Mi memcached container memory limit), modifying the limit to 256Mi gets the deploymentconfig 'unpoisoned' and the deployment succeeds without needing to delete the deploymentconfig or the project it lives in.

That is interesting, because previously I detetermined that just removing the memcached container and re-running the deployment is not enough, not even when deleting the deploymentconfig and recreating it.

Comment 16 Abhishek Gupta 2016-10-14 20:11:09 UTC
The minimum memory that can be specified is 250Mi and hence specifying 64Mi resulted in the error that you saw. Was that the only issue and can this bug be closed? If I missed something, please let me know so that I can target the issue specifically.

Comment 17 Abhishek Gupta 2016-10-14 20:12:59 UTC
Lowering severity based on my comment above. Will reconsider severity if some other issue is highlighted.

Comment 18 Pieter Nagel 2016-10-18 06:24:28 UTC
If the minimum that can be specified is 250Mi, then `oc apply` etc. should at least do validation and refuse to update to 64Mi when people like me specify that in future.

Assuming that people can no longer trip over this bug because they can no longer specify invalid too low memory limits, by all means close this bug.

Comment 19 Michal Fojtik 2016-10-31 14:46:09 UTC
This is being addressed in upstream: https://github.com/kubernetes/kubernetes/pull/20113

Comment 20 Xiaoli Tian 2017-06-15 07:09:39 UTC
OpenShift Online Preview has been decommissioned, go to https://manage.openshift.com/ for using OpenShift Online starter cluster