Bug 1921717 - Deployment creates a huge number of ReplicaSets
Summary: Deployment creates a huge number of ReplicaSets
Status: CLOSED DUPLICATE of bug 1933666
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Dev Console
Version: 4.6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Christoph Jerolimov
QA Contact: Gajanan More
Depends On:
TreeView+ depends on / blocked
Reported: 2021-01-28 13:43 UTC by Matous Jobanek
Modified: 2021-04-06 22:14 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1925180 (view as bug list)
Last Closed: 2021-04-06 22:14:55 UTC
Target Upstream Version:

Attachments (Terms of Use)
check-replicasets.sh (156 bytes, application/x-shellscript)
2021-04-06 22:06 UTC, Christoph Jerolimov
no flags Details
create-samples-broken.sh (8.29 KB, application/x-shellscript)
2021-04-06 22:06 UTC, Christoph Jerolimov
no flags Details
create-samples-works.sh (8.27 KB, application/x-shellscript)
2021-04-06 22:07 UTC, Christoph Jerolimov
no flags Details
cleanup-all-samples.sh (430 bytes, application/x-shellscript)
2021-04-06 22:07 UTC, Christoph Jerolimov
no flags Details

Description Matous Jobanek 2021-01-28 13:43:36 UTC
Description of problem:

On Dev Sandbox platform
(https://developers.redhat.com/developer-sandbox) a few users experienced that after creating a sample application from the catalog (Go or Node.js), OpenShift started creating a huge number of ReplicaSets - all of them were owned by one deployment.
It seems that some problematic deployments keep generating tons of replica sets. 

It also happened that in a case of one user it created hundreds of thousands of ReplicaSets. This caused that the cluster monitoring collapsed and ate all memory from the node which ended up in the unstable state of the entire OS cluster.

After this experience, we introduced a counter limit for ReplicaSets in our ClusterResourceQuota that is set per user (max 30). After doing this, there were users that hit this limit in the same way ending with 30 ReplicaSets owned by one Deployment.

There is another issue that is related to ClusterResourceQuota: https://bugzilla.redhat.com/show_bug.cgi?id=1920699 so both issues could be related, but it's only my guess.

Version-Release number of selected component (if applicable):

OSD 4.6.8

How reproducible:

Signup for Dev Sandbox https://developers.redhat.com/developer-sandbox and create a sample app. However, it's not 100% reproducible. 

Steps to Reproduce:
1. Signup for Dev Sandbox https://developers.redhat.com/developer-sandbox
2. In Web Console in Developer tab when adding an app click on "Samples"
3. Choose either Go or Node.js
4. Create

Technically, it could be also reproducible in OCP - you can use our ClusterResourceQuota from https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/cluster.yaml
The template used for creating the same namespace is here: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/ns_code.yaml

Actual results:

It starts created a huge number of ReplicaSets

Expected results:

Should create only one ReplicaSet

Comment 1 Matous Jobanek 2021-01-28 13:57:30 UTC
I was just able to reproduce it. 
I watched the ReplicaSets with the command: oc get replicasets --watch-only -o json
here is the output: https://gist.githubusercontent.com/MatousJobanek/a6a6b7512f70c0b0e65c8d4d7a2b066d/raw/98d427b6816b032098063a01b26cf085d686a799/replicasets-out.txt

Comment 2 Michal Fojtik 2021-02-01 12:28:09 UTC
Well, based on the replicasets in https://gist.githubusercontent.com/MatousJobanek/a6a6b7512f70c0b0e65c8d4d7a2b066d/raw/98d427b6816b032098063a01b26cf085d686a799/replicasets-out.txt something is constantly updating the pod template inside deployment which results to replicaset creation (it rollout new revision).

To root cause this, we will need to know what field in this deployment pod template is being modified. My guess is some admission hook is used to alter it, but please check what fields are modified and by what client.

Since we are not seeing this in OCP, I'm setting this as blocker- for 4.7.0

Comment 3 Matous Jobanek 2021-02-01 18:01:21 UTC
Thanks for your reply.

We have an admission webhook there, but it doesn't modify the deployment, but a Pod. https://github.com/codeready-toolchain/member-operator/blob/master/deploy/webhook/member-operator-webhook.yaml#L66-L97
We set a priority and priorityClassName to use our custom one (to set lower priority for users' pods) - here is the logic: https://github.com/codeready-toolchain/member-operator/blob/master/pkg/webhook/mutatingwebhook/mutate.go
But as I have said, it should not modify anything in the deployment, but only a Pod.

I'll try to observe which field are modified, but I'm not sure how to figure out which client does it...

Comment 4 W. Trevor King 2021-02-01 20:57:56 UTC
> I'll try to observe which field are modified, but I'm not sure how to figure out which client does it...

YAML metadata.managedFields should say who last set each property, and you can also look in the audit logs [1].

[1]: https://docs.openshift.com/container-platform/4.6/security/audit-log-view.html

Comment 5 Maciej Szulik 2021-02-02 11:14:29 UTC
I'd like to see the following:
- deployment yaml
- replicasets yaml
- pods yaml
- events yaml

Additionally, kube-controller-manager logs at -v=4 and ideally audit events, just like Trevor mentioned in the previous comment.

From the replicaset watch stream it's clear the deployment controller is noticing a change and is rolling new revisions.

Comment 6 Matous Jobanek 2021-02-02 18:24:10 UTC
Ok, I'll try to provide all these data as soon as I reproduce it again - it's pretty hard to hit the issue since it doesn't happen often.

Comment 7 Matous Jobanek 2021-02-03 12:42:27 UTC
I've been trying to reproduce it again for my Sandbox environment, but without any success so far.
However, a colleague of mine figured out that she hit the same issue, but already 4 days ago.

Deployment: https://gist.github.com/MatousJobanek/e3483b95201eadddb10a5095b3c38d22 The last item in the managed fields (manager: member-operator) is from our operator that tried to idle the deployment after 8 hours.
ReplicaSets: https://gist.github.com/MatousJobanek/d485980eea8d4bc0da9c341929a2f257
There are no pods
The only event that is available: 
5m29s       Warning   ReplicaSetCreateError   deployment/python-sample   Failed to create new replica set "python-sample-6cf6dbc5bd": replicasets.apps "python-sample-6cf6dbc5bd" is forbidden: exceeded quota: for-kkanova, requested: count/replicasets.apps=1, used: count/replicasets.apps=30, limited: count/replicasets.apps=30

I'll attach the kube-controller-manager logs as a file
Unfortunately, I - as a dedicated admin - don't have permissions to read the audit logs in OSD. If they are really needed, we can try to ask SRE team for providing the logs.

Comment 8 Matous Jobanek 2021-02-03 12:47:09 UTC
The file with the logs is too big to add it as an attachment. Here is the link to the file:  https://drive.google.com/file/d/1w6RC1bf7IzAVfLOC7SatNawQUcj_hKnD/view?usp=sharing

Comment 9 Maciej Szulik 2021-02-03 12:49:18 UTC
That's good for starters, thx! I'll have a look.

Comment 10 Matous Jobanek 2021-02-03 13:28:09 UTC
Hey, I was just able to reproduce it. I've spent hours trying it in Firefox, then I tried in Chrome and I hit the issue after a few attempts.
Since it's browser specific, then I guess that it's more console issue than a deployment one.

I watched the deployments in the namespace using:
oc get deployments --watch-only -o json
here is the log: https://gist.github.com/MatousJobanek/122a83f77218ee2006ad94938fe83e5d

ReplicaSets: https://gist.github.com/MatousJobanek/09c1681292bb2b49570f5ccb346ebba5

events: https://gist.github.com/MatousJobanek/c39ad3ca88c3a42f34ae7db863e34422 (as a side note: I had created also a python sample app before the Go one when trying to reproduce it)

Comment 11 Matous Jobanek 2021-02-03 13:37:11 UTC
here is the kube-controller-manager log from the time when I started working in the mjobanek-dev namespace: https://gist.githubusercontent.com/MatousJobanek/89cfc8cf7278041efc9ddeed56b4eba7/raw/bce5293f2d176c0c7af19cce67c82c322837bd03/mjobanek-kube-controller-manager.log

Comment 12 Maciej Szulik 2021-02-04 12:37:33 UTC
Looks like the problem is coming from web-console, I was just able to reproduce it simply 
by creating a sample python application. In contrast when I invoked:
oc new-app https://github.com/sclorg/django-ex.git 
there was no such problem. 

It looks like the deployment created from the web-console has incorrect annotations:

"alpha.image.policy.openshift.io/resolve-names": "*",
"image.openshift.io/triggers": "[{\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"golang-sample:latest\",\"namespace\":\"mjobanek-dev\"},\"fieldPath\":\"spec.template.spec.containers[?(@.name==\\\"golang-sample\\\")].image\",\"pause\":\"false\"}]",

which in new-app version are:

image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"django-ex:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"django-ex\")].image"}]'

Sending this over to console team for further investigation.

Comment 13 Maciej Szulik 2021-02-04 14:39:16 UTC
I've created a clone in https://bugzilla.redhat.com/show_bug.cgi?id=1925180 to look into the image-lookup bits more
from the kcm point of view to ensure we don't get trapped in a similar problem again.

Comment 14 Rohit Rai 2021-02-17 18:43:41 UTC
I investigated about this issue and tried to reproduce it across different environments and I could only rarely reproduce it on dev sandbox cluster's console. I was not able to reproduce it on OCP console, local console pointing to dev sandbox cluster and oc command using same resources that are created by web console. I could not even reproduce it on the dev sandbox console enough times to actually find a pattern that could help identify the issue.

Comment 15 Matous Jobanek 2021-02-19 13:19:17 UTC
Have you tried to create an automated test to reproduce it more times on Dev Sandbox? Is there anything we could help you with as a Sandbox team with analyzing the problem?
From our experience, it's an issue that is observed very rarely, but when it's hit then it can break the whole cluster if an appropriate quota is not set. In our case, it breaks "just" the user's environment. Unfortunately, the issue was also observed during a live demo on openshift.tv.

Comment 16 Christoph Jerolimov 2021-03-29 12:41:00 UTC
Hi, I have reproduced this last week, but could not reproduce it today on the same sandbox account.

There is no fix yet available for 4.6, but one idea currently is that it is linked to https://bugzilla.redhat.com/show_bug.cgi?id=1933664.

The web console creates all resources for a sample at the same time (1 ImageStream, 1 BuildConfig, 2 Secrets, 1 Deployment, 1 Service and 1 Route), this could result in some race conditions. The other issue shows some Forbidden in same rare cases. With the latest version this resources will now create sequential. But this fix is currently only available for 4.8 and 4.7, a backport for 4.6 is in progress. The 4.6 backport is tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1933665.

So my expectation is that it should still be possible to reproduce this. I will prepare a shell script which reproduce the API calls automatically.

Comment 17 Christoph Jerolimov 2021-04-06 22:04:25 UTC
After some more investigation I can now confirm that we have this issues 'only' if we create multiple resources at the same time. TL;DR: For the samples in the console we fixed this already in 4.8 and 4.7 and are currently backport this to 4.6 as well. The change creates the Secret, Route, Secret, ImageStream and Deployment sequential instead of parallel.

- for 4.8: https://bugzilla.redhat.com/show_bug.cgi?id=1933664
- for 4.7: https://bugzilla.redhat.com/show_bug.cgi?id=1933665
- for 4.6: https://bugzilla.redhat.com/show_bug.cgi?id=1933666

I attached the scripts to this issue and publish them on https://github.com/jerolimov/openshift/tree/master/issues/bz-1921717 as well.

With the script create-samples-works.sh I could verify over 130 times that the Deployments starts successfully if the resources are created sequential. The app starts 1-2 ReplicaSets and a Pod. The second ReplicaSet was created after the build finish and publish a new ImageStream.

The script create-samples-broken.sh creates the same resources but creates all of them 'at the same time' (in parallel). For this I just run the 7 oc commands in background. Because this fails so often I run this only a few times. I get the following errors while doing this:

- 7x Successful
- 7x Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission)
- 6x Found too much replica sets

> Create golang-sample with suffix -1 in namespace cjerolim-code...
> Create resources...
> secret/golang-sample-1-github-webhook-secret created
> route.route.openshift.io/golang-sample-1 created
> secret/golang-sample-1-generic-webhook-secret created
> imagestream.image.openshift.io/golang-sample-1 created
> buildconfig.build.openshift.io/golang-sample-1 created
> The Deployment "golang-sample-1" is invalid: spec.template.spec.containers[0].image: Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission)
> service/golang-sample-1 created
> Wait 2 minute for build...
> No resources found in cjerolim-code namespace.
> Error from server (NotFound): deployments.apps "golang-sample-1" not found

> Create golang-sample with suffix -1 in namespace cjerolim-code...
> Create resources...
> secret/golang-sample-1-generic-webhook-secret created
> imagestream.image.openshift.io/golang-sample-1 created
> deployment.apps/golang-sample-1 created
> service/golang-sample-1 created
> secret/golang-sample-1-github-webhook-secret created
> route.route.openshift.io/golang-sample-1 created
> buildconfig.build.openshift.io/golang-sample-1 created
> Wait 2 minute for build...
> Found too much replica sets.

The Deployment starts the Pod successfully, but the Deployment golang-sample-1 status still doesn't look good:
> status:
>   conditions:
>     - type: Progressing
>       status: 'False'
>       lastUpdateTime: '2021-04-06T19:36:44Z'
>       lastTransitionTime: '2021-04-06T19:36:44Z'
>       reason: ReplicaSetCreateError
>       message: >-
>         Failed to create new replica set "golang-sample-1-5f48d6dc46":
>         replicasets.apps "golang-sample-1-5f48d6dc46" is forbidden: exceeded
>         quota: for-cjerolim-replicas, requested: count/replicasets.apps=1, used:
>         count/replicasets.apps=30, limited: count/replicasets.apps=30
>   collisionCount: 20

Further investigations on how fast the ReplicaSets was created.
> mm:ss count
> 00:32 0
> 00:37 6
> 00:43 15
> 02:03 18
> 02:09 21
> 07:11 21 (stop watching)

Another case:
> No resources found in cjerolim-code namespace.
> 00:30 0
> 00:36 4
> 00:42 14
> 04:28 21
> 14:29 25
> 24:27 27
> 34:29 29 (stop watching)

Another case:
> 00:43 0
> 00:45 10
> 39:47 13
> 03:38 21
> 08:27 25
> 08:28 27
> 22:36 27 (stop watching)

Comment 18 Christoph Jerolimov 2021-04-06 22:06:21 UTC
Created attachment 1769696 [details]

Comment 19 Christoph Jerolimov 2021-04-06 22:06:55 UTC
Created attachment 1769697 [details]

Comment 20 Christoph Jerolimov 2021-04-06 22:07:19 UTC
Created attachment 1769698 [details]

Comment 21 Christoph Jerolimov 2021-04-06 22:07:45 UTC
Created attachment 1769699 [details]

Comment 22 Christoph Jerolimov 2021-04-06 22:14:55 UTC
Mark this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1933666 (4.6 backport) as this backport will fix this for our sandbox users and customers. For more details see comment 17.

And @maszulik@redhat.com created https://bugzilla.redhat.com/show_bug.cgi?id=1925180 for tracking this from the kcm point of view, see comment 13.

*** This bug has been marked as a duplicate of bug 1933666 ***

Note You need to log in before you can comment on or make changes to this bug.