Description of problem: On Dev Sandbox platform (https://developers.redhat.com/developer-sandbox) a few users experienced that after creating a sample application from the catalog (Go or Node.js), OpenShift started creating a huge number of ReplicaSets - all of them were owned by one deployment. It seems that some problematic deployments keep generating tons of replica sets. It also happened that in a case of one user it created hundreds of thousands of ReplicaSets. This caused that the cluster monitoring collapsed and ate all memory from the node which ended up in the unstable state of the entire OS cluster. After this experience, we introduced a counter limit for ReplicaSets in our ClusterResourceQuota that is set per user (max 30). After doing this, there were users that hit this limit in the same way ending with 30 ReplicaSets owned by one Deployment. There is another issue that is related to ClusterResourceQuota: https://bugzilla.redhat.com/show_bug.cgi?id=1920699 so both issues could be related, but it's only my guess. Version-Release number of selected component (if applicable): OSD 4.6.8 How reproducible: Signup for Dev Sandbox https://developers.redhat.com/developer-sandbox and create a sample app. However, it's not 100% reproducible. Steps to Reproduce: 1. Signup for Dev Sandbox https://developers.redhat.com/developer-sandbox 2. In Web Console in Developer tab when adding an app click on "Samples" 3. Choose either Go or Node.js 4. Create Technically, it could be also reproducible in OCP - you can use our ClusterResourceQuota from https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/cluster.yaml The template used for creating the same namespace is here: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/ns_code.yaml Actual results: It starts created a huge number of ReplicaSets Expected results: Should create only one ReplicaSet
I was just able to reproduce it. I watched the ReplicaSets with the command: oc get replicasets --watch-only -o json here is the output: https://gist.githubusercontent.com/MatousJobanek/a6a6b7512f70c0b0e65c8d4d7a2b066d/raw/98d427b6816b032098063a01b26cf085d686a799/replicasets-out.txt
Well, based on the replicasets in https://gist.githubusercontent.com/MatousJobanek/a6a6b7512f70c0b0e65c8d4d7a2b066d/raw/98d427b6816b032098063a01b26cf085d686a799/replicasets-out.txt something is constantly updating the pod template inside deployment which results to replicaset creation (it rollout new revision). To root cause this, we will need to know what field in this deployment pod template is being modified. My guess is some admission hook is used to alter it, but please check what fields are modified and by what client. Since we are not seeing this in OCP, I'm setting this as blocker- for 4.7.0
Thanks for your reply. We have an admission webhook there, but it doesn't modify the deployment, but a Pod. https://github.com/codeready-toolchain/member-operator/blob/master/deploy/webhook/member-operator-webhook.yaml#L66-L97 We set a priority and priorityClassName to use our custom one (to set lower priority for users' pods) - here is the logic: https://github.com/codeready-toolchain/member-operator/blob/master/pkg/webhook/mutatingwebhook/mutate.go But as I have said, it should not modify anything in the deployment, but only a Pod. I'll try to observe which field are modified, but I'm not sure how to figure out which client does it...
> I'll try to observe which field are modified, but I'm not sure how to figure out which client does it... YAML metadata.managedFields should say who last set each property, and you can also look in the audit logs [1]. [1]: https://docs.openshift.com/container-platform/4.6/security/audit-log-view.html
I'd like to see the following: - deployment yaml - replicasets yaml - pods yaml - events yaml Additionally, kube-controller-manager logs at -v=4 and ideally audit events, just like Trevor mentioned in the previous comment. From the replicaset watch stream it's clear the deployment controller is noticing a change and is rolling new revisions.
Ok, I'll try to provide all these data as soon as I reproduce it again - it's pretty hard to hit the issue since it doesn't happen often.
I've been trying to reproduce it again for my Sandbox environment, but without any success so far. However, a colleague of mine figured out that she hit the same issue, but already 4 days ago. Deployment: https://gist.github.com/MatousJobanek/e3483b95201eadddb10a5095b3c38d22 The last item in the managed fields (manager: member-operator) is from our operator that tried to idle the deployment after 8 hours. ReplicaSets: https://gist.github.com/MatousJobanek/d485980eea8d4bc0da9c341929a2f257 There are no pods The only event that is available: {noformat} 5m29s Warning ReplicaSetCreateError deployment/python-sample Failed to create new replica set "python-sample-6cf6dbc5bd": replicasets.apps "python-sample-6cf6dbc5bd" is forbidden: exceeded quota: for-kkanova, requested: count/replicasets.apps=1, used: count/replicasets.apps=30, limited: count/replicasets.apps=30 {noformat} I'll attach the kube-controller-manager logs as a file Unfortunately, I - as a dedicated admin - don't have permissions to read the audit logs in OSD. If they are really needed, we can try to ask SRE team for providing the logs.
The file with the logs is too big to add it as an attachment. Here is the link to the file: https://drive.google.com/file/d/1w6RC1bf7IzAVfLOC7SatNawQUcj_hKnD/view?usp=sharing
That's good for starters, thx! I'll have a look.
Hey, I was just able to reproduce it. I've spent hours trying it in Firefox, then I tried in Chrome and I hit the issue after a few attempts. Since it's browser specific, then I guess that it's more console issue than a deployment one. I watched the deployments in the namespace using: oc get deployments --watch-only -o json here is the log: https://gist.github.com/MatousJobanek/122a83f77218ee2006ad94938fe83e5d ReplicaSets: https://gist.github.com/MatousJobanek/09c1681292bb2b49570f5ccb346ebba5 events: https://gist.github.com/MatousJobanek/c39ad3ca88c3a42f34ae7db863e34422 (as a side note: I had created also a python sample app before the Go one when trying to reproduce it)
here is the kube-controller-manager log from the time when I started working in the mjobanek-dev namespace: https://gist.githubusercontent.com/MatousJobanek/89cfc8cf7278041efc9ddeed56b4eba7/raw/bce5293f2d176c0c7af19cce67c82c322837bd03/mjobanek-kube-controller-manager.log
Looks like the problem is coming from web-console, I was just able to reproduce it simply by creating a sample python application. In contrast when I invoked: oc new-app https://github.com/sclorg/django-ex.git there was no such problem. It looks like the deployment created from the web-console has incorrect annotations: "alpha.image.policy.openshift.io/resolve-names": "*", "image.openshift.io/triggers": "[{\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"golang-sample:latest\",\"namespace\":\"mjobanek-dev\"},\"fieldPath\":\"spec.template.spec.containers[?(@.name==\\\"golang-sample\\\")].image\",\"pause\":\"false\"}]", which in new-app version are: image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"django-ex:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"django-ex\")].image"}]' Sending this over to console team for further investigation.
I've created a clone in https://bugzilla.redhat.com/show_bug.cgi?id=1925180 to look into the image-lookup bits more from the kcm point of view to ensure we don't get trapped in a similar problem again.
I investigated about this issue and tried to reproduce it across different environments and I could only rarely reproduce it on dev sandbox cluster's console. I was not able to reproduce it on OCP console, local console pointing to dev sandbox cluster and oc command using same resources that are created by web console. I could not even reproduce it on the dev sandbox console enough times to actually find a pattern that could help identify the issue.
Have you tried to create an automated test to reproduce it more times on Dev Sandbox? Is there anything we could help you with as a Sandbox team with analyzing the problem? From our experience, it's an issue that is observed very rarely, but when it's hit then it can break the whole cluster if an appropriate quota is not set. In our case, it breaks "just" the user's environment. Unfortunately, the issue was also observed during a live demo on openshift.tv.
Hi, I have reproduced this last week, but could not reproduce it today on the same sandbox account. There is no fix yet available for 4.6, but one idea currently is that it is linked to https://bugzilla.redhat.com/show_bug.cgi?id=1933664. The web console creates all resources for a sample at the same time (1 ImageStream, 1 BuildConfig, 2 Secrets, 1 Deployment, 1 Service and 1 Route), this could result in some race conditions. The other issue shows some Forbidden in same rare cases. With the latest version this resources will now create sequential. But this fix is currently only available for 4.8 and 4.7, a backport for 4.6 is in progress. The 4.6 backport is tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1933665. So my expectation is that it should still be possible to reproduce this. I will prepare a shell script which reproduce the API calls automatically.
After some more investigation I can now confirm that we have this issues 'only' if we create multiple resources at the same time. TL;DR: For the samples in the console we fixed this already in 4.8 and 4.7 and are currently backport this to 4.6 as well. The change creates the Secret, Route, Secret, ImageStream and Deployment sequential instead of parallel. - for 4.8: https://bugzilla.redhat.com/show_bug.cgi?id=1933664 - for 4.7: https://bugzilla.redhat.com/show_bug.cgi?id=1933665 - for 4.6: https://bugzilla.redhat.com/show_bug.cgi?id=1933666 I attached the scripts to this issue and publish them on https://github.com/jerolimov/openshift/tree/master/issues/bz-1921717 as well. With the script create-samples-works.sh I could verify over 130 times that the Deployments starts successfully if the resources are created sequential. The app starts 1-2 ReplicaSets and a Pod. The second ReplicaSet was created after the build finish and publish a new ImageStream. The script create-samples-broken.sh creates the same resources but creates all of them 'at the same time' (in parallel). For this I just run the 7 oc commands in background. Because this fails so often I run this only a few times. I get the following errors while doing this: - 7x Successful - 7x Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission) - 6x Found too much replica sets FULL FORBIDDEN ERROR LOG: > Create golang-sample with suffix -1 in namespace cjerolim-code... > > Create resources... > > secret/golang-sample-1-github-webhook-secret created > route.route.openshift.io/golang-sample-1 created > secret/golang-sample-1-generic-webhook-secret created > imagestream.image.openshift.io/golang-sample-1 created > buildconfig.build.openshift.io/golang-sample-1 created > The Deployment "golang-sample-1" is invalid: spec.template.spec.containers[0].image: Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission) > service/golang-sample-1 created > > Wait 2 minute for build... > > No resources found in cjerolim-code namespace. > Error from server (NotFound): deployments.apps "golang-sample-1" not found FULL TOO MUCH REPLICA SET LOG: > Create golang-sample with suffix -1 in namespace cjerolim-code... > > Create resources... > > secret/golang-sample-1-generic-webhook-secret created > imagestream.image.openshift.io/golang-sample-1 created > deployment.apps/golang-sample-1 created > service/golang-sample-1 created > secret/golang-sample-1-github-webhook-secret created > route.route.openshift.io/golang-sample-1 created > buildconfig.build.openshift.io/golang-sample-1 created > > Wait 2 minute for build... > > Found too much replica sets. The Deployment starts the Pod successfully, but the Deployment golang-sample-1 status still doesn't look good: > status: > conditions: > - type: Progressing > status: 'False' > lastUpdateTime: '2021-04-06T19:36:44Z' > lastTransitionTime: '2021-04-06T19:36:44Z' > reason: ReplicaSetCreateError > message: >- > Failed to create new replica set "golang-sample-1-5f48d6dc46": > replicasets.apps "golang-sample-1-5f48d6dc46" is forbidden: exceeded > quota: for-cjerolim-replicas, requested: count/replicasets.apps=1, used: > count/replicasets.apps=30, limited: count/replicasets.apps=30 > collisionCount: 20 Further investigations on how fast the ReplicaSets was created. > mm:ss count > 00:32 0 > 00:37 6 > 00:43 15 > 02:03 18 > 02:09 21 > 07:11 21 (stop watching) Another case: > No resources found in cjerolim-code namespace. > 00:30 0 > 00:36 4 > 00:42 14 > 04:28 21 > 14:29 25 > 24:27 27 > 34:29 29 (stop watching) Another case: > 00:43 0 > 00:45 10 > 39:47 13 > 03:38 21 > 08:27 25 > 08:28 27 > 22:36 27 (stop watching)
Created attachment 1769696 [details] check-replicasets.sh
Created attachment 1769697 [details] create-samples-broken.sh
Created attachment 1769698 [details] create-samples-works.sh
Created attachment 1769699 [details] cleanup-all-samples.sh
Mark this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1933666 (4.6 backport) as this backport will fix this for our sandbox users and customers. For more details see comment 17. And @maszulik created https://bugzilla.redhat.com/show_bug.cgi?id=1925180 for tracking this from the kcm point of view, see comment 13. *** This bug has been marked as a duplicate of bug 1933666 ***