Bug 1925180

Summary:

Deployment creates a huge number of ReplicaSets - image-lookup bits

Product:

OpenShift Container Platform

Reporter:

Maciej Szulik <maszulik>

Component:

kube-apiserver

Assignee:

Filip Krepinsky <fkrepins>

Status:

CLOSED DUPLICATE

QA Contact:

zhou ying <yinzhou>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.6

CC:

alkazako, aos-bugs, cjerolim, cvogt, fkrepins, mfojtik, mgugino, mjobanek, nmukherj, wking, xxia, yinzhou

Target Milestone:

---

Target Release:

4.9.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: When deployment and image stream is created at the same time race condition can occur in the deployment image resolution. Consequence: deployment controller creates replica sets in infinite loop Fix: responsibilities of apiserver's imagepolicy plugin were lowered Result: concurrent creation of the deployment and image stream does not cause infinite replica sets anymore

Story Points:

---

Clone Of:

1921717

Clones:

1982717 (view as bug list)

Environment:

Last Closed:

2021-10-18 17:29:03 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
reproduce.sh	none

Description Maciej Szulik 2021-02-04 14:35:45 UTC

+++ This bug was initially created as a clone of Bug #1921717 +++

Description of problem:

On Dev Sandbox platform
(https://developers.redhat.com/developer-sandbox) a few users experienced that after creating a sample application from the catalog (Go or Node.js), OpenShift started creating a huge number of ReplicaSets - all of them were owned by one deployment.
It seems that some problematic deployments keep generating tons of replica sets. 

It also happened that in a case of one user it created hundreds of thousands of ReplicaSets. This caused that the cluster monitoring collapsed and ate all memory from the node which ended up in the unstable state of the entire OS cluster.

After this experience, we introduced a counter limit for ReplicaSets in our ClusterResourceQuota that is set per user (max 30). After doing this, there were users that hit this limit in the same way ending with 30 ReplicaSets owned by one Deployment.

There is another issue that is related to ClusterResourceQuota: https://bugzilla.redhat.com/show_bug.cgi?id=1920699 so both issues could be related, but it's only my guess.

Version-Release number of selected component (if applicable):

OSD 4.6.8

How reproducible:

Signup for Dev Sandbox https://developers.redhat.com/developer-sandbox and create a sample app. However, it's not 100% reproducible. 

Steps to Reproduce:
1. Signup for Dev Sandbox https://developers.redhat.com/developer-sandbox
2. In Web Console in Developer tab when adding an app click on "Samples"
3. Choose either Go or Node.js
4. Create

Technically, it could be also reproducible in OCP - you can use our ClusterResourceQuota from https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/cluster.yaml
The template used for creating the same namespace is here: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/ns_code.yaml

Actual results:

It starts created a huge number of ReplicaSets

Expected results:

Should create only one ReplicaSet

--- Additional comment from Matous Jobanek on 2021-01-28 14:57:30 CET ---

I was just able to reproduce it. 
I watched the ReplicaSets with the command: oc get replicasets --watch-only -o json
here is the output: https://gist.githubusercontent.com/MatousJobanek/a6a6b7512f70c0b0e65c8d4d7a2b066d/raw/98d427b6816b032098063a01b26cf085d686a799/replicasets-out.txt

--- Additional comment from Michal Fojtik on 2021-02-01 13:28:09 CET ---

Well, based on the replicasets in https://gist.githubusercontent.com/MatousJobanek/a6a6b7512f70c0b0e65c8d4d7a2b066d/raw/98d427b6816b032098063a01b26cf085d686a799/replicasets-out.txt something is constantly updating the pod template inside deployment which results to replicaset creation (it rollout new revision).

To root cause this, we will need to know what field in this deployment pod template is being modified. My guess is some admission hook is used to alter it, but please check what fields are modified and by what client.

Since we are not seeing this in OCP, I'm setting this as blocker- for 4.7.0

--- Additional comment from Matous Jobanek on 2021-02-01 19:01:21 CET ---

Thanks for your reply.

We have an admission webhook there, but it doesn't modify the deployment, but a Pod. https://github.com/codeready-toolchain/member-operator/blob/master/deploy/webhook/member-operator-webhook.yaml#L66-L97
We set a priority and priorityClassName to use our custom one (to set lower priority for users' pods) - here is the logic: https://github.com/codeready-toolchain/member-operator/blob/master/pkg/webhook/mutatingwebhook/mutate.go
But as I have said, it should not modify anything in the deployment, but only a Pod.

I'll try to observe which field are modified, but I'm not sure how to figure out which client does it...

--- Additional comment from W. Trevor King on 2021-02-01 21:57:56 CET ---

> I'll try to observe which field are modified, but I'm not sure how to figure out which client does it...

YAML metadata.managedFields should say who last set each property, and you can also look in the audit logs [1].

[1]: https://docs.openshift.com/container-platform/4.6/security/audit-log-view.html

--- Additional comment from Maciej Szulik on 2021-02-02 12:14:29 CET ---

I'd like to see the following:
- deployment yaml
- replicasets yaml
- pods yaml
- events yaml

Additionally, kube-controller-manager logs at -v=4 and ideally audit events, just like Trevor mentioned in the previous comment.

From the replicaset watch stream it's clear the deployment controller is noticing a change and is rolling new revisions.

--- Additional comment from Matous Jobanek on 2021-02-02 19:24:10 CET ---

Ok, I'll try to provide all these data as soon as I reproduce it again - it's pretty hard to hit the issue since it doesn't happen often.

--- Additional comment from Matous Jobanek on 2021-02-03 13:42:27 CET ---

I've been trying to reproduce it again for my Sandbox environment, but without any success so far.
However, a colleague of mine figured out that she hit the same issue, but already 4 days ago.

Deployment: https://gist.github.com/MatousJobanek/e3483b95201eadddb10a5095b3c38d22 The last item in the managed fields (manager: member-operator) is from our operator that tried to idle the deployment after 8 hours.
ReplicaSets: https://gist.github.com/MatousJobanek/d485980eea8d4bc0da9c341929a2f257
There are no pods
The only event that is available: 
{noformat}
5m29s       Warning   ReplicaSetCreateError   deployment/python-sample   Failed to create new replica set "python-sample-6cf6dbc5bd": replicasets.apps "python-sample-6cf6dbc5bd" is forbidden: exceeded quota: for-kkanova, requested: count/replicasets.apps=1, used: count/replicasets.apps=30, limited: count/replicasets.apps=30
{noformat}

I'll attach the kube-controller-manager logs as a file
Unfortunately, I - as a dedicated admin - don't have permissions to read the audit logs in OSD. If they are really needed, we can try to ask SRE team for providing the logs.

--- Additional comment from Matous Jobanek on 2021-02-03 13:47:09 CET ---

The file with the logs is too big to add it as an attachment. Here is the link to the file:  https://drive.google.com/file/d/1w6RC1bf7IzAVfLOC7SatNawQUcj_hKnD/view?usp=sharing

--- Additional comment from Maciej Szulik on 2021-02-03 13:49:18 CET ---

That's good for starters, thx! I'll have a look.

--- Additional comment from Matous Jobanek on 2021-02-03 14:28:09 CET ---

Hey, I was just able to reproduce it. I've spent hours trying it in Firefox, then I tried in Chrome and I hit the issue after a few attempts.
Since it's browser specific, then I guess that it's more console issue than a deployment one.

I watched the deployments in the namespace using:
oc get deployments --watch-only -o json
here is the log: https://gist.github.com/MatousJobanek/122a83f77218ee2006ad94938fe83e5d

ReplicaSets: https://gist.github.com/MatousJobanek/09c1681292bb2b49570f5ccb346ebba5

events: https://gist.github.com/MatousJobanek/c39ad3ca88c3a42f34ae7db863e34422 (as a side note: I had created also a python sample app before the Go one when trying to reproduce it)

--- Additional comment from Matous Jobanek on 2021-02-03 14:37:11 CET ---

here is the kube-controller-manager log from the time when I started working in the mjobanek-dev namespace: https://gist.githubusercontent.com/MatousJobanek/89cfc8cf7278041efc9ddeed56b4eba7/raw/bce5293f2d176c0c7af19cce67c82c322837bd03/mjobanek-kube-controller-manager.log

--- Additional comment from Maciej Szulik on 2021-02-04 13:37:33 CET ---

Looks like the problem is coming from web-console, I was just able to reproduce it simply 
by creating a sample python application. In contrast when I invoked:
oc new-app https://github.com/sclorg/django-ex.git 
there was no such problem. 

It looks like the deployment created from the web-console has incorrect annotations:

"alpha.image.policy.openshift.io/resolve-names": "*",
"image.openshift.io/triggers": "[{\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"golang-sample:latest\",\"namespace\":\"mjobanek-dev\"},\"fieldPath\":\"spec.template.spec.containers[?(@.name==\\\"golang-sample\\\")].image\",\"pause\":\"false\"}]",

which in new-app version are:

image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"django-ex:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"django-ex\")].image"}]'

Sending this over to console team for further investigation.

Comment 1 Maciej Szulik 2021-02-04 14:37:36 UTC

This is to look into the annotation settings:

"alpha.image.policy.openshift.io/resolve-names": "*",
"image.openshift.io/triggers": "[{\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"golang-sample:latest\",\"namespace\":\"mjobanek-dev\"},\"fieldPath\":\"spec.template.spec.containers[?(@.name==\\\"golang-sample\\\")].image\",\"pause\":\"false\"}]",

in comparison with:

image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"django-ex:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"django-ex\")].image"}]'

to ensure we don't shoot ourselves in the foot next time in the future.

Comment 2 Michal Fojtik 2021-03-06 15:26:53 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Christoph Jerolimov 2021-04-06 23:19:51 UTC

Hey @maszulik, I've been investigating this issue a little bit and you can find some notes in the origin issue, see https://bugzilla.redhat.com/show_bug.cgi?id=1921717#c17

You can find my scripts to reproduce this issue in the origin issue as well or on GitHub https://github.com/jerolimov/openshift/tree/master/issues/bz-1921717. They fails with two different issues.

On the sandbox cluster they fails for me in 2/3 cases (when creating the resources parallel)
- 7x was successful
- 7x failed with Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission)
- 6x failed with too much replica sets

On a local CRC cluster the numbers are a little bit better, but they fails sometimes as well.
- 10x was successful
- 1x failed with Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission)
- 1x failed with too much replica sets (Created over 1000 ReplicaSets within minutes!)

I'm sure you can drop some of the resources (the Secrets, Route and Service), but I wanted a script which reproduce the complete console api calls. If you have any question please feel free to contact me here or in Slack.

Comment 4 Michal Fojtik 2021-04-06 23:22:32 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 5 Michal Fojtik 2021-05-07 00:14:28 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 6 Maciej Szulik 2021-05-14 16:36:33 UTC

Sample deployment that triggered this problem is https://github.com/openshift/hypershift/pull/207/files#diff-cb5a6beecbd454946c5c62ee50feffee25a9680a6909477c5f3999d10cf1ebd0

Comment 7 Filip Krepinsky 2021-06-01 17:44:01 UTC

Created attachment 1788551 [details]
reproduce.sh

Comment 8 Filip Krepinsky 2021-06-01 17:45:11 UTC

I uploaded minimal reproducible example (based on @cjerolim repo). It didn't reproduce for me in default namespace, but in test namespace.
The reproducibility varies - it is between 10-70%.

The results observed by @cjerolim are caused by the following Scenario 1 and Scenario 2. The bug itself is originating in kube-apiserver admission plugin, but ultimately it is facilitated by these 3 component (kube-apiserver, kube-controller-manager and openshift-apiserver) and the following edge case scenarios:

Scenario 1

Deployment D admitted by apiserver (has original image)
Deployment D verified by apiserver (has original image)
Deployment D created
ImageStream I created
Deployment D started syncing
Deployment D created ReplicaSet R (both have original image tag)
ReplicaSet R admitted by apiserver (changes image to a newer version according to the ImageStream I)
ReplicaSet R verified by apiserver (has new image)
ReplicaSet R created
Deployment finished syncing and failed due to receiving new replica
Deployment D starts syncing again
Deployment D compares ReplicaSet R with its pod spec and they do not equal
Deployment D decides to create a new ReplicaSet R2 (both have original image tag)
ReplicaSet R2 admitted by apiserver (changes image to a newer version according to the ImageStream I)
ReplicaSet R2 verified by apiserver (has new image)
Deployment finished syncing and failed due to receiving new replica
Deployment D starts syncing again
...

Additionaly Deployment D will never have the correct image tag according to the ImageStream I, because the annotation image.openshift.io/triggers in Deployment D is in a wrong format and will never resolve in openshift-apiserver. So the kube-controller-manager and kube-apiserver will keep reconciling these incompatible states forever

Scenario 2

Deployment A admitted by apiserver (has original image tag)
ImageStream I created
Deployment A is getting verified by apiserver (wants to change image according to the ImageStream I in verify)
apiserver -> Forbidden: this image is prohibited by policy: this image is prohibited by policy (changed after admission)

Scenario 3
ImageStream I created
Deployment D admitted (changes image to a newer version according to the ImageStream I)
all the logic works fine...

Solution

- We can prevent touching references that are owned by controllers when admitting them. And only set correct images for the parents. This will give time to the controllers to react to these changes and eventually the correct images will be set in children as well. Fixes Scenario 1.
- We can skip verify when race occurs as in Scenario 2. This will make the admission atomic and will be similar to a flow where ImageStream I gets created just right after verify.
- We could also check format of image.openshift.io/triggers (JSON validation) to prevent resources admition with incorrectly set annoation. Although I am not sure if this is a correct way and if annotations should affect admission. This would go to another BZ.

Comment 9 Filip Krepinsky 2021-06-01 17:50:37 UTC

moving to kube-apiserver - I posted a fix for both scenarios to apiserver-library-go

Comment 11 zhou ying 2021-07-15 02:41:06 UTC

for i in {100..1100}; do oc new-project test$i; /tmp/broken.sh ; oc delete project test$i ; done


Run the loop , can't reproduce this issue , will move to verified status . 
[root@localhost ~]# oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-07-12-203753   True        False         25h     Cluster version is 4.9.0-0.nightly-2021-07-12-203753

Comment 15 errata-xmlrpc 2021-10-18 17:29:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 16 Scott Dodson 2021-10-21 17:48:13 UTC


*** This bug has been marked as a duplicate of bug 1976775 ***