2004203 – build config's created prior to 4.8 with image change triggers can result in trigger storm in OCM/openshift-apiserver

Bug 2004203 - build config's created prior to 4.8 with image change triggers can result in trigger storm in OCM/openshift-apiserver

Summary: build config's created prior to 4.8 with image change triggers can result in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Gabe Montero
QA Contact:	Jitendar Singh
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:
Depends On:
Blocks:	2006791
TreeView+	depends on / blocked

Reported:	2021-09-14 17:09 UTC by Gabe Montero
Modified:	2024-12-20 21:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Before this update, if you created a build configuration containing an image change trigger in {product-title} 4.7.x or earlier, the image change trigger might trigger builds continuously. + This issue happened because, with the deprecation and removal of the `lastTriggeredImageID` field from the `BuildConfig` spec for Builds, the image change trigger controller stopped checking that field before instantiating builds. {product-title} 4.8 introduced new fields in the status that the image change trigger controller needed to check, but didn't. + With this update, the image change trigger controller continuously checks the correct fields in the spec and status for the last triggered image ID and only triggers a build when necessary. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2004203[BZ#2004203])
Clone Of:
Clones:	2006791 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:10:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-apiserver pull 246	None	Merged	Bug 2004203: no longer abort ICT induced builds because of spec last triggered image ID	2021-09-29 09:44:34 UTC
Github	openshift openshift-controller-manager pull 201	None	Merged	Bug 2004203: BC ICT still must check spec last triggered image ID in case BC was last processed when cluster was pre 4.8	2021-09-29 09:44:37 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:10:40 UTC

Description Gabe Montero 2021-09-14 17:09:16 UTC

Description of problem:

so the tl;dr is in https://coreos.slack.com/archives/C014MHHKUSF/p1631629868053400

Adam, feel free to change the severity at your discretion, but it seems plausible users will hit this with upgrades to 4.8.

The high level params of the scenario:
- a BC with an ICT is created in 4.7, and the spec last triggered field is updated
- the BC's ICT image does not change again until after an upgrade to 4.8, so only its spec is still updated
- the ICT then fires after the 4.8 upgrade

The details:
- on the OCM / trigger side, at 4.8, we are only checking the the status to see if we have already triggered:  https://github.com/openshift/openshift-controller-manager/blob/release-4.8/pkg/image/trigger/buildconfigs/buildconfigs.go#L179-L186 ... so at 4.8, it attempts to trigger
- on the api server side, we are checking both spec and status to see if we have already triggered for a given image, and if so, error out ... this is still occurring in 4.10, 4.9, and 4.8 (in 4.9, we had a TODO to remove that

4.10:  https://github.com/openshift/openshift-apiserver/blob/release-4.10/pkg/build/apiserver/buildgenerator/generator.go#L374-L379
4.9: https://github.com/openshift/openshift-apiserver/blob/release-4.9/pkg/build/apiserver/buildgenerator/generator.go#L374-L379
4.8 https://github.com/openshift/openshift-apiserver/blob/release-4.8/pkg/build/apiserver/buildgenerator/generator.go#L374-L379

On the api-server side, I believe would remove the spec/req check on those lines for 4.9 and 4.10, as we no longer want to update spec last triggered ID, and want to update status regardless of what is in spec

On the OCM side, I believe we need to check spec and status for the foreseeable future, where if either match, we continue/skip, as we may be dealing with a BC that was not touched since we went to 4.8 ... so that piece needs to go back all the way to 4.8

Version-Release number of selected component (if applicable):

4.8, 4.9, 4.10

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 10 Jitendar Singh 2021-10-07 10:52:09 UTC

verified

Comment 13 Adam Kaplan 2021-10-20 15:44:53 UTC

Work around:

1. Identify the BuildConfigs that are failing to start a build from an ImageChange trigger:

$ oc get events -A | grep "BuildConfigTriggerFailed" | awk '{ print $5 " -n " $1 }' | sort | uniq

2. For each affected BuildConfig:
   a. Export the BuildConfig spec in YAML format - `$ oc get bc/{name} -n {namespace} -o yaml > bc-{namespace}-{name}.yaml`
   b. Delete the BuildConfig - `$ oc get bc/{name} -n {namespace}`
   c. In the exported BuildConfg YAML file, remove the status and unique identifiers in the BuildConfig metadata (uuid, creation timestamp etc), leaving the BuildConfig spec and pertinent metadata.
   d. Recreate the BuildConfig: `oc apply -f bc-{namespace}-{name}.yaml`

Comment 14 Gabe Montero 2021-10-20 18:13:58 UTC

need a slight adjustment to c)

you also need to delete the last triggered image ID in the spec

it is not filtered out during create:  https://github.com/openshift/openshift-apiserver/blob/5e3c6847d4b5757c5fc7865b1d513a1d8d5eebb2/pkg/build/apiserver/registry/buildconfig/strategy.go

Comment 15 Gabe Montero 2021-10-21 20:37:58 UTC

cat bc-{namespace}-{name}.yaml | grep -v lastTriggeredImageID > filtered-bc-{namespace}-{name}.yaml 

worked for me locally to achieve what I noted in #Comment 14

Comment 16 Brad Durrow 2021-10-31 17:11:10 UTC

It seems that when we detect that when this problem is detected we should update the status.  If we fail to do this there will be an eventual point when we retire the spec checking code and buildConfigs that have not been touched in a long time on updated clusters will start to exhibit this problem again.  If done this way that work around code could be retired in 4.10 (because any 4.7 or older cluster would have to upgrade through a newish 4.9 z stream to get to 4.10).

Comment 21 errata-xmlrpc 2022-03-10 16:10:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.