Description of problem: so the tl;dr is in https://coreos.slack.com/archives/C014MHHKUSF/p1631629868053400 Adam, feel free to change the severity at your discretion, but it seems plausible users will hit this with upgrades to 4.8. The high level params of the scenario: - a BC with an ICT is created in 4.7, and the spec last triggered field is updated - the BC's ICT image does not change again until after an upgrade to 4.8, so only its spec is still updated - the ICT then fires after the 4.8 upgrade The details: - on the OCM / trigger side, at 4.8, we are only checking the the status to see if we have already triggered: https://github.com/openshift/openshift-controller-manager/blob/release-4.8/pkg/image/trigger/buildconfigs/buildconfigs.go#L179-L186 ... so at 4.8, it attempts to trigger - on the api server side, we are checking both spec and status to see if we have already triggered for a given image, and if so, error out ... this is still occurring in 4.10, 4.9, and 4.8 (in 4.9, we had a TODO to remove that 4.10: https://github.com/openshift/openshift-apiserver/blob/release-4.10/pkg/build/apiserver/buildgenerator/generator.go#L374-L379 4.9: https://github.com/openshift/openshift-apiserver/blob/release-4.9/pkg/build/apiserver/buildgenerator/generator.go#L374-L379 4.8 https://github.com/openshift/openshift-apiserver/blob/release-4.8/pkg/build/apiserver/buildgenerator/generator.go#L374-L379 On the api-server side, I believe would remove the spec/req check on those lines for 4.9 and 4.10, as we no longer want to update spec last triggered ID, and want to update status regardless of what is in spec On the OCM side, I believe we need to check spec and status for the foreseeable future, where if either match, we continue/skip, as we may be dealing with a BC that was not touched since we went to 4.8 ... so that piece needs to go back all the way to 4.8 Version-Release number of selected component (if applicable): 4.8, 4.9, 4.10 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
verified
Work around: 1. Identify the BuildConfigs that are failing to start a build from an ImageChange trigger: $ oc get events -A | grep "BuildConfigTriggerFailed" | awk '{ print $5 " -n " $1 }' | sort | uniq 2. For each affected BuildConfig: a. Export the BuildConfig spec in YAML format - `$ oc get bc/{name} -n {namespace} -o yaml > bc-{namespace}-{name}.yaml` b. Delete the BuildConfig - `$ oc get bc/{name} -n {namespace}` c. In the exported BuildConfg YAML file, remove the status and unique identifiers in the BuildConfig metadata (uuid, creation timestamp etc), leaving the BuildConfig spec and pertinent metadata. d. Recreate the BuildConfig: `oc apply -f bc-{namespace}-{name}.yaml`
need a slight adjustment to c) you also need to delete the last triggered image ID in the spec it is not filtered out during create: https://github.com/openshift/openshift-apiserver/blob/5e3c6847d4b5757c5fc7865b1d513a1d8d5eebb2/pkg/build/apiserver/registry/buildconfig/strategy.go
cat bc-{namespace}-{name}.yaml | grep -v lastTriggeredImageID > filtered-bc-{namespace}-{name}.yaml worked for me locally to achieve what I noted in #Comment 14
It seems that when we detect that when this problem is detected we should update the status. If we fail to do this there will be an eventual point when we retire the spec checking code and buildConfigs that have not been touched in a long time on updated clusters will start to exhibit this problem again. If done this way that work around code could be retired in 4.10 (because any 4.7 or older cluster would have to upgrade through a newish 4.9 z stream to get to 4.10).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056