Bug 1960455
Summary: | Performance Addon Operator fails to install after catalog source becomes ready | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ian Miller <imiller> | |
Component: | OLM | Assignee: | Anik <anbhatta> | |
OLM sub component: | OLM | QA Contact: | xzha | |
Status: | CLOSED ERRATA | Docs Contact: | Padraig O'Grady <pogrady> | |
Severity: | high | |||
Priority: | high | CC: | achernet, alukiano, anbhatta, aos-bugs, cchun, eparis, imiller, jokerman, keyoung, krizza, melserng, pogrady, sponnaga | |
Version: | 4.8 | Keywords: | AutomationBlocker, Triaged | |
Target Milestone: | --- | Flags: | anbhatta:
needinfo+
|
|
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: In https://github.com/operator-framework/operator-lifecycle-manager/pull/2077, a new phase Failed was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
Failed state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the Installing phase to the Complete phase(unless the bundle unpacking step
failed, in which case #2093 introduced transitioning the InstallPlan to the Failed
phase).
Consequence: This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.
Fix:
* Removes the logic that transitioned the InstallPlan to Failed. Instead, the
InstallPlan will again be requeued for any reconciliation error.
* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:
Result:
When no OperatorGroup is detected:
```
conditions:
- lastTransitionTime: "2021-06-23T18:16:00Z"
lastUpdateTime: "2021-06-23T18:16:16Z"
message: attenuated service account query failed - no operator group found that
is managing this namespace
reason: InstallCheckFailed
status: "False"
type: Installed
```
Then when a valid OperatorGroup is created:
```
conditions:
- lastTransitionTime: "2021-06-23T18:33:37Z"
lastUpdateTime: "2021-06-23T18:33:37Z"
status: "True"
```
|
Story Points: | --- | |
Clone Of: | ||||
: | 1982249 1982250 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:31:04 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1982249, 1982250 |
Description
Ian Miller
2021-05-13 22:03:57 UTC
We are still triaging this issue and will add more detail. Ian,
> pushes CRs for the CatalogSource, Namespace, OperatorGroup, and Subscription. All CRs are pushed (approximately) simultaneously
That is most likely the issue here. If the catalog-operator sees an unreconciled OperatorGroup without a status, it flags it as an invalid OpertorGroup, and the InstallPlan fails. There are no retries by the InstallPlan to wait for the OperatorGroup to successfully reconcile at the moment, so the immediate solution would be to create the CatalogSource and OperatorGroup in step 1, wait for the successful reconciliation of the OperatorGroup, and then create the subscription for your operator in step 2.
ps: the github issue link that you shared returns a 404
Adding NeedInfo to indicate we're waiting on feedback about the solution.
I saw the clear reproduce, when: 1. Create the PAO namespace 2. Create the PAO catalog source 3. Create the PAO subscription and wait, the OLM will create the install plan with the error message: "invalid operator group - no operator group found that is managing this namespace" Once you create the operator group nothing happens, so you will need to re-create the subscription I am expecting that the OLM will reconcile again once the operator group created. Hey Artyom, Is there a reason you're not creating the OperatorGroup before creating the Subscription? All of our documents mentions the OperatorGroup as a pre-req. See this for example: https://olm.operatorframework.io/docs/tasks/install-operator-with-olm/#prerequisites Installplans were designed to be a single execution resource that initiate an installation and live as a book of record of that transaction. So what you're seeing is the expected behavior, i.e the error message on the InstallPlan is a record of an attempt at installing the Operator without any OperatorGroup present. When you create the OperatorGroup and re-create the subscription, the new InstallPlan is then the record of your second attempt of installing the Operator, this time with a valid OperatorGroup Under our deployments, it the often situation that all resources created via "oc create -f <all_files_under_dir>" and it possible to have a race condition here(the install plan was created before the API server updated with the operator group). I can understand that the behavior is documented, but IMHO it not very user-friendly and can be a source for errors during deployments via CLI. *** Bug 1972925 has been marked as a duplicate of this bug. *** Kevin, What are the next steps here? This will definitely impact our customers using ACM. /KenY Ken, We've discussed a possible solution where the creation of install plan is blocked till a valid OperatorGroup is detected. We should have a PR up with implementing that solution soon. https://github.com/operator-framework/operator-lifecycle-manager/pull/2215 - Upstream pull request verify: 1, install cluster [root@preserve-olm-agent-test 1960455]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-07-15-015134 True False 107m Cluster version is 4.9.0-0.nightly-2021-07-15-015134 [root@preserve-olm-agent-test 1960455]# oc adm release info registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-07-15-015134 --commits|grep operator-lifecycle-manager operator-lifecycle-manager https://github.com/openshift/operator-framework-olm 8740cee32bc0973361238df1ae8af3f87f7d6588 2, install catsrc, sub [root@preserve-olm-agent-test 1960455]# cat catsrc.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: ditto-operator-index namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: quay.io/olmqe/ditto-index:v1-4.8-xzha updateStrategy: registryPoll: interval: 10m [root@preserve-olm-agent-test 1960455]# cat sub.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ditto-operator namespace: test-1 spec: channel: "alpha" installPlanApproval: Automatic name: ditto-operator source: ditto-operator-index sourceNamespace: openshift-marketplace [root@preserve-olm-agent-test 1960455]# oc apply -f catsrc.yaml catalogsource.operators.coreos.com/ditto-operator-index created [root@preserve-olm-agent-test 1960455]# oc new-project test-1 [root@preserve-olm-agent-test 1960455]# oc apply -f sub.yaml subscription.operators.coreos.com/ditto-operator created [root@preserve-olm-agent-test 1960455]# oc get ip -o yaml apiVersion: v1 items: - apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan metadata: creationTimestamp: "2021-07-16T03:29:55Z" generateName: install- generation: 1 labels: operators.coreos.com/ditto-operator.test-1: "" name: install-hk2qb namespace: test-1 ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: Subscription name: ditto-operator uid: 12bfa98e-eea5-4835-9479-e70f28cad301 resourceVersion: "76969" uid: affdd1bc-1759-4cb9-9e37-9c197e577418 spec: approval: Automatic approved: true clusterServiceVersionNames: - ditto-operator.v0.1.1 generation: 1 status: bundleLookups: - catalogSourceRef: name: ditto-operator-index namespace: openshift-marketplace conditions: - message: bundle contents have not yet been persisted to installplan status reason: BundleNotUnpacked status: "True" type: BundleLookupNotPersisted - message: unpack job not yet started reason: JobNotStarted status: "True" type: BundleLookupPending identifier: ditto-operator.v0.1.1 path: quay.io/olmqe/ditto-operator:0.1.1 properties: '{"properties":[{"type":"olm.gvk","value":{"group":"iot.eclipse.org","kind":"Ditto","version":"v1alpha1"}},{"type":"olm.package","value":{"packageName":"ditto-operator","version":"0.1.1"}}]}' replaces: ditto-operator.v0.1.0 catalogSources: [] conditions: - lastTransitionTime: "2021-07-16T03:29:55Z" lastUpdateTime: "2021-07-16T03:30:14Z" message: no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed phase: Installing kind: List metadata: resourceVersion: "" selfLink: "" 3) install og [root@preserve-olm-agent-test 1960455]# oc apply -f og.yaml operatorgroup.operators.coreos.com/og-single created check ip/csv [root@preserve-olm-agent-test 1960455]# oc get ip -o yaml apiVersion: v1 items: - apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan .... conditions: - lastTransitionTime: "2021-07-16T03:31:09Z" lastUpdateTime: "2021-07-16T03:31:09Z" status: "True" type: Installed phase: Complete ..... [root@preserve-olm-agent-test 1960455]# oc get csv NAME DISPLAY VERSION REPLACES PHASE ditto-operator.v0.1.1 Eclipse Ditto 0.1.1 ditto-operator.v0.1.0 Succeeded 4) check event [root@preserve-olm-agent-test 1960455]# oc get events --sort-by='.lastTimestamp' LAST SEEN TYPE REASON OBJECT MESSAGE 7m29s Normal Scheduled pod/ditto-operator-75df74ff55-mqbpr Successfully assigned test-1/ditto-operator-75df74ff55-mqbpr to ip-10-0-179-31.us-east-2.compute.internal 9m19s Normal CreatedSCCRanges namespace/test-1 created SCC ranges 7m30s Normal RequirementsUnknown clusterserviceversion/ditto-operator.v0.1.1 requirements not yet checked 7m30s Normal InstallWaiting clusterserviceversion/ditto-operator.v0.1.1 installing: waiting for deployment ditto-operator to become ready: deployment "ditto-operator" not available: Deployment does not have minimum availability. 7m30s Normal InstallWaiting clusterserviceversion/ditto-operator.v0.1.1 installing: waiting for deployment ditto-operator to become ready: waiting for spec update of deployment "ditto-operator" to be observed... 7m30s Normal InstallSucceeded clusterserviceversion/ditto-operator.v0.1.1 waiting for install components to report healthy 7m30s Normal SuccessfulCreate replicaset/ditto-operator-75df74ff55 Created pod: ditto-operator-75df74ff55-mqbpr 7m30s Normal AllRequirementsMet clusterserviceversion/ditto-operator.v0.1.1 all requirements found, attempting install 7m30s Normal ScalingReplicaSet deployment/ditto-operator Scaled up replica set ditto-operator-75df74ff55 to 1 7m27s Normal AddedInterface pod/ditto-operator-75df74ff55-mqbpr Add eth0 [10.129.2.20/23] from openshift-sdn 7m27s Normal Pulling pod/ditto-operator-75df74ff55-mqbpr Pulling image "docker.io/ctron/ditto-operator:0.1.1" 7m21s Normal Started pod/ditto-operator-75df74ff55-mqbpr Started container ditto-operator 7m21s Normal Created pod/ditto-operator-75df74ff55-mqbpr Created container ditto-operator 7m21s Normal Pulled pod/ditto-operator-75df74ff55-mqbpr Successfully pulled image "docker.io/ctron/ditto-operator:0.1.1" in 6.441539021s 7m20s Normal InstallSucceeded clusterserviceversion/ditto-operator.v0.1.1 install strategy completed with no errors LGTM, verified. verify: 1, install cluster [root@preserve-olm-agent-test 1960455]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-07-15-015134 True False 107m Cluster version is 4.9.0-0.nightly-2021-07-15-015134 [root@preserve-olm-agent-test 1960455]# oc adm release info registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-07-15-015134 --commits|grep operator-lifecycle-manager operator-lifecycle-manager https://github.com/openshift/operator-framework-olm 8740cee32bc0973361238df1ae8af3f87f7d6588 2, install catsrc, sub [root@preserve-olm-agent-test 1960455]# cat catsrc.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: ditto-operator-index namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: quay.io/olmqe/ditto-index:v1-4.8-xzha updateStrategy: registryPoll: interval: 10m [root@preserve-olm-agent-test 1960455]# cat sub.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ditto-operator namespace: test-1 spec: channel: "alpha" installPlanApproval: Automatic name: ditto-operator source: ditto-operator-index sourceNamespace: openshift-marketplace [root@preserve-olm-agent-test 1960455]# oc apply -f catsrc.yaml catalogsource.operators.coreos.com/ditto-operator-index created [root@preserve-olm-agent-test 1960455]# oc new-project test-1 [root@preserve-olm-agent-test 1960455]# oc apply -f sub.yaml subscription.operators.coreos.com/ditto-operator created [root@preserve-olm-agent-test 1960455]# oc get ip -o yaml apiVersion: v1 items: - apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan metadata: creationTimestamp: "2021-07-16T03:29:55Z" generateName: install- generation: 1 labels: operators.coreos.com/ditto-operator.test-1: "" name: install-hk2qb namespace: test-1 ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: Subscription name: ditto-operator uid: 12bfa98e-eea5-4835-9479-e70f28cad301 resourceVersion: "76969" uid: affdd1bc-1759-4cb9-9e37-9c197e577418 spec: approval: Automatic approved: true clusterServiceVersionNames: - ditto-operator.v0.1.1 generation: 1 status: bundleLookups: - catalogSourceRef: name: ditto-operator-index namespace: openshift-marketplace conditions: - message: bundle contents have not yet been persisted to installplan status reason: BundleNotUnpacked status: "True" type: BundleLookupNotPersisted - message: unpack job not yet started reason: JobNotStarted status: "True" type: BundleLookupPending identifier: ditto-operator.v0.1.1 path: quay.io/olmqe/ditto-operator:0.1.1 properties: '{"properties":[{"type":"olm.gvk","value":{"group":"iot.eclipse.org","kind":"Ditto","version":"v1alpha1"}},{"type":"olm.package","value":{"packageName":"ditto-operator","version":"0.1.1"}}]}' replaces: ditto-operator.v0.1.0 catalogSources: [] conditions: - lastTransitionTime: "2021-07-16T03:29:55Z" lastUpdateTime: "2021-07-16T03:30:14Z" message: no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed phase: Installing kind: List metadata: resourceVersion: "" selfLink: "" 3) install og [root@preserve-olm-agent-test 1960455]# oc apply -f og.yaml operatorgroup.operators.coreos.com/og-single created check ip/csv [root@preserve-olm-agent-test 1960455]# oc get ip -o yaml apiVersion: v1 items: - apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan .... conditions: - lastTransitionTime: "2021-07-16T03:31:09Z" lastUpdateTime: "2021-07-16T03:31:09Z" status: "True" type: Installed phase: Complete ..... [root@preserve-olm-agent-test 1960455]# oc get csv NAME DISPLAY VERSION REPLACES PHASE ditto-operator.v0.1.1 Eclipse Ditto 0.1.1 ditto-operator.v0.1.0 Succeeded 4) check event [root@preserve-olm-agent-test 1960455]# oc get events --sort-by='.lastTimestamp' LAST SEEN TYPE REASON OBJECT MESSAGE 7m29s Normal Scheduled pod/ditto-operator-75df74ff55-mqbpr Successfully assigned test-1/ditto-operator-75df74ff55-mqbpr to ip-10-0-179-31.us-east-2.compute.internal 9m19s Normal CreatedSCCRanges namespace/test-1 created SCC ranges 7m30s Normal RequirementsUnknown clusterserviceversion/ditto-operator.v0.1.1 requirements not yet checked 7m30s Normal InstallWaiting clusterserviceversion/ditto-operator.v0.1.1 installing: waiting for deployment ditto-operator to become ready: deployment "ditto-operator" not available: Deployment does not have minimum availability. 7m30s Normal InstallWaiting clusterserviceversion/ditto-operator.v0.1.1 installing: waiting for deployment ditto-operator to become ready: waiting for spec update of deployment "ditto-operator" to be observed... 7m30s Normal InstallSucceeded clusterserviceversion/ditto-operator.v0.1.1 waiting for install components to report healthy 7m30s Normal SuccessfulCreate replicaset/ditto-operator-75df74ff55 Created pod: ditto-operator-75df74ff55-mqbpr 7m30s Normal AllRequirementsMet clusterserviceversion/ditto-operator.v0.1.1 all requirements found, attempting install 7m30s Normal ScalingReplicaSet deployment/ditto-operator Scaled up replica set ditto-operator-75df74ff55 to 1 7m27s Normal AddedInterface pod/ditto-operator-75df74ff55-mqbpr Add eth0 [10.129.2.20/23] from openshift-sdn 7m27s Normal Pulling pod/ditto-operator-75df74ff55-mqbpr Pulling image "docker.io/ctron/ditto-operator:0.1.1" 7m21s Normal Started pod/ditto-operator-75df74ff55-mqbpr Started container ditto-operator 7m21s Normal Created pod/ditto-operator-75df74ff55-mqbpr Created container ditto-operator 7m21s Normal Pulled pod/ditto-operator-75df74ff55-mqbpr Successfully pulled image "docker.io/ctron/ditto-operator:0.1.1" in 6.441539021s 7m20s Normal InstallSucceeded clusterserviceversion/ditto-operator.v0.1.1 install strategy completed with no errors LGTM, verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |