Bug 1856714
Summary: | Rescued tasks cause ansible operator to re-run reconciliation loop | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Omer Yahud <oyahud> |
Component: | Operator SDK | Assignee: | Fabian von Feilitzsch <fabian> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cuiping HUO <chuo> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.5 | CC: | aos-bugs, dageoffr, jesusr, jfan, jiazha, marobrie |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
ManagedFields were being processed during reconciliation.
Consequence:
Potential for infinite loops in during reconciles for some resources.
Fix:
Patched operator-lib to ignore the ManagedFields during reconcilation.
Result:
More consistent reconcile loops.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-03-23 20:37:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Omer Yahud
2020-07-14 09:35:14 UTC
Good catch, currently we skip errors when they were dropped with `ignore_errors` but we neglected to handle the `rescue` case. If the information on whether it was rescued is available in the Ansible event data is present then this shouldn't be too difficult. If that information is not present in the event data we may have to do some extra work to determine whether we've failed or not. The current handling of ignored errors is here: https://github.com/operator-framework/operator-sdk/blob/master/internal/ansible/controller/reconcile.go#L189 Hopefully we just need to add an extra check to the event struct like `event.Rescued()`, assuming that the information is stored in the event. If the information isn't in the event we may need to do something a little cleverer at the end of the run to pull the failure out. Fixed this problem at https://github.com/operator-framework/operator-sdk/pull/3650 Version: - OCP: [root@preserve-olm-env memcached-operator]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-18-002612 True False 72m Cluster version is 4.6.0-0.nightly-2020-09-18-002612 [root@preserve-olm-env memcached-operator]# oc exec catalog-operator-9ff69c6cf-h8dhz -n openshift-operator-lifecycle-manager -- olm --version OLM version: 0.16.1 git commit: 1dd1b464f0c72893561e40881f827e3d41d6c934 - operator-sdk: [root@preserve-olm-env memcached-operator]# operator-sdk version operator-sdk version: "v1.0.0", commit: "d7d5e0cd6cf5468bb66e0849f08fda5bf557f4fa", kubernetes version: "v1.18.2", go version: "go1.13.11 linux/amd64", GOOS: "linux", GOARCH: "amd64" Steps to test: 1, Create an ansible operator and API [root@preserve-olm-env 1856714]# mkdir memcached-operator [root@preserve-olm-env memcached-operator]# cd memcached-operator [root@preserve-olm-env memcached-operator]# operator-sdk init --plugins=ansible --domain example.com [root@preserve-olm-env memcached-operator]# operator-sdk create api --group cache --version v1alpha1 --kind Memcached --generate-role --generate-playbook 2, Write a playbook with a failed task in a `block` and `rescue` the failure [root@preserve-olm-env memcached-operator]# cat playbooks/memcached.yml --- - name: Rescue example hosts: localhost tasks: - name: handle err block: - name: i force a failure command: /bin/false - debug: msg: 'I never execute, due to the above task failing, :-(' rescue: - debug: msg: 'I caught an error, can do stuff here to fix it, :-)' 3, Modify the manager Update roles/memecached/tasks/main.yml: --- - name: start memcached community.kubernetes.k8s: definition: kind: Deployment apiVersion: apps/v1 metadata: name: '{{ ansible_operator_meta.name }}-memcached' namespace: '{{ ansible_operator_meta.namespace }}' spec: replicas: "{{size}}" selector: matchLabels: app: memcached template: metadata: labels: app: memcached spec: containers: - name: memcached command: - memcached - -m=64 - -o - modern - -v image: "docker.io/memcached:1.4.36-alpine" ports: - containerPort: 11211 4, Build and push image [root@preserve-olm-env memcached-operator]# make docker-build docker-push IMG=quay.io/yuhui12/memcached-operator:1856714 docker build . -t quay.io/yuhui12/memcached-operator:1856714 Sending build context to Docker daemon 88.06kB Step 1/6 : FROM quay.io/operator-framework/ansible-operator:v1.0.0 ---> be357ef40597 Step 2/6 : COPY requirements.yml ${HOME}/requirements.yml ---> Using cache ---> eec87f0cf43c Step 3/6 : RUN ansible-galaxy collection install -r ${HOME}/requirements.yml && chmod -R ug+rwx ${HOME}/.ansible ---> Using cache ---> c85e954f9333 Step 4/6 : COPY watches.yaml ${HOME}/watches.yaml ---> e1777e0a22c0 Step 5/6 : COPY roles/ ${HOME}/roles/ ---> b89e31d93be0 Step 6/6 : COPY playbooks/ ${HOME}/playbooks/ ---> 99576f535fc1 Successfully built 99576f535fc1 Successfully tagged quay.io/yuhui12/memcached-operator:1856714 docker push quay.io/yuhui12/memcached-operator:1856714 The push refers to repository [quay.io/yuhui12/memcached-operator] 82b2fdc5ff4e: Pushed 20596896b321: Pushed 12be6e0cbf4d: Pushed 458ddba9e00d: Layer already exists 333d56c9ef9f: Layer already exists 39773a0668bf: Layer already exists 5e0763826339: Layer already exists 2cf9b6efd4b5: Layer already exists 1cee13bc64bc: Layer already exists ea99055d6b5f: Layer already exists 70056249a0e2: Layer already exists 226bfaae015f: Layer already exists 1856714: digest: sha256:287605c22be42ee11f25221ceb22b93aa9296c958979e88e310a2504890e5dee size: 2821 5, Deploy the operator on the OCP 4.6 cluster [root@preserve-olm-env memcached-operator]# make install /usr/local/go/bin/kustomize build config/crd | kubectl apply -f - customresourcedefinition.apiextensions.k8s.io/memcacheds.cache.example.com created [root@preserve-olm-env memcached-operator]# export IMG=quay.io/yuhui12/memcached-operator:1856714 [root@preserve-olm-env memcached-operator]# make deploy cd config/manager && /usr/local/go/bin/kustomize edit set image controller=quay.io/yuhui12/memcached-operator:1856714 /usr/local/go/bin/kustomize build config/default | kubectl apply -f - namespace/memcached-operator-system created customresourcedefinition.apiextensions.k8s.io/memcacheds.cache.example.com unchanged role.rbac.authorization.k8s.io/memcached-operator-leader-election-role created clusterrole.rbac.authorization.k8s.io/memcached-operator-manager-role created clusterrole.rbac.authorization.k8s.io/memcached-operator-proxy-role created clusterrole.rbac.authorization.k8s.io/memcached-operator-metrics-reader created rolebinding.rbac.authorization.k8s.io/memcached-operator-leader-election-rolebinding created clusterrolebinding.rbac.authorization.k8s.io/memcached-operator-manager-rolebinding created clusterrolebinding.rbac.authorization.k8s.io/memcached-operator-proxy-rolebinding created service/memcached-operator-controller-manager-metrics-service created deployment.apps/memcached-operator-controller-manager created [root@preserve-olm-env memcached-operator]# kubectl config set-context --current --namespace=memcached-operator-system Context "admin" modified. [root@preserve-olm-env memcached-operator]# kubectl get deployment NAME READY UP-TO-DATE AVAILABLE AGE memcached-operator-controller-manager 1/1 1 1 42s 6, Create Memcached Resource [root@preserve-olm-env memcached-operator]# kubectl apply -f config/samples/cache_v1alpha1_memcached.yaml memcached.cache.example.com/memcached-sample created 7, Check the summary for the playbook. It shows "rescued=1". [root@preserve-olm-env memcached-operator]# oc logs memcached-operator-controller-manager-5cf755594-ktkxt ... --------------------------- Ansible Task StdOut ------------------------------- TASK [Gathering Facts] ********************************************************* task path: /opt/ansible/playbooks/memcached.yml:2 ------------------------------------------------------------------------------- {"level":"info","ts":1600414352.3363469,"logger":"logging_event_handler","msg":"[playbook task]","name":"memcached-sample","namespace":"memcached-operator-system","gvk":"cache.example.com/v1alpha1, Kind=Memcached","event_type":"playbook_on_task_start","job":"4037200794235010051","EventData.Name":"Gathering Facts"} --------------------------- Ansible Task StdOut ------------------------------- {"level":"info","ts":1600414353.3462336,"logger":"logging_event_handler","msg":"[playbook task]","name":"memcached-sample","namespace":"memcached-operator-system","gvk":"cache.example.com/v1alpha1, Kind=Memcached","event_type":"playbook_on_task_start","job":"4037200794235010051","EventData.Name":"i force a failure"} TASK [i force a failure] ******************************************************* task path: /opt/ansible/playbooks/memcached.yml:7 ------------------------------------------------------------------------------- --------------------------- Ansible Task StdOut ------------------------------- TASK [i force a failure] ******************************** fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/bin/false"], "delta": "0:00:00.002419", "end": "2020-09-18 07:32:33.862425", "msg": "non-zero return code", "rc": 1, "start": "2020-09-18 07:32:33.860006", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ------------------------------------------------------------------------------- {"level":"error","ts":1600414353.8915808,"logger":"logging_event_handler","msg":"","name":"memcached-sample","namespace":"memcached-operator-system","gvk":"cache.example.com/v1alpha1, Kind=Memcached","event_type":"runner_on_failed","job":"4037200794235010051","EventData.Task":"i force a failure","EventData.TaskArgs":"","EventData.FailedTaskPath":"/opt/ansible/playbooks/memcached.yml:7","error":"[playbook task failed]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr.0/zapr.go:128\ngithub.com/operator-framework/operator-sdk/internal/ansible/events.loggingEventHandler.Handle\n\tsrc/github.com/operator-framework/operator-sdk/internal/ansible/events/log_events.go:87"} --------------------------- Ansible Task StdOut ------------------------------- TASK [debug] ******************************** ok: [localhost] => { "msg": "I caught an error, can do stuff here to fix it, :-)" } ------------------------------------------------------------------------------- {"level":"info","ts":1600414353.917217,"logger":"logging_event_handler","msg":"[playbook debug]","name":"memcached-sample","namespace":"memcached-operator-system","gvk":"cache.example.com/v1alpha1, Kind=Memcached","event_type":"runner_on_ok","job":"4037200794235010051","EventData.TaskArgs":""} {"level":"info","ts":1600414354.1043732,"logger":"runner","msg":"Ansible-runner exited successfully","job":"4037200794235010051","name":"memcached-sample","namespace":"memcached-operator-system"} --------------------------- Ansible Task Status Event StdOut ----------------- PLAY RECAP ********************************************************************* localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=1 ignored=0 8, Check the Memcached resource status. [root@preserve-olm-env memcached-operator]# oc describe Memcached memcached-sample Name: memcached-sample Namespace: memcached-operator-system Labels: <none> Annotations: <none> API Version: cache.example.com/v1alpha1 Kind: Memcached Metadata: Creation Timestamp: 2020-09-18T07:32:29Z Generation: 1 Managed Fields: API Version: cache.example.com/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:kubectl.kubernetes.io/last-applied-configuration: f:spec: .: f:size: Manager: kubectl Operation: Update Time: 2020-09-18T07:32:29Z API Version: cache.example.com/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:conditions: Manager: ansible-operator Operation: Update Time: 2020-09-18T07:39:11Z Resource Version: 103935 Self Link: /apis/cache.example.com/v1alpha1/namespaces/memcached-operator-system/memcacheds/memcached-sample UID: fbf8b22d-08c8-453c-b579-d6f45122b954 Spec: Size: 3 Status: Conditions: Last Transition Time: 2020-09-18T07:39:07Z Message: Running reconciliation Reason: Running Status: False Type: Running Ansible Result: Changed: 0 Completion: 2020-09-18T07:39:11.165935 Failures: 0 Ok: 2 Skipped: 0 Last Transition Time: 2020-09-18T07:39:11Z Message: non-zero return code Reason: Failed Status: True Type: Failure Events: <none> The Memcached resource status is Failure. 9, Check the reconcile frequency. [root@preserve-olm-env memcached-operator]# oc logs memcached-operator-controller-manager-5cf755594-ktkxt manager |grep "Reconciler error" -c 19 [root@preserve-olm-env 1868712]# oc -n memcached-operator-system logs memcached-operator-controller-manager-5cf755594-ktkxt manager |grep "Reconciler error" -c 27 The playbook is being rerun indefinitely. Assigned the bug on OCP 4.6. Moving this to 4.7 as this is not an OCP release blocker. Team will continue to review and evaluate further. I am investigating this. This was fixed in operator-lib https://github.com/operator-framework/operator-lib/pull/59 and pulled into downstream operator-sdk 1.7.2-ocp (and 1.8.0-ocp) |