Description of problem: There is no easy way to rerun a Kubernetes job. As discussed in bug 1632852 there are times when the job terminates with a failure and does not run again. In that sort of scenario the job has to be recreated in order for it to run again. This should be automated through the installer. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
PR https://github.com/openshift/openshift-ansible/pull/10340
Fix is available in openshift-ansible-3.11.23-1
Tested with openshift-ansible-3.11.23-1 scenario 1: 1. Deploy metrics 3.11 and make sure all the pods could be in running status 2. Delete hawkular-metrics-schema job, and run playbooks/openshift-metrics/schema.yml, hawkular-metrics-schema job could be created, and hawkular-metrics-schema pod could be created. scenario 2: 1. Don't deploy metrics 3.11, run playbooks/openshift-metrics/schema.yml directly. Although hawkular-metrics-schema pod is in ContainerCreating status due to secrets "hawkular-metrics-account" and "hawkular-metrics-certs" are not created(they are used for hawkular-metrics pod), purpose to only installing and running the schema installer job is achieved. # oc -n openshift-infra get job NAME DESIRED SUCCESSFUL AGE hawkular-metrics-schema 1 0 26m # oc -n openshift-infra get pod NAME READY STATUS RESTARTS AGE hawkular-metrics-schema-jtms6 0/1 ContainerCreating 0 26m # oc -n openshift-infra describe pod hawkular-metrics-schema-jtms6 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27m default-scheduler Successfully assigned openshift-infra/hawkular-metrics-schema-jtms6 to ip-172-18-10-45.ec2.internal Warning FailedMount 17m (x13 over 27m) kubelet, ip-172-18-10-45.ec2.internal MountVolume.SetUp failed for volume "hawkular-metrics-certs" : secrets "hawkular-metrics-certs" not found Warning FailedMount 7m (x9 over 25m) kubelet, ip-172-18-10-45.ec2.internal Unable to mount volumes for pod "hawkular-metrics-schema-jtms6_openshift-infra(3cff463f-d1e5-11e8-97db-0e7631322b02)": timeout expired waiting for volumes to attach or mount for pod "openshift-infra"/"hawkular-metrics-schema-jtms6". list of unmounted volumes=[hawkular-metrics-certs hawkular-metrics-account]. list of unattached volumes=[hawkular-metrics-certs hawkular-metrics-account default-token-fz7sx] Warning FailedMount 1m (x21 over 27m) kubelet, ip-172-18-10-45.ec2.internal MountVolume.SetUp failed for volume "hawkular-metrics-account" : secrets "hawkular-metrics-account" not found @Vadim Do you think the scenarios are enough to close this defect?
(In reply to Junqi Zhao from comment #3) > Tested with openshift-ansible-3.11.23-1 > scenario 1: > 1. Deploy metrics 3.11 and make sure all the pods could be in running status > 2. Delete hawkular-metrics-schema job, and run > playbooks/openshift-metrics/schema.yml, hawkular-metrics-schema job could be > created, and hawkular-metrics-schema pod could be created. This looks correct. Initially I've seen this with Origin, where the image for schema job was missing and this playbook was necessary to restore Metrics install. @Ruben, any ideas how to corrupt the schema so that the job would fix it and QE could verify schema is restored? > > scenario 2: > 1. Don't deploy metrics 3.11, run playbooks/openshift-metrics/schema.yml > directly. Although hawkular-metrics-schema pod is in ContainerCreating > status due to secrets "hawkular-metrics-account" and > "hawkular-metrics-certs" are not created(they are used for hawkular-metrics > pod), purpose to only installing and running the schema installer job is > achieved. I don't think its valid. Schema job playbook should not be used if metrics are not deployed
(In reply to Vadim Rutkovsky from comment #4) > @Ruben, any ideas how to corrupt the schema so that the job would fix it and > QE could verify schema is restored? No matter the schema is corrupted or not, hawkular-metrics-schema job will be deleted firstly, then create new hawkular-metrics-schema job roles/openshift_metrics/tasks/run_schema_job.yaml - include_tasks: install_hawkular_schema_job.yaml roles/openshift_metrics/tasks/install_hawkular_schema_job.yaml --- - name: list installed jobs command: > {{ openshift_client_binary }} -n {{ openshift_metrics_project }} --config={{ mktemp.stdout }}/admin.kubeconfig get jobs register: jobs # We cannot use oc apply here because the Job template has immutable fields # on which oc apply will fail. - name: remove hawkular-metrics-schema job command: > {{ openshift_client_binary }} -n {{ openshift_metrics_project }} --config={{ mktemp.stdout }}/admin.kubeconfig delete job hawkular-metrics-schema register: delete_schema_job when: "'hawkular-metrics-schema' in jobs.stdout" - name: generate hawkular-metrics schema job template: src: hawkular_metrics_schema_job.j2 dest: "{{ mktemp.stdout }}/templates/hawkular_metrics_schema_job.yaml" changed_when: false ***************************************************************** So, it is no need to do the testing with corrupted hawkular-metrics-schema job
Tested with openshift-ansible-3.11.23-1, issue is fixed. Steps: scale down cassandra and hawkular-metrics rc, after a while, scale them up. There are error in hawkular-metrics pod, "The schema version check failed". Then run the playbooks/openshift-metrics/schema.yml playbook, all pods will be running well. # oc -n openshift-infra get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-x97wh 1/1 Running 0 12m hawkular-metrics-kwjxz 1/1 Running 1 12m hawkular-metrics-schema-vf8gj 0/1 Completed 0 5m heapster-htc2p 1/1 Running 0 3h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024