Description of problem: After successful rollout of deploymentconfig.apps.openshift.io/logging-es-data-master-*, the pod is marked as Unhealthy when using oc describe. Version-Release number of selected component (if applicable): ovirt-engine-metrics-1.2.1-0.0.master.20190220121053.el7.noarch (patchset 48) How reproducible: 100 % Steps to Reproduce: 1. Run install_okd.yml playbook 2. It fails on task Rolling out new pod(s) for {{ _es_node }} 3. SSH to OpenShift master node and cancel the current rollout 4. Initiate a new rollout. If you removed all causes of the original failure (e.g. insufficient memory), this rollout should succeed. 5. Run `oc get pods -n openshift-logging -l component=es` and make sure that both container of that pod are running 6. Make sure there's no problem in elasticsearch's container log: `oc logs $(oc get pods -n openshift-logging -l component=es -o name) -c elasticsearch` 7. Now describe the ES pod: `oc describe $(oc get pods -n openshift-logging -l component=es -o name) | tail -n 20` Actual results: http://pastebin.test.redhat.com/720256 Expected results: Pod should pass readiness probe
Moving to ASSIGNED. There is no link to patch that fixes it, no manual steps, no workaround. Basically nothing changed between status NEW and ON_QA.
Current workaround that worked for me: 1) On successfully deployed bastion VM, get this patch: https://github.com/openshift/openshift-ansible/pull/11220/files Note: Since I couldn't apply the patch with yum, I simply got the files and overwrote the existing ones. 2) Run the metrics-store VM deployment ansible playbook install_okd (should end with no errors) Note: The metrics-store VM must not be already created or you'll get into an error during the playbook. 3) Once metrics-store VM is successfully created and openshift is installed, check pods, svc and routes, that there are no errors -> ES pod (without deploy tag) should be running but unhealthy, when you describe the pod. If there is problem with '/opt/app-root/src/init_failures' not existing, go to pod line: oc rsh ES_POD_NAME (once in pod line run this) touch /opt/app-root/src/init_failures chmod 777 /opt/app-root/src/init_failures 4) Exit the pod line with exit command and see if this made the pod go healthy (events dissapear, there should be <none> in the Events: )
Update on the workound: Do step 3 without the pod line commands (without connecting to pod line as well) and instead, do this command according to this: https://docs.openshift.com/container-platform/3.11/install_config/aggregate_logging.html#troubleshooting-related-to-elasticsearch for p in $(oc get pods -l component=es -o jsonpath={.items[*].metadata.name}); do \ oc exec -c elasticsearch $ES_POD -- touch /opt/app-root/src/init_failures; \ done Looks like it takes a lot of time to be applied. Do step 4 without the pod line commands.
Fixed in current patch.
4.3.1 has been released, please re-target this bug as soon as possible.
Steps to Reproduce: 1. Run install_okd.yml playbook 2. It fails on task Rolling out new pod(s) for {{ _es_node }} 3. SSH to OpenShift master node and cancel the current rollout 4. Initiate a new rollout. If you removed all causes of the original failure (e.g. insufficient memory), this rollout should succeed. 5. Run `oc get pods -n openshift-logging -l component=es` and make sure that both container of that pod are running 6. Make sure there's no problem in elasticsearch's container log: `oc logs $(oc get pods -n openshift-logging -l component=es -o name) -c elasticsearch` 7. Now describe the ES pod Result: ES pod is ready and healthy, no error in the events of the pod. Verified in: ovirt-engine-4.2.8.5-0.1.el7ev.noarch ovirt-engine-metrics-1.2.1.3-1.el7ev.noarch Verified tested in: ovirt-engine-4.3.3.1-0.1.el7.noarch ovirt-engine-metrics-1.2.1.3-1.el7ev.noarch
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.