Description of problem: elasticsearch-{delete|rollover}-* pods (from cronjobs) hang on curl -s https://elasticsearch:9200/audit/_settings/index.creation_date Version-Release number of selected component (if applicable): registry.redhat.io/openshift4/ose-elasticsearch-operator@sha256:3ec62b62cfe3a47f9798e05ecce2bae104e4d1a9d4ca57fe16471ada0e32227a How reproducible: Unknown. Happening intermittently on OSD clusters. Steps to Reproduce: ? Actual results: Some pods are Running for a really long time. Like: `oc get pod -l component=indexManagement | grep Running` shows really long durations. Logs are empty. Expected results: These jobs should take well under an hour to complete. Additional info: Debugging an elasticsearch-delete-audit pod, the `delete` script is hanging on the following `curl`:curl -s https://elasticsearch:9200/audit/_settings/index.creation_date --cacert /etc/indexmanagement/keys/admin-ca '-HAuthorization: Bearer {redacted}' -HContent-Type:application/json
Looked at an elasticsearch-rollover-app pod and it's hanging here: curl -s 'https://elasticsearch:9200/app-write/_rollover?pretty' -w '%{response_code}' --cacert /etc/indexmanagement/keys/admin-ca -HContent-Type:application/json -XPOST '-HAuthorization: Bearer {redacted}' -o /tmp/response.txt -d '{"conditions":{"max_age":"8h","max_docs":122880000,"max_size":"120gb"}}'
Hi, I am one of the SREPs working with Eric during APAC hours. Here is what I found: It is worth noting that the cluster which has this issue does not have the recent changes (from - https://github.com/openshift/elasticsearch-operator/pull/477). Therefore I grabbed the latest delete script with the try/catch blocks and ran it to see if we get anything useful. But it basically hangs silently and indefinitely. Trying to run the delete script a few times I also found that it also hangs here (https://github.com/openshift/elasticsearch-operator/blob/8d1d59fcbbf8031f3d5dbbaa8a9eb17a0c1184f8/pkg/indexmanagement/scripts.go#L25) : curl -s 'https://elasticsearch:9200/audit-*/_alias/audit-write' --cacert {redacted} '-HAuthorization: Bearer {redacted} -HContent-Type:application/json Looks like all curls hang (indefinitely without a timeout) on the indexmanagement cronjobs.
Tested with elasticsearch-operator.4.7.0-202011030448.p0, the option `--connect-timeout ${CONNECT_TIMEOUT}` has been added to the curl commands in the rollover and delete scripts.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652