1881709 – elasticsearch-delete and elasticsearch-rollover pods hanging

Bug 1881709 - elasticsearch-delete and elasticsearch-rollover pods hanging

Summary: elasticsearch-delete and elasticsearch-rollover pods hanging

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jeff Cantrill
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:	osd-45-logging, logging-exploration
Depends On:
Blocks:	1887567
TreeView+	depends on / blocked

Reported:	2020-09-22 22:37 UTC by Eric Fried
Modified:	2022-05-09 12:57 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	[discrete] [id="ocp-4-7-curl-conn-timeout"] // https://bugzilla.redhat.com/show_bug.cgi?id=1881709 ==== New connection timeout for deletion jobs The current release adds a connection timeout for deletion jobs. This helps prevent pods from occasionally hanging when they query Elasticsearch to delete indices. Now, if the underlying 'curl' call does not connect before the timeout period elapses, the timeout terminates the call.
Clone Of:
Environment:
Last Closed:	2021-02-24 11:21:18 UTC
Target Upstream Version:
Embargoed:
Flags:	jcantril: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 512	0	None	closed	Bug 1881709: Add connect timeout for IM jobs	2021-02-09 00:15:03 UTC
Red Hat Product Errata	RHBA-2021:0652	0	None	None	None	2021-02-24 11:21:52 UTC

Description Eric Fried 2020-09-22 22:37:34 UTC

Description of problem:

elasticsearch-{delete|rollover}-* pods (from cronjobs) hang on curl -s https://elasticsearch:9200/audit/_settings/index.creation_date

Version-Release number of selected component (if applicable):

registry.redhat.io/openshift4/ose-elasticsearch-operator@sha256:3ec62b62cfe3a47f9798e05ecce2bae104e4d1a9d4ca57fe16471ada0e32227a

How reproducible:
Unknown. Happening intermittently on OSD clusters.

Steps to Reproduce:
?

Actual results:
Some pods are Running for a really long time. Like:
`oc get pod -l component=indexManagement | grep Running`
shows really long durations.

Logs are empty.

Expected results:
These jobs should take well under an hour to complete.

Additional info:
Debugging an elasticsearch-delete-audit pod, the `delete` script is hanging on the following `curl`:curl -s https://elasticsearch:9200/audit/_settings/index.creation_date --cacert /etc/indexmanagement/keys/admin-ca '-HAuthorization: Bearer {redacted}' -HContent-Type:application/json

Comment 1 Eric Fried 2020-09-22 22:45:41 UTC

Looked at an elasticsearch-rollover-app pod and it's hanging here:

curl -s 'https://elasticsearch:9200/app-write/_rollover?pretty' -w '%{response_code}' --cacert /etc/indexmanagement/keys/admin-ca -HContent-Type:application/json -XPOST '-HAuthorization: Bearer {redacted}' -o /tmp/response.txt -d '{"conditions":{"max_age":"8h","max_docs":122880000,"max_size":"120gb"}}'

Comment 2 Karthik Perumal 2020-09-23 02:06:58 UTC

Hi,

I am one of the SREPs working with Eric during APAC hours. Here is what I found:

It is worth noting that the cluster which has this issue does not have the recent changes (from - https://github.com/openshift/elasticsearch-operator/pull/477). Therefore I grabbed the latest delete script with the try/catch blocks and ran it to see if we get anything useful. But it basically hangs silently and indefinitely.

Trying to run the delete script a few times I also found that it also hangs here (https://github.com/openshift/elasticsearch-operator/blob/8d1d59fcbbf8031f3d5dbbaa8a9eb17a0c1184f8/pkg/indexmanagement/scripts.go#L25) :

curl -s 'https://elasticsearch:9200/audit-*/_alias/audit-write' --cacert {redacted} '-HAuthorization: Bearer {redacted} -HContent-Type:application/json

Looks like all curls hang (indefinitely without a timeout) on the indexmanagement cronjobs.

Comment 4 Qiaoling Tang 2020-11-04 07:57:07 UTC

Tested with elasticsearch-operator.4.7.0-202011030448.p0, the option `--connect-timeout ${CONNECT_TIMEOUT}` has been added to the curl commands in the rollover and delete scripts.

Comment 10 errata-xmlrpc 2021-02-24 11:21:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.