1889371 – Rollover job fails with null pointer exception

Bug 1889371 - Rollover job fails with null pointer exception

Summary: Rollover job fails with null pointer exception

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-19 13:43 UTC by tmicheli
Modified:	2024-03-25 16:45 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-26 15:37:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot with failed rollover/delete jobs. (40.25 KB, image/png) 2020-10-20 07:27 UTC, Andreas Furbach	no flags	Details
rollover-audit log with the NPE (213 bytes, application/octet-stream) 2020-10-20 07:28 UTC, Andreas Furbach	no flags	Details
View All

Description tmicheli 2020-10-19 13:43:30 UTC

Description of problem:
The rollover job for the infra logs fails with a null pointer exception and the executing pod goes into an error state. The issue occurres when upgrading from 4.5.11 to 4.5.13

Version-Release number of selected component (if applicable):
4.5.13

How reproducible:
Upgrade from OpenShift Container Platform (OCP) 4.5.11 to 4.5.13.

Actual results:
Logoutput from the elasticsearch-rollover-infra-xxx pod

~~~
  "error" : {
    "root_cause" : [
      {
        "type" : "null_pointer_exception",
        "reason" : null
      }
    ],
    "type" : "null_pointer_exception",
    "reason" : null
  },
  "status" : 500
~~~

Expected results:
* No null pointer exception

Additional info:

Comment 4 Jeff Cantrill 2020-10-19 16:09:18 UTC

Can you confirm if this is repeatable?  Does it go away on a subsequent run of the deletion job?

Comment 5 Andreas Furbach 2020-10-20 07:24:01 UTC

(In reply to Jeff Cantrill from comment #4)
> Can you confirm if this is repeatable?  Does it go away on a subsequent run
> of the deletion job?

I still see errors in the OCP summary, see screenshot. I checked the logs, they are either not existing/empty or show exactly the same content (NPE).

Comment 6 Andreas Furbach 2020-10-20 07:27:57 UTC

Created attachment 1722784 [details]
Screenshot with failed rollover/delete jobs.

Comment 7 Andreas Furbach 2020-10-20 07:28:51 UTC

Created attachment 1722785 [details]
rollover-audit log with the NPE

Comment 11 Hui Kang 2021-01-11 22:05:30 UTC

Hi, Andreas
could you paste the CRL CR's yaml file? I'd like try reproducing it in a 4.5.13 cluster. Thanks.

Comment 12 ewolinet 2021-03-26 15:37:47 UTC

Closing since linked customer cases are closed and we mitigate this with retrying in the cronjobs and with logic that verifies the rollover indices aren't in a bad state when this can happen.

To fix the actual NPE that this stems from, we would need to bump Elasticsearch up to 6.8.6 (which would also require its plugins and Kibana and its plugins to be bumped to the same version which is unclear when we will do this)

Note You need to log in before you can comment on or make changes to this bug.