Bug 1889371 - Rollover job fails with null pointer exception
Summary: Rollover job fails with null pointer exception
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.7.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard: logging-exploration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-19 13:43 UTC by tmicheli
Modified: 2024-03-25 16:45 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-26 15:37:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screenshot with failed rollover/delete jobs. (40.25 KB, image/png)
2020-10-20 07:27 UTC, Andreas Furbach
no flags Details
rollover-audit log with the NPE (213 bytes, application/octet-stream)
2020-10-20 07:28 UTC, Andreas Furbach
no flags Details

Description tmicheli 2020-10-19 13:43:30 UTC
Description of problem:
The rollover job for the infra logs fails with a null pointer exception and the executing pod goes into an error state. The issue occurres when upgrading from 4.5.11 to 4.5.13

Version-Release number of selected component (if applicable):
4.5.13

How reproducible:
Upgrade from OpenShift Container Platform (OCP) 4.5.11 to 4.5.13.

Actual results:
Logoutput from the elasticsearch-rollover-infra-xxx pod

~~~
  "error" : {
    "root_cause" : [
      {
        "type" : "null_pointer_exception",
        "reason" : null
      }
    ],
    "type" : "null_pointer_exception",
    "reason" : null
  },
  "status" : 500
~~~

Expected results:
* No null pointer exception

Additional info:

Comment 4 Jeff Cantrill 2020-10-19 16:09:18 UTC
Can you confirm if this is repeatable?  Does it go away on a subsequent run of the deletion job?

Comment 5 Andreas Furbach 2020-10-20 07:24:01 UTC
(In reply to Jeff Cantrill from comment #4)
> Can you confirm if this is repeatable?  Does it go away on a subsequent run
> of the deletion job?

I still see errors in the OCP summary, see screenshot. I checked the logs, they are either not existing/empty or show exactly the same content (NPE).

Comment 6 Andreas Furbach 2020-10-20 07:27:57 UTC
Created attachment 1722784 [details]
Screenshot with failed rollover/delete jobs.

Comment 7 Andreas Furbach 2020-10-20 07:28:51 UTC
Created attachment 1722785 [details]
rollover-audit log with the NPE

Comment 11 Hui Kang 2021-01-11 22:05:30 UTC
Hi, Andreas
could you paste the CRL CR's yaml file? I'd like try reproducing it in a 4.5.13 cluster. Thanks.

Comment 12 ewolinet 2021-03-26 15:37:47 UTC
Closing since linked customer cases are closed and we mitigate this with retrying in the cronjobs and with logic that verifies the rollover indices aren't in a bad state when this can happen.

To fix the actual NPE that this stems from, we would need to bump Elasticsearch up to 6.8.6 (which would also require its plugins and Kibana and its plugins to be bumped to the same version which is unclear when we will do this)


Note You need to log in before you can comment on or make changes to this bug.