Bug 1889371

Summary: Rollover job fails with null pointer exception
Product: OpenShift Container Platform Reporter: tmicheli
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED CURRENTRELEASE QA Contact: Anping Li <anli>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.5CC: afurbach, aos-bugs, dkulkarn, ewolinet, hkang, jcantril, periklis
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-26 15:37:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot with failed rollover/delete jobs.
none
rollover-audit log with the NPE none

Description tmicheli 2020-10-19 13:43:30 UTC
Description of problem:
The rollover job for the infra logs fails with a null pointer exception and the executing pod goes into an error state. The issue occurres when upgrading from 4.5.11 to 4.5.13

Version-Release number of selected component (if applicable):
4.5.13

How reproducible:
Upgrade from OpenShift Container Platform (OCP) 4.5.11 to 4.5.13.

Actual results:
Logoutput from the elasticsearch-rollover-infra-xxx pod

~~~
  "error" : {
    "root_cause" : [
      {
        "type" : "null_pointer_exception",
        "reason" : null
      }
    ],
    "type" : "null_pointer_exception",
    "reason" : null
  },
  "status" : 500
~~~

Expected results:
* No null pointer exception

Additional info:

Comment 4 Jeff Cantrill 2020-10-19 16:09:18 UTC
Can you confirm if this is repeatable?  Does it go away on a subsequent run of the deletion job?

Comment 5 Andreas Furbach 2020-10-20 07:24:01 UTC
(In reply to Jeff Cantrill from comment #4)
> Can you confirm if this is repeatable?  Does it go away on a subsequent run
> of the deletion job?

I still see errors in the OCP summary, see screenshot. I checked the logs, they are either not existing/empty or show exactly the same content (NPE).

Comment 6 Andreas Furbach 2020-10-20 07:27:57 UTC
Created attachment 1722784 [details]
Screenshot with failed rollover/delete jobs.

Comment 7 Andreas Furbach 2020-10-20 07:28:51 UTC
Created attachment 1722785 [details]
rollover-audit log with the NPE

Comment 11 Hui Kang 2021-01-11 22:05:30 UTC
Hi, Andreas
could you paste the CRL CR's yaml file? I'd like try reproducing it in a 4.5.13 cluster. Thanks.

Comment 12 ewolinet 2021-03-26 15:37:47 UTC
Closing since linked customer cases are closed and we mitigate this with retrying in the cronjobs and with logic that verifies the rollover indices aren't in a bad state when this can happen.

To fix the actual NPE that this stems from, we would need to bump Elasticsearch up to 6.8.6 (which would also require its plugins and Kibana and its plugins to be bumped to the same version which is unclear when we will do this)