Bug 1889371

Summary:

Rollover job fails with null pointer exception

Product:

OpenShift Container Platform

Reporter:

tmicheli

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Anping Li <anli>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.5

CC:

afurbach, aos-bugs, dkulkarn, ewolinet, hkang, jcantril, periklis

Target Milestone:

---

Target Release:

4.7.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

logging-exploration

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-03-26 15:37:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Screenshot with failed rollover/delete jobs.	none
rollover-audit log with the NPE	none

Description tmicheli 2020-10-19 13:43:30 UTC

Description of problem:
The rollover job for the infra logs fails with a null pointer exception and the executing pod goes into an error state. The issue occurres when upgrading from 4.5.11 to 4.5.13

Version-Release number of selected component (if applicable):
4.5.13

How reproducible:
Upgrade from OpenShift Container Platform (OCP) 4.5.11 to 4.5.13.

Actual results:
Logoutput from the elasticsearch-rollover-infra-xxx pod

~~~
  "error" : {
    "root_cause" : [
      {
        "type" : "null_pointer_exception",
        "reason" : null
      }
    ],
    "type" : "null_pointer_exception",
    "reason" : null
  },
  "status" : 500
~~~

Expected results:
* No null pointer exception

Additional info:

Comment 4 Jeff Cantrill 2020-10-19 16:09:18 UTC

Can you confirm if this is repeatable?  Does it go away on a subsequent run of the deletion job?

Comment 5 Andreas Furbach 2020-10-20 07:24:01 UTC

(In reply to Jeff Cantrill from comment #4)
> Can you confirm if this is repeatable?  Does it go away on a subsequent run
> of the deletion job?

I still see errors in the OCP summary, see screenshot. I checked the logs, they are either not existing/empty or show exactly the same content (NPE).

Comment 6 Andreas Furbach 2020-10-20 07:27:57 UTC

Created attachment 1722784 [details]
Screenshot with failed rollover/delete jobs.

Comment 7 Andreas Furbach 2020-10-20 07:28:51 UTC

Created attachment 1722785 [details]
rollover-audit log with the NPE

Comment 11 Hui Kang 2021-01-11 22:05:30 UTC

Hi, Andreas
could you paste the CRL CR's yaml file? I'd like try reproducing it in a 4.5.13 cluster. Thanks.

Comment 12 ewolinet 2021-03-26 15:37:47 UTC

Closing since linked customer cases are closed and we mitigate this with retrying in the cronjobs and with logic that verifies the rollover indices aren't in a bad state when this can happen.

To fix the actual NPE that this stems from, we would need to bump Elasticsearch up to 6.8.6 (which would also require its plugins and Kibana and its plugins to be bumped to the same version which is unclear when we will do this)