Bug 1416210
Summary: | Elasticsearch throws IndexAlreadyExistsException During Aggregated Logging Upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Nick Schuetz <nschuetz> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED ERRATA | QA Contact: | Xia Zhao <xiazhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.4.0 | CC: | aos-bugs, ewolinet, jokerman, juzhao, kurktchiev, mmccomas, nschuetz, rmeggins, rromerom, xiazhao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The openshift-elasticsearch-plugin always attempted to create the index
Consequence: The plugin could produce a stack trace if the index already existed. This would
possible occur when the node was drained and the pod was re-started with the same name. The
name is used as part of the index causing the error.
Fix: Make the plugin check for index existence before trying to seed the index.
Result: The plugin determines the index exists and no longer produces a stack trace.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-03-23 09:21:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1420684 | ||
Bug Blocks: |
Description
Nick Schuetz
2017-01-24 21:25:17 UTC
Correction: Use the following documentation to reproduce: https://docs.openshift.com/container-platform/3.4/install_config/upgrading/manual_upgrades.html#manual-upgrading-efk-logging-stack @Xia Is this something you can take on and verify or reproduce? @Jeff In my test I used hostpath PV for ES storage and didn't encounter the exception. Will retry it with nfs PV as the bug reporter mentioned. @xia, Looking at the code, I think the only way to get into this state is if there is some sort of abrupt termination between the time this index is created and before it is actually seeded with the ACL. It seems the only way to force it into this state would be to stand up the ES instance, and then after cycling it down, to manually delete the ACL document. I'm going to see if I can reproduce that way and create a fix. @Xia, My attempts to reproduce were unsuccessful. I think, however, the following will guard against future issues: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/59 @Jeff, Thanks for the info. I feel interested and want to know how to manually delete the ACL document, would you please help list the steps? Thanks. And is this the recommended steps for reproducing the original issue? 1. Install logging 3.3.1,bound with persistent storage, and populate logs for a user-project index 2. Scale down ES cluster to replica=0, and make sure ES gracefully shut down 3. Manually delete the ACL document inside ES (How to achieve this?) 4. Upgrade logging stacks to 3.4.1 (consider ES cluster will be up after upgrade) 5. Check the old log entries of user-project on kibana UI 6. Search for IndexAlreadyExistsException in upgrade logs -- I'm actually not very sure if step #4 should be prior to #2 and #3... Xia, I exec'd into the ES container while it is running and perform a delete using curl and giving it the certs. It is something like: 'curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XDELETE https://localhost:9300/searchguard-<PODNAME>/sg?id=0' but I would have to try again to ensure i had it correctly. I'm not certain to what extent it's worth the effort. @Jeff, Thanks for the info. --Tried inside my ES container and it failed: My logging stacks are: # oc get po -n logging NAME READY STATUS RESTARTS AGE logging-curator-1-ktmlb 1/1 Running 0 3h logging-deployer-wqw5l 0/1 Completed 0 3h logging-es-gjntg1kf-1-n6fsz 1/1 Running 0 3h logging-fluentd-1np15 1/1 Running 0 3h logging-kibana-1-kdtlk 2/2 Running 0 3h And my applications are: # oc get po -n test NAME READY STATUS RESTARTS AGE ruby-ex-1-61n0p 1/1 Running 0 3h ruby-ex-1-build 0/1 Completed 0 3h Here is how I exercised the curl command to delete ACL doc: # oc rsh logging-es-gjntg1kf-1-n6fsz sh-4.2$ curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XDELETE https://localhost:9300/searchguard-ruby-ex-1-61n0p/sg?id=0 This is not a HTTP port I wonder if I need to specify the namespace name of my app in the curl command somewhere? Also is "id=0" a fixed value? --Removed needinfo flag and please reset it if you want me to repro this bug. Awaiting for plugin SG rpms to be built to produce new images with fix. Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421630 and https://bugzilla.redhat.com/show_bug.cgi?id=1422355 So i just ran into this bug yesterday and spoke with Jeff and other members of the team. Here is how I was able to produce this bug (have not tried reproducing as this happened in production :/) 1. oc adm drain $infrastructureNode --ignore-daemonsets=true --force=true --grace-period=600 --delete-local-data=true 2. wait for the drain to happen. 3. do some OS level work and reboot the node 4. uncordon the node so it can receive pods again 5. watch as my 3 node ES cluster goes haywire and starts spitting out similar error (different searchguard index since a different pod got "stuck) Solution: 1. Make sure that all nodes in the cluster are at least talking and the cluster is "working". 2. Target the node in the ES cluster that is throwing that message. 3. Spin it down to 0 pods and then spin it back up to 1 4. watch the logs from the 3 nodes after a few minutes (5-10) the cluster recovers. The issue with the above though is that... well I cannot do any rolling restart work on the OS level as I do not know when and which one of the nodes will go kaput. My attempt to upgrade to 3.5.0 is currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1424981 Upgrade to 3.5.0 is failed due to https://bugzilla.redhat.com/show_bug.cgi?id=1425312 see https://bugzilla.redhat.com/show_bug.cgi?id=1425312#c9 and https://bugzilla.redhat.com/show_bug.cgi?id=1425312#c10 workaround for BZ #1425312 can not be used. *** Bug 1429827 has been marked as a duplicate of this bug. *** Upgraded logging stacks from 3.4.1 to 3.5.0, passed verification, searched the ansible execution log, and the es log post upgrade, didn't find IndexAlreadyExistsException and logging system worked fine after upgrade, set to verified. The upgrade was done with current playbooks from https://github.com/openshift/openshift-ansible Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0835 |