Bug 1416210

Summary: Elasticsearch throws IndexAlreadyExistsException During Aggregated Logging Upgrade
Product: OpenShift Container Platform Reporter: Nick Schuetz <nschuetz>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Xia Zhao <xiazhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: aos-bugs, ewolinet, jokerman, juzhao, kurktchiev, mmccomas, nschuetz, rmeggins, rromerom, xiazhao
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The openshift-elasticsearch-plugin always attempted to create the index Consequence: The plugin could produce a stack trace if the index already existed. This would possible occur when the node was drained and the pod was re-started with the same name. The name is used as part of the index causing the error. Fix: Make the plugin check for index existence before trying to seed the index. Result: The plugin determines the index exists and no longer produces a stack trace.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 09:21:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1420684    
Bug Blocks:    

Description Nick Schuetz 2017-01-24 21:25:17 UTC
Description of problem:

When upgrading to OpenShift 3.4 and updating Aggregated Logging I get the following error if there is already data in ES:

[2017-01-24 17:29:27,400][ERROR][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Blindspot] Exception encountered when seeding initial ACL
[.searchguard.logging-es-3wxbh02e-4-ax9j0] IndexAlreadyExistsException[already exists]

Version-Release number of selected component (if applicable):

3.4.0

How reproducible:

Always

Steps to Reproduce:
1. From a 3.3 install try manually upgrading using the following documentation: https://docs.openshift.com/container-platform/3.3/install_config/upgrading/manual_upgrades.html#manual-upgrading-efk-logging-stack
2.
3.

Actual results:

ES never starts up again because of the following error:

[2017-01-24 17:29:27,400][ERROR][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Blindspot] Exception encountered when seeding initial ACL
[.searchguard.logging-es-3wxbh02e-4-ax9j0] IndexAlreadyExistsException[already exists]

Expected results:

ES starts and the upgrade deployment succeeds

Additional info:

I was using an NFS persistent volume (PV) for my ES storage that worked prior to the upgrade attempt.

Comment 1 Nick Schuetz 2017-01-24 21:31:59 UTC
Correction:

Use the following documentation to reproduce: https://docs.openshift.com/container-platform/3.4/install_config/upgrading/manual_upgrades.html#manual-upgrading-efk-logging-stack

Comment 6 Jeff Cantrill 2017-01-27 15:58:45 UTC
@Xia Is this something you can take on and verify or reproduce?

Comment 7 Xia Zhao 2017-02-03 03:45:10 UTC
@Jeff  In my test I used hostpath PV for ES storage and didn't encounter the exception. Will retry it with nfs PV as the bug reporter mentioned.

Comment 8 Jeff Cantrill 2017-02-03 20:50:13 UTC
@xia,

Looking at the code, I think the only way to get into this state is if there is some sort of abrupt termination between the time this index is created and before it is actually seeded with the ACL.  It seems the only way to force it into this state would be to stand up the ES instance, and then after cycling it down, to manually delete the ACL document.  I'm going to see if I can reproduce that way and create a fix.

Comment 9 Jeff Cantrill 2017-02-03 23:32:25 UTC
@Xia,

My attempts to reproduce were unsuccessful.  I think, however, the following will guard against future issues: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/59

Comment 10 Xia Zhao 2017-02-04 03:23:44 UTC
@Jeff, 

Thanks for the info. I feel interested and want to know how to manually delete the ACL document, would you please help list the steps? Thanks.

And is this the recommended steps for reproducing the original issue?
1. Install logging 3.3.1,bound with persistent storage, and populate logs for a user-project index
2. Scale down ES cluster to replica=0, and make sure ES gracefully shut down
3. Manually delete the ACL document inside ES (How to achieve this?)
4. Upgrade logging stacks to 3.4.1 (consider ES cluster will be up after upgrade)
5. Check the old log entries of user-project on kibana UI
6. Search for IndexAlreadyExistsException in upgrade logs

-- I'm actually not very sure if step #4 should be prior to #2 and #3...

Comment 11 Jeff Cantrill 2017-02-05 18:14:06 UTC
Xia,

I exec'd into the ES container while it is running and perform a delete using curl and giving it the certs. It is something like: 
'curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XDELETE https://localhost:9300/searchguard-<PODNAME>/sg?id=0'  but I would have to try again to ensure i had it correctly.  I'm not certain to what extent it's worth the effort.

Comment 12 Xia Zhao 2017-02-06 10:07:06 UTC
@Jeff,

Thanks for the info. --Tried inside my ES container and it failed:

My logging stacks are:
# oc get po -n logging
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-1-ktmlb       1/1       Running     0          3h
logging-deployer-wqw5l        0/1       Completed   0          3h
logging-es-gjntg1kf-1-n6fsz   1/1       Running     0          3h
logging-fluentd-1np15         1/1       Running     0          3h
logging-kibana-1-kdtlk        2/2       Running     0          3h

And my applications are:
# oc get po -n test
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-61n0p   1/1       Running     0          3h
ruby-ex-1-build   0/1       Completed   0          3h

Here is how I exercised the curl command to delete ACL doc:
# oc rsh logging-es-gjntg1kf-1-n6fsz
sh-4.2$ curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XDELETE https://localhost:9300/searchguard-ruby-ex-1-61n0p/sg?id=0
This is not a HTTP port

I wonder if I need to specify the namespace name of my app in the curl command somewhere? Also is "id=0" a fixed value?

--Removed needinfo flag and please reset it if you want me to repro this bug.

Comment 13 Jeff Cantrill 2017-02-07 19:17:58 UTC
Awaiting for plugin SG rpms to be built to produce new images with fix.

Comment 18 Boris Kurktchiev 2017-02-17 13:32:09 UTC
So i just ran into this bug yesterday and spoke with Jeff and other members of the team. Here is how I was able to produce this bug (have not tried reproducing as this happened in production :/)

1. oc adm drain $infrastructureNode --ignore-daemonsets=true --force=true --grace-period=600 --delete-local-data=true
2. wait for the drain to happen.
3. do some OS level work and reboot the node
4. uncordon the node so it can receive pods again
5. watch as my 3 node ES cluster goes haywire and starts spitting out similar error (different searchguard index since a different pod got "stuck)

Solution:
1. Make sure that all nodes in the cluster are at least talking and the cluster is "working".
2. Target the node in the ES cluster that is throwing that message.
3. Spin it down to 0 pods and then spin it back up to 1
4. watch the logs from the 3 nodes after a few minutes (5-10) the cluster recovers.

The issue with the above though is that... well I cannot do any rolling restart work on the OS level as I do not know when and which one of the nodes will go kaput.

Comment 19 Xia Zhao 2017-02-20 09:16:15 UTC
My attempt to upgrade to 3.5.0 is currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1424981

Comment 21 Xia Zhao 2017-03-02 07:20:01 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1428249

Comment 22 Jeff Cantrill 2017-03-07 16:25:55 UTC
*** Bug 1429827 has been marked as a duplicate of this bug. ***

Comment 23 Xia Zhao 2017-03-14 05:31:16 UTC
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1431935

Comment 24 Xia Zhao 2017-03-15 07:33:39 UTC
Upgraded logging stacks from 3.4.1 to 3.5.0, passed verification, searched the ansible execution log, and the es log post upgrade, didn't find IndexAlreadyExistsException and logging system worked fine after upgrade, set to verified.

The upgrade was done with current playbooks from https://github.com/openshift/openshift-ansible

Comment 26 errata-xmlrpc 2017-03-23 09:21:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0835