Description of problem: When upgrading to OpenShift 3.4 and updating Aggregated Logging I get the following error if there is already data in ES: [2017-01-24 17:29:27,400][ERROR][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Blindspot] Exception encountered when seeding initial ACL [.searchguard.logging-es-3wxbh02e-4-ax9j0] IndexAlreadyExistsException[already exists] Version-Release number of selected component (if applicable): 3.4.0 How reproducible: Always Steps to Reproduce: 1. From a 3.3 install try manually upgrading using the following documentation: https://docs.openshift.com/container-platform/3.3/install_config/upgrading/manual_upgrades.html#manual-upgrading-efk-logging-stack 2. 3. Actual results: ES never starts up again because of the following error: [2017-01-24 17:29:27,400][ERROR][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Blindspot] Exception encountered when seeding initial ACL [.searchguard.logging-es-3wxbh02e-4-ax9j0] IndexAlreadyExistsException[already exists] Expected results: ES starts and the upgrade deployment succeeds Additional info: I was using an NFS persistent volume (PV) for my ES storage that worked prior to the upgrade attempt.
Correction: Use the following documentation to reproduce: https://docs.openshift.com/container-platform/3.4/install_config/upgrading/manual_upgrades.html#manual-upgrading-efk-logging-stack
@Xia Is this something you can take on and verify or reproduce?
@Jeff In my test I used hostpath PV for ES storage and didn't encounter the exception. Will retry it with nfs PV as the bug reporter mentioned.
@xia, Looking at the code, I think the only way to get into this state is if there is some sort of abrupt termination between the time this index is created and before it is actually seeded with the ACL. It seems the only way to force it into this state would be to stand up the ES instance, and then after cycling it down, to manually delete the ACL document. I'm going to see if I can reproduce that way and create a fix.
@Xia, My attempts to reproduce were unsuccessful. I think, however, the following will guard against future issues: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/59
@Jeff, Thanks for the info. I feel interested and want to know how to manually delete the ACL document, would you please help list the steps? Thanks. And is this the recommended steps for reproducing the original issue? 1. Install logging 3.3.1,bound with persistent storage, and populate logs for a user-project index 2. Scale down ES cluster to replica=0, and make sure ES gracefully shut down 3. Manually delete the ACL document inside ES (How to achieve this?) 4. Upgrade logging stacks to 3.4.1 (consider ES cluster will be up after upgrade) 5. Check the old log entries of user-project on kibana UI 6. Search for IndexAlreadyExistsException in upgrade logs -- I'm actually not very sure if step #4 should be prior to #2 and #3...
Xia, I exec'd into the ES container while it is running and perform a delete using curl and giving it the certs. It is something like: 'curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XDELETE https://localhost:9300/searchguard-<PODNAME>/sg?id=0' but I would have to try again to ensure i had it correctly. I'm not certain to what extent it's worth the effort.
@Jeff, Thanks for the info. --Tried inside my ES container and it failed: My logging stacks are: # oc get po -n logging NAME READY STATUS RESTARTS AGE logging-curator-1-ktmlb 1/1 Running 0 3h logging-deployer-wqw5l 0/1 Completed 0 3h logging-es-gjntg1kf-1-n6fsz 1/1 Running 0 3h logging-fluentd-1np15 1/1 Running 0 3h logging-kibana-1-kdtlk 2/2 Running 0 3h And my applications are: # oc get po -n test NAME READY STATUS RESTARTS AGE ruby-ex-1-61n0p 1/1 Running 0 3h ruby-ex-1-build 0/1 Completed 0 3h Here is how I exercised the curl command to delete ACL doc: # oc rsh logging-es-gjntg1kf-1-n6fsz sh-4.2$ curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XDELETE https://localhost:9300/searchguard-ruby-ex-1-61n0p/sg?id=0 This is not a HTTP port I wonder if I need to specify the namespace name of my app in the curl command somewhere? Also is "id=0" a fixed value? --Removed needinfo flag and please reset it if you want me to repro this bug.
Awaiting for plugin SG rpms to be built to produce new images with fix.
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421630 and https://bugzilla.redhat.com/show_bug.cgi?id=1422355
So i just ran into this bug yesterday and spoke with Jeff and other members of the team. Here is how I was able to produce this bug (have not tried reproducing as this happened in production :/) 1. oc adm drain $infrastructureNode --ignore-daemonsets=true --force=true --grace-period=600 --delete-local-data=true 2. wait for the drain to happen. 3. do some OS level work and reboot the node 4. uncordon the node so it can receive pods again 5. watch as my 3 node ES cluster goes haywire and starts spitting out similar error (different searchguard index since a different pod got "stuck) Solution: 1. Make sure that all nodes in the cluster are at least talking and the cluster is "working". 2. Target the node in the ES cluster that is throwing that message. 3. Spin it down to 0 pods and then spin it back up to 1 4. watch the logs from the 3 nodes after a few minutes (5-10) the cluster recovers. The issue with the above though is that... well I cannot do any rolling restart work on the OS level as I do not know when and which one of the nodes will go kaput.
My attempt to upgrade to 3.5.0 is currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1424981
Upgrade to 3.5.0 is failed due to https://bugzilla.redhat.com/show_bug.cgi?id=1425312 see https://bugzilla.redhat.com/show_bug.cgi?id=1425312#c9 and https://bugzilla.redhat.com/show_bug.cgi?id=1425312#c10 workaround for BZ #1425312 can not be used.
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1428249
*** Bug 1429827 has been marked as a duplicate of this bug. ***
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1431935
Upgraded logging stacks from 3.4.1 to 3.5.0, passed verification, searched the ansible execution log, and the es log post upgrade, didn't find IndexAlreadyExistsException and logging system worked fine after upgrade, set to verified. The upgrade was done with current playbooks from https://github.com/openshift/openshift-ansible
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0835