Bug 1791837

Summary: Kibana user indices continue to need to be deleted and restored due to permissions errors
Product: OpenShift Container Platform Reporter: Greg Rodriguez II <grodrigu>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: agerami, anli, aos-bugs, apurty, brian_mckeown, dkulkarn, jcantril, periklis, ssadhale, stwalter, tnakajo, vjaypurk, xtian
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.zFlags: rkonuru: needinfo-
tnakajo: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-27 13:49:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Rodriguez II 2020-01-16 15:12:27 UTC
Description of problem:
Users with appropriate roles are continually hitting `[security_exception] no permissions for [indices:data/read/field_caps]` errors when accessing Kibana.  Deleting/Restoring Kibana user index resolves issue momentarily, but issue reoccurs repeatedly.

Version-Release number of selected component (if applicable):
OCP 3.11.154-1

How reproducible:
Customer is able to reproduce easily

Steps to Reproduce:
1. User with appropriate roles gets `[security_exception] no permissions for [indices:data/read/field_caps]` message when accessing Kibana
2. Affected user's Kibana index is deleted and restored
3. Affected user is able to access Kibana without error
4. Days later, the issue occurs again, repeat steps

Actual results:
User indices continue to need to be deleted and restored, even in green status, in order for users to access Kibana without issues.

Expected results:
The Kibana user indices should not need to be deleted and restored originally, let alone multiple times, just for the issue to continue to reoccur

Additional info:
Adding customer specific details as private notes in BZ

Comment 3 Jeff Cantrill 2020-01-17 19:33:22 UTC
Consider asking them to try and access Kibana again and without 60s run the following script [1].  Clone the entire repo and execute with someone who has permissions to access Elasticsearch.  This will show you the state of all permissions.  

* Is the user trying to access anything other then the discover tab?
* What is their role? Can they see the default namespace?

[1] https://github.com/jcantrill/cluster-logging-tools/blob/release-3.x/scripts/view-es-permissions

Comment 8 Jeff Cantrill 2020-01-28 20:47:24 UTC
What exactly is the user doing when they see this error?  
Is it trying to get access to their kibana index?  
Is it trying to query logs by using an index pattern? If so, please list the index-pattern
What is user's role?

Following is a diagram of how a user's role is determined, how the permissions are seeded
[1] https://github.com/openshift/origin-aggregated-logging/blob/master/docs/access-control.md#role-definitions-and-permissions

Comment 9 Jeff Cantrill 2020-01-28 20:49:34 UTC
You might temporarily workaround this issue by:

1. exec into an es pod
2. edit sgconfig/sg_action_groups.yaml and add 'indices:data/read/field_caps' to the list for "INDEX_KIBANA"
3. reseed the permissions by 'es_seed_acl'

If this resolves the issue we can modify it for release

Comment 19 Greg Rodriguez II 2020-02-26 16:40:36 UTC
Received update from customer:

~~~

After working with Engineering, we shortened retention times and fine tuned some settings and did not see issues immediately after.
When updating our environment, all changes were reset and issue was observed again.
We need a solution/process to correct this that will either persist thru OCP version updates, or that can be automated against our environment after OCP version update.

~~~

"Working with Engineering" is referring to c#17.  This worked, until an update, and everything was reset.  Is there a way to achieve this goal that will persist through upgrades?

Comment 20 Jeff Cantrill 2020-02-26 20:53:36 UTC
(In reply to Greg Rodriguez II from comment #19)
> Received update from customer:
> 
> "Working with Engineering" is referring to c#17.  This worked, until an
> update, and everything was reset.  Is there a way to achieve this goal that
> will persist through upgrades?

You say these issues occur after update which is when all permissions are reseeded.

Did #c9 resolve the problem and it was only an issue after upgrade?  This would make sense then because you are then reverting what is allowed and we should update to make the change permanent. It's easy enough to test by applying #c9, reseed, test and then reverting #c9, reseed, test; the problem would disappear and come back.  Can they confirm?

The might also consider accessing Kibana with a "private" browser to ensure they are not experiencing issues related to caching.

Comment 22 Jeff Cantrill 2020-03-10 01:14:47 UTC
The output doesn't answer my question from #c20. Does making the change and reseeding the permissions resolve the issue until the pods get restarted?

Comment 24 Greg Rodriguez II 2020-04-03 14:32:45 UTC
Customer is requesting update.  Has there been any further progress at this time?

Comment 25 Greg Rodriguez II 2020-04-17 14:04:01 UTC
Are there any updates that can be provided regarding this issue?

Comment 26 Greg Rodriguez II 2020-04-21 14:25:37 UTC
Customer is pushing for movement on this issue.  Are there any updates that can be provided?

Comment 27 Greg Rodriguez II 2020-05-19 16:09:49 UTC
Issue appears to hit a stand still.  Are there any updates in this case?  Customer is still affected and requesting to move forward.

Comment 28 Greg Rodriguez II 2020-05-28 17:06:41 UTC
Customer is still impacted by this issue.  Is there anything whatsoever that can be communicated to the customer regarding the status of this issue or workarounds?

Comment 29 Jeff Cantrill 2020-06-02 19:20:00 UTC
I'm not exactly sure where the idea originated that a user's kibana index is "corrupted" and needs to be removed and replaced in order to fix the problem.  Ultimately I believe the issue relates to a user's oauth token expiring and possibly an unexpired cookie in the user's browser and the interplay between the two.  We are currently working through some odd behavior for another customer related to addressing oauth tokens, cookie expiration and sign off which I believe may mitigate some of the frustrations user's experience.

I have captured screen shots of behavior I see when my oauth token expires [1] on a 4.4 cluster which is the same logging stack. The first issue occurs by using the discover tab and allowing the browser to refresh until the token expires.  The second issue occurs when navigating to the management tab, choosing an index pattern and then clicking the refresh button.  You will note the last is the same error though it lists the kibana certs instead of the user.  I was able to solve these issues by clicking the logout link in the upper right corner and then signing back in to get a new oauth token.

* What is the user doing to retrieve logs? When they come back to Kibana at a "later" time are they asked to re-enter their credentials?
* How much "later" do they see the error message occur?
* What is the expiration time for tokens for the ocp cluster?
* Have they tried accessing Kibana in a private browser to see if there is still an issue?
* Have they tried to delete their browser cookies and do they still see the issue?


[1] https://docs.google.com/document/d/183TaKCPzaeWxxYbsZDeQVNGwHrgxgSG2a1idune-76Y/edit?usp=sharing

Comment 30 Jeff Cantrill 2020-06-02 19:46:56 UTC
I also experimented with, after my token expires, closing the window, opening a new one and navigating to Kibana.  I discovered the same error reported here related to field_caps and I resolved by signing_out and then signing back in again to refresh my oauth token.  I'm asking our docs team to document this as a known issue as we are unlikely to resolve it because of the nature of Kibana.  If signing back in resolves error and/or along with #c29, I am inclined to close this issue as WONTFIX

Comment 32 brian_mckeown 2020-06-05 16:35:10 UTC
Hello team,
Several IBM Cloud Openshift customers are impacted by defect https://bugzilla.redhat.com/show_bug.cgi?id=1835396.
But 1835396 has been closed as a duplicate of this issue.
So would someone be able to give a status on this issue please?
Thanks in advance. Brian. IBM Cloud Support.

Comment 33 Jeff Cantrill 2020-06-08 00:37:48 UTC
@Brian,

Please ref https://bugzilla.redhat.com/show_bug.cgi?id=1791837#c29 .  I don't believe this is an issue but user's needing to clear their browser cache

Comment 34 Ravi Konuru 2020-06-08 03:49:10 UTC
@Jeff Cantrell is it just me but I cannot see c29 even after I logged in.  Can you help me to find that comment? 

So to be clear, is just a Kibana issue? Have we confirmed that all log entries from fluentd are being stored in logstash ?

Comment 35 tnakajo 2020-06-09 01:58:50 UTC
@Jeff Cantrill Helo, Jeff, we have one of our customers also having the same issue on IBM Cloud IKS Version: 4.3.12_1520_openshift. Please investigate this further?

1. Configured the es but still failed.

   The procedure as below:
   a. exec into an es pod  
   b. edit sgconfig/sg_action_groups.yaml and add 'indices:data/read/field_caps' to the list for "INDEX_KIBANA"  
   c. reseed the permissions by 'es_seed_acl'

2. Cleared the browser and login the Kibana but still failed.

Comment 37 Jeff Cantrill 2020-06-18 19:01:50 UTC
Please see my comment #c33. I made the ref'd comment public and don't believe this to be a bug.  Moving to UpcomingSprint

Comment 38 Anping Li 2020-07-09 09:27:08 UTC
Following comment 29, I can fix the Kibana permissions error by log out and log in.  so move the bug to Verified.
Note: The logout navigates to https://kibana.example.com/oauth/sign_in. but you couldn't log in via 'https://kibana.example.com/oauth/sign_in.  you must log in Kibana via 'https://kibana.example.com'.


@grodrigu @brian_mckeown.com, feel free to re-open this bug if you hit that again. and please provide the oauthaccesstoken too. you can get the token by 'oc get oauthaccesstoken|grep $username'

Comment 42 Jeff Cantrill 2020-07-20 12:24:40 UTC
(In reply to Devendra Kulkarni from comment #39)
> Hello @Jeff and @Anping,
> 
> For case 02629833, the issue still persists and reoccurs once in a week and
> the only way to recover is to delete the kibana user index and then re-login.


Please verify what the version and sha of the images the customer tested.  I don't see where they would  have had access to the  image verified by QE as it was not added to the errata until after they tested.

Comment 44 errata-xmlrpc 2020-07-27 13:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990

Comment 46 Red Hat Bugzilla 2023-09-18 00:19:49 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days