Bug 1715370
Summary: | Some orphaned logs sent to .operations.* index by mistake. | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> | ||||
Component: | Logging | Assignee: | Noriko Hosoi <nhosoi> | ||||
Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.1.0 | CC: | aos-bugs, jcantril, nhosoi, rmeggins | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.2.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: In the Cluster Logging instance deletion, resources under Cluster Logging were deleted independently.
Consequence: Due to independency, there could be a short time period when the fluentd or rsyslog daemonset was still running but its logcollector service account was removed. That made the logs processed in the time miss kubernetes information including the namespace name.
Fix: Set foregroundDeletion finalizer to the collector service account. Then set the collector daemonsets owned by the service account with blockOwnerDeletion enabled, which makes the service account wait to be deleted until the descendent resources are all deleted.
Result: There is no chance for the collector daemonsets to run without the logcollector service account.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-10-16 06:29:26 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Qiaoling Tang
2019-05-30 08:27:45 UTC
Also encountered this issue in another env, when the route53 has some problem, some orphaned pod logs who's kubernetes.orphaned_namespace are `openshift-*` are sent to ".operations.*" index, not to the ".orphaned.*" index. Seems the logs which have kubernetes.orphaned_namespace "openshift-*" are sent to .operations.* index. See logs in the attachment. Hi @Qiaoling, did you have a chance to rerun this failure case recently? According to #c0, the bug was observed with the build 4.1.0-0.nightly-2019-05-29-220142. I'd like you to try the build having this fix. That is, 4.1.0-0.nightly-2019-06-21 or newer. commit 78bcf075887d32de58a15912b473ae721d4a6cbe (origin/pr/1678) Author: Rich Megginson <rmeggins> Date: Fri Jun 21 12:08:40 2019 -0600 Bug 1722898: Logging data from all projects are stored to .orphaned indexes with Elasticsearch Thanks! This could be reproduced in 4.1.0-0.nightly-2019-07-07-154416, and the images are: ose-cluster-logging-operator:v4.1.6-201907070809 ose-logging-elasticsearch5:v4.1.6-201907070809 ose-logging-fluentd:v4.1.6-201907070809 { "_index": ".operations.2019.07.09", "_type": "com.redhat.viaq.common", "_id": "OWUyOGNiNzktMTZkMi00NmYwLTg4NzctZTI2MjkzZDZjZjcw", "_version": 1, "_score": null, "_source": { "docker": { "container_id": "f8e24c405c7475d87dcc44793e81a0554f5cda33367a5e2093e94bb07149fcf6" }, "kubernetes": { "container_name": "elasticsearch-operator", "namespace_name": ".orphaned", "pod_name": "elasticsearch-operator-5c9d944579-rqz7f", "orphaned_namespace": "openshift-operators-redhat", "namespace_id": "orphaned" }, "message": "time=\"2019-07-09T02:01:39Z\" level=info msg=\"Flushing nodes for cluster elasticsearch in openshift-logging\"", "level": "info", "hostname": "localhost.localdomain", "pipeline_metadata": { "collector": { "ipaddr4": "10.129.2.16", "ipaddr6": "fe80::4865:f5ff:fe6f:1f94", "inputname": "fluent-plugin-systemd", "name": "fluentd", "received_at": "2019-07-09T02:01:39.750431+00:00", "version": "1.5.1 1.6.0" } }, "@timestamp": "2019-07-09T02:01:39.085290+00:00", "viaq_msg_id": "OWUyOGNiNzktMTZkMi00NmYwLTg4NzctZTI2MjkzZDZjZjcw" }, "fields": { "@timestamp": [ 1562637699085 ], "pipeline_metadata.collector.received_at": [ 1562637699750 ] }, "highlight": { "kubernetes.namespace_name": [ "@kibana-highlighted-field@.orphaned@/kibana-highlighted-field@" ] }, "sort": [ 1562637699085 ] } Please set the fluentd log level to trace: oc edit configmap fluentd replace this line @include configs.d/openshift/system.conf with this <system> @log_level trace </system> And provide the fluentd logs I'm also interested in the environment variables set in the fluentd pod. Please set VERBOSE to true and share the fluentd log. Thanks! if [[ $VERBOSE ]]; then set -ex fluentdargs="$fluentdargs -vv --log-event-verbose" echo ">>>>>> ENVIRONMENT VARS <<<<<" env | sort echo ">>>>>>>>>>>>><<<<<<<<<<<<<<<<" else Created attachment 1588951 [details]
fluentd pod logs
This is interesting: grep orphaned_namespace orphaned-log-in-es |sort -u "orphaned_namespace" : "openshift-machine-api", "orphaned_namespace" : "openshift-monitoring", "orphaned_namespace" : "openshift-operator-lifecycle-manager", Only those 3 namespaces. logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-operator-lifecycle-manager logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc There are a lot of these errors from during the time of the orphaned records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6 I'm not sure how the records in orphaned-log-in-es were sorted, but they all have a @timestamp from at or around this time frame. We'll need to look at the api server logs to figure out why the api server is returning a 401 Unauthorized during this time period. I don't think I have ever seen this particular problem. (In reply to Rich Megginson from comment #8) > This is interesting: > > grep orphaned_namespace orphaned-log-in-es |sort -u > "orphaned_namespace" : "openshift-machine-api", > "orphaned_namespace" : "openshift-monitoring", > "orphaned_namespace" : "openshift-operator-lifecycle-manager", > > Only those 3 namespaces. I also find orphaned_namespace "openshift-logging" and "openshift-operators-redhat" when I reproduce this issue. > logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: > fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api > logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > encountered fetching namespace metadata from Kubernetes API v1 endpoint > https://kubernetes.default.svc > > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: > fluent/log.rb:281:trace: fetching namespace metadata: > openshift-operator-lifecycle-manager > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > encountered fetching namespace metadata from Kubernetes API v1 endpoint > https://kubernetes.default.svc > > logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: > fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring > logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > encountered fetching namespace metadata from Kubernetes API v1 endpoint > https://kubernetes.default.svc > > There are a lot of these errors from during the time of the orphaned > records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: > fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6 > > I'm not sure how the records in orphaned-log-in-es were sorted, but they all > have a @timestamp from at or around this time frame. I find these logs by executing "curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key 'https://localhost:9200/.operations.*/_search?pretty&size=1000' -d '{"query":{"match": {"kubernetes.namespace_name": ".orphaned"}}}'" in es pod > We'll need to look at the api server logs to figure out why the api server > is returning a 401 Unauthorized during this time period. I don't think I > have ever seen this particular problem. I checked the pod log in openshift-apiserver namespace (not sure if I checked the correct resources), just see some logs like: E0710 05:51:03.580677 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E0710 05:51:55.786022 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W0710 05:51:55.786417 1 reflector.go:256] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: watch of *v1.ImageStream ended with: The resourceVersion for the provided watch is too old. I can see a lot of these logs, even if the EFK pods are not running. (In reply to Qiaoling Tang from comment #9) > (In reply to Rich Megginson from comment #8) > > This is interesting: > > > > grep orphaned_namespace orphaned-log-in-es |sort -u > > "orphaned_namespace" : "openshift-machine-api", > > "orphaned_namespace" : "openshift-monitoring", > > "orphaned_namespace" : "openshift-operator-lifecycle-manager", > > > > Only those 3 namespaces. > > I also find orphaned_namespace "openshift-logging" and > "openshift-operators-redhat" when I reproduce this issue. I'm assuming it is a timing issue - any namespace lookups during the time interval when the api server is returning 401 are going to fail. > > > logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: > > fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api > > logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: > > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > > encountered fetching namespace metadata from Kubernetes API v1 endpoint > > https://kubernetes.default.svc > > > > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: > > fluent/log.rb:281:trace: fetching namespace metadata: > > openshift-operator-lifecycle-manager > > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: > > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > > encountered fetching namespace metadata from Kubernetes API v1 endpoint > > https://kubernetes.default.svc > > > > logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: > > fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring > > logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: > > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > > encountered fetching namespace metadata from Kubernetes API v1 endpoint > > https://kubernetes.default.svc > > > > There are a lot of these errors from during the time of the orphaned > > records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: > > fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6 > > > > I'm not sure how the records in orphaned-log-in-es were sorted, but they all > > have a @timestamp from at or around this time frame. > > I find these logs by executing "curl -s -k --cert > /etc/elasticsearch/secret/admin-cert --key > /etc/elasticsearch/secret/admin-key > 'https://localhost:9200/.operations.*/_search?pretty&size=1000' -d > '{"query":{"match": {"kubernetes.namespace_name": ".orphaned"}}}'" in es pod Ok - you can also add `sort` to the url: ?pretty&size=1000&sort=@timestamp:desc to sort from latest to earliest, or use @timestamp:asc for earliest to latest. > > > We'll need to look at the api server logs to figure out why the api server > > is returning a 401 Unauthorized during this time period. I don't think I > > have ever seen this particular problem. > > I checked the pod log in openshift-apiserver namespace (not sure if I > checked the correct resources), just see some logs like: > E0710 05:51:03.580677 1 watcher.go:208] watch chan error: etcdserver: > mvcc: required revision has been compacted > E0710 05:51:55.786022 1 watcher.go:208] watch chan error: etcdserver: > mvcc: required revision has been compacted > W0710 05:51:55.786417 1 reflector.go:256] > github.com/openshift/client-go/image/informers/externalversions/factory.go: > 101: watch of *v1.ImageStream ended with: The resourceVersion for the > provided watch is too old. > > I can see a lot of these logs, even if the EFK pods are not running. I would expect to see log messages like "... GET /api/v1/namespaces/openshift-monitoring 401 ..." somewhere. Maybe we need to turn up the log level, or enable audit logging. I checked all the logs around the time when the issue occurred, and I filtered out the logs were gathered by the same fluentd pod, I found some of the logs were parsed correctly. We need to get the Kubernetes API server logs https://docs.openshift.com/container-platform/4.1/nodes/nodes/nodes-nodes-audit-log.html I know in the CI tests that audit logging is enabled by default. For example, here is what a fluentd request looks like: /var/log/kube-apiserver/audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"79b7fe78-34ce-40d3-8c23-86481022478f","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress","verb":"get","user":{"username":"system:serviceaccount:openshift-logging:logcollector","uid":"95461827-a3f3-11e9-afc0-0a18e06541b4","groups":["system:serviceaccounts","system:serviceaccounts:openshift-logging","system:authenticated"]},"sourceIPs":["10.0.129.192"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.3p105","objectRef":{"resource":"namespaces","namespace":"openshift-ingress","name":"openshift-ingress","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2019-07-11T19:20:03.492627Z","stageTimestamp":"2019-07-11T19:20:03.495920Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-logging-metadata-reader\" of ClusterRole \"metadata-reader\" to ServiceAccount \"logcollector/openshift-logging\""}} The SA logcollector is used by both fluentd and rsyslog. Here is the formatted output: { "annotations": { "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-logging-metadata-reader\" of ClusterRole \"metadata-reader\" to ServiceAccount \"logcollector/openshift-logging\"" }, "apiVersion": "audit.k8s.io/v1", "auditID": "36cfcb76-0e91-41e0-8606-3a75e58a779d", "kind": "Event", "level": "Metadata", "objectRef": { "apiVersion": "v1", "name": "router-default-57cf6578bb-97bzr", "namespace": "openshift-ingress", "resource": "pods" }, "requestReceivedTimestamp": "2019-07-11T20:24:02.477244Z", "requestURI": "/api/v1/namespaces/openshift-ingress/pods/router-default-57cf6578bb-97bzr", "responseStatus": { "code": 200, "metadata": {} }, "sourceIPs": [ "10.0.129.192" ], "stage": "ResponseComplete", "stageTimestamp": "2019-07-11T20:24:02.480559Z", "user": { "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-logging", "system:authenticated" ], "uid": "95461827-a3f3-11e9-afc0-0a18e06541b4", "username": "system:serviceaccount:openshift-logging:logcollector" }, "userAgent": "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.3p105", "verb": "get" } In the error case, we should see records with a responseStatus.code of 401, and hopefully some additional annotations or metadata to explain why we are getting the 401. Fluentd log: 2019-07-12 06:57:42 +0000 [trace]: fluent/log.rb:281:trace: fetching pod metadata: openshift-network-operator/network-operator-5c9c6559b-h77b2 2019-07-12 06:57:42 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:42 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-network-operator 2019-07-12 06:57:43 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:43 +0000 [trace]: fluent/log.rb:281:trace: orphaning message for: openshift-network-operator/network-operator-5c9c6559b-h77b2 2019-07-12 06:57:44 +0000 [trace]: fluent/log.rb:281:trace: fetching pod metadata: openshift-network-operator/network-operator-5c9c6559b-h77b2 2019-07-12 06:57:44 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:44 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-network-operator 2019-07-12 06:57:45 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:45 +0000 [trace]: fluent/log.rb:281:trace: orphaning message for: openshift-network-operator/network-operator-5c9c6559b-h77b2 kube-apiserver audit log: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"36c21c88-5093-4617-9815-221b6a39e78d","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-network-operator","verb":"get","user":{},"sourceIPs":["10.129.0.48"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-network-operator","name":"openshift-network-operator","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-07-12T06:57:43.004051Z","stageTimestamp":"2019-07-12T06:57:43.004081Z"} "user":{} fluentd is not passing the credentials with the request. Looks like a bug in fluent-plugin-kubernetes_metadata_filter (In reply to Rich Megginson from comment #14) > "user":{} > > fluentd is not passing the credentials with the request. Looks like a bug > in fluent-plugin-kubernetes_metadata_filter Indeed, the following get_pod and get_namespace calls are failing according to the logs in #c13. https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L87 https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L129 Does that mean the auth_option passed to Kubeclient::Client.new is missing or broken? https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L245-L250 Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? /var/run/secrets/kubernetes.io/serviceaccount/token > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod?
/var/run/secrets/kubernetes.io/serviceaccount/token
That would be good to check at fluentd startup (and rsyslog startup) e.g. in the run.sh script
(In reply to Rich Megginson from comment #16) > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? > /var/run/secrets/kubernetes.io/serviceaccount/token > > That would be good to check at fluentd startup (and rsyslog startup) e.g. in > the run.sh script Do you mean adding the checking code to run.sh (and rsyslog.sh)? If so, let me work on it! (In reply to Noriko Hosoi from comment #17) > (In reply to Rich Megginson from comment #16) > > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? > > /var/run/secrets/kubernetes.io/serviceaccount/token > > > > That would be good to check at fluentd startup (and rsyslog startup) e.g. in > > the run.sh script > > Do you mean adding the checking code to run.sh (and rsyslog.sh)? If so, let > me work on it! Yes, that's what I meant. (In reply to Rich Megginson from comment #16) > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? > /var/run/secrets/kubernetes.io/serviceaccount/token > > That would be good to check at fluentd startup (and rsyslog startup) e.g. in > the run.sh script Opened PRs to check bearer_token_file. fluentd - https://github.com/openshift/origin-aggregated-logging/pull/1702 rsyslog - https://github.com/openshift/cluster-logging-operator/pull/216 Not sure yet if it is the direct cause of this bug, I'm leaving the bug status as ASSIGNED for now. Note that the PRs will not fix the bug, but they will hopefully allow us to determine the cause. The bug status was automatically switched to POST by the bugzilla checking bot. :) @Qiaoling, please feel free to change it to ASSIGNED/FailedQA with your findings. > More specifically, can we delete fluentd SA while fluentd is still working on logs by that? You might be able to reproduce it that way, by deleting the SA, or the secret with the SA token, or both. In the past, it was not so easy to undeploy logging . . . so maybe much harder to reproduce. > I wonder what happens if we delete serviceaccount [1] after deleting the other resources? Do we have any bad consequences? That is probably ok. In general, we should try to delete objects in the reverse order that we created them. But I think that as long as you delete the Daemonset first, it shouldn't matter what order you delete the configmaps, secrets, etc. So I think deleting the SA and SA secret last should be fine. This issue is not fixed in ose-cluster-logging-operator-v4.2.0-201908041300 Reopen this bug. audit logs: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"2f55ac91-f44f-409f-8828-e36acc2c9c6b","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:01.140008Z","stageTimestamp":"2019-08-05T05:12:01.141253Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9232413e-0705-4a4d-a898-0d30d372d437","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-operators-redhat/pods/elasticsearch-operator-7b67d65659-sbk8c","verb":"get","user":{},"sourceIPs":["10.0.161.203"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-operators-redhat","name":"elasticsearch-operator-7b67d65659-sbk8c","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:01.151497Z","stageTimestamp":"2019-08-05T05:12:01.151531Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"bd3877ed-cd87-4a7b-83ec-23a80a6c7dee","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:01.905137Z","stageTimestamp":"2019-08-05T05:12:01.906574Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1c2c5449-65ea-469a-956d-8fa3333e50da","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-sdn","name":"openshift-sdn","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:01.916953Z","stageTimestamp":"2019-08-05T05:12:01.916990Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"02652404-107f-4d46-aa24-bb2f21b0e328","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:02.611130Z","stageTimestamp":"2019-08-05T05:12:02.612413Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"eb6cc0a9-2e86-4958-a7f0-576994fd7ded","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn/pods/ovs-q58vj","verb":"get","user":{},"sourceIPs":["10.0.157.95"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-sdn","name":"ovs-q58vj","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:02.620634Z","stageTimestamp":"2019-08-05T05:12:02.620663Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"d8208d48-1b8f-456d-89a2-be1db5c8d6a4","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-monitoring/pods/prometheus-k8s-0","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-monitoring","name":"prometheus-k8s-0","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:15.215025Z","stageTimestamp":"2019-08-05T05:12:15.215055Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1b0a8aee-8126-4578-9e1d-f8658b10e005","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:15.223070Z","stageTimestamp":"2019-08-05T05:12:15.224381Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1b06f70b-846e-40a8-8f50-433fea9fb2fa","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-monitoring","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-monitoring","name":"openshift-monitoring","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:15.232701Z","stageTimestamp":"2019-08-05T05:12:15.232739Z"} logs in ES: { "_index": ".operations.2019.08.05", "_type": "com.redhat.viaq.common", "_id": "YzdjOTMwNzgtNzU0OC00YjgyLTliZDktODFmMmJjYzM1NmIw", "_score": 9.096395, "_source": { "docker": { "container_id": "9028976d21dbdf541285451544953d0ec903bf6d49a45d0304e2abe3f8383abc" }, "kubernetes": { "container_name": "rules-configmap-reloader", "namespace_name": ".orphaned", "pod_name": "prometheus-k8s-0", "orphaned_namespace": "openshift-monitoring", "namespace_id": "orphaned" }, "message": "2019/08/05 05:12:14 successfully triggered reload", "level": "unknown", "pipeline_metadata": { "collector": { "ipaddr4": "10.128.2.28", "ipaddr6": "fe80::683e:dff:fed5:b310", "inputname": "fluent-plugin-systemd", "name": "fluentd", "received_at": "2019-08-05T05:12:15.233568+00:00", "version": "1.6.2 1.6.0" } }, "@timestamp": "2019-08-05T05:12:14.327828+00:00", "viaq_msg_id": "YzdjOTMwNzgtNzU0OC00YjgyLTliZDktODFmMmJjYzM1NmIw" } $ oc exec elasticsearch-cdm-68e3yxff-1-8f5c4bf57-2hts9 -- es_util --query=.operations.*/_count -d '{"query": { "match": {"kubernetes.namespace_name" : ".orphaned" }}}' |jq Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-68e3yxff-1-8f5c4bf57-2hts9 -n openshift-logging' to see all of the containers in this pod. { "count": 9, "_shards": { "total": 2, "successful": 2, "skipped": 0, "failed": 0 } } Still there's some period the "user" is not available... The session starting from the second log below has no user value although the previous has. {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"02652404-107f-4d46-aa24-bb2f21b0e328","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:02.611130Z","stageTimestamp":"2019-08-05T05:12:02.612413Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"eb6cc0a9-2e86-4958-a7f0-576994fd7ded","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn/pods/ovs-q58vj","verb":"get","user":{},"sourceIPs":["10.0.157.95"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-sdn","name":"ovs-q58vj","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:02.620634Z","stageTimestamp":"2019-08-05T05:12:02.620663Z"} Hi @jcantril, Could you please help us understanding the issue? In the following code, does RemoveServiceAccount wait until either of removeFluentd or removeRsyslog completely removes its daemonset? Or requests are not serialized but asynchronous? https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/collection.go#L148-L164 We are having an issue that logs are processed without a service account for a short period when deleting and recreating the CLO. I moved the RemoveServiceAccount code after removeFluentd / removeRsyslog, but the problem still exists. If the timing of removing service account is the cause of this issue, what'd be the best way to determine the time to remove the service account? For instance, we could get the daemonset list by this code. Can we wait until it returns empty? https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/status.go (In reply to Noriko Hosoi from comment #36) > Hi @jcantril, > > Could you please help us understanding the issue? In the following code, > does RemoveServiceAccount wait until either of removeFluentd or > removeRsyslog completely removes its daemonset? Or requests are not > serialized but asynchronous? > > https://github.com/openshift/cluster-logging-operator/blob/master/pkg/ > k8shandler/collection.go#L148-L164 Are you sure that code is being called when the clusterlogging instance is deleted? if cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeFluentd && cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeRsyslog { if err = clusterRequest.RemoveServiceAccount("logcollector"); err != nil { return } if err = clusterRequest.RemovePriorityClass(clusterLoggingPriorityClassName); err != nil { return } } That would mean that when you do `oc delete clusterlogging instance` this code is called with cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeFluentd && cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeRsyslog - is that even possible? > 2. delete clusterlogging instance to undeploy logging EFK What actually happens when `oc delete clusterlogging instance` is called? Does it go through the operator code? Or does Kubernetes itself delete the objects? If it is the latter, we may need to set some sort of "grace period" on the SA or SA secret, so that it "waits" until the rsyslog container exits, or something like that. I suspect that the clo isn't called in this case - we could add some tracing to the clo Reconcile function to see if it is called when the clusterlogging instance is deleted. > > We are having an issue that logs are processed without a service account for > a short period when deleting and recreating the CLO. I moved the > RemoveServiceAccount code after removeFluentd / removeRsyslog, but the > problem still exists. > > If the timing of removing service account is the cause of this issue, what'd > be the best way to determine the time to remove the service account? For > instance, we could get the daemonset list by this code. Can we wait until > it returns empty? > > https://github.com/openshift/cluster-logging-operator/blob/master/pkg/ > k8shandler/status.go Thank you for your input, Jeff. Jeff Cantrill: isnt there a delay set on the fluent termination...maybe it needs to be shortened... i thought we had a request to do that anyway https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace i believe it only applies to pods Google Cloud BlogGoogle Cloud Blog Kubernetes best practices: terminating with grace | Google Cloud Blog Editor’s note: Today is the fifth installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most ou(200 kB) https://cloud.google.com/blog/static/assets/GCP_Twitter_Card-2000×1000.png Tested with ose-cluster-logging-operator-v4.2.0-201908271819, unable to reproduce this issue, move to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |