Description of problem: Some orphaned logs sent to .operations.* index by mistake. { "_index": ".operations.2019.05.30", "_type": "com.redhat.viaq.common", "_id": "ZTg4MmI0MzgtMTc5Ny00M2JiLTlkNWQtNmIwOWNmN2Y4ZTM4", "_version": 1, "_score": null, "_source": { "docker": { "container_id": "75bd2cb09b9e0652397c804f1135848e728006b156793ec9116464d638ea1ec8" }, "kubernetes": { "container_name": "sdn", "namespace_name": ".orphaned", "pod_name": "sdn-z9gm7", "orphaned_namespace": "openshift-sdn", "namespace_id": "orphaned" }, "message": "I0530 05:57:46.349079 1434 roundrobin.go:338] LoadBalancerRR: Removing endpoints for openshift-logging/elasticsearch-cluster:elasticsearch", "level": "err", "hostname": "localhost.localdomain", "pipeline_metadata": { "collector": { "ipaddr4": "10.128.2.18", "ipaddr6": "fe80::50f8:7cff:fe26:884c", "inputname": "fluent-plugin-systemd", "name": "fluentd", "received_at": "2019-05-30T05:57:46.942489+00:00", "version": "1.4.0 1.6.0" } }, "@timestamp": "2019-05-30T05:57:46.349135+00:00", "viaq_msg_id": "ZTg4MmI0MzgtMTc5Ny00M2JiLTlkNWQtNmIwOWNmN2Y4ZTM4" }, "fields": { "@timestamp": [ 1559195866349 ], "pipeline_metadata.collector.received_at": [ 1559195866942 ] }, "highlight": { "kubernetes.namespace_name": [ "@kibana-highlighted-field@.orphaned@/kibana-highlighted-field@" ] }, "sort": [ 1559195866349 ] } $ oc exec elasticsearch-cdm-xeolone6-1-77764d496-j9r9g -- es_util --query=.operations.*/_count -d '{"query": { "match": {"kubernetes.namespace_name" : ".orphaned" }}}' |jq Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-xeolone6-1-77764d496-j9r9g -n openshift-logging' to see all of the containers in this pod. { "count": 87, "_shards": { "total": 2, "successful": 2, "skipped": 0, "failed": 0 } } Version-Release number of selected component (if applicable): openshift/ose-logging-elasticsearch5:v4.1.0-201905191700 openshift/ose-logging-fluentd:v4.1.0-201905191700 4.1.0-0.nightly-2019-05-29-220142 How reproducible: Always Steps to Reproduce: 1. Deploy logging, wait for a few minutes 2. delete clusterlogging instance to undeploy logging EFK 3. wait until all the EFK resources removed, create clusterlogging instance again 4. check logs in .operations.* index Actual results: Some orphaned logs are sent to .operations.* index by mistake. Expected results: orphaned logs should sent to .orphaned.* index. Additional info: I checked these orphaned logs, their kubernetes.orphaned_namespace are "openshift-sdn", "openshift-monitoring", "openshift-operators-redhat", and the messages are related to deleting logging EFK resources. A fresh install doesn't have this issue.
Also encountered this issue in another env, when the route53 has some problem, some orphaned pod logs who's kubernetes.orphaned_namespace are `openshift-*` are sent to ".operations.*" index, not to the ".orphaned.*" index. Seems the logs which have kubernetes.orphaned_namespace "openshift-*" are sent to .operations.* index. See logs in the attachment.
Hi @Qiaoling, did you have a chance to rerun this failure case recently? According to #c0, the bug was observed with the build 4.1.0-0.nightly-2019-05-29-220142. I'd like you to try the build having this fix. That is, 4.1.0-0.nightly-2019-06-21 or newer. commit 78bcf075887d32de58a15912b473ae721d4a6cbe (origin/pr/1678) Author: Rich Megginson <rmeggins> Date: Fri Jun 21 12:08:40 2019 -0600 Bug 1722898: Logging data from all projects are stored to .orphaned indexes with Elasticsearch Thanks!
This could be reproduced in 4.1.0-0.nightly-2019-07-07-154416, and the images are: ose-cluster-logging-operator:v4.1.6-201907070809 ose-logging-elasticsearch5:v4.1.6-201907070809 ose-logging-fluentd:v4.1.6-201907070809 { "_index": ".operations.2019.07.09", "_type": "com.redhat.viaq.common", "_id": "OWUyOGNiNzktMTZkMi00NmYwLTg4NzctZTI2MjkzZDZjZjcw", "_version": 1, "_score": null, "_source": { "docker": { "container_id": "f8e24c405c7475d87dcc44793e81a0554f5cda33367a5e2093e94bb07149fcf6" }, "kubernetes": { "container_name": "elasticsearch-operator", "namespace_name": ".orphaned", "pod_name": "elasticsearch-operator-5c9d944579-rqz7f", "orphaned_namespace": "openshift-operators-redhat", "namespace_id": "orphaned" }, "message": "time=\"2019-07-09T02:01:39Z\" level=info msg=\"Flushing nodes for cluster elasticsearch in openshift-logging\"", "level": "info", "hostname": "localhost.localdomain", "pipeline_metadata": { "collector": { "ipaddr4": "10.129.2.16", "ipaddr6": "fe80::4865:f5ff:fe6f:1f94", "inputname": "fluent-plugin-systemd", "name": "fluentd", "received_at": "2019-07-09T02:01:39.750431+00:00", "version": "1.5.1 1.6.0" } }, "@timestamp": "2019-07-09T02:01:39.085290+00:00", "viaq_msg_id": "OWUyOGNiNzktMTZkMi00NmYwLTg4NzctZTI2MjkzZDZjZjcw" }, "fields": { "@timestamp": [ 1562637699085 ], "pipeline_metadata.collector.received_at": [ 1562637699750 ] }, "highlight": { "kubernetes.namespace_name": [ "@kibana-highlighted-field@.orphaned@/kibana-highlighted-field@" ] }, "sort": [ 1562637699085 ] }
Please set the fluentd log level to trace: oc edit configmap fluentd replace this line @include configs.d/openshift/system.conf with this <system> @log_level trace </system> And provide the fluentd logs
I'm also interested in the environment variables set in the fluentd pod. Please set VERBOSE to true and share the fluentd log. Thanks! if [[ $VERBOSE ]]; then set -ex fluentdargs="$fluentdargs -vv --log-event-verbose" echo ">>>>>> ENVIRONMENT VARS <<<<<" env | sort echo ">>>>>>>>>>>>><<<<<<<<<<<<<<<<" else
Created attachment 1588951 [details] fluentd pod logs
This is interesting: grep orphaned_namespace orphaned-log-in-es |sort -u "orphaned_namespace" : "openshift-machine-api", "orphaned_namespace" : "openshift-monitoring", "orphaned_namespace" : "openshift-operator-lifecycle-manager", Only those 3 namespaces. logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-operator-lifecycle-manager logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc There are a lot of these errors from during the time of the orphaned records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6 I'm not sure how the records in orphaned-log-in-es were sorted, but they all have a @timestamp from at or around this time frame. We'll need to look at the api server logs to figure out why the api server is returning a 401 Unauthorized during this time period. I don't think I have ever seen this particular problem.
(In reply to Rich Megginson from comment #8) > This is interesting: > > grep orphaned_namespace orphaned-log-in-es |sort -u > "orphaned_namespace" : "openshift-machine-api", > "orphaned_namespace" : "openshift-monitoring", > "orphaned_namespace" : "openshift-operator-lifecycle-manager", > > Only those 3 namespaces. I also find orphaned_namespace "openshift-logging" and "openshift-operators-redhat" when I reproduce this issue. > logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: > fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api > logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > encountered fetching namespace metadata from Kubernetes API v1 endpoint > https://kubernetes.default.svc > > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: > fluent/log.rb:281:trace: fetching namespace metadata: > openshift-operator-lifecycle-manager > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > encountered fetching namespace metadata from Kubernetes API v1 endpoint > https://kubernetes.default.svc > > logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: > fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring > logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > encountered fetching namespace metadata from Kubernetes API v1 endpoint > https://kubernetes.default.svc > > There are a lot of these errors from during the time of the orphaned > records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: > fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6 > > I'm not sure how the records in orphaned-log-in-es were sorted, but they all > have a @timestamp from at or around this time frame. I find these logs by executing "curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key 'https://localhost:9200/.operations.*/_search?pretty&size=1000' -d '{"query":{"match": {"kubernetes.namespace_name": ".orphaned"}}}'" in es pod > We'll need to look at the api server logs to figure out why the api server > is returning a 401 Unauthorized during this time period. I don't think I > have ever seen this particular problem. I checked the pod log in openshift-apiserver namespace (not sure if I checked the correct resources), just see some logs like: E0710 05:51:03.580677 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E0710 05:51:55.786022 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W0710 05:51:55.786417 1 reflector.go:256] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: watch of *v1.ImageStream ended with: The resourceVersion for the provided watch is too old. I can see a lot of these logs, even if the EFK pods are not running.
(In reply to Qiaoling Tang from comment #9) > (In reply to Rich Megginson from comment #8) > > This is interesting: > > > > grep orphaned_namespace orphaned-log-in-es |sort -u > > "orphaned_namespace" : "openshift-machine-api", > > "orphaned_namespace" : "openshift-monitoring", > > "orphaned_namespace" : "openshift-operator-lifecycle-manager", > > > > Only those 3 namespaces. > > I also find orphaned_namespace "openshift-logging" and > "openshift-operators-redhat" when I reproduce this issue. I'm assuming it is a timing issue - any namespace lookups during the time interval when the api server is returning 401 are going to fail. > > > logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: > > fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api > > logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: > > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > > encountered fetching namespace metadata from Kubernetes API v1 endpoint > > https://kubernetes.default.svc > > > > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: > > fluent/log.rb:281:trace: fetching namespace metadata: > > openshift-operator-lifecycle-manager > > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: > > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > > encountered fetching namespace metadata from Kubernetes API v1 endpoint > > https://kubernetes.default.svc > > > > logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: > > fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring > > logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: > > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' > > encountered fetching namespace metadata from Kubernetes API v1 endpoint > > https://kubernetes.default.svc > > > > There are a lot of these errors from during the time of the orphaned > > records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: > > fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6 > > > > I'm not sure how the records in orphaned-log-in-es were sorted, but they all > > have a @timestamp from at or around this time frame. > > I find these logs by executing "curl -s -k --cert > /etc/elasticsearch/secret/admin-cert --key > /etc/elasticsearch/secret/admin-key > 'https://localhost:9200/.operations.*/_search?pretty&size=1000' -d > '{"query":{"match": {"kubernetes.namespace_name": ".orphaned"}}}'" in es pod Ok - you can also add `sort` to the url: ?pretty&size=1000&sort=@timestamp:desc to sort from latest to earliest, or use @timestamp:asc for earliest to latest. > > > We'll need to look at the api server logs to figure out why the api server > > is returning a 401 Unauthorized during this time period. I don't think I > > have ever seen this particular problem. > > I checked the pod log in openshift-apiserver namespace (not sure if I > checked the correct resources), just see some logs like: > E0710 05:51:03.580677 1 watcher.go:208] watch chan error: etcdserver: > mvcc: required revision has been compacted > E0710 05:51:55.786022 1 watcher.go:208] watch chan error: etcdserver: > mvcc: required revision has been compacted > W0710 05:51:55.786417 1 reflector.go:256] > github.com/openshift/client-go/image/informers/externalversions/factory.go: > 101: watch of *v1.ImageStream ended with: The resourceVersion for the > provided watch is too old. > > I can see a lot of these logs, even if the EFK pods are not running. I would expect to see log messages like "... GET /api/v1/namespaces/openshift-monitoring 401 ..." somewhere. Maybe we need to turn up the log level, or enable audit logging.
I checked all the logs around the time when the issue occurred, and I filtered out the logs were gathered by the same fluentd pod, I found some of the logs were parsed correctly.
We need to get the Kubernetes API server logs https://docs.openshift.com/container-platform/4.1/nodes/nodes/nodes-nodes-audit-log.html I know in the CI tests that audit logging is enabled by default. For example, here is what a fluentd request looks like: /var/log/kube-apiserver/audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"79b7fe78-34ce-40d3-8c23-86481022478f","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress","verb":"get","user":{"username":"system:serviceaccount:openshift-logging:logcollector","uid":"95461827-a3f3-11e9-afc0-0a18e06541b4","groups":["system:serviceaccounts","system:serviceaccounts:openshift-logging","system:authenticated"]},"sourceIPs":["10.0.129.192"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.3p105","objectRef":{"resource":"namespaces","namespace":"openshift-ingress","name":"openshift-ingress","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2019-07-11T19:20:03.492627Z","stageTimestamp":"2019-07-11T19:20:03.495920Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-logging-metadata-reader\" of ClusterRole \"metadata-reader\" to ServiceAccount \"logcollector/openshift-logging\""}} The SA logcollector is used by both fluentd and rsyslog. Here is the formatted output: { "annotations": { "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-logging-metadata-reader\" of ClusterRole \"metadata-reader\" to ServiceAccount \"logcollector/openshift-logging\"" }, "apiVersion": "audit.k8s.io/v1", "auditID": "36cfcb76-0e91-41e0-8606-3a75e58a779d", "kind": "Event", "level": "Metadata", "objectRef": { "apiVersion": "v1", "name": "router-default-57cf6578bb-97bzr", "namespace": "openshift-ingress", "resource": "pods" }, "requestReceivedTimestamp": "2019-07-11T20:24:02.477244Z", "requestURI": "/api/v1/namespaces/openshift-ingress/pods/router-default-57cf6578bb-97bzr", "responseStatus": { "code": 200, "metadata": {} }, "sourceIPs": [ "10.0.129.192" ], "stage": "ResponseComplete", "stageTimestamp": "2019-07-11T20:24:02.480559Z", "user": { "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-logging", "system:authenticated" ], "uid": "95461827-a3f3-11e9-afc0-0a18e06541b4", "username": "system:serviceaccount:openshift-logging:logcollector" }, "userAgent": "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.3p105", "verb": "get" } In the error case, we should see records with a responseStatus.code of 401, and hopefully some additional annotations or metadata to explain why we are getting the 401.
Fluentd log: 2019-07-12 06:57:42 +0000 [trace]: fluent/log.rb:281:trace: fetching pod metadata: openshift-network-operator/network-operator-5c9c6559b-h77b2 2019-07-12 06:57:42 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:42 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-network-operator 2019-07-12 06:57:43 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:43 +0000 [trace]: fluent/log.rb:281:trace: orphaning message for: openshift-network-operator/network-operator-5c9c6559b-h77b2 2019-07-12 06:57:44 +0000 [trace]: fluent/log.rb:281:trace: fetching pod metadata: openshift-network-operator/network-operator-5c9c6559b-h77b2 2019-07-12 06:57:44 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:44 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-network-operator 2019-07-12 06:57:45 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc 2019-07-12 06:57:45 +0000 [trace]: fluent/log.rb:281:trace: orphaning message for: openshift-network-operator/network-operator-5c9c6559b-h77b2 kube-apiserver audit log: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"36c21c88-5093-4617-9815-221b6a39e78d","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-network-operator","verb":"get","user":{},"sourceIPs":["10.129.0.48"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-network-operator","name":"openshift-network-operator","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-07-12T06:57:43.004051Z","stageTimestamp":"2019-07-12T06:57:43.004081Z"}
"user":{} fluentd is not passing the credentials with the request. Looks like a bug in fluent-plugin-kubernetes_metadata_filter
(In reply to Rich Megginson from comment #14) > "user":{} > > fluentd is not passing the credentials with the request. Looks like a bug > in fluent-plugin-kubernetes_metadata_filter Indeed, the following get_pod and get_namespace calls are failing according to the logs in #c13. https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L87 https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L129 Does that mean the auth_option passed to Kubeclient::Client.new is missing or broken? https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L245-L250 Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? /var/run/secrets/kubernetes.io/serviceaccount/token
> Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? /var/run/secrets/kubernetes.io/serviceaccount/token That would be good to check at fluentd startup (and rsyslog startup) e.g. in the run.sh script
(In reply to Rich Megginson from comment #16) > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? > /var/run/secrets/kubernetes.io/serviceaccount/token > > That would be good to check at fluentd startup (and rsyslog startup) e.g. in > the run.sh script Do you mean adding the checking code to run.sh (and rsyslog.sh)? If so, let me work on it!
(In reply to Noriko Hosoi from comment #17) > (In reply to Rich Megginson from comment #16) > > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? > > /var/run/secrets/kubernetes.io/serviceaccount/token > > > > That would be good to check at fluentd startup (and rsyslog startup) e.g. in > > the run.sh script > > Do you mean adding the checking code to run.sh (and rsyslog.sh)? If so, let > me work on it! Yes, that's what I meant.
(In reply to Rich Megginson from comment #16) > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod? > /var/run/secrets/kubernetes.io/serviceaccount/token > > That would be good to check at fluentd startup (and rsyslog startup) e.g. in > the run.sh script Opened PRs to check bearer_token_file. fluentd - https://github.com/openshift/origin-aggregated-logging/pull/1702 rsyslog - https://github.com/openshift/cluster-logging-operator/pull/216 Not sure yet if it is the direct cause of this bug, I'm leaving the bug status as ASSIGNED for now.
Note that the PRs will not fix the bug, but they will hopefully allow us to determine the cause.
The bug status was automatically switched to POST by the bugzilla checking bot. :) @Qiaoling, please feel free to change it to ASSIGNED/FailedQA with your findings.
> More specifically, can we delete fluentd SA while fluentd is still working on logs by that? You might be able to reproduce it that way, by deleting the SA, or the secret with the SA token, or both. In the past, it was not so easy to undeploy logging . . . so maybe much harder to reproduce. > I wonder what happens if we delete serviceaccount [1] after deleting the other resources? Do we have any bad consequences? That is probably ok. In general, we should try to delete objects in the reverse order that we created them. But I think that as long as you delete the Daemonset first, it shouldn't matter what order you delete the configmaps, secrets, etc. So I think deleting the SA and SA secret last should be fine.
This issue is not fixed in ose-cluster-logging-operator-v4.2.0-201908041300 Reopen this bug.
audit logs: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"2f55ac91-f44f-409f-8828-e36acc2c9c6b","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:01.140008Z","stageTimestamp":"2019-08-05T05:12:01.141253Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9232413e-0705-4a4d-a898-0d30d372d437","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-operators-redhat/pods/elasticsearch-operator-7b67d65659-sbk8c","verb":"get","user":{},"sourceIPs":["10.0.161.203"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-operators-redhat","name":"elasticsearch-operator-7b67d65659-sbk8c","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:01.151497Z","stageTimestamp":"2019-08-05T05:12:01.151531Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"bd3877ed-cd87-4a7b-83ec-23a80a6c7dee","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:01.905137Z","stageTimestamp":"2019-08-05T05:12:01.906574Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1c2c5449-65ea-469a-956d-8fa3333e50da","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-sdn","name":"openshift-sdn","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:01.916953Z","stageTimestamp":"2019-08-05T05:12:01.916990Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"02652404-107f-4d46-aa24-bb2f21b0e328","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:02.611130Z","stageTimestamp":"2019-08-05T05:12:02.612413Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"eb6cc0a9-2e86-4958-a7f0-576994fd7ded","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn/pods/ovs-q58vj","verb":"get","user":{},"sourceIPs":["10.0.157.95"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-sdn","name":"ovs-q58vj","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:02.620634Z","stageTimestamp":"2019-08-05T05:12:02.620663Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"d8208d48-1b8f-456d-89a2-be1db5c8d6a4","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-monitoring/pods/prometheus-k8s-0","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-monitoring","name":"prometheus-k8s-0","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:15.215025Z","stageTimestamp":"2019-08-05T05:12:15.215055Z"} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1b0a8aee-8126-4578-9e1d-f8658b10e005","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:15.223070Z","stageTimestamp":"2019-08-05T05:12:15.224381Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1b06f70b-846e-40a8-8f50-433fea9fb2fa","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-monitoring","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-monitoring","name":"openshift-monitoring","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:15.232701Z","stageTimestamp":"2019-08-05T05:12:15.232739Z"}
logs in ES: { "_index": ".operations.2019.08.05", "_type": "com.redhat.viaq.common", "_id": "YzdjOTMwNzgtNzU0OC00YjgyLTliZDktODFmMmJjYzM1NmIw", "_score": 9.096395, "_source": { "docker": { "container_id": "9028976d21dbdf541285451544953d0ec903bf6d49a45d0304e2abe3f8383abc" }, "kubernetes": { "container_name": "rules-configmap-reloader", "namespace_name": ".orphaned", "pod_name": "prometheus-k8s-0", "orphaned_namespace": "openshift-monitoring", "namespace_id": "orphaned" }, "message": "2019/08/05 05:12:14 successfully triggered reload", "level": "unknown", "pipeline_metadata": { "collector": { "ipaddr4": "10.128.2.28", "ipaddr6": "fe80::683e:dff:fed5:b310", "inputname": "fluent-plugin-systemd", "name": "fluentd", "received_at": "2019-08-05T05:12:15.233568+00:00", "version": "1.6.2 1.6.0" } }, "@timestamp": "2019-08-05T05:12:14.327828+00:00", "viaq_msg_id": "YzdjOTMwNzgtNzU0OC00YjgyLTliZDktODFmMmJjYzM1NmIw" } $ oc exec elasticsearch-cdm-68e3yxff-1-8f5c4bf57-2hts9 -- es_util --query=.operations.*/_count -d '{"query": { "match": {"kubernetes.namespace_name" : ".orphaned" }}}' |jq Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-68e3yxff-1-8f5c4bf57-2hts9 -n openshift-logging' to see all of the containers in this pod. { "count": 9, "_shards": { "total": 2, "successful": 2, "skipped": 0, "failed": 0 } }
Still there's some period the "user" is not available... The session starting from the second log below has no user value although the previous has. {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"02652404-107f-4d46-aa24-bb2f21b0e328","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:02.611130Z","stageTimestamp":"2019-08-05T05:12:02.612413Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"eb6cc0a9-2e86-4958-a7f0-576994fd7ded","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn/pods/ovs-q58vj","verb":"get","user":{},"sourceIPs":["10.0.157.95"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-sdn","name":"ovs-q58vj","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:02.620634Z","stageTimestamp":"2019-08-05T05:12:02.620663Z"}
Hi @jcantril, Could you please help us understanding the issue? In the following code, does RemoveServiceAccount wait until either of removeFluentd or removeRsyslog completely removes its daemonset? Or requests are not serialized but asynchronous? https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/collection.go#L148-L164 We are having an issue that logs are processed without a service account for a short period when deleting and recreating the CLO. I moved the RemoveServiceAccount code after removeFluentd / removeRsyslog, but the problem still exists. If the timing of removing service account is the cause of this issue, what'd be the best way to determine the time to remove the service account? For instance, we could get the daemonset list by this code. Can we wait until it returns empty? https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/status.go
(In reply to Noriko Hosoi from comment #36) > Hi @jcantril, > > Could you please help us understanding the issue? In the following code, > does RemoveServiceAccount wait until either of removeFluentd or > removeRsyslog completely removes its daemonset? Or requests are not > serialized but asynchronous? > > https://github.com/openshift/cluster-logging-operator/blob/master/pkg/ > k8shandler/collection.go#L148-L164 Are you sure that code is being called when the clusterlogging instance is deleted? if cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeFluentd && cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeRsyslog { if err = clusterRequest.RemoveServiceAccount("logcollector"); err != nil { return } if err = clusterRequest.RemovePriorityClass(clusterLoggingPriorityClassName); err != nil { return } } That would mean that when you do `oc delete clusterlogging instance` this code is called with cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeFluentd && cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeRsyslog - is that even possible? > 2. delete clusterlogging instance to undeploy logging EFK What actually happens when `oc delete clusterlogging instance` is called? Does it go through the operator code? Or does Kubernetes itself delete the objects? If it is the latter, we may need to set some sort of "grace period" on the SA or SA secret, so that it "waits" until the rsyslog container exits, or something like that. I suspect that the clo isn't called in this case - we could add some tracing to the clo Reconcile function to see if it is called when the clusterlogging instance is deleted. > > We are having an issue that logs are processed without a service account for > a short period when deleting and recreating the CLO. I moved the > RemoveServiceAccount code after removeFluentd / removeRsyslog, but the > problem still exists. > > If the timing of removing service account is the cause of this issue, what'd > be the best way to determine the time to remove the service account? For > instance, we could get the daemonset list by this code. Can we wait until > it returns empty? > > https://github.com/openshift/cluster-logging-operator/blob/master/pkg/ > k8shandler/status.go
Thank you for your input, Jeff. Jeff Cantrill: isnt there a delay set on the fluent termination...maybe it needs to be shortened... i thought we had a request to do that anyway https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace i believe it only applies to pods Google Cloud BlogGoogle Cloud Blog Kubernetes best practices: terminating with grace | Google Cloud Blog Editor’s note: Today is the fifth installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most ou(200 kB) https://cloud.google.com/blog/static/assets/GCP_Twitter_Card-2000×1000.png
Tested with ose-cluster-logging-operator-v4.2.0-201908271819, unable to reproduce this issue, move to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922