Bug 1715370 - Some orphaned logs sent to .operations.* index by mistake.
Summary: Some orphaned logs sent to .operations.* index by mistake.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Noriko Hosoi
QA Contact: Qiaoling Tang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-30 08:27 UTC by Qiaoling Tang
Modified: 2019-10-16 06:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In the Cluster Logging instance deletion, resources under Cluster Logging were deleted independently. Consequence: Due to independency, there could be a short time period when the fluentd or rsyslog daemonset was still running but its logcollector service account was removed. That made the logs processed in the time miss kubernetes information including the namespace name. Fix: Set foregroundDeletion finalizer to the collector service account. Then set the collector daemonsets owned by the service account with blockOwnerDeletion enabled, which makes the service account wait to be deleted until the descendent resources are all deleted. Result: There is no chance for the collector daemonsets to run without the logcollector service account.
Clone Of:
Environment:
Last Closed: 2019-10-16 06:29:26 UTC
Target Upstream Version:


Attachments (Terms of Use)
fluentd pod logs (977.75 KB, application/gzip)
2019-07-10 01:50 UTC, Qiaoling Tang
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 216 'None' closed Bug 1715370: Some orphaned logs sent to .operations.* index by mistake. 2020-11-17 10:40:50 UTC
Github openshift cluster-logging-operator pull 218 'None' closed Bug 1715370: Some orphaned logs sent to .operations.* index by mistake. 2020-11-17 10:40:50 UTC
Github openshift cluster-logging-operator pull 228 'None' closed Bug 1715370: Some orphaned logs sent to .operations.* index by mistake. 2020-11-17 10:40:49 UTC
Github openshift cluster-logging-operator pull 230 'None' closed Bug 1715370: Some orphaned logs sent to .operations.* index by mistake. 2020-11-17 10:40:50 UTC
Github openshift origin-aggregated-logging pull 1702 'None' closed Bug 1715370: Some orphaned logs sent to .operations.* index 2020-11-17 10:40:50 UTC
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:29:43 UTC

Description Qiaoling Tang 2019-05-30 08:27:45 UTC
Description of problem:
Some orphaned logs sent to .operations.* index by mistake.

{
  "_index": ".operations.2019.05.30",
  "_type": "com.redhat.viaq.common",
  "_id": "ZTg4MmI0MzgtMTc5Ny00M2JiLTlkNWQtNmIwOWNmN2Y4ZTM4",
  "_version": 1,
  "_score": null,
  "_source": {
    "docker": {
      "container_id": "75bd2cb09b9e0652397c804f1135848e728006b156793ec9116464d638ea1ec8"
    },
    "kubernetes": {
      "container_name": "sdn",
      "namespace_name": ".orphaned",
      "pod_name": "sdn-z9gm7",
      "orphaned_namespace": "openshift-sdn",
      "namespace_id": "orphaned"
    },
    "message": "I0530 05:57:46.349079    1434 roundrobin.go:338] LoadBalancerRR: Removing endpoints for openshift-logging/elasticsearch-cluster:elasticsearch",
    "level": "err",
    "hostname": "localhost.localdomain",
    "pipeline_metadata": {
      "collector": {
        "ipaddr4": "10.128.2.18",
        "ipaddr6": "fe80::50f8:7cff:fe26:884c",
        "inputname": "fluent-plugin-systemd",
        "name": "fluentd",
        "received_at": "2019-05-30T05:57:46.942489+00:00",
        "version": "1.4.0 1.6.0"
      }
    },
    "@timestamp": "2019-05-30T05:57:46.349135+00:00",
    "viaq_msg_id": "ZTg4MmI0MzgtMTc5Ny00M2JiLTlkNWQtNmIwOWNmN2Y4ZTM4"
  },
  "fields": {
    "@timestamp": [
      1559195866349
    ],
    "pipeline_metadata.collector.received_at": [
      1559195866942
    ]
  },
  "highlight": {
    "kubernetes.namespace_name": [
      "@kibana-highlighted-field@.orphaned@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1559195866349
  ]
}

$ oc exec elasticsearch-cdm-xeolone6-1-77764d496-j9r9g -- es_util --query=.operations.*/_count -d '{"query": { "match": {"kubernetes.namespace_name" : ".orphaned" }}}' |jq
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-xeolone6-1-77764d496-j9r9g -n openshift-logging' to see all of the containers in this pod.
{
  "count": 87,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  }
}


Version-Release number of selected component (if applicable):
openshift/ose-logging-elasticsearch5:v4.1.0-201905191700
openshift/ose-logging-fluentd:v4.1.0-201905191700

4.1.0-0.nightly-2019-05-29-220142

How reproducible:
Always

Steps to Reproduce:
1. Deploy logging, wait for a few minutes
2. delete clusterlogging instance to undeploy logging EFK
3. wait until all the EFK resources removed, create clusterlogging instance again 
4. check logs in .operations.* index

Actual results:
Some orphaned logs are sent to .operations.* index by mistake.

Expected results:
orphaned logs should sent to .orphaned.* index.

Additional info:

I checked these orphaned logs, their kubernetes.orphaned_namespace are "openshift-sdn", "openshift-monitoring", "openshift-operators-redhat", and the messages are related to deleting logging EFK resources.


A fresh install doesn't have this issue.

Comment 1 Qiaoling Tang 2019-05-31 03:33:54 UTC
Also encountered this issue in another env, when the route53 has some problem, some orphaned pod logs who's kubernetes.orphaned_namespace are `openshift-*` are sent to ".operations.*" index, not to the ".orphaned.*" index.

Seems the logs which have kubernetes.orphaned_namespace "openshift-*" are sent to .operations.* index.

See logs in the attachment.

Comment 3 Noriko Hosoi 2019-07-08 21:55:18 UTC
Hi @Qiaoling, did you have a chance to rerun this failure case recently?

According to #c0, the bug was observed with the build 4.1.0-0.nightly-2019-05-29-220142.

I'd like you to try the build having this fix.  That is, 4.1.0-0.nightly-2019-06-21 or newer.

commit 78bcf075887d32de58a15912b473ae721d4a6cbe (origin/pr/1678)
Author: Rich Megginson <rmeggins@redhat.com>
Date:   Fri Jun 21 12:08:40 2019 -0600
    Bug 1722898: Logging data from all projects are stored to .orphaned indexes with Elasticsearch

Thanks!

Comment 4 Qiaoling Tang 2019-07-09 02:06:03 UTC
This could be reproduced in 4.1.0-0.nightly-2019-07-07-154416, and the images are: 
ose-cluster-logging-operator:v4.1.6-201907070809
ose-logging-elasticsearch5:v4.1.6-201907070809
ose-logging-fluentd:v4.1.6-201907070809


{
  "_index": ".operations.2019.07.09",
  "_type": "com.redhat.viaq.common",
  "_id": "OWUyOGNiNzktMTZkMi00NmYwLTg4NzctZTI2MjkzZDZjZjcw",
  "_version": 1,
  "_score": null,
  "_source": {
    "docker": {
      "container_id": "f8e24c405c7475d87dcc44793e81a0554f5cda33367a5e2093e94bb07149fcf6"
    },
    "kubernetes": {
      "container_name": "elasticsearch-operator",
      "namespace_name": ".orphaned",
      "pod_name": "elasticsearch-operator-5c9d944579-rqz7f",
      "orphaned_namespace": "openshift-operators-redhat",
      "namespace_id": "orphaned"
    },
    "message": "time=\"2019-07-09T02:01:39Z\" level=info msg=\"Flushing nodes for cluster elasticsearch in openshift-logging\"",
    "level": "info",
    "hostname": "localhost.localdomain",
    "pipeline_metadata": {
      "collector": {
        "ipaddr4": "10.129.2.16",
        "ipaddr6": "fe80::4865:f5ff:fe6f:1f94",
        "inputname": "fluent-plugin-systemd",
        "name": "fluentd",
        "received_at": "2019-07-09T02:01:39.750431+00:00",
        "version": "1.5.1 1.6.0"
      }
    },
    "@timestamp": "2019-07-09T02:01:39.085290+00:00",
    "viaq_msg_id": "OWUyOGNiNzktMTZkMi00NmYwLTg4NzctZTI2MjkzZDZjZjcw"
  },
  "fields": {
    "@timestamp": [
      1562637699085
    ],
    "pipeline_metadata.collector.received_at": [
      1562637699750
    ]
  },
  "highlight": {
    "kubernetes.namespace_name": [
      "@kibana-highlighted-field@.orphaned@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1562637699085
  ]
}

Comment 5 Rich Megginson 2019-07-09 15:01:06 UTC
Please set the fluentd log level to trace:

oc edit configmap fluentd

replace this line

    @include configs.d/openshift/system.conf

with this

    <system>
      @log_level trace
    </system>

And provide the fluentd logs

Comment 6 Noriko Hosoi 2019-07-09 18:18:41 UTC
I'm also interested in the environment variables set in the fluentd pod.  Please set VERBOSE to true and share the fluentd log.  Thanks!

if [[ $VERBOSE ]]; then
  set -ex
  fluentdargs="$fluentdargs -vv --log-event-verbose"
  echo ">>>>>> ENVIRONMENT VARS <<<<<"
  env | sort
  echo ">>>>>>>>>>>>><<<<<<<<<<<<<<<<"
else

Comment 7 Qiaoling Tang 2019-07-10 01:50:05 UTC
Created attachment 1588951 [details]
fluentd pod logs

Comment 8 Rich Megginson 2019-07-10 02:51:05 UTC
This is interesting:

grep orphaned_namespace orphaned-log-in-es |sort -u
            "orphaned_namespace" : "openshift-machine-api",
            "orphaned_namespace" : "openshift-monitoring",
            "orphaned_namespace" : "openshift-operator-lifecycle-manager",

Only those 3 namespaces.

logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api
logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc

logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-operator-lifecycle-manager
logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc

logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring
logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc

There are a lot of these errors from during the time of the orphaned records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods: fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6

I'm not sure how the records in orphaned-log-in-es were sorted, but they all have a @timestamp from at or around this time frame.

We'll need to look at the api server logs to figure out why the api server is returning a 401 Unauthorized during this time period.  I don't think I have ever seen this particular problem.

Comment 9 Qiaoling Tang 2019-07-10 06:20:14 UTC
(In reply to Rich Megginson from comment #8)
> This is interesting:
> 
> grep orphaned_namespace orphaned-log-in-es |sort -u
>             "orphaned_namespace" : "openshift-machine-api",
>             "orphaned_namespace" : "openshift-monitoring",
>             "orphaned_namespace" : "openshift-operator-lifecycle-manager",
> 
> Only those 3 namespaces.

I also find orphaned_namespace "openshift-logging" and "openshift-operators-redhat" when I reproduce this issue.

> logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]:
> fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api
> logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]:
> fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized'
> encountered fetching namespace metadata from Kubernetes API v1 endpoint
> https://kubernetes.default.svc
> 
> logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]:
> fluent/log.rb:281:trace: fetching namespace metadata:
> openshift-operator-lifecycle-manager
> logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]:
> fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized'
> encountered fetching namespace metadata from Kubernetes API v1 endpoint
> https://kubernetes.default.svc
> 
> logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]:
> fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring
> logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]:
> fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized'
> encountered fetching namespace metadata from Kubernetes API v1 endpoint
> https://kubernetes.default.svc
> 
> There are a lot of these errors from during the time of the orphaned
> records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods:
> fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6
> 
> I'm not sure how the records in orphaned-log-in-es were sorted, but they all
> have a @timestamp from at or around this time frame.

I find these logs by executing "curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key 'https://localhost:9200/.operations.*/_search?pretty&size=1000' -d '{"query":{"match": {"kubernetes.namespace_name": ".orphaned"}}}'" in es pod

> We'll need to look at the api server logs to figure out why the api server
> is returning a 401 Unauthorized during this time period.  I don't think I
> have ever seen this particular problem.

I checked the pod log in openshift-apiserver namespace (not sure if I checked the correct resources), just see some logs like:
E0710 05:51:03.580677       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E0710 05:51:55.786022       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W0710 05:51:55.786417       1 reflector.go:256] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: watch of *v1.ImageStream ended with: The resourceVersion for the provided watch is too old.

I can see a lot of these logs, even if the EFK pods are not running.

Comment 10 Rich Megginson 2019-07-10 13:27:34 UTC
(In reply to Qiaoling Tang from comment #9)
> (In reply to Rich Megginson from comment #8)
> > This is interesting:
> > 
> > grep orphaned_namespace orphaned-log-in-es |sort -u
> >             "orphaned_namespace" : "openshift-machine-api",
> >             "orphaned_namespace" : "openshift-monitoring",
> >             "orphaned_namespace" : "openshift-operator-lifecycle-manager",
> > 
> > Only those 3 namespaces.
> 
> I also find orphaned_namespace "openshift-logging" and
> "openshift-operators-redhat" when I reproduce this issue.

I'm assuming it is a timing issue - any namespace lookups during the time interval when the api server is returning 401 are going to fail.

> 
> > logs/fluentd-2lrcb.log:2019-07-10 01:27:16 +0000 [trace]:
> > fluent/log.rb:281:trace: fetching namespace metadata: openshift-machine-api
> > logs/fluentd-2lrcb.log-2019-07-10 01:27:16 +0000 [debug]:
> > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized'
> > encountered fetching namespace metadata from Kubernetes API v1 endpoint
> > https://kubernetes.default.svc
> > 
> > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [trace]:
> > fluent/log.rb:281:trace: fetching namespace metadata:
> > openshift-operator-lifecycle-manager
> > logs/fluentd-vmrl6.log:2019-07-10 01:27:18 +0000 [debug]:
> > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized'
> > encountered fetching namespace metadata from Kubernetes API v1 endpoint
> > https://kubernetes.default.svc
> > 
> > logs/fluentd-l8tsm.log:2019-07-10 01:27:10 +0000 [trace]:
> > fluent/log.rb:281:trace: fetching namespace metadata: openshift-monitoring
> > logs/fluentd-l8tsm.log-2019-07-10 01:27:10 +0000 [debug]:
> > fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized'
> > encountered fetching namespace metadata from Kubernetes API v1 endpoint
> > https://kubernetes.default.svc
> > 
> > There are a lot of these errors from during the time of the orphaned
> > records, from 1:27:10 to 01:27:40, and only from 4 of the fluentd pods:
> > fluentd-2lrcb fluentd-ksq8d fluentd-l8tsm fluentd-vmrl6
> > 
> > I'm not sure how the records in orphaned-log-in-es were sorted, but they all
> > have a @timestamp from at or around this time frame.
> 
> I find these logs by executing "curl -s -k --cert
> /etc/elasticsearch/secret/admin-cert --key
> /etc/elasticsearch/secret/admin-key
> 'https://localhost:9200/.operations.*/_search?pretty&size=1000' -d
> '{"query":{"match": {"kubernetes.namespace_name": ".orphaned"}}}'" in es pod

Ok - you can also add `sort` to the url: ?pretty&size=1000&sort=@timestamp:desc

to sort from latest to earliest, or use @timestamp:asc for earliest to latest.

> 
> > We'll need to look at the api server logs to figure out why the api server
> > is returning a 401 Unauthorized during this time period.  I don't think I
> > have ever seen this particular problem.
> 
> I checked the pod log in openshift-apiserver namespace (not sure if I
> checked the correct resources), just see some logs like:
> E0710 05:51:03.580677       1 watcher.go:208] watch chan error: etcdserver:
> mvcc: required revision has been compacted
> E0710 05:51:55.786022       1 watcher.go:208] watch chan error: etcdserver:
> mvcc: required revision has been compacted
> W0710 05:51:55.786417       1 reflector.go:256]
> github.com/openshift/client-go/image/informers/externalversions/factory.go:
> 101: watch of *v1.ImageStream ended with: The resourceVersion for the
> provided watch is too old.
> 
> I can see a lot of these logs, even if the EFK pods are not running.


I would expect to see log messages like "... GET /api/v1/namespaces/openshift-monitoring 401 ..." somewhere.  Maybe we need to turn up the log level, or enable audit logging.

Comment 11 Qiaoling Tang 2019-07-11 06:45:18 UTC
I checked all the logs around the time when the issue occurred, and I filtered out the logs were gathered by the same fluentd pod, I found some of the logs were parsed correctly.

Comment 12 Rich Megginson 2019-07-11 20:28:37 UTC
We need to get the Kubernetes API server logs https://docs.openshift.com/container-platform/4.1/nodes/nodes/nodes-nodes-audit-log.html
I know in the CI tests that audit logging is enabled by default.  For example, here is what a fluentd request looks like:

/var/log/kube-apiserver/audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"79b7fe78-34ce-40d3-8c23-86481022478f","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress","verb":"get","user":{"username":"system:serviceaccount:openshift-logging:logcollector","uid":"95461827-a3f3-11e9-afc0-0a18e06541b4","groups":["system:serviceaccounts","system:serviceaccounts:openshift-logging","system:authenticated"]},"sourceIPs":["10.0.129.192"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.3p105","objectRef":{"resource":"namespaces","namespace":"openshift-ingress","name":"openshift-ingress","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2019-07-11T19:20:03.492627Z","stageTimestamp":"2019-07-11T19:20:03.495920Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-logging-metadata-reader\" of ClusterRole \"metadata-reader\" to ServiceAccount \"logcollector/openshift-logging\""}}

The SA logcollector is used by both fluentd and rsyslog.  Here is the formatted output:

{
    "annotations": {
        "authorization.k8s.io/decision": "allow",
        "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-logging-metadata-reader\" of ClusterRole \"metadata-reader\" to ServiceAccount \"logcollector/openshift-logging\""
    },
    "apiVersion": "audit.k8s.io/v1",
    "auditID": "36cfcb76-0e91-41e0-8606-3a75e58a779d",
    "kind": "Event",
    "level": "Metadata",
    "objectRef": {
        "apiVersion": "v1",
        "name": "router-default-57cf6578bb-97bzr",
        "namespace": "openshift-ingress",
        "resource": "pods"
    },
    "requestReceivedTimestamp": "2019-07-11T20:24:02.477244Z",
    "requestURI": "/api/v1/namespaces/openshift-ingress/pods/router-default-57cf6578bb-97bzr",
    "responseStatus": {
        "code": 200,
        "metadata": {}
    },
    "sourceIPs": [
        "10.0.129.192"
    ],
    "stage": "ResponseComplete",
    "stageTimestamp": "2019-07-11T20:24:02.480559Z",
    "user": {
        "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:openshift-logging",
            "system:authenticated"
        ],
        "uid": "95461827-a3f3-11e9-afc0-0a18e06541b4",
        "username": "system:serviceaccount:openshift-logging:logcollector"
    },
    "userAgent": "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.3p105",
    "verb": "get"
}

In the error case, we should see records with a responseStatus.code of 401, and hopefully some additional annotations or metadata to explain why we are getting the 401.

Comment 13 Qiaoling Tang 2019-07-12 08:08:50 UTC
Fluentd log:

2019-07-12 06:57:42 +0000 [trace]: fluent/log.rb:281:trace: fetching pod metadata: openshift-network-operator/network-operator-5c9c6559b-h77b2
2019-07-12 06:57:42 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc
2019-07-12 06:57:42 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-network-operator
2019-07-12 06:57:43 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc
2019-07-12 06:57:43 +0000 [trace]: fluent/log.rb:281:trace: orphaning message for: openshift-network-operator/network-operator-5c9c6559b-h77b2 
2019-07-12 06:57:44 +0000 [trace]: fluent/log.rb:281:trace: fetching pod metadata: openshift-network-operator/network-operator-5c9c6559b-h77b2
2019-07-12 06:57:44 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc
2019-07-12 06:57:44 +0000 [trace]: fluent/log.rb:281:trace: fetching namespace metadata: openshift-network-operator
2019-07-12 06:57:45 +0000 [debug]: fluent/log.rb:302:debug: Exception 'HTTP status code 401, Unauthorized' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc
2019-07-12 06:57:45 +0000 [trace]: fluent/log.rb:281:trace: orphaning message for: openshift-network-operator/network-operator-5c9c6559b-h77b2 


kube-apiserver audit log:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"36c21c88-5093-4617-9815-221b6a39e78d","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-network-operator","verb":"get","user":{},"sourceIPs":["10.129.0.48"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-network-operator","name":"openshift-network-operator","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-07-12T06:57:43.004051Z","stageTimestamp":"2019-07-12T06:57:43.004081Z"}

Comment 14 Rich Megginson 2019-07-12 13:35:24 UTC
"user":{}

fluentd is not passing the credentials with the request.  Looks like a bug in fluent-plugin-kubernetes_metadata_filter

Comment 15 Noriko Hosoi 2019-07-19 20:23:41 UTC
(In reply to Rich Megginson from comment #14)
> "user":{}
> 
> fluentd is not passing the credentials with the request.  Looks like a bug
> in fluent-plugin-kubernetes_metadata_filter

Indeed, the following get_pod and get_namespace calls are failing according to the logs in #c13.

https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L87
https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L129

Does that mean the auth_option passed to Kubeclient::Client.new is missing or broken?
https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/filter_kubernetes_metadata.rb#L245-L250

Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod?
/var/run/secrets/kubernetes.io/serviceaccount/token

Comment 16 Rich Megginson 2019-07-19 20:25:49 UTC
> Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod?
/var/run/secrets/kubernetes.io/serviceaccount/token

That would be good to check at fluentd startup (and rsyslog startup) e.g. in the run.sh script

Comment 17 Noriko Hosoi 2019-07-19 20:28:10 UTC
(In reply to Rich Megginson from comment #16)
> > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod?
> /var/run/secrets/kubernetes.io/serviceaccount/token
> 
> That would be good to check at fluentd startup (and rsyslog startup) e.g. in
> the run.sh script

Do you mean adding the checking code to run.sh (and rsyslog.sh)?  If so, let me work on it!

Comment 18 Rich Megginson 2019-07-19 20:29:24 UTC
(In reply to Noriko Hosoi from comment #17)
> (In reply to Rich Megginson from comment #16)
> > > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod?
> > /var/run/secrets/kubernetes.io/serviceaccount/token
> > 
> > That would be good to check at fluentd startup (and rsyslog startup) e.g. in
> > the run.sh script
> 
> Do you mean adding the checking code to run.sh (and rsyslog.sh)?  If so, let
> me work on it!

Yes, that's what I meant.

Comment 19 Noriko Hosoi 2019-07-19 21:37:40 UTC
(In reply to Rich Megginson from comment #16)
> > Assuming the test is not setting bearer_token_file nor secret_dir in filter-k8s-meta.conf, can we double-check if this file exists and what's in it in a fluentd pod?
> /var/run/secrets/kubernetes.io/serviceaccount/token
> 
> That would be good to check at fluentd startup (and rsyslog startup) e.g. in
> the run.sh script

Opened PRs to check bearer_token_file.
fluentd - https://github.com/openshift/origin-aggregated-logging/pull/1702
rsyslog - https://github.com/openshift/cluster-logging-operator/pull/216

Not sure yet if it is the direct cause of this bug, I'm leaving the bug status as ASSIGNED for now.

Comment 20 Rich Megginson 2019-07-19 21:39:50 UTC
Note that the PRs will not fix the bug, but they will hopefully allow us to determine the cause.

Comment 21 Noriko Hosoi 2019-07-19 22:11:31 UTC
The bug status was automatically switched to POST by the bugzilla checking bot. :)
@Qiaoling, please feel free to change it to ASSIGNED/FailedQA with your findings.

Comment 30 Rich Megginson 2019-07-26 01:33:29 UTC
> More specifically, can we delete fluentd SA while fluentd is still working on logs by that?

You might be able to reproduce it that way, by deleting the SA, or the secret with the SA token, or both.  In the past, it was not so easy to undeploy logging . . . so maybe much harder to reproduce.

> I wonder what happens if we delete serviceaccount [1] after deleting the other resources?  Do we have any bad consequences?

That is probably ok.  In general, we should try to delete objects in the reverse order that we created them.  But I think that as long as you delete the Daemonset first, it shouldn't matter what order you delete the configmaps, secrets, etc.  So I think deleting the SA and SA secret last should be fine.

Comment 32 Qiaoling Tang 2019-08-05 05:21:01 UTC
This issue is not fixed in ose-cluster-logging-operator-v4.2.0-201908041300

Reopen this bug.

Comment 33 Qiaoling Tang 2019-08-05 05:49:42 UTC
audit logs:
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"2f55ac91-f44f-409f-8828-e36acc2c9c6b","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:01.140008Z","stageTimestamp":"2019-08-05T05:12:01.141253Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9232413e-0705-4a4d-a898-0d30d372d437","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-operators-redhat/pods/elasticsearch-operator-7b67d65659-sbk8c","verb":"get","user":{},"sourceIPs":["10.0.161.203"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-operators-redhat","name":"elasticsearch-operator-7b67d65659-sbk8c","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:01.151497Z","stageTimestamp":"2019-08-05T05:12:01.151531Z"}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"bd3877ed-cd87-4a7b-83ec-23a80a6c7dee","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:01.905137Z","stageTimestamp":"2019-08-05T05:12:01.906574Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1c2c5449-65ea-469a-956d-8fa3333e50da","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-sdn","name":"openshift-sdn","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:01.916953Z","stageTimestamp":"2019-08-05T05:12:01.916990Z"}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"02652404-107f-4d46-aa24-bb2f21b0e328","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:02.611130Z","stageTimestamp":"2019-08-05T05:12:02.612413Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"eb6cc0a9-2e86-4958-a7f0-576994fd7ded","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn/pods/ovs-q58vj","verb":"get","user":{},"sourceIPs":["10.0.157.95"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-sdn","name":"ovs-q58vj","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:02.620634Z","stageTimestamp":"2019-08-05T05:12:02.620663Z"}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"d8208d48-1b8f-456d-89a2-be1db5c8d6a4","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-monitoring/pods/prometheus-k8s-0","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-monitoring","name":"prometheus-k8s-0","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:15.215025Z","stageTimestamp":"2019-08-05T05:12:15.215055Z"}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1b0a8aee-8126-4578-9e1d-f8658b10e005","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:15.223070Z","stageTimestamp":"2019-08-05T05:12:15.224381Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"1b06f70b-846e-40a8-8f50-433fea9fb2fa","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-monitoring","verb":"get","user":{},"sourceIPs":["10.0.131.165"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"namespaces","namespace":"openshift-monitoring","name":"openshift-monitoring","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:15.232701Z","stageTimestamp":"2019-08-05T05:12:15.232739Z"}

Comment 34 Qiaoling Tang 2019-08-05 05:53:08 UTC
logs in ES:

      {
        "_index": ".operations.2019.08.05",
        "_type": "com.redhat.viaq.common",
        "_id": "YzdjOTMwNzgtNzU0OC00YjgyLTliZDktODFmMmJjYzM1NmIw",
        "_score": 9.096395,
        "_source": {
          "docker": {
            "container_id": "9028976d21dbdf541285451544953d0ec903bf6d49a45d0304e2abe3f8383abc"
          },
          "kubernetes": {
            "container_name": "rules-configmap-reloader",
            "namespace_name": ".orphaned",
            "pod_name": "prometheus-k8s-0",
            "orphaned_namespace": "openshift-monitoring",
            "namespace_id": "orphaned"
          },
          "message": "2019/08/05 05:12:14 successfully triggered reload",
          "level": "unknown",
          "pipeline_metadata": {
            "collector": {
              "ipaddr4": "10.128.2.28",
              "ipaddr6": "fe80::683e:dff:fed5:b310",
              "inputname": "fluent-plugin-systemd",
              "name": "fluentd",
              "received_at": "2019-08-05T05:12:15.233568+00:00",
              "version": "1.6.2 1.6.0"
            }
          },
          "@timestamp": "2019-08-05T05:12:14.327828+00:00",
          "viaq_msg_id": "YzdjOTMwNzgtNzU0OC00YjgyLTliZDktODFmMmJjYzM1NmIw"
        }

$ oc exec elasticsearch-cdm-68e3yxff-1-8f5c4bf57-2hts9 -- es_util --query=.operations.*/_count -d '{"query": { "match": {"kubernetes.namespace_name" : ".orphaned" }}}' |jq
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-68e3yxff-1-8f5c4bf57-2hts9 -n openshift-logging' to see all of the containers in this pod.
{
  "count": 9,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  }
}

Comment 35 Noriko Hosoi 2019-08-05 15:30:59 UTC
Still there's some period the "user" is not available...  The session starting from the second log below has no user value although the previous has.

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"02652404-107f-4d46-aa24-bb2f21b0e328","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-logging/secrets/logcollector-token-pqp7r","verb":"get","user":{"username":"system:apiserver","uid":"27f7e704-1377-4959-92da-21900e4410af","groups":["system:masters"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.14.0+bf9534a (linux/amd64) kubernetes/bf9534a","objectRef":{"resource":"secrets","namespace":"openshift-logging","name":"logcollector-token-pqp7r","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2019-08-05T05:12:02.611130Z","stageTimestamp":"2019-08-05T05:12:02.612413Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"eb6cc0a9-2e86-4958-a7f0-576994fd7ded","stage":"ResponseStarted","requestURI":"/api/v1/namespaces/openshift-sdn/pods/ovs-q58vj","verb":"get","user":{},"sourceIPs":["10.0.157.95"],"userAgent":"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157","objectRef":{"resource":"pods","namespace":"openshift-sdn","name":"ovs-q58vj","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-08-05T05:12:02.620634Z","stageTimestamp":"2019-08-05T05:12:02.620663Z"}

Comment 36 Noriko Hosoi 2019-08-05 19:37:56 UTC
Hi @jcantril,

Could you please help us understanding the issue?  In the following code, does RemoveServiceAccount wait until either of removeFluentd or removeRsyslog completely removes its daemonset?  Or requests are not serialized but asynchronous?

https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/collection.go#L148-L164

We are having an issue that logs are processed without a service account for a short period when deleting and recreating the CLO.  I moved the RemoveServiceAccount code after removeFluentd / removeRsyslog, but the problem still exists.

If the timing of removing service account is the cause of this issue, what'd be the best way to determine the time to remove the service account?  For instance, we could get the daemonset list by this code.  Can we wait until it returns empty?

https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/status.go

Comment 37 Rich Megginson 2019-08-05 20:04:40 UTC
(In reply to Noriko Hosoi from comment #36)
> Hi @jcantril,
> 
> Could you please help us understanding the issue?  In the following code,
> does RemoveServiceAccount wait until either of removeFluentd or
> removeRsyslog completely removes its daemonset?  Or requests are not
> serialized but asynchronous?
> 
> https://github.com/openshift/cluster-logging-operator/blob/master/pkg/
> k8shandler/collection.go#L148-L164

Are you sure that code is being called when the clusterlogging instance is deleted?

	if cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeFluentd && cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeRsyslog {
		if err = clusterRequest.RemoveServiceAccount("logcollector"); err != nil {
			return
		}

		if err = clusterRequest.RemovePriorityClass(clusterLoggingPriorityClassName); err != nil {
			return
		}
	}

That would mean that when you do `oc delete clusterlogging instance` this code is called with cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeFluentd && cluster.Spec.Collection.Logs.Type != logging.LogCollectionTypeRsyslog - is that even possible?

> 2. delete clusterlogging instance to undeploy logging EFK

What actually happens when `oc delete clusterlogging instance` is called?  Does it go through the operator code?  Or does Kubernetes itself delete the objects?

If it is the latter, we may need to set some sort of "grace period" on the SA or SA secret, so that it "waits" until the rsyslog container exits, or something like that.  I suspect that the clo isn't called in this case - we could add some tracing to the clo Reconcile function to see if it is called when the clusterlogging instance is deleted.

> 
> We are having an issue that logs are processed without a service account for
> a short period when deleting and recreating the CLO.  I moved the
> RemoveServiceAccount code after removeFluentd / removeRsyslog, but the
> problem still exists.
> 
> If the timing of removing service account is the cause of this issue, what'd
> be the best way to determine the time to remove the service account?  For
> instance, we could get the daemonset list by this code.  Can we wait until
> it returns empty?
> 
> https://github.com/openshift/cluster-logging-operator/blob/master/pkg/
> k8shandler/status.go

Comment 38 Noriko Hosoi 2019-08-05 21:19:34 UTC
Thank you for your input, Jeff.

Jeff Cantrill:
isnt there a delay set on the fluent termination...maybe it needs to be shortened... i thought we had  a request to do that anyway

https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace  i believe it only applies to pods
Google Cloud BlogGoogle Cloud Blog
Kubernetes best practices: terminating with grace | Google Cloud Blog
Editor’s note: Today is the fifth installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most ou(200 kB)
https://cloud.google.com/blog/static/assets/GCP_Twitter_Card-2000×1000.png

Comment 43 Qiaoling Tang 2019-08-30 01:24:40 UTC
Tested with ose-cluster-logging-operator-v4.2.0-201908271819, unable to reproduce this issue, move to VERIFIED.

Comment 45 errata-xmlrpc 2019-10-16 06:29:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.