Created attachment 1234970 [details] es log #2
Created attachment 1234971 [details] es_ops log #1
Created attachment 1234972 [details] es_ops log #2
Created attachment 1234973 [details] kibana log
Created attachment 1234977 [details] kibana_ops log
Tested on OCP 3.4.0 V38 + Logging 3.4.0, this issue doesn't re-produced. I guess this issue is related to memory size, I used "vm_type: n1-standard-2" for GCE, it tooks a about one hour to show the logs on Kibana, much faster than xiazhao's two hours. oc get configmap logging-deployer -o yaml apiVersion: v1 data: enable-ops-cluster: "true" es-cluster-size: "2" es-instance-ram: 1G es-pvc-dynamic: "true" es-pvc-prefix: "" es-pvc-size: "" kibana-hostname: kibana.0104-03q.qe.rhcloud.com kibana-ops-hostname: kibana-ops.0104-03q.qe.rhcloud.com public-master-url: https://104.155.144.193:8443 storage-group: "" use-journal: "true" kind: ConfigMap metadata: creationTimestamp: 2017-01-04T07:06:20Z name: logging-deployer namespace: logging resourceVersion: "3608" selfLink: /api/v1/namespaces/logging/configmaps/logging-deployer uid: 4ba48527-d24c-11e6-a995-42010af0000e oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-fghq5 1/1 Running 0 2h logging-curator-ops-1-z6jwi 1/1 Running 0 2h logging-deployer-scs62 0/1 Completed 0 2h logging-es-0sn7afy4-1-onp0i 1/1 Running 0 2h logging-es-jqsekt6y-1-h3ogz 1/1 Running 0 2h logging-es-ops-0dusq0sn-1-avulp 1/1 Running 0 2h logging-es-ops-jrk7pk5s-1-2oz7i 1/1 Running 0 2h logging-fluentd-faq7m 1/1 Running 0 2h logging-kibana-1-oajon 2/2 Running 0 2h logging-kibana-ops-1-216y9 2/2 Running 0 2h Image ID openshift3/logging-auth-proxy e96b37a99960 openshift3/logging-kibana 27f978fc2946 openshift3/logging-fluentd c493f8b4553b openshift3/logging-elasticsearch 3ca95a8a9433 openshift3/logging-curator e39988877cd9 openshift3/logging-deployer 1033ccb0557b
Created attachment 1237079 [details] Attached ES two pods log
@junqi - Is the problem that elasticsearch is missing data that is found in oc logs? Is it missing permanently, or does it take too long to show up? Is this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1426137 and https://bugzilla.redhat.com/show_bug.cgi?id=1397293 or is this a different issue?
(In reply to Rich Megginson from comment #9) > @junqi - Is the problem that elasticsearch is missing data that is found in > oc logs? Is it missing permanently, or does it take too long to show up? > > Is this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1426137 > and https://bugzilla.redhat.com/show_bug.cgi?id=1397293 or is this a > different issue? pre-condition when this issue was found: use_journal = true, enable_kibana_ops = true and elastic_cluster_size = 2 issue descripton: ES pod logs were not shown completely when using 'oc logs $es-pod' to check log, for example, it only have the following log for es pod(you could also see attachment es log #1, and es log #2): ******************************************************************************* [2016-12-23 07:36:57,142][WARN ][com.floragunn.searchguard.configuration.PrivilegesEvaluator] project.logging.5cf5542b-c8d8-11e6-bf83-42010af00028.* does not exist in cluster metadata [2016-12-23 08:37:16,023][WARN ][com.floragunn.searchguard.configuration.PrivilegesEvaluator] logstash-* does not exist in cluster metadata [2016-12-23 08:37:16,024][WARN ][com.floragunn.searchguard.configuration.PrivilegesEvaluator] logstash-* does not exist in cluster metadata ******************************************************************************* This issue was originally found in OCP 3.4.0, and happened occasionally, It was not happen in my env. It is not a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1426137, BZ #1426137 was found on logging 3.5 ansible installation. It is not a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1397293, we use jourald as logging driver, although it took a long time to show logs on Kibana UI, the key point of this defect is ES logs are not completely displayed
I don't understand what you mean by "logs are not completely displayed". Are you saying that the content of es log #1 and es log #2 should be identical? Because they should not be. They are the logs of different instances of elasticsearch serving different clients and different roles in the cluster.
(In reply to Rich Megginson from comment #11) > I don't understand what you mean by "logs are not completely displayed". > Are you saying that the content of es log #1 and es log #2 should be > identical? Because they should not be. They are the logs of different > instances of elasticsearch serving different clients and different roles in > the cluster. I should explain clearly to you. See the attached log #1, description of this defect: when the logging stacks are running, there are only 3 lines record in ES pod log, and don't have other logs any more, that is what I mean "logs are not completely displayed". The attached log #2, is one sample of ES log should be like, for example, we can find update_mapping info and Cluster health status changed to GREEN.
> when the logging stacks are running, there are only 3 lines record in ES pod log, and don't have other logs any more, that is what I mean "logs are not completely displayed". Why is it a problem if there are only 3 lines record in one of the ES pod logs?
(In reply to Rich Megginson from comment #13) > > when the logging stacks are running, there are only 3 lines record in ES pod log, and don't have other logs any more, that is what I mean "logs are not completely displayed". > > Why is it a problem if there are only 3 lines record in one of the ES pod > logs? See the attached file es log #2, this is what the ES log looks like after ES pods are running, we can find user project update_mapping info and Cluster health status changed to GREEN. This defect is not re-produced in my env and I did not encounter this error before. we can continue to monitor ES log, if this defect does not happen, I think we can close it as 'Not a Bug' or 'Works for me'
Changed the status to closed by mistake, reset it.
I'm not sure how Elasticsearch logging works when running in a cluster, and I'm also not sure if this behavior changed from ES 1.5 to 2.x (OSE 3.3 -> 3.5), or changed from ES 2.1 -> 2.4. Assuming when fluentd is talking to Elasticsearch through the Service, the Service may do some sort of load balancing/round robin so that fluentd sends data to different Elasticsearch members of the cluster at different times. So each Elasticsearch may not register the same traffic as the other Elasticsearch. For example, if fluentd adds an index to ElasticsearchA, we would expect a message like this in the ElasticsearchA log: [2017-01-04 07:52:55,356][INFO ][cluster.metadata ] [Earthquake] [project.install-test.5a80392c-d23f-11e6-9bb9-42010af0000e.2017.01.04] creating index, cause [auto(bulk api)], templates [com.redhat.viaq-openshift-project.template.json], shards [1]/[0], mappings [_default_] But we wouldn't necessarily expect a log message like this in the ElasticsearchB pod log. We would expect that doing a search on ElasticsearchB would return the data in that index. But the OpenShift Elasticsearch service hides that from you - if you query Elasticsearch from outside the pod (e.g. via Kibana, not via oc exec $espod -- curl ...), you don't really know if you are talking to ElasticsearchA or B. Same with fluentd - when fluentd opens a connection to send data to Elasticsearch, it is using the Elasticsearch Service - it may be talking to ElasticsearchA or ElasticsearchB for the duration of its connection. Does this explain why there are different logs in the ElasticsearchA and ElasticsearchB logs? If not, can you explain exactly which log messages are completely displayed, and which are not? I'm including Lukáš and Peter to see if they can help explain why different members of the Elasticsearch cluster would log different messages.
ping - can someone please comment on https://bugzilla.redhat.com/show_bug.cgi?id=1408397#c16 ? I would like to close this bug if it really isn't a bug, or somehow fix it.
(In reply to Rich Megginson from comment #17) > ping - can someone please comment on > https://bugzilla.redhat.com/show_bug.cgi?id=1408397#c16 ? > > I would like to close this bug if it really isn't a bug, or somehow fix it. Yes, it happened occasionally, and not re-produced in my env, I think we can close it as 'Not a Bug' or 'Works for me' if Lukáš and Peter do not have different opinions.
(In reply to Junqi Zhao from comment #18) > (In reply to Rich Megginson from comment #17) > > ping - can someone please comment on > > https://bugzilla.redhat.com/show_bug.cgi?id=1408397#c16 ? > > > > I would like to close this bug if it really isn't a bug, or somehow fix it. > > Yes, it happened occasionally, and not re-produced in my env, I think we can > close it as 'Not a Bug' or 'Works for me' if Lukáš and Peter do not have > different opinions. Either this is a bug in Elasticsearch, that it somehow doesn't print its own logs, or it is a bug in docker, that it is missing or not displaying some of the logs it gets from containers/pods. At this point I don't know which.
We can verify what logs are missing exactly by considering what logs are written in the various ES pods to the local Elasticsearch log files.
(In reply to Rich Megginson from comment #19) > Either this is a bug in Elasticsearch, that it somehow doesn't print its own > logs, or it is a bug in docker, that it is missing or not displaying some of > the logs it gets from containers/pods. At this point I don't know which. This defect was not happen in my environment when I tried to re-produced it, and it does not exist in our environments now, so we don't know the root cause. In my opinion, we will try to check if this error happens again and if logs are missing according to Perter's comments in Comment 20. If it does not happen again, we should not take so much effort on it, then we can close it. What do you think?
May be I am missing something here but in general logs for each ES node are expected to be different. Specifically, logs like: - creating index - update_mapping - auto expanded replicas to [?] - Cluster health status changed from [RED] to [YELLOW] - Cluster health status changed from [YELLOW] to [GREEN] can be expected only on current cluster master node. They all refer to cluster state change and at any given point in time there should be only a single node that is in charge of making this change - and that is the master node. Notice, that if existing master node goes away then another master-enabled node is elected to take the role of cluster master and then it will be logs of this node where we will find these logs going forward. I tested this with ES 5.2.2, two node cluster, no matter if you send index create request to first or second node (9200 or 9201) the "creating index" always appear only in master node log. I am not sure how this worked in previous ES versions but I would be very surprised to find out it worked differently. FYI in the future releases we can expect more role specific nodes (master only, data only, coordinating [aka client] nodes ... etc) thus we can expect even more diversity in logs. Other reason why logs are different are typically JVM GC logs, which are really node specific only. HTH
Is this still a problem with the latest 3.4.1 release?
@Peter, Checked with the latest 3.4.1 images, this issue did not repro anymore, can see the full log of elasticsearch there. Please feel free to transfer back to ON_QA for closure, thanks! Images tested with: openshift3/logging-deployer dcee53833a87 openshift3/logging-kibana e7b6eb3c6d3c openshift3/logging-elasticsearch fab3ad9b2410 openshift3/logging-fluentd 10c6287124fc openshift3/logging-curator e8b1bbdfa30f openshift3/logging-auth-proxy f2750505bbf8 # openshift version openshift v3.4.1.18 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0
Set to verified according to comment #24
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1425