Created attachment 1222617 [details] fluentd_log
Created attachment 1222618 [details] kibana log
Created attachment 1222619 [details] screenshot where kibana stuck while workaround the "Apply these filters?"
Created attachment 1222620 [details] result of query ES with index name *2016*
However, I now was able to see log entries there inside kibana after the deployment completed 3 hours, and seems EFK stacks restarted themselves 18 minutes ago: $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-tvo0u 1/1 Running 1 18m logging-curator-ops-1-eosmh 1/1 Running 1 18m logging-es-ops-u804xrbi-1-6ch3y 1/1 Running 1 18m logging-es-tf4pi7xr-1-dpcyi 1/1 Running 0 18m logging-fluentd-sj5jr 1/1 Running 0 3h logging-kibana-1-apmko 2/2 Running 1 18m logging-kibana-ops-1-xi8fr 2/2 Running 1 18m Also able to workaround bug #1388031 now, and bug #1397683 disappeared on kibana UI. I see the logs prior to EFK was collected by journald, but seems it take very long even read_from_head is set to "false" in fluentd. json-file behaved faster here. Bug title and priority, severity was changed.
Thanks for the confirmation, Eric. I retested the journald log driver on AWS, it took 2 hours to show log entries on kibana UI. Seems AWS behaved faster than GCE, and the original issue may related with IAAS. Set to verified for now and will reopen if encounter again on other IAAS. Env tested: # openshift version openshift v3.4.0.29+ca980ba kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 openshift3/logging-deployer cfa18cbbc8f3 openshift3/logging-fluentd 7b11a29c82c1 openshift3/logging-elasticsearch 6716a0ad8b2b openshift3/logging-auth-proxy ec334b0c2669 openshift3/logging-kibana 7fc9916eea4d openshift3/logging-curator 9af78fc06248
Tried with es cluster size = 2 on openstack, it took 1 hour to show log entries on kibana UI. So seems this is related with IAAS, and time taken openstack < AWS < GCE.
Reopen this specially for GCE platform, since the issue was reproduced: It took 3 hours to show log entries on kibana UI. Similar issue was not encountered on AWS ,openstack and even the json-file log driver on GCE. Images tested with: On ops registry: openshift3/logging-auth-proxy e96b37a99960 openshift3/logging-kibana 27f978fc2946 openshift3/logging-fluentd c493f8b4553b openshift3/logging-elasticsearch 3ca95a8a9433 openshift3/logging-curator e39988877cd9 openshift3/logging-deployer 1033ccb0557b # openshift env openshift v3.4.0.38 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0
(In reply to Xia Zhao from comment #13) > Reopen this specially for GCE platform, since the issue was reproduced: > It took 3 hours to show log entries on kibana UI. Similar issue was not > encountered on AWS ,openstack and even the json-file log driver on GCE. > > Images tested with: > On ops registry: > openshift3/logging-auth-proxy e96b37a99960 > openshift3/logging-kibana 27f978fc2946 > openshift3/logging-fluentd c493f8b4553b > openshift3/logging-elasticsearch 3ca95a8a9433 > openshift3/logging-curator e39988877cd9 > openshift3/logging-deployer 1033ccb0557b > > # openshift env > openshift v3.4.0.38 > kubernetes v1.4.0+776c994 > etcd 3.1.0-rc.0 Seems the root cause is documented here https://docs.openshift.com/container-platform/3.3/install_config/aggregate_logging_sizing.html#install-config-aggregate-logging-sizing-guidelines-rate-limiting
Is this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1410705 - Roll over log entries are visible when journal-read-from-head is false ? There is a bug in the fluent-plugin-systemd used when reading from journald - even if you set read_from_head "false" it will still try to read from head, which may take up to a few hours. Try to reproduce this bug with the fix from https://bugzilla.redhat.com/show_bug.cgi?id=1410705
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421011
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421563
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1420219, will verify it after BZ #1420219 get fixed
Tested on AWS with 7.5 GB RAM machine, it is faster than before, but still need to take around 15-20 minutes to show logs in Kibana. Will verify again with larger memory machine, and check whether it can be faster or not.
Please gather this additional information: 1) use `oc get pods` to find the name of the fluentd pod -> ${fluentd-pod} then oc exec ${fluentd-pod} -- cat /etc/fluent/configs.d/dynamic/input-syslog-default-syslog.conf 2) use `oc get pods` to find the name of the elasticsearch pod -> ${es-pod} then oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/project.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > apps-last-10.json oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/project.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > apps-first-10.json oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > ops-last-10.json oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > ops-first-10.json The rest of these should be done on the host where the fluentd pod is running 3) date; date -u 4) ls -alrtF /var/log 5) cat /var/log/journal.pos 6) du -sk /var/log/journal/* 7) du -sk /run/log/journal/* 8) journalctl -o export | head -100 9) journalctl -o export | tail -100
@rmeggins Tested with 16GB VM AWS machine, user project logs can not be found after one hour attached info you wanted # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-x2kp3 1/1 Running 0 50m logging-es-60i3nvwq-1-kj3n8 1/1 Running 0 50m logging-fluentd-sck4v 1/1 Running 0 50m logging-kibana-1-tpwvl 2/2 Running 0 50m 1) # oc exec logging-fluentd-sck4v -- cat /etc/fluent/configs.d/dynamic/input-syslog-default-syslog.conf <source> @type systemd @label @INGRESS path "/run/log/journal" pos_file /var/log/journal.pos tag journal </source> 2) see attached apps-last-10.json and apps-first-10.json, the last two don't have return result. # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > ops-last-10.json No JSON object could be decoded # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > ops-first-10.json No JSON object could be decoded 3) info on fluentd pod, please see attached fluentd_pod_info.txt
Created attachment 1257160 [details] apps-last-10.json
Created attachment 1257161 [details] apps-first-10.json
Created attachment 1257162 [details] fluentd_pod_info.txt
according to comment 26, assign this defect back, logs can not be found in Kibana after one hour.
I guess I wasn't clear on this instructions. When I say "run on the host" I mean run on the node/machine, not inside the fluentd pod: The rest of these should be done _on the host_ where the fluentd pod is running, not inside the fluentd pod. 3) date; date -u 4) ls -alrtF /var/log 5) cat /var/log/journal.pos 6) du -sk /var/log/journal/* 7) du -sk /run/log/journal/* 8) journalctl -o export | head -100 9) journalctl -o export | tail -100
"user project logs can not be found after one hour" To me, it looks as though it is the .operations.* logs that cannot be found. According to the apps*.json, you have this: "_index": "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24", "_source": { "@timestamp": "2017-02-24T03:21:47.912379+00:00", and "_index": "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24", "_source": { "@timestamp": "2017-02-24T02:37:57.900172+00:00", These are user project logs. The first one was generated at 2017-02-24T02:37:57.900172 and the last one at 2017-02-24T03:21:47. Are you saying that these logs are not showing up in kibana? For the .operations.* logs, try changing the curl slightly like this, since you don't have a separate ops cluster: # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > ops-last-10.json # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > ops-first-10.json
Also please attach the elasticsearch log
(In reply to Rich Megginson from comment #31) > I guess I wasn't clear on this instructions. When I say "run on the host" I > mean run on the node/machine, not inside the fluentd pod: > > The rest of these should be done _on the host_ where the fluentd pod is > running, not inside the fluentd pod. > > 3) date; date -u > 4) ls -alrtF /var/log > 5) cat /var/log/journal.pos > 6) du -sk /var/log/journal/* > 7) du -sk /run/log/journal/* > 8) journalctl -o export | head -100 > 9) journalctl -o export | tail -100 See attached node_info.txt
Created attachment 1257893 [details] node_info.txt
Created attachment 1257894 [details] node_info.txt, check this file please
(In reply to Rich Megginson from comment #32) > "user project logs can not be found after one hour" > > To me, it looks as though it is the .operations.* logs that cannot be found. > According to the apps*.json, you have this: > > "_index": > "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24", > "_source": { > "@timestamp": "2017-02-24T03:21:47.912379+00:00", > > and > > "_index": > "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24", > "_source": { > "@timestamp": "2017-02-24T02:37:57.900172+00:00", > > These are user project logs. The first one was generated at > 2017-02-24T02:37:57.900172 and the last one at 2017-02-24T03:21:47. Are you > saying that these logs are not showing up in kibana? > > For the .operations.* logs, try changing the curl slightly like this, since > you don't have a separate ops cluster: > > # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert > /etc/elasticsearch/secret/admin-cert --key > /etc/elasticsearch/secret/admin-key > https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:desc > | python -mjson.tool > ops-last-10.json > > # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert > /etc/elasticsearch/secret/admin-cert --key > /etc/elasticsearch/secret/admin-key > https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | > python -mjson.tool > ops-first-10.json project install-test is created by OCP, not by user.
Created attachment 1257902 [details] apps(ops)-first-10.json, apps(ops)-last-10.json
Created attachment 1257904 [details] es pod log
log indicies: .kibana .kibana.ef0b7ff169fdc9202e567ce53aa5e17320cb2d7d .operations.2017.02.27 .searchguard.logging-es-6ln1tp4n-1-x5p10 project.install-test.b293a8a4-fc8a-11e6-86b2-0e9faec1ca4a.2017.02.27 project.java.2d0a1297-fc90-11e6-a9c8-0e9faec1ca4a.2017.02.27 project.logging.7d5500cd-fc8a-11e6-86b2-0e9faec1ca4a.2 project logging is where logging deployed, project java is user project. Tested again in a clean env, RAM is 16 GB, logs can be shown around 5 minutes. It seems logs can be shown quickly in a clean env, last time, I had been installed and uninstalled logging for a few times and then check the logs, then the logs can not be found are after after one hour.
How can I reproduce the situation where you install and uninstall a few times? Do the logs take too long only in that situation? In the clean install case, what I would expect is that it takes a few minutes for all of the logs to show up after install because of all of the initializations taking place during startup. But after everything is running, caches are warmed up, etc. it should only take a few seconds for logs to show up. I have attached a shell script which can be used to test the timing. You'll need to be admin (oc login --username=system:admin) and be in the logging project (oc project logging). $ ./check-log-speed.sh 2>&1 | tee log No resources found. good - wait_for_fluentd_to_catch_up: found 1 record project logging for 9e054c82-6d4e-433f-862d-b9a609053704 took 8 seconds for project log good - wait_for_fluentd_to_catch_up: found 1 record project .operations for 8fb31270-c7d2-4a85-98be-c510d3cb70c0 took 4 seconds for operations log This means it took 8 seconds for a log record to show up in Elasticsearch from a project "logging" container, and 4 seconds for a system log to show up in Elasticsearch in the .operations index.
Created attachment 1258225 [details] check-log-speed.sh
It is fast to read logs with journald log driver after deployed logging stacks in a clean environment. But after a few hours, records for the last 15 minutes can not be found in Kibana UI, select 'Today' to show logs, the latest log is generated about 1 hour ago. It is the same if re-installed logging check speed log after re-installed logging. ./check-log-speed.sh 2>&1 | tee log No resources found. good - wait_for_fluentd_to_catch_up: found 1 record project logging for 60ffdf2e-e3bc-425a-a940-ba2bf256f6e5 took 2084 seconds for project log Error: found 0 for project .operations message 5c454526-2e7b-43fc-bf1a-5f2a9ee51750 - expected 1 { "_shards": { "failed": 0, "successful": 1, "total": 1 }, "count": 0 } { "_shards": { "failed": 0, "successful": 1, "total": 1 }, "hits": { "hits": [], "max_score": null, "total": 0 }, "timed_out": false, "took": 3 } failed - wait_for_fluentd_to_catch_up: not found 1 record project .operations for 5c454526-2e7b-43fc-bf1a-5f2a9ee51750
There is a better check-log-speed.sh tool available from https://github.com/openshift/origin-aggregated-logging/pull/347: https://raw.githubusercontent.com/richm/origin-aggregated-logging/d7ba6a19283b57012cee2f3709c656509b8bee77/hack/check-log-speed.sh Please use this to measure the fluentd/es logging speed. Run this periodically e.g. every minute like this: $ while true ; do date ./check-log-speed.sh done
It is extremely slow on GCE with journald log driver, attached check log speed file by using scripts in Comment 44
Created attachment 1266886 [details] log speed check
How much RAM and disk space do the GCE machines have? How many cores? Is fluentd configured with JOURNAL_READ_FROM_HEAD=true? What disk storage is Elasticsearch using?
GCE VM type:n1-standard-4, 4 cpus, 15 GB memory, configured with openshift_logging_fluentd_journal_read_from_head=true, but there were only a few projects and did not have many logs, no pv attached for Elasticsearch, Docker Storage Driver is Devicemapper
(In reply to Junqi Zhao from comment #48) > GCE VM type:n1-standard-4, 4 cpus, 15 GB memory, This is a pretty large machine, shouldn't be a problem > configured with > openshift_logging_fluentd_journal_read_from_head=true, but there were only a > few projects and did not have many logs, If the journal is used, it is used both for system logs and project logs. If you are doing read_from_head=true, it may take hours and hours, depending on how many system logs are in the journal, before the project logs begin to show up. I regret ever giving the ability to set read_from_head=true so easily . . . in the meantime, what does $ journalctl | wc -l say? That is, what is the size of the journal? Also $ journalctl | head -2 When does the journal begin? > no pv attached for Elasticsearch, > Docker Storage Driver is Devicemapper
the original environment is shutdown, I will test again, and provide info
> > configured with > > openshift_logging_fluentd_journal_read_from_head=true, but there were only a > > few projects and did not have many logs, > > If the journal is used, it is used both for system logs and project logs. > If you are doing read_from_head=true, it may take hours and hours, depending > on how many system logs are in the journal, before the project logs begin to > show up. > > I regret ever giving the ability to set read_from_head=true so easily . . . even we don't set read_from_head=true, it is also slow to show log entries on Kibana, usually it takes about 2 hour.
I think it's related to GCE,there are only a few persons using GCE this morning and I test again(GCE VM type:n1-standard-4, not set read_from_head=true), after about 2 minutes, log entries could be found on Kibana. I remember on this Monday, when there were not many colleagues use GCE,log entries could be found on Kibana quickly, but when there were many colleagues using GCE in the afternoon, it took a long time show log entries. # journalctl | wc -l 240394 # journalctl | head -2 -- Logs begin at Mon 2017-03-27 16:09:47 UTC, end at Wed 2017-03-29 02:11:10 UTC. -- Mar 27 16:09:47 localhost systemd-journal[88]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.7M available → current limit 91.9M). # while true ; do > date > ./check-log-speed.sh > done Wed Mar 29 01:52:00 UTC 2017 Wed Mar 29 01:57:02 UTC 2017 Wed Mar 29 02:02:03 UTC 2017 Wed Mar 29 02:07:04 UTC 2017
(In reply to Junqi Zhao from comment #52) > I think it's related to GCE,there are only a few persons using GCE this > morning and I test again(GCE VM type:n1-standard-4, not set > read_from_head=true), after about 2 minutes, log entries could be found on > Kibana. > > I remember on this Monday, when there were not many colleagues use GCE,log > entries could be found on Kibana quickly, but when there were many > colleagues using GCE in the afternoon, it took a long time show log entries. I'm not sure why it would make a difference. Do we have one GCE machine that is shared by everyone? > > > # journalctl | wc -l > 240394 > # journalctl | head -2 > -- Logs begin at Mon 2017-03-27 16:09:47 UTC, end at Wed 2017-03-29 02:11:10 > UTC. -- > Mar 27 16:09:47 localhost systemd-journal[88]: Runtime journal is using 8.0M > (max allowed 91.9M, trying to leave 137.9M free of 911.7M available → > current limit 91.9M). > > > > # while true ; do > > date > > ./check-log-speed.sh > > done > Wed Mar 29 01:52:00 UTC 2017 > Wed Mar 29 01:57:02 UTC 2017 > Wed Mar 29 02:02:03 UTC 2017 > Wed Mar 29 02:07:04 UTC 2017 Is this the only output from check-log-speed.sh? It isn't telling how far fluentd is behind journal, etc.?
> I'm not sure why it would make a difference. Do we have one GCE machine that > is shared by everyone? > no, we use separeted GCE machine, I don't know why it happens, AWS doesn't have such issue. > > # while true ; do > > > date > > > ./check-log-speed.sh > > > done > > Wed Mar 29 01:52:00 UTC 2017 > > Wed Mar 29 01:57:02 UTC 2017 > > Wed Mar 29 02:02:03 UTC 2017 > > Wed Mar 29 02:07:04 UTC 2017 > > Is this the only output from check-log-speed.sh? It isn't telling how far > fluentd is behind journal, etc.? yes, it is only output the date, no more informations
Hmm - try the latest version from the PR: https://raw.githubusercontent.com/richm/origin-aggregated-logging/fabb0ccd79f86073ddef746c20d48f3987e112a8/hack/testing/check-log-speed.sh $ wget https://raw.githubusercontent.com/richm/origin-aggregated-logging/fabb0ccd79f86073ddef746c20d48f3987e112a8/hack/testing/check-log-speed.sh $ chmod +x check-log-speed.sh $ ./check-log-speed.sh If you still don't get output like this: Fluentd is 4.999937 seconds behind the journal Fluentd is 5 records behind the journal Elasticsearch operations index is 5.174718 seconds behind the journal Elasticsearch operations index is 0.174781 seconds behind Fluentd Then run with $ bash -x ./check-log-speed.sh 2>&1 | output and attach the output to this bz In general, I'm really not sure how to debug this problem further. I would need to be able to log in to the machine and investigate, because I'm not sure what to look for.
Created attachment 1267447 [details] output log of check-log-speed.sh
Created attachment 1267476 [details] output log of check-log-speed.sh please use this one
yes, all pods log could be shown after deployed, but after a few hours, AND ALL OF THE POD LOGS ARE EMPTY. # ps -ef | grep fluentd root 50183 50165 0 05:34 ? 00:00:00 /usr/bin/ruby /usr/bin/fluentd root 50368 50183 9 05:34 ? 00:33:43 /usr/bin/ruby /usr/bin/fluentd root 93150 92577 0 11:38 pts/0 00:00:00 grep --color=auto fluentd # ps -ef | grep java root 93232 92577 0 11:39 pts/0 00:00:00 grep --color=auto java It is related to GCE, AWS and openstack don't have this issue. please ignore it and check the output of ./check-log-speed.sh in Comment 61
Tested during our routine testing, covered Openstack, AWS, GCE, log entries can be shown quickly on Kibana with journald as logging driver. For GCE, it took a longer time to show log entries, since it is related to GCE, and other IAASs don't have such issue now, close this defect as VERIFIED.
(In reply to Junqi Zhao from comment #64) > For GCE, it took a longer time to show log entries changet to 'For GCE, it took a longer time to show log entries if there are many persons use GCE at the same time'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1129