Bug 1397293 - It takes too long to show log entries on Kibana with journald log driver after scale up ES
Summary: It takes too long to show log entries on Kibana with journald log driver afte...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 3.4.z
Assignee: ewolinet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-22 08:12 UTC by Xia Zhao
Modified: 2017-04-26 05:35 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-04-26 05:35:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fluentd_log (4.26 KB, text/plain)
2016-11-22 08:13 UTC, Xia Zhao
no flags Details
kibana log (245.28 KB, text/plain)
2016-11-22 08:14 UTC, Xia Zhao
no flags Details
screenshot where kibana stuck while workaround the "Apply these filters?" (150.16 KB, image/png)
2016-11-22 08:16 UTC, Xia Zhao
no flags Details
result of query ES with index name *2016* (23.15 KB, text/plain)
2016-11-22 08:19 UTC, Xia Zhao
no flags Details
apps-last-10.json (21.22 KB, text/plain)
2017-02-24 07:18 UTC, Junqi Zhao
no flags Details
apps-first-10.json (21.22 KB, text/plain)
2017-02-24 07:18 UTC, Junqi Zhao
no flags Details
fluentd_pod_info.txt (2.70 KB, text/plain)
2017-02-24 07:19 UTC, Junqi Zhao
no flags Details
node_info.txt (12.86 KB, text/plain)
2017-02-27 02:14 UTC, Junqi Zhao
no flags Details
node_info.txt, check this file please (12.92 KB, text/plain)
2017-02-27 02:19 UTC, Junqi Zhao
no flags Details
apps(ops)-first-10.json, apps(ops)-last-10.json (5.62 KB, application/x-gzip)
2017-02-27 02:46 UTC, Junqi Zhao
no flags Details
es pod log (17.70 KB, text/plain)
2017-02-27 02:48 UTC, Junqi Zhao
no flags Details
check-log-speed.sh (3.98 KB, text/plain)
2017-02-27 23:38 UTC, Rich Megginson
no flags Details
log speed check (80.76 KB, text/plain)
2017-03-28 09:04 UTC, Junqi Zhao
no flags Details
output log of check-log-speed.sh (22.61 KB, text/plain)
2017-03-30 08:52 UTC, Junqi Zhao
no flags Details
output log of check-log-speed.sh (22.61 KB, text/plain)
2017-03-30 10:18 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1129 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update 2017-04-26 09:35:35 UTC

Comment 1 Xia Zhao 2016-11-22 08:13:46 UTC
Created attachment 1222617 [details]
fluentd_log

Comment 2 Xia Zhao 2016-11-22 08:14:13 UTC
Created attachment 1222618 [details]
kibana log

Comment 3 Xia Zhao 2016-11-22 08:16:34 UTC
Created attachment 1222619 [details]
screenshot where kibana stuck while workaround the "Apply these filters?"

Comment 4 Xia Zhao 2016-11-22 08:19:24 UTC
Created attachment 1222620 [details]
result of query ES with index name *2016*

Comment 8 Xia Zhao 2016-11-23 08:39:29 UTC
However, I now was able to see log entries there inside kibana after the deployment completed 3 hours, and seems EFK stacks restarted themselves 18 minutes ago:

$ oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-tvo0u           1/1       Running   1          18m
logging-curator-ops-1-eosmh       1/1       Running   1          18m
logging-es-ops-u804xrbi-1-6ch3y   1/1       Running   1          18m
logging-es-tf4pi7xr-1-dpcyi       1/1       Running   0          18m
logging-fluentd-sj5jr             1/1       Running   0          3h
logging-kibana-1-apmko            2/2       Running   1          18m
logging-kibana-ops-1-xi8fr        2/2       Running   1          18m

Also able to workaround bug #1388031 now, and bug #1397683 disappeared on kibana UI. 

I see the logs prior to EFK was collected by journald, but seems it take very long even read_from_head is set to "false" in fluentd.

json-file behaved faster here. Bug title and priority, severity was changed.

Comment 11 Xia Zhao 2016-11-24 05:34:51 UTC
Thanks for the confirmation, Eric. I retested the journald log driver on AWS, it took 2 hours to show log entries on kibana UI. Seems AWS behaved faster than GCE, and the original issue may related with IAAS. Set to verified for now and will reopen if encounter again on other IAAS.

Env tested:
# openshift version
openshift v3.4.0.29+ca980ba
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

openshift3/logging-deployer    cfa18cbbc8f3
openshift3/logging-fluentd    7b11a29c82c1
openshift3/logging-elasticsearch    6716a0ad8b2b
openshift3/logging-auth-proxy    ec334b0c2669
openshift3/logging-kibana    7fc9916eea4d
openshift3/logging-curator    9af78fc06248

Comment 12 Xia Zhao 2016-11-28 08:29:20 UTC
Tried with es cluster size = 2 on openstack, it took 1 hour to show log entries on kibana UI. So seems this is related with IAAS, and time taken openstack < AWS < GCE.

Comment 13 Xia Zhao 2016-12-23 08:52:08 UTC
Reopen this specially for GCE platform, since the issue was reproduced: 
It took 3 hours to show log entries on kibana UI. Similar issue was not encountered on AWS ,openstack and even the json-file log driver on GCE.

Images tested with:
On ops registry:
openshift3/logging-auth-proxy    e96b37a99960
openshift3/logging-kibana    27f978fc2946
openshift3/logging-fluentd    c493f8b4553b
openshift3/logging-elasticsearch    3ca95a8a9433
openshift3/logging-curator    e39988877cd9
openshift3/logging-deployer    1033ccb0557b

# openshift env
openshift v3.4.0.38
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 14 Xia Zhao 2016-12-23 09:44:45 UTC
(In reply to Xia Zhao from comment #13)
> Reopen this specially for GCE platform, since the issue was reproduced: 
> It took 3 hours to show log entries on kibana UI. Similar issue was not
> encountered on AWS ,openstack and even the json-file log driver on GCE.
> 
> Images tested with:
> On ops registry:
> openshift3/logging-auth-proxy    e96b37a99960
> openshift3/logging-kibana    27f978fc2946
> openshift3/logging-fluentd    c493f8b4553b
> openshift3/logging-elasticsearch    3ca95a8a9433
> openshift3/logging-curator    e39988877cd9
> openshift3/logging-deployer    1033ccb0557b
> 
> # openshift env
> openshift v3.4.0.38
> kubernetes v1.4.0+776c994
> etcd 3.1.0-rc.0

Seems the root cause is documented here https://docs.openshift.com/container-platform/3.3/install_config/aggregate_logging_sizing.html#install-config-aggregate-logging-sizing-guidelines-rate-limiting

Comment 20 Rich Megginson 2017-02-10 03:47:41 UTC
Is this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1410705 - Roll over log entries are visible when journal-read-from-head 
is false ?

There is a bug in the fluent-plugin-systemd used when reading from journald - even if you set

  read_from_head "false"

it will still try to read from head, which may take up to a few hours.

Try to reproduce this bug with the fix from https://bugzilla.redhat.com/show_bug.cgi?id=1410705

Comment 21 Xia Zhao 2017-02-10 09:15:26 UTC
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421011

Comment 22 Xia Zhao 2017-02-13 05:52:59 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421563

Comment 23 Junqi Zhao 2017-02-20 05:41:43 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1420219, will verify it after BZ #1420219 get fixed

Comment 24 Junqi Zhao 2017-02-23 07:01:30 UTC
Tested on AWS with 7.5 GB RAM machine, it is faster than before, but still need to take around 15-20 minutes to show logs in Kibana.

Will verify again with larger memory machine, and check whether it can be faster or not.

Comment 25 Rich Megginson 2017-02-23 17:21:24 UTC
Please gather this additional information:

1) use `oc get pods` to find the name of the fluentd pod -> ${fluentd-pod}  then
oc exec ${fluentd-pod} -- cat /etc/fluent/configs.d/dynamic/input-syslog-default-syslog.conf

2) use `oc get pods` to find the name of the elasticsearch pod -> ${es-pod} then

oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/project.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > apps-last-10.json
oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/project.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > apps-first-10.json
oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > ops-last-10.json
oc exec ${es-pod} -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > ops-first-10.json

The rest of these should be done on the host where the fluentd pod is running

3) date; date -u
4) ls -alrtF /var/log
5) cat /var/log/journal.pos
6) du -sk /var/log/journal/*
7) du -sk /run/log/journal/*
8) journalctl -o export | head -100
9) journalctl -o export | tail -100

Comment 26 Junqi Zhao 2017-02-24 07:16:05 UTC
@rmeggins
Tested with 16GB VM AWS machine, user project logs can not be found after one hour
attached info you wanted

# oc get po
NAME                          READY     STATUS    RESTARTS   AGE
logging-curator-1-x2kp3       1/1       Running   0          50m
logging-es-60i3nvwq-1-kj3n8   1/1       Running   0          50m
logging-fluentd-sck4v         1/1       Running   0          50m
logging-kibana-1-tpwvl        2/2       Running   0          50m


1) # oc exec logging-fluentd-sck4v -- cat /etc/fluent/configs.d/dynamic/input-syslog-default-syslog.conf
<source>
  @type systemd
  @label @INGRESS
  path "/run/log/journal"
  pos_file /var/log/journal.pos
  tag journal
</source>

2)  see attached apps-last-10.json and apps-first-10.json, the last two don't have return result.
# oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > ops-last-10.json
No JSON object could be decoded
# oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es-ops:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > ops-first-10.json
No JSON object could be decoded

3) info on fluentd pod, please see attached fluentd_pod_info.txt

Comment 27 Junqi Zhao 2017-02-24 07:18:00 UTC
Created attachment 1257160 [details]
apps-last-10.json

Comment 28 Junqi Zhao 2017-02-24 07:18:36 UTC
Created attachment 1257161 [details]
apps-first-10.json

Comment 29 Junqi Zhao 2017-02-24 07:19:15 UTC
Created attachment 1257162 [details]
fluentd_pod_info.txt

Comment 30 Junqi Zhao 2017-02-24 08:20:54 UTC
according to comment 26, assign this defect back, logs can not be found in Kibana after one hour.

Comment 31 Rich Megginson 2017-02-24 15:06:41 UTC
I guess I wasn't clear on this instructions.  When I say "run on the host" I mean run on the node/machine, not inside the fluentd pod:

The rest of these should be done _on the host_ where the fluentd pod is running, not inside the fluentd pod.

3) date; date -u
4) ls -alrtF /var/log
5) cat /var/log/journal.pos
6) du -sk /var/log/journal/*
7) du -sk /run/log/journal/*
8) journalctl -o export | head -100
9) journalctl -o export | tail -100

Comment 32 Rich Megginson 2017-02-24 15:11:46 UTC
"user project logs can not be found after one hour"

To me, it looks as though it is the .operations.* logs that cannot be found.  According to the apps*.json, you have this:

                "_index": "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24",
                "_source": {
                    "@timestamp": "2017-02-24T03:21:47.912379+00:00",

and 

                "_index": "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24",
                "_source": {
                    "@timestamp": "2017-02-24T02:37:57.900172+00:00",

These are user project logs.  The first one was generated at 2017-02-24T02:37:57.900172 and the last one at 2017-02-24T03:21:47.  Are you saying that these logs are not showing up in kibana?

For the .operations.* logs, try changing the curl slightly like this, since you don't have a separate ops cluster:

# oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:desc | python -mjson.tool > ops-last-10.json

# oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:asc | python -mjson.tool > ops-first-10.json

Comment 33 Rich Megginson 2017-02-24 21:17:39 UTC
Also please attach the elasticsearch log

Comment 34 Junqi Zhao 2017-02-27 02:14:22 UTC
(In reply to Rich Megginson from comment #31)
> I guess I wasn't clear on this instructions.  When I say "run on the host" I
> mean run on the node/machine, not inside the fluentd pod:
> 
> The rest of these should be done _on the host_ where the fluentd pod is
> running, not inside the fluentd pod.
> 
> 3) date; date -u
> 4) ls -alrtF /var/log
> 5) cat /var/log/journal.pos
> 6) du -sk /var/log/journal/*
> 7) du -sk /run/log/journal/*
> 8) journalctl -o export | head -100
> 9) journalctl -o export | tail -100

See attached node_info.txt

Comment 35 Junqi Zhao 2017-02-27 02:14:56 UTC
Created attachment 1257893 [details]
node_info.txt

Comment 36 Junqi Zhao 2017-02-27 02:19:51 UTC
Created attachment 1257894 [details]
node_info.txt, check this file please

Comment 37 Junqi Zhao 2017-02-27 02:28:33 UTC
(In reply to Rich Megginson from comment #32)
> "user project logs can not be found after one hour"
> 
> To me, it looks as though it is the .operations.* logs that cannot be found.
> According to the apps*.json, you have this:
> 
>                 "_index":
> "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24",
>                 "_source": {
>                     "@timestamp": "2017-02-24T03:21:47.912379+00:00",
> 
> and 
> 
>                 "_index":
> "project.install-test.c41e7be6-fa2d-11e6-936f-0e4235f26c30.2017.02.24",
>                 "_source": {
>                     "@timestamp": "2017-02-24T02:37:57.900172+00:00",
> 
> These are user project logs.  The first one was generated at
> 2017-02-24T02:37:57.900172 and the last one at 2017-02-24T03:21:47.  Are you
> saying that these logs are not showing up in kibana?
> 
> For the .operations.* logs, try changing the curl slightly like this, since
> you don't have a separate ops cluster:
> 
> # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert
> /etc/elasticsearch/secret/admin-cert --key
> /etc/elasticsearch/secret/admin-key
> https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:desc
> | python -mjson.tool > ops-last-10.json
> 
> # oc exec logging-es-60i3nvwq-1-kj3n8 -- curl -s -k --cert
> /etc/elasticsearch/secret/admin-cert --key
> /etc/elasticsearch/secret/admin-key
> https://logging-es:9200/.operations.*/_search?size=10\&sort=@timestamp:asc |
> python -mjson.tool > ops-first-10.json

project install-test is created by OCP, not by user.

Comment 38 Junqi Zhao 2017-02-27 02:46:00 UTC
Created attachment 1257902 [details]
apps(ops)-first-10.json, apps(ops)-last-10.json

Comment 39 Junqi Zhao 2017-02-27 02:48:34 UTC
Created attachment 1257904 [details]
es pod log

Comment 40 Junqi Zhao 2017-02-27 02:56:57 UTC
log indicies:
.kibana
.kibana.ef0b7ff169fdc9202e567ce53aa5e17320cb2d7d
.operations.2017.02.27
.searchguard.logging-es-6ln1tp4n-1-x5p10
project.install-test.b293a8a4-fc8a-11e6-86b2-0e9faec1ca4a.2017.02.27
project.java.2d0a1297-fc90-11e6-a9c8-0e9faec1ca4a.2017.02.27
project.logging.7d5500cd-fc8a-11e6-86b2-0e9faec1ca4a.2

project logging is where logging deployed, project java is user project.


Tested again in a clean env, RAM is 16 GB, logs can be shown around 5 minutes. It seems logs can be shown quickly in a clean env, last time, I had been installed and uninstalled logging for a few times and then check the logs, then the logs can not be found are after after one hour.

Comment 41 Rich Megginson 2017-02-27 23:37:20 UTC
How can I reproduce the situation where you install and uninstall a few times?  Do the logs take too long only in that situation?

In the clean install case, what I would expect is that it takes a few minutes for all of the logs to show up after install because of all of the initializations taking place during startup.  But after everything is running, caches are warmed up, etc. it should only take a few seconds for logs to show up.

I have attached a shell script which can be used to test the timing.  You'll need to be admin (oc login --username=system:admin) and be in the logging project (oc project logging).

$ ./check-log-speed.sh 2>&1 | tee log
No resources found.
good - wait_for_fluentd_to_catch_up: found 1 record project logging for 9e054c82-6d4e-433f-862d-b9a609053704
took 8 seconds for project log
good - wait_for_fluentd_to_catch_up: found 1 record project .operations for 8fb31270-c7d2-4a85-98be-c510d3cb70c0
took 4 seconds for operations log

This means it took 8 seconds for a log record to show up in Elasticsearch from a project "logging" container, and 4 seconds for a system log to show up in Elasticsearch in the .operations index.

Comment 42 Rich Megginson 2017-02-27 23:38:04 UTC
Created attachment 1258225 [details]
check-log-speed.sh

Comment 43 Junqi Zhao 2017-03-13 03:52:48 UTC
It is fast to read logs with journald log driver after deployed logging stacks in a clean environment. But after a few hours, records for the last 15 minutes can not be found in Kibana UI, select 'Today' to show logs, the latest log is generated about 1 hour ago. It is the same if re-installed logging

check speed log after re-installed logging.
./check-log-speed.sh 2>&1 | tee log
No resources found.
good - wait_for_fluentd_to_catch_up: found 1 record project logging for 60ffdf2e-e3bc-425a-a940-ba2bf256f6e5
took 2084 seconds for project log
Error: found 0 for project .operations message 5c454526-2e7b-43fc-bf1a-5f2a9ee51750 - expected 1
{
    "_shards": {
        "failed": 0,
        "successful": 1,
        "total": 1
    },
    "count": 0
}
{
    "_shards": {
        "failed": 0,
        "successful": 1,
        "total": 1
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": 0
    },
    "timed_out": false,
    "took": 3
}
failed - wait_for_fluentd_to_catch_up: not found 1 record project .operations for 5c454526-2e7b-43fc-bf1a-5f2a9ee51750

Comment 44 Rich Megginson 2017-03-13 22:45:33 UTC
There is a better check-log-speed.sh tool available from https://github.com/openshift/origin-aggregated-logging/pull/347:

https://raw.githubusercontent.com/richm/origin-aggregated-logging/d7ba6a19283b57012cee2f3709c656509b8bee77/hack/check-log-speed.sh

Please use this to measure the fluentd/es logging speed.  Run this periodically e.g. every minute like this:

$ while true ; do
  date
  ./check-log-speed.sh
done

Comment 45 Junqi Zhao 2017-03-28 09:04:12 UTC
It is extremely slow on GCE with journald log driver, attached check log speed file by using scripts in Comment 44

Comment 46 Junqi Zhao 2017-03-28 09:04:50 UTC
Created attachment 1266886 [details]
log speed check

Comment 47 Rich Megginson 2017-03-28 14:02:31 UTC
How much RAM and disk space do the GCE machines have?  How many cores?
Is fluentd configured with JOURNAL_READ_FROM_HEAD=true?
What disk storage is Elasticsearch using?

Comment 48 Junqi Zhao 2017-03-29 00:33:06 UTC
GCE VM type:n1-standard-4, 4 cpus, 15 GB memory, configured with openshift_logging_fluentd_journal_read_from_head=true, but there were only a few projects and did not have many logs, no pv attached for Elasticsearch, Docker Storage Driver is Devicemapper

Comment 49 Rich Megginson 2017-03-29 01:17:53 UTC
(In reply to Junqi Zhao from comment #48)
> GCE VM type:n1-standard-4, 4 cpus, 15 GB memory, 

This is a pretty large machine, shouldn't be a problem

> configured with
> openshift_logging_fluentd_journal_read_from_head=true, but there were only a
> few projects and did not have many logs,

If the journal is used, it is used both for system logs and project logs.  If you are doing read_from_head=true, it may take hours and hours, depending on how many system logs are in the journal, before the project logs begin to show up.

I regret ever giving the ability to set read_from_head=true so easily . . .

in the meantime, what does

$ journalctl | wc -l

say?  That is, what is the size of the journal?  Also

$ journalctl | head -2

When does the journal begin?

> no pv attached for Elasticsearch,
> Docker Storage Driver is Devicemapper

Comment 50 Junqi Zhao 2017-03-29 01:27:38 UTC
the original environment is shutdown, I will test again, and provide info

Comment 51 Junqi Zhao 2017-03-29 01:38:48 UTC
> > configured with
> > openshift_logging_fluentd_journal_read_from_head=true, but there were only a
> > few projects and did not have many logs,
> 
> If the journal is used, it is used both for system logs and project logs. 
> If you are doing read_from_head=true, it may take hours and hours, depending
> on how many system logs are in the journal, before the project logs begin to
> show up.
> 
> I regret ever giving the ability to set read_from_head=true so easily . . .

even we don't set read_from_head=true, it is also slow to show log entries on Kibana, usually it takes about 2 hour.

Comment 52 Junqi Zhao 2017-03-29 02:12:01 UTC
I think it's related to GCE,there are only a few persons using GCE this morning and I test again(GCE VM type:n1-standard-4, not set read_from_head=true), after about 2 minutes, log entries could be found on Kibana.

I remember on this Monday, when there were not many colleagues use GCE,log entries could be found on Kibana quickly, but when there were many colleagues using GCE in the afternoon, it took a long time show log entries.


# journalctl | wc -l
240394
# journalctl | head -2
-- Logs begin at Mon 2017-03-27 16:09:47 UTC, end at Wed 2017-03-29 02:11:10 UTC. --
Mar 27 16:09:47 localhost systemd-journal[88]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.7M available → current limit 91.9M).



# while true ; do
>   date
>   ./check-log-speed.sh
> done
Wed Mar 29 01:52:00 UTC 2017
Wed Mar 29 01:57:02 UTC 2017
Wed Mar 29 02:02:03 UTC 2017
Wed Mar 29 02:07:04 UTC 2017

Comment 53 Rich Megginson 2017-03-29 03:09:28 UTC
(In reply to Junqi Zhao from comment #52)
> I think it's related to GCE,there are only a few persons using GCE this
> morning and I test again(GCE VM type:n1-standard-4, not set
> read_from_head=true), after about 2 minutes, log entries could be found on
> Kibana.
> 
> I remember on this Monday, when there were not many colleagues use GCE,log
> entries could be found on Kibana quickly, but when there were many
> colleagues using GCE in the afternoon, it took a long time show log entries.

I'm not sure why it would make a difference. Do we have one GCE machine that is shared by everyone?

> 
> 
> # journalctl | wc -l
> 240394
> # journalctl | head -2
> -- Logs begin at Mon 2017-03-27 16:09:47 UTC, end at Wed 2017-03-29 02:11:10
> UTC. --
> Mar 27 16:09:47 localhost systemd-journal[88]: Runtime journal is using 8.0M
> (max allowed 91.9M, trying to leave 137.9M free of 911.7M available →
> current limit 91.9M).
> 
> 
> 
> # while true ; do
> >   date
> >   ./check-log-speed.sh
> > done
> Wed Mar 29 01:52:00 UTC 2017
> Wed Mar 29 01:57:02 UTC 2017
> Wed Mar 29 02:02:03 UTC 2017
> Wed Mar 29 02:07:04 UTC 2017

Is this the only output from check-log-speed.sh?  It isn't telling how far fluentd is behind journal, etc.?

Comment 54 Junqi Zhao 2017-03-29 05:20:57 UTC
> I'm not sure why it would make a difference. Do we have one GCE machine that
> is shared by everyone?
> 
no, we use separeted GCE machine, I don't know why it happens, AWS doesn't have such issue.

> > # while true ; do
> > >   date
> > >   ./check-log-speed.sh
> > > done
> > Wed Mar 29 01:52:00 UTC 2017
> > Wed Mar 29 01:57:02 UTC 2017
> > Wed Mar 29 02:02:03 UTC 2017
> > Wed Mar 29 02:07:04 UTC 2017
> 
> Is this the only output from check-log-speed.sh?  It isn't telling how far
> fluentd is behind journal, etc.?

yes, it is only output the date, no more informations

Comment 55 Rich Megginson 2017-03-29 21:27:34 UTC
Hmm - try the latest version from the PR: https://raw.githubusercontent.com/richm/origin-aggregated-logging/fabb0ccd79f86073ddef746c20d48f3987e112a8/hack/testing/check-log-speed.sh

$ wget https://raw.githubusercontent.com/richm/origin-aggregated-logging/fabb0ccd79f86073ddef746c20d48f3987e112a8/hack/testing/check-log-speed.sh

$ chmod +x check-log-speed.sh

$ ./check-log-speed.sh

If you still don't get output like this:
Fluentd is 4.999937 seconds behind the journal
Fluentd is 5 records behind the journal
Elasticsearch operations index is 5.174718 seconds behind the journal
Elasticsearch operations index is 0.174781 seconds behind Fluentd

Then run with

$ bash -x ./check-log-speed.sh 2>&1 | output

and attach the output to this bz

In general, I'm really not sure how to debug this problem further.  I would need to be able to log in to the machine and investigate, because I'm not sure what to look for.

Comment 56 Junqi Zhao 2017-03-30 08:52:04 UTC
Created attachment 1267447 [details]
output log of check-log-speed.sh

Comment 58 Junqi Zhao 2017-03-30 10:18:23 UTC
Created attachment 1267476 [details]
output log of check-log-speed.sh

please use this one

Comment 62 Junqi Zhao 2017-04-01 11:42:00 UTC
yes, all pods log could be shown after deployed, but after a few hours, AND ALL OF THE POD LOGS ARE EMPTY.

# ps -ef | grep fluentd
root      50183  50165  0 05:34 ?        00:00:00 /usr/bin/ruby /usr/bin/fluentd
root      50368  50183  9 05:34 ?        00:33:43 /usr/bin/ruby /usr/bin/fluentd
root      93150  92577  0 11:38 pts/0    00:00:00 grep --color=auto fluentd
# ps -ef | grep java
root      93232  92577  0 11:39 pts/0    00:00:00 grep --color=auto java

It is related to GCE, AWS and openstack don't have this issue. please ignore it and check the output of ./check-log-speed.sh in Comment 61

Comment 64 Junqi Zhao 2017-04-05 00:54:58 UTC
Tested during our routine testing, covered Openstack, AWS, GCE, log entries can be shown quickly on Kibana with journald as logging driver. For GCE, it took a longer time to show log entries, since it is related to GCE, and other IAASs don't have such issue now, close this defect as VERIFIED.

Comment 65 Junqi Zhao 2017-04-05 00:57:37 UTC
(In reply to Junqi Zhao from comment #64)

> For GCE, it took a longer time to show log entries

changet to 'For GCE, it took a longer time to show log entries if there are many persons use GCE at the same time'

Comment 67 errata-xmlrpc 2017-04-26 05:35:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129


Note You need to log in before you can comment on or make changes to this bug.