1722944 – rsyslog log collector dropping logs

Bug 1722944 - rsyslog log collector dropping logs

Summary: rsyslog log collector dropping logs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Rich Megginson
QA Contact:	Eric Matysek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-21 19:01 UTC by Eric Matysek
Modified:	2019-10-16 06:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:32:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
rsyslog pod logs (1.90 KB, text/plain) 2019-06-21 19:01 UTC, Eric Matysek	no flags	Details
rsyslog metrics and logs (5.56 KB, application/gzip) 2019-07-09 16:53 UTC, Eric Matysek	no flags	Details
Missing logs (252.34 KB, application/gzip) 2019-07-15 19:49 UTC, Eric Matysek	no flags	Details
impstats and rsyslog.pod output (2.15 MB, application/gzip) 2019-07-16 21:18 UTC, Eric Matysek	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 220	0	'None'	closed	Bug 1722944: rsyslog log collector dropping logs	2021-01-09 15:47:06 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:32 UTC

Internal Links: 1739558

Description Eric Matysek 2019-06-21 19:01:27 UTC

Created attachment 1583376 [details]
rsyslog pod logs

Description of problem:
When testing the rsyslog log collector I am not seeing the correct number of documents in the resulting elasticsearch index.

With 10 pods producing 30k logs at 50 per second I have generally observed 0-50 logs missing per index.

The problem only gets worse as the rate gets higher.
-With 20 pods producing 30k logs at 50 per second I can see some indices missing thousands of logs.
-With 10 pods producing 30k logs at 100 per second only 4 out of 10 of my elasticsearch indices were created at all (although only 1 missing log out of those)

I have not seen the same issues when testing with the fluentd log collector.

Originally I thought this may be due to rsyslog pods going OOM as I was seeing rsyslog pod restarts, but after increasing the memory limit to 1Gi the issue still stands.

Version-Release number of selected component (if applicable):


How reproducible:
Very consistent although some indices do get the expected count while other may only have a small fraction of expected docs.


Steps to Reproduce:
1. Create clusterlogging operator with rsyslog log collector
2. Create pods that will generate give number of logs at given rate 
3. Check elasticsearch index document count.

Actual results:
Not all logs showing up in elasticsearch index

Expected results:
All logs show up in elasticsearch index

Additional info:
Also seeing this log line on startup of the cluster logging instance, but I assume once the elasticsearch pods become reachable it is no longer an issue.
rsyslogd: omelasticsearch: we are suspending ourselfs due to server failure 7: Failed connect to elasticsearch:9200; No route to host [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ]
See attachment for logs of al rsyslog pods.

Comment 1 Rich Megginson 2019-06-21 19:32:52 UTC

> Also seeing this log line on startup of the cluster logging instance, but I assume once the elasticsearch pods become reachable it is no longer an issue.
rsyslogd: omelasticsearch: we are suspending ourselfs due to server failure 7: Failed connect to elasticsearch:9200; No route to host [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ]

Right.  It should clear up.

How did you generate the rsyslog logs?  In addition to `oc logs -c rsyslog $rpod` you need to do `oc exec -c rsyslog $rpod -- logs`

We should also take a look at the prometheus stats.  https://github.com/openshift/origin-aggregated-logging/blob/master/test/zzz-rsyslog.sh#L238

e.g. something like this:

for ii in $( seq 1 600 ) ; do
  for rpod in $( oc get pods -l component=rsyslog -o jsonpath='{.items[*].metadata.name}' ) ; do
    rpod_ip="$( oc get pod $rpod -o jsonpath='{.status.podIP}' )"
    curl -s -k https://${rpod_ip}:24231/metrics >> /var/tmp/${rpod}-metrics-scrape 2>&1
  done
  sleep 10
done

The metrics can indicate problems that might not be logged by default.

Comment 2 Rich Megginson 2019-06-21 19:34:18 UTC

Also, is this with 4.1 or 4.2? The Version field says 4.1.  I'm assuming you deployed the latest 4.2 candidate, but just want to make sure.

Comment 3 Eric Matysek 2019-06-21 20:06:25 UTC

The version I tested on was 4.1.2

I generated the logs using cluster loader and ocp_logtest (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md)

I will work on grabbing the additional rsyslog pod logs and prometheus stats you mentioned.

Comment 4 Rich Megginson 2019-06-21 23:17:27 UTC

(In reply to Eric Matysek from comment #3)
> The version I tested on was 4.1.2

rsyslog wasn't fully supported in 4.1

Please try the latest 4.2 candidate to see if you can reproduce with the latest supported rsyslog

> 
> I generated the logs using cluster loader and ocp_logtest
> (https://github.com/openshift/svt/blob/master/openshift_scalability/content/
> logtest/ocp_logtest-README.md)
> 
> I will work on grabbing the additional rsyslog pod logs and prometheus stats
> you mentioned.

Comment 5 Eric Matysek 2019-06-24 19:23:43 UTC

Verified I am still seeing the issue on build 4.2.0-0.nightly-2019-06-24-112621

I expected 30k documents in these indices:
green  open   project.logtest-rsyslog-4-0.1e50a0c8-96af-11e9-9f33-0e27288c199a.2019.06.24  onuVsHOhS1CrH1MfbFww3g   3   1      29996            0     65.6mb         32.8mb
green  open   project.logtest-rsyslog-4-1.1e46eb2a-96af-11e9-bd21-12e0af33f230.2019.06.24  LlU8OxzLSBCnk1HTz5e36g   3   1      29984            0     65.3mb         32.3mb
green  open   project.logtest-rsyslog-4-2.1e5250fc-96af-11e9-bd21-12e0af33f230.2019.06.24  g6tycm1sTFOIBKpCldoVsQ   3   1      29994            0     64.1mb           32mb
green  open   project.logtest-rsyslog-4-3.1e5361d7-96af-11e9-bd21-12e0af33f230.2019.06.24  GQyZjzW2ShOMN42mz728Tg   3   1      29986            0     64.7mb         32.1mb
green  open   project.logtest-rsyslog-4-4.1e57d6d6-96af-11e9-9f33-0e27288c199a.2019.06.24  QstnJWsASbeVwr29CPs1mA   3   1      29991            0     65.4mb         32.8mb
...omitting 6-19...

Trying to grab logs as you suggested gave me this for all of my rsyslog pods:
[ematysek@jump ~]$ oc exec -c rsyslog rsyslog-58pg8 -- logs
ls: cannot access /var/log/rsyslog/rsyslog*: No such file or directory


Working on grabbing prometheus metrics

Comment 6 Rich Megginson 2019-06-24 19:52:05 UTC

ls -alrtF /var/log/rsyslog /var/lib/rsyslog.pod

Comment 7 Noriko Hosoi 2019-06-24 19:59:53 UTC

Also could you share output from these command lines?
$ oc describe ds rsyslog
$ oc get events | grep -i warning

Comment 8 Eric Matysek 2019-06-25 20:10:42 UTC

(In reply to Rich Megginson from comment #6)
> ls -alrtF /var/log/rsyslog /var/lib/rsyslog.pod

[root@rsyslog-2lbgn /]# ls -alrtF /var/log/rsyslog /var/lib/rsyslog.pod
/var/log/rsyslog:
total 0
drwxr-xr-x. 12 root root 189 Jun 25 18:09 ../
drwxr-xr-x.  2 root root   6 Jun 25 18:09 ./

/var/lib/rsyslog.pod:
total 1232
-rw-------. 1 root root    216 Jun 25 19:19 imfile-state:75547144
-rw-------. 1 root root    174 Jun 25 19:19 imfile-state:50530172
-rw-------. 1 root root    206 Jun 25 19:19 imfile-state:25715
-rw-------. 1 root root    206 Jun 25 19:19 imfile-state:251867675
-rw-------. 1 root root    221 Jun 25 19:19 imfile-state:251722273
-rw-------. 1 root root    197 Jun 25 19:19 imfile-state:235350489
-rw-------. 1 root root    209 Jun 25 19:19 imfile-state:235181738
-rw-------. 1 root root    191 Jun 25 19:19 imfile-state:176177139
-rw-------. 1 root root    216 Jun 25 19:19 imfile-state:167811834
-rw-------. 1 root root    215 Jun 25 19:19 imfile-state:12159287
-rw-------. 1 root root    196 Jun 25 19:19 imfile-state:12139139
-rw-------. 1 root root    196 Jun 25 19:19 imfile-state:92287048
-rw-------. 1 root root    201 Jun 25 19:19 imfile-state:83905953
-rw-------. 1 root root    192 Jun 25 19:19 imfile-state:50475188
-rw-------. 1 root root    190 Jun 25 19:19 imfile-state:50364301
-rw-------. 1 root root    185 Jun 25 19:19 imfile-state:50333793
-rw-------. 1 root root    189 Jun 25 19:19 imfile-state:33595529
-rw-------. 1 root root    195 Jun 25 19:19 imfile-state:33595524
-rw-------. 1 root root    185 Jun 25 19:19 imfile-state:33595452
-rw-------. 1 root root    176 Jun 25 19:19 imfile-state:33556122
-rw-------. 1 root root    198 Jun 25 19:19 imfile-state:260056133
-rw-------. 1 root root    195 Jun 25 19:19 imfile-state:25213637
-rw-------. 1 root root    182 Jun 25 19:19 imfile-state:234882904
-rw-------. 1 root root    220 Jun 25 19:19 imfile-state:226493924
-rw-------. 1 root root    172 Jun 25 19:19 imfile-state:218109513
-rw-------. 1 root root    197 Jun 25 19:19 imfile-state:192942257
-rw-------. 1 root root    192 Jun 25 19:19 imfile-state:159455953
-rw-------. 1 root root    190 Jun 25 19:19 imfile-state:159384795
-rw-------. 1 root root    184 Jun 25 19:19 imfile-state:159384696
-rw-------. 1 root root    195 Jun 25 19:19 imfile-state:150995985
-rw-------. 1 root root    191 Jun 25 19:19 imfile-state:134240410
-rw-------. 1 root root    166 Jun 25 19:19 imfile-state:125829935
-rw-------. 1 root root    189 Jun 25 19:19 imfile-state:117465705
-rw-------. 1 root root    191 Jun 25 19:19 imfile-state:117441875
-rw-------. 1 root root    174 Jun 25 19:19 imfile-state:92275490
-rw-------. 1 root root    191 Jun 25 19:19 imfile-state:218104515
-rw-------. 1 root root    182 Jun 25 19:19 imfile-state:209715439
drwxr-xr-x. 1 root root     39 Jun 25 19:21 ../
-rw-------. 1 root root    201 Jun 25 19:38 imfile-state:243486924
-rw-------. 1 root root    201 Jun 25 19:38 imfile-state:109232691
-rw-------. 1 root root    201 Jun 25 19:38 imfile-state:25253818
-rw-------. 1 root root    200 Jun 25 19:38 imfile-state:184845298
-rw-------. 1 root root    200 Jun 25 19:38 imfile-state:75682204
-rw-------. 1 root root    200 Jun 25 19:38 imfile-state:235466422
-rw-------. 1 root root    201 Jun 25 19:41 imfile-state:243486923
-rw-------. 1 root root    201 Jun 25 19:41 imfile-state:109232690
-rw-------. 1 root root    201 Jun 25 19:41 imfile-state:25253815
-rw-------. 1 root root    200 Jun 25 19:41 imfile-state:184845295
-rw-------. 1 root root    200 Jun 25 19:41 imfile-state:75682201
-rw-------. 1 root root    200 Jun 25 19:41 imfile-state:235466416
-rw-r--r--. 1 root root    123 Jun 25 20:06 imjournal.state
-rw-------. 1 root root    578 Jun 25 20:06 es-app.qi
drwxr-xr-x. 2 root root   4096 Jun 25 20:06 ./
-rw-------. 1 root root 872631 Jun 25 20:06 es-app.00000268

Comment 9 Eric Matysek 2019-06-25 20:12:52 UTC

(In reply to Noriko Hosoi from comment #7)
> Also could you share output from these command lines?
> $ oc describe ds rsyslog
> $ oc get events | grep -i warning

[ematysek@jump ~]$ oc describe ds rsyslog
Name:           rsyslog
Selector:       component=rsyslog,logging-infra=rsyslog,provider=openshift
Node-Selector:  <none>
Labels:         component=rsyslog
                logging-infra=rsyslog
                provider=openshift
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 5
Number of Nodes Scheduled with Up-to-date Pods: 5
Number of Nodes Scheduled with Available Pods: 5
Number of Nodes Misscheduled: 0
Pods Status:  5 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           component=rsyslog
                    logging-infra=rsyslog
                    provider=openshift
  Annotations:      scheduler.alpha.kubernetes.io/critical-pod: 
  Service Account:  logcollector
  Containers:
   rsyslog:
    Image:      quay.io/openshift/origin-logging-rsyslog:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      /opt/app-root/bin/rsyslog.sh
    Limits:
      memory:  1Gi
    Environment:
      MERGE_JSON_LOG:        false
      K8S_HOST_URL:          https://kubernetes.default.svc
      ES_HOST:               elasticsearch
      ES_PORT:               9200
      ES_CLIENT_CERT:        /etc/rsyslog/keys/app-cert
      ES_CLIENT_KEY:         /etc/rsyslog/keys/app-key
      ES_CA:                 /etc/rsyslog/keys/app-ca
      OPS_HOST:              elasticsearch
      OPS_PORT:              9200
      OPS_CLIENT_CERT:       /etc/rsyslog/keys/infra-cert
      OPS_CLIENT_KEY:        /etc/rsyslog/keys/infra-key
      OPS_CA:                /etc/rsyslog/keys/infra-ca
      BUFFER_QUEUE_LIMIT:    32
      BUFFER_SIZE_LIMIT:     8m
      FILE_BUFFER_LIMIT:     256Mi
      RSYSLOG_CPU_LIMIT:     node allocatable (limits.cpu)
      RSYSLOG_MEMORY_LIMIT:  1073741824 (limits.memory)
      NODE_IPV4:              (v1:status.hostIP)
    Mounts:
      /etc/docker-hostname from dockerhostname (ro)
      /etc/localtime from localtime (ro)
      /etc/machine-id from machineid (ro)
      /etc/rsyslog.d from config (ro)
      /etc/rsyslog/conf from main (ro)
      /etc/rsyslog/keys from certs (ro)
      /opt/app-root/bin from bin (ro)
      /run/log/journal from runlogjournal (rw)
      /var/lib/docker from varlibdockercontainers (ro)
      /var/lib/rsyslog.pod from filebufferstorage (rw)
      /var/log from varlog (rw)
  Volumes:
   runlogjournal:
    Type:          HostPath (bare host directory volume)
    Path:          /run/log/journal
    HostPathType:  
   varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
   varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker
    HostPathType:  
   bin:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rsyslog-bin
    Optional:  false
   main:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rsyslog-main
    Optional:  false
   config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rsyslog
    Optional:  false
   certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rsyslog
    Optional:    false
   dockerhostname:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hostname
    HostPathType:  
   localtime:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/localtime
    HostPathType:  
   machineid:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/machine-id
    HostPathType:  
   filebufferstorage:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rsyslog.pod
    HostPathType:  
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  50m   daemonset-controller  Created pod: rsyslog-d5zhw
  Normal  SuccessfulCreate  50m   daemonset-controller  Created pod: rsyslog-mbhmv
  Normal  SuccessfulCreate  50m   daemonset-controller  Created pod: rsyslog-2lbgn
  Normal  SuccessfulCreate  50m   daemonset-controller  Created pod: rsyslog-xc9cv
  Normal  SuccessfulCreate  50m   daemonset-controller  Created pod: rsyslog-tx97q
[ematysek@jump ~]$ 


[ematysek@jump ~]$ oc get events | grep -i warning
51m         Warning   FailedAttachVolume       pod/elasticsearch-cdm-2ojk3phw-1-5c8c5f7586-ts5vx                  AttachVolume.Attach failed for volume "pvc-56e2bfa5-977e-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0c44d516d73d7b574\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state
49m         Warning   Unhealthy                pod/elasticsearch-cdm-2ojk3phw-1-5c8c5f7586-ts5vx                  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
48m         Warning   Unhealthy                pod/elasticsearch-cdm-2ojk3phw-1-5c8c5f7586-ts5vx                  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
50m         Warning   FailedAttachVolume       pod/elasticsearch-cdm-2ojk3phw-2-7cdc4879f-rg6n6                   AttachVolume.Attach failed for volume "pvc-5700eafc-977e-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0e08d10df1de14b8c\"" to instance "i-081a6b4a7bfcd4bd9" since volume is in "creating" state
49m         Warning   Unhealthy                pod/elasticsearch-cdm-2ojk3phw-2-7cdc4879f-rg6n6                   Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
48m         Warning   Unhealthy                pod/elasticsearch-cdm-2ojk3phw-2-7cdc4879f-rg6n6                   Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
49m         Warning   FailedAttachVolume       pod/elasticsearch-cdm-2ojk3phw-3-7d495899b9-wg4fb                  AttachVolume.Attach failed for volume "pvc-571f70cc-977e-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-07508359dc15a71be\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state
48m         Warning   Unhealthy                pod/elasticsearch-cdm-2ojk3phw-3-7d495899b9-wg4fb                  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
47m         Warning   Unhealthy                pod/elasticsearch-cdm-2ojk3phw-3-7d495899b9-wg4fb                  Readiness probe failed: cat: /opt/app-root/src/init_failures: No such file or directory
122m        Warning   FailedAttachVolume       pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf                  AttachVolume.Attach failed for volume "pvc-531d34a5-9774-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-00dbaa1d5bbe58283\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state
121m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf                  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
120m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf                  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
52m         Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf                  Readiness probe errored: rpc error: code = Unknown desc = container is not created or running
122m        Warning   FailedAttachVolume       pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v                   AttachVolume.Attach failed for volume "pvc-533b4e22-9774-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0bd152cc1e009bfbc\"" to instance "i-081a6b4a7bfcd4bd9" since volume is in "creating" state
121m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v                   Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
120m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v                   Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
119m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v                   Readiness probe failed: cat: /opt/app-root/src/init_failures: No such file or directory
52m         Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v                   Readiness probe errored: rpc error: code = Unknown desc = container is not created or running
121m        Warning   FailedAttachVolume       pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj                  AttachVolume.Attach failed for volume "pvc-5359cc87-9774-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0a120cf2ef2311fc8\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state
120m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj                  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
119m        Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj                  Readiness probe failed: cat: /opt/app-root/src/init_failures: No such file or directory
52m         Warning   Unhealthy                pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj                  Readiness probe errored: rpc error: code = Unknown desc = container is not created or running
51m         Warning   Unhealthy                pod/kibana-579964d87f-4q7q2                                        Readiness probe errored: rpc error: code = Unknown desc = container is not created or running
[ematysek@jump ~]$

Comment 10 Noriko Hosoi 2019-06-25 21:00:04 UTC

> build 4.2.0-0.nightly-2019-06-24-112621

I'm wondering what is the source of logging/rsyslog to build this version...

I expect to see the logrotate container in the rsyslog daemonset like this.
$ oc describe ds rsyslog | egrep logrotate
   logrotate:
      /etc/cron.d from logrotate-crontab (ro)
      /opt/app-root/bin from logrotate-bin (ro)
   logrotate-bin:
    Name:      logrotate-bin
   logrotate-crontab:
    Name:      logrotate-crontab

Comment 11 Eric Matysek 2019-06-26 20:27:12 UTC

(In reply to Noriko Hosoi from comment #10)
> > build 4.2.0-0.nightly-2019-06-24-112621
> 
> I'm wondering what is the source of logging/rsyslog to build this version...

Is there something additional that I need to specify in the the clusterlogging CRD spec to enable logrotate?

Comment 12 Noriko Hosoi 2019-06-26 23:04:21 UTC

(In reply to Eric Matysek from comment #11)
> (In reply to Noriko Hosoi from comment #10)
> > > build 4.2.0-0.nightly-2019-06-24-112621
> > 
> > I'm wondering what is the source of logging/rsyslog to build this version...
> 
> Is there something additional that I need to specify in the the
> clusterlogging CRD spec to enable logrotate?

No, it's supposed to be available OOTB.
https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/rsyslog.go

If you run "oc get cm", what do you get?  Here's mine.
$ oc get cm
NAME                            DATA   AGE
cluster-logging-operator-lock   0      2m59s
curator                         3      2m53s
elasticsearch                   3      2m54s
elasticsearch-operator-lock     0      3m4s
logrotate-bin                   3      17s
logrotate-crontab               1      17s
rsyslog                         22     17s
rsyslog-bin                     1      17s
rsyslog-main                    1      17s
sharing-config                  2      2m54s

Comment 13 Mike Fiedler 2019-06-27 00:24:31 UTC

This testing is on 4.2, but the logging images are coming from public OperatorHub (clusterloggging is v4.1.2).  

rsyslog.x86_64                              8.37.0-9.el7           installed 
rsyslog-mmkubernetes.x86_64                 8.37.0-9.el7           installed

root@ip-172-31-64-58: ~ # oc describe ds rsyslog | egrep logrotate
root@ip-172-31-64-58: ~ # oc describe ds rsyslog | grep Image
    Image:      quay.io/openshift/origin-logging-rsyslog:latest


Is there a different rsyslog image than quay.io/openshift/origin-logging-rsyslog:latest ?  Should we be running a different upstream version of clusterlogging?

root@ip-172-31-64-58: ~ # oc get cm
NAME             DATA   AGE
curator          3      5m18s
elasticsearch    3      5m22s
rsyslog          15     5m14s
rsyslog-bin      1      5m14s
rsyslog-main     1      5m14s
sharing-config   2      5m21s

Comment 14 Rich Megginson 2019-06-27 17:40:15 UTC

DO NOT USE quay.io/openshift/origin-logging-* images.  Logging images from CI are not published there anymore.  The dates are very misleading.  For example, https://quay.io/repository/openshift/origin-logging-rsyslog?tag=latest&tab=tags says "latest" was "modified" 2 hours ago.  I have no idea what "modified" means in this context.  If you look at the actual image:

>docker run -it quay.io/openshift/origin-logging-rsyslog:latest bash
# rpm -qi rsyslog

Install Date: Wed May 15 04:51:34 2019


Instead, you should be using the images built in the ci registry in registry.svc.ci.openshift.org/origin/4.2

>docker run -it registry.svc.ci.openshift.org/origin/4.2:logging-rsyslog bash
# rpm -qi rsyslog

Install Date: Wed Jun 26 21:54:01 2019
Build Date  : Wed Jun 26 21:53:19 2019


Same with the operator images.  Use registry.svc.ci.openshift.org/origin/4.2:elasticsearch-operator and cluster-logging-operator
Those are the real 4.2 candidate images.


Logging CI uses registry.svc.ci.openshift.org/origin/4.2:*

Comment 15 Noriko Hosoi 2019-06-27 17:55:28 UTC

(sorry, i mistakenly changed the bz status...)

Comment 17 Rich Megginson 2019-06-28 14:31:36 UTC

logging CI uses https://github.com/openshift/origin-aggregated-logging/blob/master/hack/deploy-logging.sh for CI testing in api.ci
We are currently using the deploy_logging_using_clo_make method (the default).

This is how the script is run from logging CI or manual testing, by the hack/get-cluster-run-tests.sh script.

https://github.com/openshift/origin-aggregated-logging/blob/master/hack/get-cluster-run-tests.sh#L162

USE_EO_LATEST_IMAGE=true and USE_CLO_LATEST_IMAGE=true

This tells the script to use the latest logging and logging operator images in registry.svc.ci.openshift.org/origin/4.2

You can override all of these using EXTERNAL_REGISTRY EXT_REG_IMAGE_NS MASTER_VERSION

For 4.1, you can use USE_OLM=true to do a deployment using OLM (catalogs, subscriptions, etc.)

Comment 18 Mike Fiedler 2019-07-02 18:09:11 UTC

I got logging deployed using comment 17.   The operators come from registry.svc.ci.openshift.org

registry.svc.ci.openshift.org/origin/4.2:cluster-logging-operator
registry.svc.ci.openshift.org/origin/4.2:elasticsearch-operator

But the actual logging images are coming from quay.io.   I'll look at how to override.

quay.io/openshift/origin-logging-rsyslog:latest
quay.io/openshift/origin-logging-kibana5:latest  
quay.io/openshift/origin-logging-elasticsearch5:latest

Comment 19 Rich Megginson 2019-07-02 18:38:14 UTC

(In reply to Mike Fiedler from comment #18)
> I got logging deployed using comment 17.   The operators come from
> registry.svc.ci.openshift.org
> 
> registry.svc.ci.openshift.org/origin/4.2:cluster-logging-operator
> registry.svc.ci.openshift.org/origin/4.2:elasticsearch-operator
> 
> But the actual logging images are coming from quay.io.   I'll look at how to
> override.

Try USE_CUSTOM_IMAGES=true - that should cause the logging deploy script to use the latest 4.2 tagged images in registry.svc.ci.openshift.org

> 
> quay.io/openshift/origin-logging-rsyslog:latest
> quay.io/openshift/origin-logging-kibana5:latest  
> quay.io/openshift/origin-logging-elasticsearch5:latest

Comment 20 Mike Fiedler 2019-07-02 19:08:06 UTC

This does the trick:  USE_CUSTOM_IMAGES=true MASTER_VERSION=4.2 EXT_REG_IMAGE_NS=origin EXTERNAL_REGISTRY=registry.svc.ci.openshift.org hack/deploy-logging.sh

Comment 21 Eric Matysek 2019-07-09 16:23:23 UTC

I am still seeing large portions of logs missing from elasticsearch using the logging deploy script to pull the correct 4.2 images.

$ oc describe ds/rsyslog | grep -C 1 -i image 
   rsyslog:
    Image:      registry.svc.ci.openshift.org/origin/4.2:logging-rsyslog
    Port:       24231/TCP
--
   logrotate:
    Image:      registry.svc.ci.openshift.org/origin/4.2:logging-rsyslog
    Port:       <none>


These elasticsearch indices should all have 60,000 documents
health status index                                                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-rsyslog-3-0.ba169bc7-a25e-11e9-badd-0ac45d2c0912.2019.07.09 F0XFOFtySA2lOR0A8Sd4Rg   3   1      60000            0    128.9mb         64.5mb
green  open   project.logtest-rsyslog-3-1.ba16698e-a25e-11e9-badd-0ac45d2c0912.2019.07.09 aQu6Ocl8RPWwqQUHuawfjQ   3   1      50000            0    104.8mb         52.3mb
green  open   project.logtest-rsyslog-3-2.ba254ee3-a25e-11e9-984f-0e3fa727759a.2019.07.09 jhRp-_SnR6up5bErftUC3w   3   1      50000            0    106.1mb         52.9mb
green  open   project.logtest-rsyslog-3-3.ba1196ab-a25e-11e9-984f-0e3fa727759a.2019.07.09 6CLFlpVOQcChUCip8j5vOw   3   1      50000            0    108.8mb         54.2mb
green  open   project.logtest-rsyslog-3-4.ba111abd-a25e-11e9-984f-0e3fa727759a.2019.07.09 7ROY7LdxTXi40nsuu2u1Eg   3   1      50200            0    105.1mb         52.9mb

The indices with missing indices are missing big chunks of sequential messages
ex:
index: project.logtest-rsyslog-3-2.ba254ee3-a25e-11e9-984f-0e3fa727759a.2019.07.09
Missing log lines: 20151-30150
Number of missing logs: 10000
16.6667% message loss rate

Observed no restarts for rsyslog pods.
I don't see any indication of an issue in rsyslog pod logs.

The following test parameters consistently shows missing logs:
# of projects: 5
lines per project: 60,000
rate: 6,000 logs/m/project

The fluentd log collector has no issues with the same test(also produces smaller store.size, I guess less metadata):
health status index                                                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-fluentd-3-0.6e6b89b5-a263-11e9-badd-0ac45d2c0912.2019.07.09 ksVZR5jLS5-RsaYWceQnOQ   3   1      60000            0     86.3mb           43mb
green  open   project.logtest-fluentd-3-1.6e68aacd-a263-11e9-badd-0ac45d2c0912.2019.07.09 P80969ysQnet8V-7PPE9OQ   3   1      60000            0     85.8mb         42.8mb
green  open   project.logtest-fluentd-3-2.6e736e91-a263-11e9-984f-0e3fa727759a.2019.07.09 -4OhBXadTZuAM9dGl14gzg   3   1      60000            0     86.3mb         43.3mb
green  open   project.logtest-fluentd-3-3.6e6a3bd0-a263-11e9-984f-0e3fa727759a.2019.07.09 d_b_XoOiRECX_NHCwTZl3A   3   1      60000            0     85.8mb         42.8mb
green  open   project.logtest-fluentd-3-4.6e6a36fb-a263-11e9-984f-0e3fa727759a.2019.07.09 w15HjQHkQXmzzGwEMs00mQ   3   1      60000            0       86mb           43mb

Comment 22 Rich Megginson 2019-07-09 16:32:04 UTC

Please include the metrics from rsyslog and the logs e.g. something like this:

for pod in $( oc get pods -l component=rsyslog -o jsonpath='{.items[*].metadata.name}' ) ; do
  oc exec -c rsyslog $pod -- curl -sk https://localhost:24231/metrics > $pod.metrics
  oc exec -c rsyslog $pod -- logs > $pod.log
done

Please include the other indices - perhaps the logs are going to .operations.* or .orphaned.* or some other indices

oc exec -c elasticsearch $espod -- es_util --query=_cat/indices?v

Comment 23 Eric Matysek 2019-07-09 16:53:24 UTC

Created attachment 1588820 [details]
rsyslog metrics and logs

Comment 24 Eric Matysek 2019-07-09 16:55:59 UTC

All indices:
health status index                                                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-fluentd-3-0.6e6b89b5-a263-11e9-badd-0ac45d2c0912.2019.07.09 ksVZR5jLS5-RsaYWceQnOQ   3   1      60450            0     87.5mb         43.6mb
green  open   project.logtest-fluentd-3-1.6e68aacd-a263-11e9-badd-0ac45d2c0912.2019.07.09 P80969ysQnet8V-7PPE9OQ   3   1      60500            0     87.7mb         43.7mb
green  open   project.logtest-fluentd-3-2.6e736e91-a263-11e9-984f-0e3fa727759a.2019.07.09 -4OhBXadTZuAM9dGl14gzg   3   1      60500            0     87.7mb         43.8mb
green  open   project.logtest-fluentd-3-3.6e6a3bd0-a263-11e9-984f-0e3fa727759a.2019.07.09 d_b_XoOiRECX_NHCwTZl3A   3   1      60600            0     87.5mb         43.7mb
green  open   project.logtest-fluentd-3-4.6e6a36fb-a263-11e9-984f-0e3fa727759a.2019.07.09 w15HjQHkQXmzzGwEMs00mQ   3   1      60300            0     87.3mb         43.6mb
green  open   project.logtest-rsyslog-3-0.ba169bc7-a25e-11e9-badd-0ac45d2c0912.2019.07.09 F0XFOFtySA2lOR0A8Sd4Rg   3   1      60000            0    128.9mb         64.5mb
green  open   project.logtest-rsyslog-3-1.ba16698e-a25e-11e9-badd-0ac45d2c0912.2019.07.09 aQu6Ocl8RPWwqQUHuawfjQ   3   1      50000            0    104.8mb         52.3mb
green  open   project.logtest-rsyslog-3-2.ba254ee3-a25e-11e9-984f-0e3fa727759a.2019.07.09 jhRp-_SnR6up5bErftUC3w   3   1      50000            0    106.1mb         52.9mb
green  open   project.logtest-rsyslog-3-3.ba1196ab-a25e-11e9-984f-0e3fa727759a.2019.07.09 6CLFlpVOQcChUCip8j5vOw   3   1      50000            0    108.8mb         54.2mb
green  open   project.logtest-rsyslog-3-4.ba111abd-a25e-11e9-984f-0e3fa727759a.2019.07.09 7ROY7LdxTXi40nsuu2u1Eg   3   1      50200            0    105.1mb         52.9mb
green  open   project.logtest-rsyslog-5-0.c50d1a5e-a267-11e9-badd-0ac45d2c0912.2019.07.09 HfgyfipgQIWGm95EWVHxqQ   3   1      30000            0     65.3mb         32.6mb
green  open   project.logtest-rsyslog-5-1.c516f449-a267-11e9-badd-0ac45d2c0912.2019.07.09 epCY9HOiQs2-kGu5v496Fw   3   1      30000            0     65.2mb         32.6mb
green  open   project.logtest-rsyslog-5-2.c515e15a-a267-11e9-badd-0ac45d2c0912.2019.07.09 wcwNktiHQe6AFEr9yGp97w   3   1      30000            0     65.5mb         32.7mb
green  open   project.logtest-rsyslog-5-3.c51580f0-a267-11e9-984f-0e3fa727759a.2019.07.09 hrSjnGU7SrGrDSs4PXJVbA   3   1      30000            0     65.9mb         32.9mb
green  open   project.logtest-rsyslog-5-4.c5099874-a267-11e9-984f-0e3fa727759a.2019.07.09 KxXZ9j0aS_yRrJT3s6DnvA   3   1      30000            0     65.2mb         32.5mb
green  open   project.logtest-rsyslog-5-5.c519a1e2-a267-11e9-badd-0ac45d2c0912.2019.07.09 A7TUHs2qTh6NRg_Kkl-jmg   3   1      30000            0     65.7mb         32.8mb
green  open   project.logtest-rsyslog-5-6.c512ab96-a267-11e9-badd-0ac45d2c0912.2019.07.09 jt8mGc7_TfaKkewxCX7wZw   3   1      30000            0     65.4mb         32.7mb
green  open   project.logtest-rsyslog-5-7.c517af54-a267-11e9-984f-0e3fa727759a.2019.07.09 K7WoETsnTeeBX8iREqrbkQ   3   1      30000            0     64.8mb         32.3mb
green  open   project.logtest-rsyslog-5-8.c521b7a6-a267-11e9-984f-0e3fa727759a.2019.07.09 0lJcmrs2Q4mlMTP4Fr_Zmg   3   1      30000            0       65mb         32.5mb
green  open   project.logtest-rsyslog-5-9.c5177cc0-a267-11e9-984f-0e3fa727759a.2019.07.09 H1Xa7udrQnmRaU8VD1Xr1g   3   1      30000            0     65.5mb         32.5mb
green  open   .kibana                                                                     x3ffge2bQNaSr1vrR7qwnQ   1   1          5            0    112.2kb         54.8kb
green  open   .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac                            al6wnGjnSuSuW3QR4qz30A   1   1          2            1    103.4kb         51.7kb
green  open   .operations.2019.07.09                                                      4z82jh_vQ_GVG4c7xKrSvw   3   1     711396            0      2.2gb            1gb
green  open   .operations...                                                              yMFATbstRimOr_AQR-XjLg   3   1         60            0    187.8kb         72.8kb
green  open   .searchguard                                                                KwgVp98NSGilYY2qhV5UOA   1   1          0            0    149.8kb         74.9kb

Comment 25 Eric Matysek 2019-07-09 17:47:23 UTC

This query returns 0 hits so I don't think the logs are mistakenly going to .operations*
/.operations.2019.07.09/_search?q=message:SVTLogger

Comment 26 Rich Megginson 2019-07-09 18:01:31 UTC

There must be a clue in the fact that there are exactly 10000 records missing . . .

Also

green open .operations... yMFATbstRimOr_AQR-XjLg 3 1 60 0 187.8kb 72.8kb

This is bad - it means we're not correctly parsing the date for some of the operations logs - please file a bug for this.

When did the test start? I see this in the rsyslog logs:

Jul 9 15:27:03 rsyslog-lzmp7 rsyslogd: [origin software="rsyslogd" swVersion="8.37.0-9.el7" x-pid="15" x-info="http://www.rsyslog.com"] start
Jul 9 15:29:12 rsyslog-lzmp7 rsyslogd-2007: omelasticsearch: we are suspending ourselfs due to server failure 7: Failed connect to elasticsearch:9200; Connection timed out [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ]
Jul 9 15:29:12 rsyslog-lzmp7 rsyslogd-2007: action 'elasticsearch' suspended (module 'omelasticsearch'), retry 0. There should be messages before this one giving the reason for suspension. [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ]
Jul 9 15:29:17 rsyslog-lzmp7 rsyslogd-2007: omelasticsearch: checkConn failed after 1 attempts. [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ]
Jul 9 15:29:47 rsyslog-lzmp7 rsyslogd-2359: action 'elasticsearch' resumed (module 'omelasticsearch') [v8.37.0-9.el7 try http://www.rsyslog.com/e/2359 ]

It seems unlikely that exactly 10000 records were dropped during this time, but maybe? How does this happen? Do you redeploy elasticsearch after starting rsyslog?

I also see this:

Jul 9 15:51:30 rsyslog-lzmp7 rsyslogd-3000: omelasticsearch: could not parse JSON result [v8.37.0-9.el7]

Please file a bz for this. I don't think it is related to the missing 10000 records, but it should be investigated.

Comment 27 Rich Megginson 2019-07-09 18:11:12 UTC

Also - the rsyslog prometheus metrics aren't very helpful in diagnosing this problem.  We need the raw impstats.json.

oc edit cm rsyslog - scroll down to 10-viaq-modules.conf

Uncomment this line:

    module(load="impstats" interval="1" format="cee" log.syslog="off" log.file=`echo $RSYSLOG_IMPSTATS_FILE`)

comment out everything that follows:

    #module(load="omprog")

    #module(
    #  load="impstats"
    #  interval="10"
    #  format="json"
    #  resetCounters="off"
    #  ruleset="prometheus_stats"
    #)

    #ruleset(name="prometheus_stats") {
    #  action(
    #    type="omprog"
    #    name="prometheus_stats"
    #    binary="/usr/local/bin/rsyslog_exporter --web.listen-address=:24231 --tls.server-crt=/etc/rsyslog/metrics/tls.crt --tls.server-key=/etc/rsyslog/metrics/tls.key"
    #  )
    #}


That will disable the prometheus exporter and just dump the raw stats to /var/lib/rsyslog.pod/impstats.json

Then, at the end of the run, use

for pod in $( oc get pods -l component=rsyslog -o jsonpath='{.items[*].metadata.name}' ) ; do
  oc exec -c rsyslog $pod -- cat /var/lib/rsyslog.pod/impstats.json > $pod.stats.json
done

I believe QE is going to open another bug about the usefulness of the prometheus exporter metrics.

Comment 28 Eric Matysek 2019-07-09 18:40:51 UTC

Agreed about the 10,000 records missing clue, from my testing it is almost always chunks of 9000-10000 missing logs, even when I adjust the number of pods or rate of messages.
Ex. from a different test:
10x projects 4500 logs/min/project
health status index                                                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-rsyslog-6-5.2a19f4d2-a272-11e9-984f-0e3fa727759a.2019.07.09 80qb3n4ZQi2ZS5p5px5thg   3   1      25886            0     56.1mb         27.9mb
Expected max number: 45000
Missing log line(s): 9425-19342 (9917)
Missing log line(s): 28577-37772 (9195)
Number of missing logs: 19114
42.4756% message loss rate


I will file a bz for the .operations... index

The rsyslog-3 test started at 15:32
End at 15:42
I believe those logs are generated because rsyslog pods are up and ready much quicker than the elasticsearch pods which take a few minutes to get to the ready state.
There may also be some extra rsyslog pod log weirdness due to me switching the log collector back and forth between rsyslog/fluentd between tests, I can re-run and grab new logs from a fresh cluster if needed.

I will file a bz for the 'could not parse JSON result' log

Comment 29 Eric Matysek 2019-07-15 16:46:13 UTC

I was able to confirm that changing the log line size from 1KB to 2KB caused the chunks of missing records to reduce from ~10k to ~5k.
This support Mike's theory that this issue is related to the kubelet wrapping logs at 10MB

Missing log line(s): 30251-35300 (5049)
Missing log line(s): 40301-45350 (5049)
Missing log line(s): 50351-55400 (5049)

I'll look into increasing this limit and see how the test performs after that

Comment 30 Rich Megginson 2019-07-15 16:55:02 UTC

(In reply to Eric Matysek from comment #29)
> I was able to confirm that changing the log line size from 1KB to 2KB caused
> the chunks of missing records to reduce from ~10k to ~5k.
> This support Mike's theory that this issue is related to the kubelet
> wrapping logs at 10MB
> 
> Missing log line(s): 30251-35300 (5049)
> Missing log line(s): 40301-45350 (5049)
> Missing log line(s): 50351-55400 (5049)
> 
> I'll look into increasing this limit and see how the test performs after that

hmm - could be related to the rsyslog endmsg.regex support for cri-o partial records.  Can you provide an example of the raw cri-o file log under /var/log/containers/*.log for a record that is not processed by rsyslog into Elasticsearch?

Comment 31 Eric Matysek 2019-07-15 19:49:16 UTC

While looking into this, one interesting thing I noticed was that all the logtest logs appear to be going through the same rsyslog pod.
[root@rsyslog-zmx4v containers]# pwd
/var/log/containers
[root@rsyslog-zmx4v containers]# ls | grep logtest
centos-logtest-75666_logtest-3-3_centos-logtest-6b0145e5d63a5a91bd469f26c4e55f859d79d5802068c1fa1f878722eb180b0a.log
centos-logtest-9z8s2_logtest-3-0_centos-logtest-9977c254377e503a483c5a7c59a73908e792f3a3cb93ec3f9ee1894994111c8b.log
centos-logtest-dckwr_logtest-3-2_centos-logtest-cad7f493ba3e0dc5ce2918c2ce5bd626a4a5e0cfd17eee1b26bd7a2edb1d1c75.log
centos-logtest-ms9xh_logtest-3-4_centos-logtest-978d2d02791ac5c2924c8a3a9578845f7827f7f1bc1a292ebf21723b623c47d7.log
centos-logtest-rcgjk_logtest-3-1_centos-logtest-f4fb630ab339aa5b7c1dfa0e45868d60fe979e5e69be6c2bff4ead6564d537c3.log


logtest-3-0 results (rolls logs at 20Mi instead of 10Mi so we expect missing logs in ~20k chunks):
Expected max number: 60000
Missing log line(s): 37601-56600 (18999)
Number of missing logs: 19000
31.6667% message loss rate

Here is one of the missing log lines:
[root@rsyslog-zmx4v centos-logtest]# pwd
/var/log/pods/logtest-3-0_centos-logtest-9z8s2_7ac4e8c7-a735-11e9-888f-0265ad267b82/centos-logtest

[root@rsyslog-zmx4v centos-logtest]# grep 55000 0.log.20190715-192912
2019-07-15T19:28:55.601740695+00:00 stderr F 2019-07-15 19:28:55,601 - SVTLogger - INFO - centos-logtest-9z8s2 : 55000 : DVHtpC9S5 IVjquBCSA o4FkqalUC 4sxs0N3ll kKrUHLQJy PYWZzzeXM x8wvMC8Yn n7dniTU3x m9aEljYB5 AckAxmF9U du8flHeKm F53jeyGgw RxOjXwlCC ag7jMN9wl ubSU5lru5 5VwA8KEch DvMRbX0cR yXVFkZDyU LyzIjn8RY IHbA1CxZO rJPCbesue nCZPYKlFs Mxt34ZCt3 E39iXa7Zy dH5hu14cI A1kLR7Vea 7eX6doYOG 2ODhcoMFl 0FEN9fdri IGT3kKtqT mswtMZohg ZRAzjwQWr dH5cERq9c WkI1WAnAP 8EpTLT0Za pYFdUL1RS sHWBN44rk 6SRCBERHC Y6viOmmlr PBxUjRegt LrvJ0Pksf 8zvvsBjGD bIZzNG2uh AFFJRn3PM RoOgB2B6Z AXKzwTeoU GxpSZhxBv cCsNkTFF7 1wLzDXl3l b3B4qZbmw 1D2peR6Qt WUauJhpei hFAbGx6p8 OuAjrTKkA X0SNRkTJ2 eybqz1Ig7 7CZAx6HGp Hbc9EwPoJ QUqq3Y1xA HhsxTJnox Z20PaeC2G AmOx55gUa tH5iPcYUg BVW1GPKKn I9S6nxUuS jjIWnORgK MQHB9s4O9 ZpVfOsrus KPLonlBrM 4NDg2gUiE luHt1am37 UqmeF4CgX nVQyWswCd clAuQLqlW s7hgrhj2j ThhQWpOXl hZWJWh7DS vdeESPPdW Ue52nsUkN pMmKdIjlU qpRr4RKZx D0W5nWoOy rfvPUaqTC 4U2iiqd8a wDMNZOdqt lmWVvaPdf H3ZbYcxIV fLd1sRsIK JECotRN3s 8kkZB7wVR L01X1YNnk IIlZUGrMV DqSrRGQJR EC7a5wkyV roY1J3tyj vEMAKWP5v oyAORvyLS 3LIvzcFig kHHbXyW1P N1ru7sh5a 4NtQ0Va1M mrjOyCv3G FbY5

All the missing log lines from this run are in the same file:
/var/log/pods/logtest-3-0_centos-logtest-9z8s2_7ac4e8c7-a735-11e9-888f-0265ad267b82/centos-logtest/0.log.20190715-192912

Comment 32 Eric Matysek 2019-07-15 19:49:59 UTC

Created attachment 1590873 [details]
Missing logs

Comment 33 Rich Megginson 2019-07-15 20:00:57 UTC

> While looking into this, one interesting thing I noticed was that all the logtest logs appear to be going through the same rsyslog pod.

Do you mean - all of the logtest pods are running on the same node?  I guess this is a good stress test for a single rsyslog node, but if that was not your intention, do you need to set some sort of anti-affinity on the logtest pods so that they are all scheduled on different nodes, spread out as much as possible?

Comment 34 Eric Matysek 2019-07-15 20:19:37 UTC

(In reply to Rich Megginson from comment #33)
> > While looking into this, one interesting thing I noticed was that all the logtest logs appear to be going through the same rsyslog pod.
> 
> Do you mean - all of the logtest pods are running on the same node?  I guess
> this is a good stress test for a single rsyslog node, but if that was not
> your intention, do you need to set some sort of anti-affinity on the logtest
> pods so that they are all scheduled on different nodes, spread out as much
> as possible?

I mean I have 5 logtest projects and there are 5 rsyslog pods, but all 5 logtest logs are on the same rsyslog pod.
There were no logs in the /var/log/containers/ directory for logtest pods on any of the other 4 rsyslog pods.
[root@rsyslog-xbppx /]# ls /var/log/containers/cent*
ls: cannot access /var/log/containers/cent*: No such file or directory

Should these be getting load balanced across all 5 rsyslog pods?

Comment 35 Rich Megginson 2019-07-15 20:36:19 UTC

> I mean I have 5 logtest projects and there are 5 rsyslog pods, but all 5 logtest logs are on the same rsyslog pod.

There is 1 rsyslog pod per node.  rsyslog is a daemonset, which means usually there is one and only one pod per node.

If you do `oc get pods -o wide` you can see which nodes your pods are running on.

> Should these be getting load balanced across all 5 rsyslog pods?

I don't know, it depends on what you are trying to test.  If all of your logtest pods are running on the same node, and this is not what you want, you'll need to ask Mike how to run logtest so that it is spread out evenly among all of the nodes.  Or investigate pod anti-affinity, taints, and tolerations.

Comment 36 Eric Matysek 2019-07-16 13:24:56 UTC

> There is 1 rsyslog pod per node.  rsyslog is a daemonset, which means usually there is one and only one pod per node.

Ah I see, thank you for the clarification

Comment 37 Eric Matysek 2019-07-16 15:12:01 UTC

I think it's safe to say the issue is related to the logs getting rotated since following the steps here (https://github.com/mffiedler/ocp-svt/blob/master/svt-notes/logging/config_kubelet_larger_pod_logs.md) changed the size of the chunks missing.
Is it possible rsyslog is falling behind and the logs are getting gzipped before rsyslog is able to push them to elasticsearch?

Comment 38 Rich Megginson 2019-07-16 15:32:29 UTC

(In reply to Eric Matysek from comment #37)
> I think it's safe to say the issue is related to the logs getting rotated
> since following the steps here
> (https://github.com/mffiedler/ocp-svt/blob/master/svt-notes/logging/
> config_kubelet_larger_pod_logs.md) changed the size of the chunks missing.
> Is it possible rsyslog is falling behind and the logs are getting gzipped
> before rsyslog is able to push them to elasticsearch?

Yes, that is possible, but very disappointing, as it means that rsyslog is not performing as well as fluentd :-( (assuming everything else is the same, only using rsyslog instead of fluentd)

We'll need to gather some monitoring information from rsyslog.

Change rsyslog to produce impstats.json as in https://bugzilla.redhat.com/show_bug.cgi?id=1722944#c27

Also every minute do `oc exec -c rsyslog $rsyslodpod -- ls -alrtF /var/lib/rsyslog.pod` so we can see what is happening with the queue files and the position files.

Comment 39 Eric Matysek 2019-07-16 21:17:14 UTC

Performed a fresh test today, 5 logtest projects each producing 60k logs, only 50 logs were missing from the whole test so let me know if it would be better to load it more to get more missing records.

index: project.logtest-3-1.c6b276d1-a809-11e9-b8a7-0ad653da7944.2019.07.16
Expected max number: 60000
Missing log line(s): 9301-9350 (50)
Number of missing logs: 50
0.0833% message loss rate

rsyslog pod responsible: rsyslog-gbqw9

Comment 40 Eric Matysek 2019-07-16 21:18:11 UTC

Created attachment 1591207 [details]
impstats and rsyslog.pod output

Comment 42 Eric Matysek 2019-08-09 13:43:29 UTC

Performance did improve with PR220 but fluentd is still more performant than rsyslog by quite a bit.
Because this bug has so much history at this point I'm going to mark this verified and open a new bug.

Comment 43 errata-xmlrpc 2019-10-16 06:32:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.