Created attachment 1583376 [details] rsyslog pod logs Description of problem: When testing the rsyslog log collector I am not seeing the correct number of documents in the resulting elasticsearch index. With 10 pods producing 30k logs at 50 per second I have generally observed 0-50 logs missing per index. The problem only gets worse as the rate gets higher. -With 20 pods producing 30k logs at 50 per second I can see some indices missing thousands of logs. -With 10 pods producing 30k logs at 100 per second only 4 out of 10 of my elasticsearch indices were created at all (although only 1 missing log out of those) I have not seen the same issues when testing with the fluentd log collector. Originally I thought this may be due to rsyslog pods going OOM as I was seeing rsyslog pod restarts, but after increasing the memory limit to 1Gi the issue still stands. Version-Release number of selected component (if applicable): How reproducible: Very consistent although some indices do get the expected count while other may only have a small fraction of expected docs. Steps to Reproduce: 1. Create clusterlogging operator with rsyslog log collector 2. Create pods that will generate give number of logs at given rate 3. Check elasticsearch index document count. Actual results: Not all logs showing up in elasticsearch index Expected results: All logs show up in elasticsearch index Additional info: Also seeing this log line on startup of the cluster logging instance, but I assume once the elasticsearch pods become reachable it is no longer an issue. rsyslogd: omelasticsearch: we are suspending ourselfs due to server failure 7: Failed connect to elasticsearch:9200; No route to host [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ] See attachment for logs of al rsyslog pods.
> Also seeing this log line on startup of the cluster logging instance, but I assume once the elasticsearch pods become reachable it is no longer an issue. rsyslogd: omelasticsearch: we are suspending ourselfs due to server failure 7: Failed connect to elasticsearch:9200; No route to host [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ] Right. It should clear up. How did you generate the rsyslog logs? In addition to `oc logs -c rsyslog $rpod` you need to do `oc exec -c rsyslog $rpod -- logs` We should also take a look at the prometheus stats. https://github.com/openshift/origin-aggregated-logging/blob/master/test/zzz-rsyslog.sh#L238 e.g. something like this: for ii in $( seq 1 600 ) ; do for rpod in $( oc get pods -l component=rsyslog -o jsonpath='{.items[*].metadata.name}' ) ; do rpod_ip="$( oc get pod $rpod -o jsonpath='{.status.podIP}' )" curl -s -k https://${rpod_ip}:24231/metrics >> /var/tmp/${rpod}-metrics-scrape 2>&1 done sleep 10 done The metrics can indicate problems that might not be logged by default.
Also, is this with 4.1 or 4.2? The Version field says 4.1. I'm assuming you deployed the latest 4.2 candidate, but just want to make sure.
The version I tested on was 4.1.2 I generated the logs using cluster loader and ocp_logtest (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) I will work on grabbing the additional rsyslog pod logs and prometheus stats you mentioned.
(In reply to Eric Matysek from comment #3) > The version I tested on was 4.1.2 rsyslog wasn't fully supported in 4.1 Please try the latest 4.2 candidate to see if you can reproduce with the latest supported rsyslog > > I generated the logs using cluster loader and ocp_logtest > (https://github.com/openshift/svt/blob/master/openshift_scalability/content/ > logtest/ocp_logtest-README.md) > > I will work on grabbing the additional rsyslog pod logs and prometheus stats > you mentioned.
Verified I am still seeing the issue on build 4.2.0-0.nightly-2019-06-24-112621 I expected 30k documents in these indices: green open project.logtest-rsyslog-4-0.1e50a0c8-96af-11e9-9f33-0e27288c199a.2019.06.24 onuVsHOhS1CrH1MfbFww3g 3 1 29996 0 65.6mb 32.8mb green open project.logtest-rsyslog-4-1.1e46eb2a-96af-11e9-bd21-12e0af33f230.2019.06.24 LlU8OxzLSBCnk1HTz5e36g 3 1 29984 0 65.3mb 32.3mb green open project.logtest-rsyslog-4-2.1e5250fc-96af-11e9-bd21-12e0af33f230.2019.06.24 g6tycm1sTFOIBKpCldoVsQ 3 1 29994 0 64.1mb 32mb green open project.logtest-rsyslog-4-3.1e5361d7-96af-11e9-bd21-12e0af33f230.2019.06.24 GQyZjzW2ShOMN42mz728Tg 3 1 29986 0 64.7mb 32.1mb green open project.logtest-rsyslog-4-4.1e57d6d6-96af-11e9-9f33-0e27288c199a.2019.06.24 QstnJWsASbeVwr29CPs1mA 3 1 29991 0 65.4mb 32.8mb ...omitting 6-19... Trying to grab logs as you suggested gave me this for all of my rsyslog pods: [ematysek@jump ~]$ oc exec -c rsyslog rsyslog-58pg8 -- logs ls: cannot access /var/log/rsyslog/rsyslog*: No such file or directory Working on grabbing prometheus metrics
ls -alrtF /var/log/rsyslog /var/lib/rsyslog.pod
Also could you share output from these command lines? $ oc describe ds rsyslog $ oc get events | grep -i warning
(In reply to Rich Megginson from comment #6) > ls -alrtF /var/log/rsyslog /var/lib/rsyslog.pod [root@rsyslog-2lbgn /]# ls -alrtF /var/log/rsyslog /var/lib/rsyslog.pod /var/log/rsyslog: total 0 drwxr-xr-x. 12 root root 189 Jun 25 18:09 ../ drwxr-xr-x. 2 root root 6 Jun 25 18:09 ./ /var/lib/rsyslog.pod: total 1232 -rw-------. 1 root root 216 Jun 25 19:19 imfile-state:75547144 -rw-------. 1 root root 174 Jun 25 19:19 imfile-state:50530172 -rw-------. 1 root root 206 Jun 25 19:19 imfile-state:25715 -rw-------. 1 root root 206 Jun 25 19:19 imfile-state:251867675 -rw-------. 1 root root 221 Jun 25 19:19 imfile-state:251722273 -rw-------. 1 root root 197 Jun 25 19:19 imfile-state:235350489 -rw-------. 1 root root 209 Jun 25 19:19 imfile-state:235181738 -rw-------. 1 root root 191 Jun 25 19:19 imfile-state:176177139 -rw-------. 1 root root 216 Jun 25 19:19 imfile-state:167811834 -rw-------. 1 root root 215 Jun 25 19:19 imfile-state:12159287 -rw-------. 1 root root 196 Jun 25 19:19 imfile-state:12139139 -rw-------. 1 root root 196 Jun 25 19:19 imfile-state:92287048 -rw-------. 1 root root 201 Jun 25 19:19 imfile-state:83905953 -rw-------. 1 root root 192 Jun 25 19:19 imfile-state:50475188 -rw-------. 1 root root 190 Jun 25 19:19 imfile-state:50364301 -rw-------. 1 root root 185 Jun 25 19:19 imfile-state:50333793 -rw-------. 1 root root 189 Jun 25 19:19 imfile-state:33595529 -rw-------. 1 root root 195 Jun 25 19:19 imfile-state:33595524 -rw-------. 1 root root 185 Jun 25 19:19 imfile-state:33595452 -rw-------. 1 root root 176 Jun 25 19:19 imfile-state:33556122 -rw-------. 1 root root 198 Jun 25 19:19 imfile-state:260056133 -rw-------. 1 root root 195 Jun 25 19:19 imfile-state:25213637 -rw-------. 1 root root 182 Jun 25 19:19 imfile-state:234882904 -rw-------. 1 root root 220 Jun 25 19:19 imfile-state:226493924 -rw-------. 1 root root 172 Jun 25 19:19 imfile-state:218109513 -rw-------. 1 root root 197 Jun 25 19:19 imfile-state:192942257 -rw-------. 1 root root 192 Jun 25 19:19 imfile-state:159455953 -rw-------. 1 root root 190 Jun 25 19:19 imfile-state:159384795 -rw-------. 1 root root 184 Jun 25 19:19 imfile-state:159384696 -rw-------. 1 root root 195 Jun 25 19:19 imfile-state:150995985 -rw-------. 1 root root 191 Jun 25 19:19 imfile-state:134240410 -rw-------. 1 root root 166 Jun 25 19:19 imfile-state:125829935 -rw-------. 1 root root 189 Jun 25 19:19 imfile-state:117465705 -rw-------. 1 root root 191 Jun 25 19:19 imfile-state:117441875 -rw-------. 1 root root 174 Jun 25 19:19 imfile-state:92275490 -rw-------. 1 root root 191 Jun 25 19:19 imfile-state:218104515 -rw-------. 1 root root 182 Jun 25 19:19 imfile-state:209715439 drwxr-xr-x. 1 root root 39 Jun 25 19:21 ../ -rw-------. 1 root root 201 Jun 25 19:38 imfile-state:243486924 -rw-------. 1 root root 201 Jun 25 19:38 imfile-state:109232691 -rw-------. 1 root root 201 Jun 25 19:38 imfile-state:25253818 -rw-------. 1 root root 200 Jun 25 19:38 imfile-state:184845298 -rw-------. 1 root root 200 Jun 25 19:38 imfile-state:75682204 -rw-------. 1 root root 200 Jun 25 19:38 imfile-state:235466422 -rw-------. 1 root root 201 Jun 25 19:41 imfile-state:243486923 -rw-------. 1 root root 201 Jun 25 19:41 imfile-state:109232690 -rw-------. 1 root root 201 Jun 25 19:41 imfile-state:25253815 -rw-------. 1 root root 200 Jun 25 19:41 imfile-state:184845295 -rw-------. 1 root root 200 Jun 25 19:41 imfile-state:75682201 -rw-------. 1 root root 200 Jun 25 19:41 imfile-state:235466416 -rw-r--r--. 1 root root 123 Jun 25 20:06 imjournal.state -rw-------. 1 root root 578 Jun 25 20:06 es-app.qi drwxr-xr-x. 2 root root 4096 Jun 25 20:06 ./ -rw-------. 1 root root 872631 Jun 25 20:06 es-app.00000268
(In reply to Noriko Hosoi from comment #7) > Also could you share output from these command lines? > $ oc describe ds rsyslog > $ oc get events | grep -i warning [ematysek@jump ~]$ oc describe ds rsyslog Name: rsyslog Selector: component=rsyslog,logging-infra=rsyslog,provider=openshift Node-Selector: <none> Labels: component=rsyslog logging-infra=rsyslog provider=openshift Annotations: deprecated.daemonset.template.generation: 1 Desired Number of Nodes Scheduled: 5 Current Number of Nodes Scheduled: 5 Number of Nodes Scheduled with Up-to-date Pods: 5 Number of Nodes Scheduled with Available Pods: 5 Number of Nodes Misscheduled: 0 Pods Status: 5 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: component=rsyslog logging-infra=rsyslog provider=openshift Annotations: scheduler.alpha.kubernetes.io/critical-pod: Service Account: logcollector Containers: rsyslog: Image: quay.io/openshift/origin-logging-rsyslog:latest Port: <none> Host Port: <none> Command: /bin/sh Args: /opt/app-root/bin/rsyslog.sh Limits: memory: 1Gi Environment: MERGE_JSON_LOG: false K8S_HOST_URL: https://kubernetes.default.svc ES_HOST: elasticsearch ES_PORT: 9200 ES_CLIENT_CERT: /etc/rsyslog/keys/app-cert ES_CLIENT_KEY: /etc/rsyslog/keys/app-key ES_CA: /etc/rsyslog/keys/app-ca OPS_HOST: elasticsearch OPS_PORT: 9200 OPS_CLIENT_CERT: /etc/rsyslog/keys/infra-cert OPS_CLIENT_KEY: /etc/rsyslog/keys/infra-key OPS_CA: /etc/rsyslog/keys/infra-ca BUFFER_QUEUE_LIMIT: 32 BUFFER_SIZE_LIMIT: 8m FILE_BUFFER_LIMIT: 256Mi RSYSLOG_CPU_LIMIT: node allocatable (limits.cpu) RSYSLOG_MEMORY_LIMIT: 1073741824 (limits.memory) NODE_IPV4: (v1:status.hostIP) Mounts: /etc/docker-hostname from dockerhostname (ro) /etc/localtime from localtime (ro) /etc/machine-id from machineid (ro) /etc/rsyslog.d from config (ro) /etc/rsyslog/conf from main (ro) /etc/rsyslog/keys from certs (ro) /opt/app-root/bin from bin (ro) /run/log/journal from runlogjournal (rw) /var/lib/docker from varlibdockercontainers (ro) /var/lib/rsyslog.pod from filebufferstorage (rw) /var/log from varlog (rw) Volumes: runlogjournal: Type: HostPath (bare host directory volume) Path: /run/log/journal HostPathType: varlog: Type: HostPath (bare host directory volume) Path: /var/log HostPathType: varlibdockercontainers: Type: HostPath (bare host directory volume) Path: /var/lib/docker HostPathType: bin: Type: ConfigMap (a volume populated by a ConfigMap) Name: rsyslog-bin Optional: false main: Type: ConfigMap (a volume populated by a ConfigMap) Name: rsyslog-main Optional: false config: Type: ConfigMap (a volume populated by a ConfigMap) Name: rsyslog Optional: false certs: Type: Secret (a volume populated by a Secret) SecretName: rsyslog Optional: false dockerhostname: Type: HostPath (bare host directory volume) Path: /etc/hostname HostPathType: localtime: Type: HostPath (bare host directory volume) Path: /etc/localtime HostPathType: machineid: Type: HostPath (bare host directory volume) Path: /etc/machine-id HostPathType: filebufferstorage: Type: HostPath (bare host directory volume) Path: /var/lib/rsyslog.pod HostPathType: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 50m daemonset-controller Created pod: rsyslog-d5zhw Normal SuccessfulCreate 50m daemonset-controller Created pod: rsyslog-mbhmv Normal SuccessfulCreate 50m daemonset-controller Created pod: rsyslog-2lbgn Normal SuccessfulCreate 50m daemonset-controller Created pod: rsyslog-xc9cv Normal SuccessfulCreate 50m daemonset-controller Created pod: rsyslog-tx97q [ematysek@jump ~]$ [ematysek@jump ~]$ oc get events | grep -i warning 51m Warning FailedAttachVolume pod/elasticsearch-cdm-2ojk3phw-1-5c8c5f7586-ts5vx AttachVolume.Attach failed for volume "pvc-56e2bfa5-977e-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0c44d516d73d7b574\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state 49m Warning Unhealthy pod/elasticsearch-cdm-2ojk3phw-1-5c8c5f7586-ts5vx Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] 48m Warning Unhealthy pod/elasticsearch-cdm-2ojk3phw-1-5c8c5f7586-ts5vx Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] 50m Warning FailedAttachVolume pod/elasticsearch-cdm-2ojk3phw-2-7cdc4879f-rg6n6 AttachVolume.Attach failed for volume "pvc-5700eafc-977e-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0e08d10df1de14b8c\"" to instance "i-081a6b4a7bfcd4bd9" since volume is in "creating" state 49m Warning Unhealthy pod/elasticsearch-cdm-2ojk3phw-2-7cdc4879f-rg6n6 Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] 48m Warning Unhealthy pod/elasticsearch-cdm-2ojk3phw-2-7cdc4879f-rg6n6 Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] 49m Warning FailedAttachVolume pod/elasticsearch-cdm-2ojk3phw-3-7d495899b9-wg4fb AttachVolume.Attach failed for volume "pvc-571f70cc-977e-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-07508359dc15a71be\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state 48m Warning Unhealthy pod/elasticsearch-cdm-2ojk3phw-3-7d495899b9-wg4fb Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] 47m Warning Unhealthy pod/elasticsearch-cdm-2ojk3phw-3-7d495899b9-wg4fb Readiness probe failed: cat: /opt/app-root/src/init_failures: No such file or directory 122m Warning FailedAttachVolume pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf AttachVolume.Attach failed for volume "pvc-531d34a5-9774-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-00dbaa1d5bbe58283\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state 121m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] 120m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] 52m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-1-657ff768f8-mnjvf Readiness probe errored: rpc error: code = Unknown desc = container is not created or running 122m Warning FailedAttachVolume pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v AttachVolume.Attach failed for volume "pvc-533b4e22-9774-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0bd152cc1e009bfbc\"" to instance "i-081a6b4a7bfcd4bd9" since volume is in "creating" state 121m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] 120m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] 119m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v Readiness probe failed: cat: /opt/app-root/src/init_failures: No such file or directory 52m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-2-65b854478-vng6v Readiness probe errored: rpc error: code = Unknown desc = container is not created or running 121m Warning FailedAttachVolume pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj AttachVolume.Attach failed for volume "pvc-5359cc87-9774-11e9-934e-128329b34930" : "Error attaching EBS volume \"vol-0a120cf2ef2311fc8\"" to instance "i-0a9c2918c261c4e5e" since volume is in "creating" state 120m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] 119m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj Readiness probe failed: cat: /opt/app-root/src/init_failures: No such file or directory 52m Warning Unhealthy pod/elasticsearch-cdm-gv0bryoh-3-7c6bccc6bd-g2gkj Readiness probe errored: rpc error: code = Unknown desc = container is not created or running 51m Warning Unhealthy pod/kibana-579964d87f-4q7q2 Readiness probe errored: rpc error: code = Unknown desc = container is not created or running [ematysek@jump ~]$
> build 4.2.0-0.nightly-2019-06-24-112621 I'm wondering what is the source of logging/rsyslog to build this version... I expect to see the logrotate container in the rsyslog daemonset like this. $ oc describe ds rsyslog | egrep logrotate logrotate: /etc/cron.d from logrotate-crontab (ro) /opt/app-root/bin from logrotate-bin (ro) logrotate-bin: Name: logrotate-bin logrotate-crontab: Name: logrotate-crontab
(In reply to Noriko Hosoi from comment #10) > > build 4.2.0-0.nightly-2019-06-24-112621 > > I'm wondering what is the source of logging/rsyslog to build this version... Is there something additional that I need to specify in the the clusterlogging CRD spec to enable logrotate?
(In reply to Eric Matysek from comment #11) > (In reply to Noriko Hosoi from comment #10) > > > build 4.2.0-0.nightly-2019-06-24-112621 > > > > I'm wondering what is the source of logging/rsyslog to build this version... > > Is there something additional that I need to specify in the the > clusterlogging CRD spec to enable logrotate? No, it's supposed to be available OOTB. https://github.com/openshift/cluster-logging-operator/blob/master/pkg/k8shandler/rsyslog.go If you run "oc get cm", what do you get? Here's mine. $ oc get cm NAME DATA AGE cluster-logging-operator-lock 0 2m59s curator 3 2m53s elasticsearch 3 2m54s elasticsearch-operator-lock 0 3m4s logrotate-bin 3 17s logrotate-crontab 1 17s rsyslog 22 17s rsyslog-bin 1 17s rsyslog-main 1 17s sharing-config 2 2m54s
This testing is on 4.2, but the logging images are coming from public OperatorHub (clusterloggging is v4.1.2). rsyslog.x86_64 8.37.0-9.el7 installed rsyslog-mmkubernetes.x86_64 8.37.0-9.el7 installed root@ip-172-31-64-58: ~ # oc describe ds rsyslog | egrep logrotate root@ip-172-31-64-58: ~ # oc describe ds rsyslog | grep Image Image: quay.io/openshift/origin-logging-rsyslog:latest Is there a different rsyslog image than quay.io/openshift/origin-logging-rsyslog:latest ? Should we be running a different upstream version of clusterlogging? root@ip-172-31-64-58: ~ # oc get cm NAME DATA AGE curator 3 5m18s elasticsearch 3 5m22s rsyslog 15 5m14s rsyslog-bin 1 5m14s rsyslog-main 1 5m14s sharing-config 2 5m21s
DO NOT USE quay.io/openshift/origin-logging-* images. Logging images from CI are not published there anymore. The dates are very misleading. For example, https://quay.io/repository/openshift/origin-logging-rsyslog?tag=latest&tab=tags says "latest" was "modified" 2 hours ago. I have no idea what "modified" means in this context. If you look at the actual image: >docker run -it quay.io/openshift/origin-logging-rsyslog:latest bash # rpm -qi rsyslog Install Date: Wed May 15 04:51:34 2019 Instead, you should be using the images built in the ci registry in registry.svc.ci.openshift.org/origin/4.2 >docker run -it registry.svc.ci.openshift.org/origin/4.2:logging-rsyslog bash # rpm -qi rsyslog Install Date: Wed Jun 26 21:54:01 2019 Build Date : Wed Jun 26 21:53:19 2019 Same with the operator images. Use registry.svc.ci.openshift.org/origin/4.2:elasticsearch-operator and cluster-logging-operator Those are the real 4.2 candidate images. Logging CI uses registry.svc.ci.openshift.org/origin/4.2:*
(sorry, i mistakenly changed the bz status...)
logging CI uses https://github.com/openshift/origin-aggregated-logging/blob/master/hack/deploy-logging.sh for CI testing in api.ci We are currently using the deploy_logging_using_clo_make method (the default). This is how the script is run from logging CI or manual testing, by the hack/get-cluster-run-tests.sh script. https://github.com/openshift/origin-aggregated-logging/blob/master/hack/get-cluster-run-tests.sh#L162 USE_EO_LATEST_IMAGE=true and USE_CLO_LATEST_IMAGE=true This tells the script to use the latest logging and logging operator images in registry.svc.ci.openshift.org/origin/4.2 You can override all of these using EXTERNAL_REGISTRY EXT_REG_IMAGE_NS MASTER_VERSION For 4.1, you can use USE_OLM=true to do a deployment using OLM (catalogs, subscriptions, etc.)
I got logging deployed using comment 17. The operators come from registry.svc.ci.openshift.org registry.svc.ci.openshift.org/origin/4.2:cluster-logging-operator registry.svc.ci.openshift.org/origin/4.2:elasticsearch-operator But the actual logging images are coming from quay.io. I'll look at how to override. quay.io/openshift/origin-logging-rsyslog:latest quay.io/openshift/origin-logging-kibana5:latest quay.io/openshift/origin-logging-elasticsearch5:latest
(In reply to Mike Fiedler from comment #18) > I got logging deployed using comment 17. The operators come from > registry.svc.ci.openshift.org > > registry.svc.ci.openshift.org/origin/4.2:cluster-logging-operator > registry.svc.ci.openshift.org/origin/4.2:elasticsearch-operator > > But the actual logging images are coming from quay.io. I'll look at how to > override. Try USE_CUSTOM_IMAGES=true - that should cause the logging deploy script to use the latest 4.2 tagged images in registry.svc.ci.openshift.org > > quay.io/openshift/origin-logging-rsyslog:latest > quay.io/openshift/origin-logging-kibana5:latest > quay.io/openshift/origin-logging-elasticsearch5:latest
This does the trick: USE_CUSTOM_IMAGES=true MASTER_VERSION=4.2 EXT_REG_IMAGE_NS=origin EXTERNAL_REGISTRY=registry.svc.ci.openshift.org hack/deploy-logging.sh
I am still seeing large portions of logs missing from elasticsearch using the logging deploy script to pull the correct 4.2 images. $ oc describe ds/rsyslog | grep -C 1 -i image rsyslog: Image: registry.svc.ci.openshift.org/origin/4.2:logging-rsyslog Port: 24231/TCP -- logrotate: Image: registry.svc.ci.openshift.org/origin/4.2:logging-rsyslog Port: <none> These elasticsearch indices should all have 60,000 documents health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open project.logtest-rsyslog-3-0.ba169bc7-a25e-11e9-badd-0ac45d2c0912.2019.07.09 F0XFOFtySA2lOR0A8Sd4Rg 3 1 60000 0 128.9mb 64.5mb green open project.logtest-rsyslog-3-1.ba16698e-a25e-11e9-badd-0ac45d2c0912.2019.07.09 aQu6Ocl8RPWwqQUHuawfjQ 3 1 50000 0 104.8mb 52.3mb green open project.logtest-rsyslog-3-2.ba254ee3-a25e-11e9-984f-0e3fa727759a.2019.07.09 jhRp-_SnR6up5bErftUC3w 3 1 50000 0 106.1mb 52.9mb green open project.logtest-rsyslog-3-3.ba1196ab-a25e-11e9-984f-0e3fa727759a.2019.07.09 6CLFlpVOQcChUCip8j5vOw 3 1 50000 0 108.8mb 54.2mb green open project.logtest-rsyslog-3-4.ba111abd-a25e-11e9-984f-0e3fa727759a.2019.07.09 7ROY7LdxTXi40nsuu2u1Eg 3 1 50200 0 105.1mb 52.9mb The indices with missing indices are missing big chunks of sequential messages ex: index: project.logtest-rsyslog-3-2.ba254ee3-a25e-11e9-984f-0e3fa727759a.2019.07.09 Missing log lines: 20151-30150 Number of missing logs: 10000 16.6667% message loss rate Observed no restarts for rsyslog pods. I don't see any indication of an issue in rsyslog pod logs. The following test parameters consistently shows missing logs: # of projects: 5 lines per project: 60,000 rate: 6,000 logs/m/project The fluentd log collector has no issues with the same test(also produces smaller store.size, I guess less metadata): health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open project.logtest-fluentd-3-0.6e6b89b5-a263-11e9-badd-0ac45d2c0912.2019.07.09 ksVZR5jLS5-RsaYWceQnOQ 3 1 60000 0 86.3mb 43mb green open project.logtest-fluentd-3-1.6e68aacd-a263-11e9-badd-0ac45d2c0912.2019.07.09 P80969ysQnet8V-7PPE9OQ 3 1 60000 0 85.8mb 42.8mb green open project.logtest-fluentd-3-2.6e736e91-a263-11e9-984f-0e3fa727759a.2019.07.09 -4OhBXadTZuAM9dGl14gzg 3 1 60000 0 86.3mb 43.3mb green open project.logtest-fluentd-3-3.6e6a3bd0-a263-11e9-984f-0e3fa727759a.2019.07.09 d_b_XoOiRECX_NHCwTZl3A 3 1 60000 0 85.8mb 42.8mb green open project.logtest-fluentd-3-4.6e6a36fb-a263-11e9-984f-0e3fa727759a.2019.07.09 w15HjQHkQXmzzGwEMs00mQ 3 1 60000 0 86mb 43mb
Please include the metrics from rsyslog and the logs e.g. something like this: for pod in $( oc get pods -l component=rsyslog -o jsonpath='{.items[*].metadata.name}' ) ; do oc exec -c rsyslog $pod -- curl -sk https://localhost:24231/metrics > $pod.metrics oc exec -c rsyslog $pod -- logs > $pod.log done Please include the other indices - perhaps the logs are going to .operations.* or .orphaned.* or some other indices oc exec -c elasticsearch $espod -- es_util --query=_cat/indices?v
Created attachment 1588820 [details] rsyslog metrics and logs
All indices: health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open project.logtest-fluentd-3-0.6e6b89b5-a263-11e9-badd-0ac45d2c0912.2019.07.09 ksVZR5jLS5-RsaYWceQnOQ 3 1 60450 0 87.5mb 43.6mb green open project.logtest-fluentd-3-1.6e68aacd-a263-11e9-badd-0ac45d2c0912.2019.07.09 P80969ysQnet8V-7PPE9OQ 3 1 60500 0 87.7mb 43.7mb green open project.logtest-fluentd-3-2.6e736e91-a263-11e9-984f-0e3fa727759a.2019.07.09 -4OhBXadTZuAM9dGl14gzg 3 1 60500 0 87.7mb 43.8mb green open project.logtest-fluentd-3-3.6e6a3bd0-a263-11e9-984f-0e3fa727759a.2019.07.09 d_b_XoOiRECX_NHCwTZl3A 3 1 60600 0 87.5mb 43.7mb green open project.logtest-fluentd-3-4.6e6a36fb-a263-11e9-984f-0e3fa727759a.2019.07.09 w15HjQHkQXmzzGwEMs00mQ 3 1 60300 0 87.3mb 43.6mb green open project.logtest-rsyslog-3-0.ba169bc7-a25e-11e9-badd-0ac45d2c0912.2019.07.09 F0XFOFtySA2lOR0A8Sd4Rg 3 1 60000 0 128.9mb 64.5mb green open project.logtest-rsyslog-3-1.ba16698e-a25e-11e9-badd-0ac45d2c0912.2019.07.09 aQu6Ocl8RPWwqQUHuawfjQ 3 1 50000 0 104.8mb 52.3mb green open project.logtest-rsyslog-3-2.ba254ee3-a25e-11e9-984f-0e3fa727759a.2019.07.09 jhRp-_SnR6up5bErftUC3w 3 1 50000 0 106.1mb 52.9mb green open project.logtest-rsyslog-3-3.ba1196ab-a25e-11e9-984f-0e3fa727759a.2019.07.09 6CLFlpVOQcChUCip8j5vOw 3 1 50000 0 108.8mb 54.2mb green open project.logtest-rsyslog-3-4.ba111abd-a25e-11e9-984f-0e3fa727759a.2019.07.09 7ROY7LdxTXi40nsuu2u1Eg 3 1 50200 0 105.1mb 52.9mb green open project.logtest-rsyslog-5-0.c50d1a5e-a267-11e9-badd-0ac45d2c0912.2019.07.09 HfgyfipgQIWGm95EWVHxqQ 3 1 30000 0 65.3mb 32.6mb green open project.logtest-rsyslog-5-1.c516f449-a267-11e9-badd-0ac45d2c0912.2019.07.09 epCY9HOiQs2-kGu5v496Fw 3 1 30000 0 65.2mb 32.6mb green open project.logtest-rsyslog-5-2.c515e15a-a267-11e9-badd-0ac45d2c0912.2019.07.09 wcwNktiHQe6AFEr9yGp97w 3 1 30000 0 65.5mb 32.7mb green open project.logtest-rsyslog-5-3.c51580f0-a267-11e9-984f-0e3fa727759a.2019.07.09 hrSjnGU7SrGrDSs4PXJVbA 3 1 30000 0 65.9mb 32.9mb green open project.logtest-rsyslog-5-4.c5099874-a267-11e9-984f-0e3fa727759a.2019.07.09 KxXZ9j0aS_yRrJT3s6DnvA 3 1 30000 0 65.2mb 32.5mb green open project.logtest-rsyslog-5-5.c519a1e2-a267-11e9-badd-0ac45d2c0912.2019.07.09 A7TUHs2qTh6NRg_Kkl-jmg 3 1 30000 0 65.7mb 32.8mb green open project.logtest-rsyslog-5-6.c512ab96-a267-11e9-badd-0ac45d2c0912.2019.07.09 jt8mGc7_TfaKkewxCX7wZw 3 1 30000 0 65.4mb 32.7mb green open project.logtest-rsyslog-5-7.c517af54-a267-11e9-984f-0e3fa727759a.2019.07.09 K7WoETsnTeeBX8iREqrbkQ 3 1 30000 0 64.8mb 32.3mb green open project.logtest-rsyslog-5-8.c521b7a6-a267-11e9-984f-0e3fa727759a.2019.07.09 0lJcmrs2Q4mlMTP4Fr_Zmg 3 1 30000 0 65mb 32.5mb green open project.logtest-rsyslog-5-9.c5177cc0-a267-11e9-984f-0e3fa727759a.2019.07.09 H1Xa7udrQnmRaU8VD1Xr1g 3 1 30000 0 65.5mb 32.5mb green open .kibana x3ffge2bQNaSr1vrR7qwnQ 1 1 5 0 112.2kb 54.8kb green open .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac al6wnGjnSuSuW3QR4qz30A 1 1 2 1 103.4kb 51.7kb green open .operations.2019.07.09 4z82jh_vQ_GVG4c7xKrSvw 3 1 711396 0 2.2gb 1gb green open .operations... yMFATbstRimOr_AQR-XjLg 3 1 60 0 187.8kb 72.8kb green open .searchguard KwgVp98NSGilYY2qhV5UOA 1 1 0 0 149.8kb 74.9kb
This query returns 0 hits so I don't think the logs are mistakenly going to .operations* /.operations.2019.07.09/_search?q=message:SVTLogger
There must be a clue in the fact that there are exactly 10000 records missing . . . Also green open .operations... yMFATbstRimOr_AQR-XjLg 3 1 60 0 187.8kb 72.8kb This is bad - it means we're not correctly parsing the date for some of the operations logs - please file a bug for this. When did the test start? I see this in the rsyslog logs: Jul 9 15:27:03 rsyslog-lzmp7 rsyslogd: [origin software="rsyslogd" swVersion="8.37.0-9.el7" x-pid="15" x-info="http://www.rsyslog.com"] start Jul 9 15:29:12 rsyslog-lzmp7 rsyslogd-2007: omelasticsearch: we are suspending ourselfs due to server failure 7: Failed connect to elasticsearch:9200; Connection timed out [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ] Jul 9 15:29:12 rsyslog-lzmp7 rsyslogd-2007: action 'elasticsearch' suspended (module 'omelasticsearch'), retry 0. There should be messages before this one giving the reason for suspension. [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ] Jul 9 15:29:17 rsyslog-lzmp7 rsyslogd-2007: omelasticsearch: checkConn failed after 1 attempts. [v8.37.0-9.el7 try http://www.rsyslog.com/e/2007 ] Jul 9 15:29:47 rsyslog-lzmp7 rsyslogd-2359: action 'elasticsearch' resumed (module 'omelasticsearch') [v8.37.0-9.el7 try http://www.rsyslog.com/e/2359 ] It seems unlikely that exactly 10000 records were dropped during this time, but maybe? How does this happen? Do you redeploy elasticsearch after starting rsyslog? I also see this: Jul 9 15:51:30 rsyslog-lzmp7 rsyslogd-3000: omelasticsearch: could not parse JSON result [v8.37.0-9.el7] Please file a bz for this. I don't think it is related to the missing 10000 records, but it should be investigated.
Also - the rsyslog prometheus metrics aren't very helpful in diagnosing this problem. We need the raw impstats.json. oc edit cm rsyslog - scroll down to 10-viaq-modules.conf Uncomment this line: module(load="impstats" interval="1" format="cee" log.syslog="off" log.file=`echo $RSYSLOG_IMPSTATS_FILE`) comment out everything that follows: #module(load="omprog") #module( # load="impstats" # interval="10" # format="json" # resetCounters="off" # ruleset="prometheus_stats" #) #ruleset(name="prometheus_stats") { # action( # type="omprog" # name="prometheus_stats" # binary="/usr/local/bin/rsyslog_exporter --web.listen-address=:24231 --tls.server-crt=/etc/rsyslog/metrics/tls.crt --tls.server-key=/etc/rsyslog/metrics/tls.key" # ) #} That will disable the prometheus exporter and just dump the raw stats to /var/lib/rsyslog.pod/impstats.json Then, at the end of the run, use for pod in $( oc get pods -l component=rsyslog -o jsonpath='{.items[*].metadata.name}' ) ; do oc exec -c rsyslog $pod -- cat /var/lib/rsyslog.pod/impstats.json > $pod.stats.json done I believe QE is going to open another bug about the usefulness of the prometheus exporter metrics.
Agreed about the 10,000 records missing clue, from my testing it is almost always chunks of 9000-10000 missing logs, even when I adjust the number of pods or rate of messages. Ex. from a different test: 10x projects 4500 logs/min/project health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open project.logtest-rsyslog-6-5.2a19f4d2-a272-11e9-984f-0e3fa727759a.2019.07.09 80qb3n4ZQi2ZS5p5px5thg 3 1 25886 0 56.1mb 27.9mb Expected max number: 45000 Missing log line(s): 9425-19342 (9917) Missing log line(s): 28577-37772 (9195) Number of missing logs: 19114 42.4756% message loss rate I will file a bz for the .operations... index The rsyslog-3 test started at 15:32 End at 15:42 I believe those logs are generated because rsyslog pods are up and ready much quicker than the elasticsearch pods which take a few minutes to get to the ready state. There may also be some extra rsyslog pod log weirdness due to me switching the log collector back and forth between rsyslog/fluentd between tests, I can re-run and grab new logs from a fresh cluster if needed. I will file a bz for the 'could not parse JSON result' log
I was able to confirm that changing the log line size from 1KB to 2KB caused the chunks of missing records to reduce from ~10k to ~5k. This support Mike's theory that this issue is related to the kubelet wrapping logs at 10MB Missing log line(s): 30251-35300 (5049) Missing log line(s): 40301-45350 (5049) Missing log line(s): 50351-55400 (5049) I'll look into increasing this limit and see how the test performs after that
(In reply to Eric Matysek from comment #29) > I was able to confirm that changing the log line size from 1KB to 2KB caused > the chunks of missing records to reduce from ~10k to ~5k. > This support Mike's theory that this issue is related to the kubelet > wrapping logs at 10MB > > Missing log line(s): 30251-35300 (5049) > Missing log line(s): 40301-45350 (5049) > Missing log line(s): 50351-55400 (5049) > > I'll look into increasing this limit and see how the test performs after that hmm - could be related to the rsyslog endmsg.regex support for cri-o partial records. Can you provide an example of the raw cri-o file log under /var/log/containers/*.log for a record that is not processed by rsyslog into Elasticsearch?
While looking into this, one interesting thing I noticed was that all the logtest logs appear to be going through the same rsyslog pod. [root@rsyslog-zmx4v containers]# pwd /var/log/containers [root@rsyslog-zmx4v containers]# ls | grep logtest centos-logtest-75666_logtest-3-3_centos-logtest-6b0145e5d63a5a91bd469f26c4e55f859d79d5802068c1fa1f878722eb180b0a.log centos-logtest-9z8s2_logtest-3-0_centos-logtest-9977c254377e503a483c5a7c59a73908e792f3a3cb93ec3f9ee1894994111c8b.log centos-logtest-dckwr_logtest-3-2_centos-logtest-cad7f493ba3e0dc5ce2918c2ce5bd626a4a5e0cfd17eee1b26bd7a2edb1d1c75.log centos-logtest-ms9xh_logtest-3-4_centos-logtest-978d2d02791ac5c2924c8a3a9578845f7827f7f1bc1a292ebf21723b623c47d7.log centos-logtest-rcgjk_logtest-3-1_centos-logtest-f4fb630ab339aa5b7c1dfa0e45868d60fe979e5e69be6c2bff4ead6564d537c3.log logtest-3-0 results (rolls logs at 20Mi instead of 10Mi so we expect missing logs in ~20k chunks): Expected max number: 60000 Missing log line(s): 37601-56600 (18999) Number of missing logs: 19000 31.6667% message loss rate Here is one of the missing log lines: [root@rsyslog-zmx4v centos-logtest]# pwd /var/log/pods/logtest-3-0_centos-logtest-9z8s2_7ac4e8c7-a735-11e9-888f-0265ad267b82/centos-logtest [root@rsyslog-zmx4v centos-logtest]# grep 55000 0.log.20190715-192912 2019-07-15T19:28:55.601740695+00:00 stderr F 2019-07-15 19:28:55,601 - SVTLogger - INFO - centos-logtest-9z8s2 : 55000 : DVHtpC9S5 IVjquBCSA o4FkqalUC 4sxs0N3ll kKrUHLQJy PYWZzzeXM x8wvMC8Yn n7dniTU3x m9aEljYB5 AckAxmF9U du8flHeKm F53jeyGgw RxOjXwlCC ag7jMN9wl ubSU5lru5 5VwA8KEch DvMRbX0cR yXVFkZDyU LyzIjn8RY IHbA1CxZO rJPCbesue nCZPYKlFs Mxt34ZCt3 E39iXa7Zy dH5hu14cI A1kLR7Vea 7eX6doYOG 2ODhcoMFl 0FEN9fdri IGT3kKtqT mswtMZohg ZRAzjwQWr dH5cERq9c WkI1WAnAP 8EpTLT0Za pYFdUL1RS sHWBN44rk 6SRCBERHC Y6viOmmlr PBxUjRegt LrvJ0Pksf 8zvvsBjGD bIZzNG2uh AFFJRn3PM RoOgB2B6Z AXKzwTeoU GxpSZhxBv cCsNkTFF7 1wLzDXl3l b3B4qZbmw 1D2peR6Qt WUauJhpei hFAbGx6p8 OuAjrTKkA X0SNRkTJ2 eybqz1Ig7 7CZAx6HGp Hbc9EwPoJ QUqq3Y1xA HhsxTJnox Z20PaeC2G AmOx55gUa tH5iPcYUg BVW1GPKKn I9S6nxUuS jjIWnORgK MQHB9s4O9 ZpVfOsrus KPLonlBrM 4NDg2gUiE luHt1am37 UqmeF4CgX nVQyWswCd clAuQLqlW s7hgrhj2j ThhQWpOXl hZWJWh7DS vdeESPPdW Ue52nsUkN pMmKdIjlU qpRr4RKZx D0W5nWoOy rfvPUaqTC 4U2iiqd8a wDMNZOdqt lmWVvaPdf H3ZbYcxIV fLd1sRsIK JECotRN3s 8kkZB7wVR L01X1YNnk IIlZUGrMV DqSrRGQJR EC7a5wkyV roY1J3tyj vEMAKWP5v oyAORvyLS 3LIvzcFig kHHbXyW1P N1ru7sh5a 4NtQ0Va1M mrjOyCv3G FbY5 All the missing log lines from this run are in the same file: /var/log/pods/logtest-3-0_centos-logtest-9z8s2_7ac4e8c7-a735-11e9-888f-0265ad267b82/centos-logtest/0.log.20190715-192912
Created attachment 1590873 [details] Missing logs
> While looking into this, one interesting thing I noticed was that all the logtest logs appear to be going through the same rsyslog pod. Do you mean - all of the logtest pods are running on the same node? I guess this is a good stress test for a single rsyslog node, but if that was not your intention, do you need to set some sort of anti-affinity on the logtest pods so that they are all scheduled on different nodes, spread out as much as possible?
(In reply to Rich Megginson from comment #33) > > While looking into this, one interesting thing I noticed was that all the logtest logs appear to be going through the same rsyslog pod. > > Do you mean - all of the logtest pods are running on the same node? I guess > this is a good stress test for a single rsyslog node, but if that was not > your intention, do you need to set some sort of anti-affinity on the logtest > pods so that they are all scheduled on different nodes, spread out as much > as possible? I mean I have 5 logtest projects and there are 5 rsyslog pods, but all 5 logtest logs are on the same rsyslog pod. There were no logs in the /var/log/containers/ directory for logtest pods on any of the other 4 rsyslog pods. [root@rsyslog-xbppx /]# ls /var/log/containers/cent* ls: cannot access /var/log/containers/cent*: No such file or directory Should these be getting load balanced across all 5 rsyslog pods?
> I mean I have 5 logtest projects and there are 5 rsyslog pods, but all 5 logtest logs are on the same rsyslog pod. There is 1 rsyslog pod per node. rsyslog is a daemonset, which means usually there is one and only one pod per node. If you do `oc get pods -o wide` you can see which nodes your pods are running on. > Should these be getting load balanced across all 5 rsyslog pods? I don't know, it depends on what you are trying to test. If all of your logtest pods are running on the same node, and this is not what you want, you'll need to ask Mike how to run logtest so that it is spread out evenly among all of the nodes. Or investigate pod anti-affinity, taints, and tolerations.
> There is 1 rsyslog pod per node. rsyslog is a daemonset, which means usually there is one and only one pod per node. Ah I see, thank you for the clarification
I think it's safe to say the issue is related to the logs getting rotated since following the steps here (https://github.com/mffiedler/ocp-svt/blob/master/svt-notes/logging/config_kubelet_larger_pod_logs.md) changed the size of the chunks missing. Is it possible rsyslog is falling behind and the logs are getting gzipped before rsyslog is able to push them to elasticsearch?
(In reply to Eric Matysek from comment #37) > I think it's safe to say the issue is related to the logs getting rotated > since following the steps here > (https://github.com/mffiedler/ocp-svt/blob/master/svt-notes/logging/ > config_kubelet_larger_pod_logs.md) changed the size of the chunks missing. > Is it possible rsyslog is falling behind and the logs are getting gzipped > before rsyslog is able to push them to elasticsearch? Yes, that is possible, but very disappointing, as it means that rsyslog is not performing as well as fluentd :-( (assuming everything else is the same, only using rsyslog instead of fluentd) We'll need to gather some monitoring information from rsyslog. Change rsyslog to produce impstats.json as in https://bugzilla.redhat.com/show_bug.cgi?id=1722944#c27 Also every minute do `oc exec -c rsyslog $rsyslodpod -- ls -alrtF /var/lib/rsyslog.pod` so we can see what is happening with the queue files and the position files.
Performed a fresh test today, 5 logtest projects each producing 60k logs, only 50 logs were missing from the whole test so let me know if it would be better to load it more to get more missing records. index: project.logtest-3-1.c6b276d1-a809-11e9-b8a7-0ad653da7944.2019.07.16 Expected max number: 60000 Missing log line(s): 9301-9350 (50) Number of missing logs: 50 0.0833% message loss rate rsyslog pod responsible: rsyslog-gbqw9
Created attachment 1591207 [details] impstats and rsyslog.pod output
Performance did improve with PR220 but fluentd is still more performant than rsyslog by quite a bit. Because this bug has so much history at this point I'm going to mark this verified and open a new bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922