Description of problem: No logs are showing after a certain date. Issue occurred after upgrade. After scaling down and up the fluentd pods, it allowed some logs from after the upgrade, but none after a certain point. Version-Release number of selected component (if applicable): logging-elasticsearch-v3.6.173.0.5-5 logging-fluentd-v3.6.173.0.21-17 How reproducible: Unconfirmed Actual results: 2017-09-21 22:06:00 +0200 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-09-22 00:39:02 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-09-22 00:39:03 +0200 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:11b86f8" 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:240:in `sub!' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:240:in `expand_param' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:130:in `block (2 levels) in write_objects' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `each' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `each_with_index' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `block in write_objects' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `each' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `block in msgpack_each' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:71:in `open' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:120:in `msgpack_each' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:121:in `write_objects' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:490:in `write' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:354:in `write_chunk' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:333:in `pop' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:342:in `try_flush' 2017-09-22 00:39:02 +0200 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:149:in `run' 2017-09-22 00:39:03 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-09-22 00:39:05 +0200 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:11b86f8" 2017-09-22 00:39:04 +0200 [warn]: suppressed same stacktrace 2017-09-20 19:52:13 +0200 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-09-20 19:52:55 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-09-20 19:52:20 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1c7ae4c" Contacting elasticsearch cluster 'elasticsearch' and wait for YELLOW clusterstate ... Clustername: logging-es Clusterstate: GREEN Number of nodes: 3 Number of data nodes: 3 Additional info: May be related to https://bugzilla.redhat.com/show_bug.cgi?id=1486493 -- but no projects/namespaces were deleted. Issue occurred after an upgrade.
I have a post about additional steps to get more information: https://richmegginson.livejournal.com/29741.html Note: if you change the configmap to do this, be sure to change it back _immediately_ after reproducing the problem, as this will cause fluentd to spew a large volume of data.
I've run into the same issue and as per comment 2, here's my output. I think the problem is in one of the two json objects, but they look fine to me. They both have a @timestamp and the kubernetes' one has both namespace_id and namespace_name. 2017-10-06 12:51:58 +0200 journal.system: {"systemd":{"t":{"MACHINE_ID":"00d1b79716a34614976af02f296f6e24","BOOT_ID":"f8d86ce7bcef4f1091852d9b49144f55","CAP_EFFECTIVE":"1fffffffff","CMDLINE":"/usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2","COMM":"openshift","EXE":"/usr/bin/openshift","GID":"0","HOSTNAME":"node02.domain.it","PID":"2466","SELINUX_CONTEXT":"system_u:system_r:init_t:s0","SYSTEMD_CGROUP":"/system.slice/atomic-openshift-node.service","SYSTEMD_SLICE":"system.slice","SYSTEMD_UNIT":"atomic-openshift-node.service","TRANSPORT":"stdout","UID":"0"},"u":{"SYSLOG_FACILITY":"3","SYSLOG_IDENTIFIER":"atomic-openshift-node"}},"hostname":"node02.domain.it","message":"I1006 11:16:17.251671 2466 operation_generator.go:609] MountVolume.SetUp succeeded for volume \"kubernetes.io/secret/fcd6174c-aa76-11e7-9286-005056ba13d4-deployer-token-6x17r\" (spec.Name: \"deployer-token-6x17r\") pod \"fcd6174c-aa76-11e7-9286-005056ba13d4\" (UID: \"fcd6174c-aa76-11e7-9286-005056ba13d4\").","pipeline_metadata":{"collector":{"ipaddr4":"10.128.5.100","ipaddr6":"fe80::858:aff:fe80:564","inputname":"fluent-plugin-systemd","name":"fluentd openshift","received_at":"2017-10-06T09:16:17.000000+00:00","version":"0.12.39 1.6.0"}},"level":"info","@timestamp":"2017-10-06T09:16:17.000000+00:00"} 2017-10-06 12:51:58 +0200 kubernetes.journal.container: {"docker":{"container_id":"aec5c57642c2547362fbe485cb57832d6ef4da957bcc3446c1d03f0beed9a834"},"kubernetes":{"container_name":"deployment","namespace_name":"imprese","pod_name":"imprese-2-deploy","namespace_id":"b93ecfd0-aa67-11e7-b8c1-005056ba5ce6"},"hostname":"node02.domain.it","message":"--> Scaling up imprese-2 from 0 to 1, scaling down imprese-1 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)","level":"info","pipeline_metadata":{"collector":{"ipaddr4":"10.128.5.100","ipaddr6":"fe80::858:aff:fe80:564","inputname":"fluent-plugin-systemd","name":"fluentd openshift","received_at":"2017-10-06T09:16:17.846085+00:00","version":"0.12.39 1.6.0"}},"systemd":{"t":{"MACHINE_ID":"00d1b79716a34614976af02f296f6e24","BOOT_ID":"f8d86ce7bcef4f1091852d9b49144f55","CAP_EFFECTIVE":"1fffffffff","CMDLINE":"/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=journald --log-level=warn --ipv6=false --storage-driver overlay2 --mtu=1450 --add-registry registry.access.redhat.com --add-registry registry.access.redhat.com","COMM":"dockerd-current","EXE":"/usr/bin/dockerd-current","GID":"0","HOSTNAME":"node02.domain.it","PID":"1771","SELINUX_CONTEXT":"system_u:system_r:container_runtime_t:s0","SOURCE_REALTIME_TIMESTAMP":"1507281377846085","SYSTEMD_CGROUP":"/system.slice/docker.service","SYSTEMD_SLICE":"system.slice","SYSTEMD_UNIT":"docker.service","TRANSPORT":"journal","UID":"0"}},"@timestamp":"2017-10-06T09:16:17.846085+00:00"} 2017-10-06 12:51:59 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-10-06 12:51:59 +0200 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:19e6b04"
The following solves the issue. I saw something similar BZ related MUX which fails like this. Uninstall logging -> https://docs.openshift.com/container-platform/3.6/install_config/aggregate_logging.html#aggregate-logging-cleanup Then put openshift_logging_image_version=v3.6.173.0.5 in the inventory. Install logging with https://docs.openshift.com/container-platform/3.6/install_config/aggregate_logging.html#deploying-the-efk-stack
Hello, I have same case when I see that logs from some projects are not transferred to the elastic search. However, it seems that it worked with the problematic version before, but now it doesn't work at all. The current version marked as latest is v3.6.173.0.49-4, is the new image ok? Thank you
(In reply to Vladislav Walek from comment #6) > Hello, > > I have same case when I see that logs from some projects are not transferred > to the elastic search. > However, it seems that it worked with the problematic version before, but > now it doesn't work at all. > > The current version marked as latest is v3.6.173.0.49-4, is the new image ok? If it isn't working for you, then it probably isn't ok. > > Thank you
Hello, I'm working on the case 01960527, commented by Vladislav before. As you said v3.6.173.0.49-4, marked as the latest, is not OK, could you tell us why you think so? Since the customer is facing the product issue, we should fix it soon. If rolling back to the old version: for example v3.6.173.0.5 which Miheer tried, is a valid workaround, we can recommend to the customer. However, it has not a valid reason at this point. Regards,
(In reply to Takayoshi Tanaka from comment #8) > Hello, > > I'm working on the case 01960527, commented by Vladislav before. As you said > v3.6.173.0.49-4, marked as the latest, is not OK, could you tell us why you > think so? I don't know. But if it isn't working, then it isn't OK. > > Since the customer is facing the product issue, we should fix it soon. If > rolling back to the old version: for example v3.6.173.0.5 which Miheer > tried, is a valid workaround, we can recommend to the customer. However, it > has not a valid reason at this point. > > Regards, We have later versions too - the latest version is https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=617058 logging-fluentd-docker-v3.6.173.0.63-2 which may also fix the problem (I don't know because I don't know what the problem is)
Hello Rich, the issue is that the logs are not send to the elastic search. The case I am working on is showing: - there are 2 app nodes. n01 and n02. - there 2 projects running on those(2 pods each) - p01 and p02 - from p01 when pod is running on n01 you can see logs, but only till 30th of October - from p01 when pod running on n02 you can't see the logs - from the p02 if running on n01 or n02 you can't see anything - the logs will be deleted after 7 days We can see logs on the pod, but can't see the index in the elastic search - both p01 and p02 are generating logs from 27th till 31th of October - in elastic search, only part of the logs are in kibana. The version 49-4 won't fix the issue. We tried to run the container, but no logs were sent to ES. Any thoughts?
Hello, also the latest version in registry available for customer is 49-4. The one mentioned in Comment #9 is not available. Thank you
(In reply to Miheer Salunke from comment #5) > The following solves the issue. I saw something similar BZ related MUX which > fails like this. > > Uninstall logging -> > https://docs.openshift.com/container-platform/3.6/install_config/ > aggregate_logging.html#aggregate-logging-cleanup > > Then put openshift_logging_image_version=v3.6.173.0.5 in the inventory. > > Install logging with > https://docs.openshift.com/container-platform/3.6/install_config/ > aggregate_logging.html#deploying-the-efk-stack I am trying to check with customer if the uninstall is necessary as the issue could be with fluentd. Change only the image tag in the daemonset from v3.6 (which points to latest) to v3.6.173.0.5 (which is alias for tag 3.6.173.0.5-5)
(In reply to Vladislav Walek from comment #11) > Hello, > > also the latest version in registry available for customer is 49-4. > The one mentioned in Comment #9 is not available. > > Thank you Right. I'm not sure when it will be going out - 3.6.3, etc.
1) If it OK with you Uninstall logging -> https://docs.openshift.com/container-platform/3.6/install_config/ aggregate_logging.html#aggregate-logging-cleanup Then put openshift_logging_image_version=v3.6.173.0.5 in the inventory. Install logging with https://docs.openshift.com/container-platform/3.6/install_config/ aggregate_logging.html#deploying-the-efk-stack OR 2) The version of the fluentd could be changed in the daemon set to the tag you want to pull - in this case v3.6.173.0.5 (remember that it must be identical to the tag in registry [1] ) Then just deleting the pod, the pod should be automatically deployed with the specified version of image. The second option was tried but still it fails. Rich Sir any suggestions on how to move ahead with this issue ?
I have 2 questions. In the fluentd pod: oc rsh $FLUENTDPOD Do we have a filter-post-z-* config file in /etc/fluent/configs.d? # ls /etc/fluent/configs.d/openshift/filter-post-z-* /etc/fluent/configs.d/openshift/filter-post-z-retag-two.conf Also, how does the fluentd's configmap look like? oc edit configmap $FLUENTDPOD Does the configmap have <label @OUTPUT> as follows? 8<----------------------------------------------------------------------------------------- <label @INGRESS> ## filters @include configs.d/openshift/filter-pre-*.conf @include configs.d/openshift/filter-retag-journal.conf @include configs.d/openshift/filter-k8s-meta.conf @include configs.d/openshift/filter-kibana-transform.conf @include configs.d/openshift/filter-k8s-flatten-hash.conf @include configs.d/openshift/filter-k8s-record-transform.conf @include configs.d/openshift/filter-syslog-record-transform.conf @include configs.d/openshift/filter-viaq-data-model.conf @include configs.d/openshift/filter-post-*.conf ## </label> <label @OUTPUT> ## matches @include configs.d/openshift/output-pre-*.conf @include configs.d/openshift/output-operations.conf @include configs.d/openshift/output-applications.conf # no post - applications.conf matches everything left ## </label> 8<----------------------------------------------------------------------------------------- If there is no filter-post-z-* config file in /etc/fluent/configs.d/openshift, please remove </label> and <label @OUTPUT> as follows: 8<----------------------------------------------------------------------------------------- <label @INGRESS> ## filters @include configs.d/openshift/filter-pre-*.conf @include configs.d/openshift/filter-retag-journal.conf @include configs.d/openshift/filter-k8s-meta.conf @include configs.d/openshift/filter-kibana-transform.conf @include configs.d/openshift/filter-k8s-flatten-hash.conf @include configs.d/openshift/filter-k8s-record-transform.conf @include configs.d/openshift/filter-syslog-record-transform.conf @include configs.d/openshift/filter-viaq-data-model.conf @include configs.d/openshift/filter-post-*.conf ## ## matches @include configs.d/openshift/output-pre-*.conf @include configs.d/openshift/output-operations.conf @include configs.d/openshift/output-applications.conf # no post - applications.conf matches everything left ## </label> 8<----------------------------------------------------------------------------------------- If you have the filter-post-z-* config file in /etc/fluent/configs.d/openshift and do not have </label> and <label @OUTPUT>, please add them. (I don't think that's the case since the fluentd run.sh does not install filter-post-z-* unless <label @OUTPUT> is found in the configmap.) Thanks, --noriko
(In reply to Rich Megginson from comment #20) > I have 2 questions. > > In the fluentd pod: > > oc rsh $FLUENTDPOD > > Do we have a filter-post-z-* config file in /etc/fluent/configs.d? > # ls /etc/fluent/configs.d/openshift/filter-post-z-* > /etc/fluent/configs.d/openshift/filter-post-z-retag-two.conf > > Also, how does the fluentd's configmap look like? > oc edit configmap $FLUENTDPOD > > Does the configmap have <label @OUTPUT> as follows? > 8<--------------------------------------------------------------------------- > -------------- > <label @INGRESS> > ## filters > @include configs.d/openshift/filter-pre-*.conf > @include configs.d/openshift/filter-retag-journal.conf > @include configs.d/openshift/filter-k8s-meta.conf > @include configs.d/openshift/filter-kibana-transform.conf > @include configs.d/openshift/filter-k8s-flatten-hash.conf > @include configs.d/openshift/filter-k8s-record-transform.conf > @include configs.d/openshift/filter-syslog-record-transform.conf > @include configs.d/openshift/filter-viaq-data-model.conf > @include configs.d/openshift/filter-post-*.conf > ## > </label> > > <label @OUTPUT> > ## matches > @include configs.d/openshift/output-pre-*.conf > @include configs.d/openshift/output-operations.conf > @include configs.d/openshift/output-applications.conf > # no post - applications.conf matches everything left > ## > </label> > 8<--------------------------------------------------------------------------- > -------------- > > If there is no filter-post-z-* config file in > /etc/fluent/configs.d/openshift, please remove </label> and <label @OUTPUT> > as follows: > 8<--------------------------------------------------------------------------- > -------------- > <label @INGRESS> > ## filters > @include configs.d/openshift/filter-pre-*.conf > @include configs.d/openshift/filter-retag-journal.conf > @include configs.d/openshift/filter-k8s-meta.conf > @include configs.d/openshift/filter-kibana-transform.conf > @include configs.d/openshift/filter-k8s-flatten-hash.conf > @include configs.d/openshift/filter-k8s-record-transform.conf > @include configs.d/openshift/filter-syslog-record-transform.conf > @include configs.d/openshift/filter-viaq-data-model.conf > @include configs.d/openshift/filter-post-*.conf > ## > > ## matches > @include configs.d/openshift/output-pre-*.conf > @include configs.d/openshift/output-operations.conf > @include configs.d/openshift/output-applications.conf > # no post - applications.conf matches everything left > ## > </label> > 8<--------------------------------------------------------------------------- > -------------- > > If you have the filter-post-z-* config file in > /etc/fluent/configs.d/openshift and do not have </label> and <label > @OUTPUT>, please add them. (I don't think that's the case since the fluentd > run.sh does not install filter-post-z-* unless <label @OUTPUT> is found in > the configmap.) > > Thanks, > --noriko @Rich, they don't such a filter, neither a <label @INGRESS> within the configmap
(In reply to Nicolas Nosenzo from comment #24) > @Rich, they don't such a filter, neither a <label @INGRESS> within the > configmap @Nicolas, how about <label @OUTPUT>?
(In reply to Noriko Hosoi from comment #25) > (In reply to Nicolas Nosenzo from comment #24) > > @Rich, they don't such a filter, neither a <label @INGRESS> within the > > configmap > > @Nicolas, how about <label @OUTPUT>? @Noriko, I meant <label @OUTPUT>, sorry.
Following things were done and had happened -> Customer was told to as per this -> https://bugzilla.redhat.com/show_bug.cgi?id=1494612#c14 still one fluentd did not send logs to the elastic search. " error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="" So then we did the following -> Stop fluentd oc label node $nodename logging-infra-fluentd- Delete stale buffer files in ls -al /var/lib/fluentd oc label node $nodename logging-infra-fluentd=true Then fluentd was again up.
(In reply to Miheer Salunke from comment #34) > Following things were done and had happened -> > > Customer was told to as per this -> > > https://bugzilla.redhat.com/show_bug.cgi?id=1494612#c14 > > > still one fluentd did not send logs to the elastic search. > " error_class="TypeError" error="no implicit conversion of nil into String" > plugin_id="" > > > So then we did the following -> > Stop fluentd > > oc label node $nodename logging-infra-fluentd- > > > Delete stale buffer files in ls -al /var/lib/fluentd We were only able to do this because these were infra node logs that were older than the retention policy. This is not a general purpose solution. The customer saved the stale buffer files so that we can do further analysis. > > > oc label node $nodename logging-infra-fluentd=true > > > Then fluentd was again up.
For customers hitting this on 3.5 is there a confirmed 3.5 image to use as a workaround that does not exhibit the behavior? If not is it safe to use 3.6.173.0.5 -- in particular, is it safe to use 3.6.173.0.5 when you only care about exporting logs to an external ELK stack?
The "working" version in 3.6 is part of this release: https://docs.openshift.com/container-platform/3.6/release_notes/ocp_3_6_release_notes.html#ocp-3-6-rhba-2017-1829 -- and it is later releases causing issue. The closest equivalent in 3.5 is here: https://docs.openshift.com/container-platform/3.5/release_notes/ocp_3_5_release_notes.html#ocp-3-5-rhba-2017-1828 The logging images associated with that release are: openshift3/logging-auth-proxy:3.5.0-28 openshift3/logging-elasticsearch:3.5.0-37 openshift3/logging-fluentd:3.5.0-26 openshift3/logging-kibana:3.5.0-30 Does this seem correct?
Updating priority as this is: - Part of prio-list thread - Causing customers to need to user earlier images to avoid the bug, causing them to be vulnerable to bugs that are fixed in later images
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/3459ae8165c856539aa2fb10ce8c2649f2d1a395 bug 1494612. Orphan records missing namespace_name and/or namespace_id https://github.com/openshift/origin-aggregated-logging/commit/064628dd44f5731d0947749bece9d27d9a45157d Merge pull request #856 from jcantrill/1494612_orphan_namespaces Automatic merge from submit-queue. bug 1494612. Orphan records missing namespace_name and/or namespace_id
Jeff, For test purpose, is there any good method to create Orphan records missing namespace_name and/or namespace_id ?
Verified, the .orphaned records are index in .orphaned project when use openshift3/logging-fluentd/images/v3.6.173.0.95-1 green open .orphaned.2018.01.08 1 0 94 0 97.2kb 97.2kb
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0113