1486493 – logging-mux pod looping on "no implicit conversion of nil into String" when namespace deleted while logging

Bug 1486493 - logging-mux pod looping on "no implicit conversion of nil into String" when namespace deleted while logging

Summary: logging-mux pod looping on "no implicit conversion of nil into String" when n...

Keywords:
Status:	CLOSED DUPLICATE of bug 1500882
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Rich Megginson
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-29 23:57 UTC by Mike Fiedler
Modified:	2021-09-09 12:33 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-18 16:32:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logging-mux dc (4.83 KB, text/plain) 2017-08-30 18:37 UTC, Mike Fiedler	no flags	Details
logging-fluentd ds (4.96 KB, text/plain) 2017-08-30 18:37 UTC, Mike Fiedler	no flags	Details
logging-logs-lab.tar.gz (49.49 KB, application/x-gzip) 2017-09-25 14:38 UTC, Ruben Romero Montes	no flags	Details
var/lib/fluentd (2.95 KB, application/x-gzip) 2017-09-28 08:36 UTC, Ruben Romero Montes	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3211281	0	None	None	None	2017-10-24 07:18:52 UTC

Description Mike Fiedler 2017-08-29 23:57:46 UTC

Description of problem:

I haven't found the exact trigger for this yet, but hopefully the stack trace will help.   Running the latest logging-fluentd as a logging-mux pod, occasionally fluentd starts looping and using a full core of CPU (max fluentd can use without the multi-thread plugin) with the following messages in the logging-mux pod log.

Stopping and re-starting logging-mux clears the issue sometimes.  Stopping and restarting the logging-fluentd pods in the daemonset clears it sometiimes.

2017-08-29 20:31:32 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:31:32 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:240:in `sub!'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:240:in `expand_param'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:130:in `block (2 levels) in write_objects'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `each'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `each_with_index'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `block in write_objects'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `each'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `block in msgpack_each'
2017-08-29 20:31:39 +0000 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1959a38" elapsed_time=26.55471477 slow_flush_log_threshold=20.0
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:71:in `open'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:120:in `msgpack_each'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:121:in `write_objects'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:490:in `write'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:354:in `write_chunk'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:333:in `pop'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:342:in `try_flush'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:149:in `run'
2017-08-29 20:31:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:31:34 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:31:46 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:31:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:31:38 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:31:47 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:31:48 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:31:47 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:31:48 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:31:49 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:32:01 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:31:49 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:32:02 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:32:36 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:32:02 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:32:36 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:33:35 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:32:37 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:33:36 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:35:40 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:33:36 +0000 [warn]: suppressed same stacktrace
2017-08-29 20:35:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:40:08 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:35:40 +0000 [warn]: suppressed same stacktrace
root@ip-172-31-11-80: ~/svt/openshift_scalability # oc logs -f logging-mux-8-vhfkp
WARNING: Available disk space (2681203712 bytes) is less than the user specified file buffer limit (2Gi times 2).
2017-08-29 20:31:00 +0000 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-08-29 20:31:00 +0000 [warn]: 'exclude1' parameter is deprecated: Use <exclude> section
2017-08-29 20:31:32 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-29 20:31:32 +0000 error_class="TypeError" error="no implicit conversion of nil into String" plugin_id="object:1c0c8e8"
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:240:in `sub!'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:240:in `expand_param'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:130:in `block (2 levels) in write_objects'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `each'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `each_with_index'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5/lib/fluent/plugin/out_elasticsearch_dynamic.rb:125:in `block in write_objects'
  2017-08-29 20:31:33 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `each'



Version-Release number of selected component (if applicable): logging-fluentd:3.6.173.0.5-10 from brew.

Additional info:  Logging install inventory (logging-fluentd:3.6.173.0.5-10 added after initial install):

[oo_first_master]
ip-172-31-26-103

[oo_first_master:vars]
openshift_deployment_type=openshift-enterprise
openshift_release=v3.6.0

openshift_logging_install_logging=true
openshift_logging_master_url=https://ec2-54-213-80-174.us-west-2.compute.amazonaws.com:8443
openshift_logging_master_public_url=https://ec2-54-213-80-174.us-west-2.compute.amazonaws.com:8443
openshift_logging_kibana_hostname=kibana.0823-vpd.qe.rhcloud.com
openshift_logging_namespace=logging
openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=v3.6
openshift_logging_es_cluster_size=1
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_size=50Gi
openshift_logging_fluentd_use_journal=true
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal
openshift_logging_use_ops=false

openshift_logging_es_cpu_limit=2000m
openshift_logging_fluentd_cpu_limit=200m
openshift_logging_mux_cpu_limit=1000m
openshift_logging_kibana_cpu_limit=200m
openshift_logging_kibana_proxy_cpu_limit=100m
openshift_logging_es_memory_limit=9Gi
openshift_logging_fluentd_memory_limit=512Mi
openshift_logging_mux_memory_limit=2Gi
openshift_logging_kibana_memory_limit=1Gi
openshift_logging_kibana_proxy_memory_limit=256Mi

Comment 1 Noriko Hosoi 2017-08-30 00:27:18 UTC

I see this warning in the log:
WARNING: Available disk space (2681203712 bytes) is less than the user specified file buffer limit (2Gi times 2).

Could you configure hostmount or pvc for the MUX file buffer with more than 4GB?
    openshift_logging_mux_file_buffer_storage_type=hostmount
  or
    openshift_logging_mux_file_buffer_storage_type=pvc
    openshift_logging_mux_file_buffer_pvc_name=logging-mux-pvc
    openshift_logging_mux_file_buffer_pvc_size=4Gi (or more)

And when the error is reported, could you run these command lines and attach the output to this bz?

oc exec $MUX_POD -- ls -l /var/lib/fluentd

oc exec $ES_POD -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert \
  --key /etc/elasticsearch/secret/admin-key "https://localhost:9200/_cat/indices?v"

Comment 2 Mike Fiedler 2017-08-30 17:59:09 UTC

Hit the issue again with the logging-mux file buffer storage configured with a 4Gi pvc as suggested in comment 1.

root@ip-172-31-11-80: ~ # oc exec $mux -- ls -l /var/lib/fluentd
total 33764
-rw-r--r--. 1 root root  455386 Aug 30 17:53 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.b557fc34c6c3d2b5e.log
-rw-r--r--. 1 root root  558315 Aug 30 17:51 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc2dc29017d08.log
-rw-r--r--. 1 root root 4193041 Aug 30 17:51 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc2df4c1ec5b8.log
-rw-r--r--. 1 root root 4193030 Aug 30 17:52 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc2ed2d4733db.log
-rw-r--r--. 1 root root 4193047 Aug 30 17:52 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc2f782eaa83a.log
-rw-r--r--. 1 root root 4193038 Aug 30 17:52 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc301f40facb5.log
-rw-r--r--. 1 root root 4193036 Aug 30 17:53 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc30fb0887f51.log
-rw-r--r--. 1 root root 4193043 Aug 30 17:53 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc330ba221aca.log
-rw-r--r--. 1 root root 4193045 Aug 30 17:53 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc3356a194a3e.log
-rw-r--r--. 1 root root 4193039 Aug 30 17:53 buffer-output-es-config.kubernetes.mux.var.log.containers.mux-mux.mux-mux_mux-undefined_mux-0123456789012345678901234567890123456789012345678901234567890123.log.q557fc34462fe8b1d.log


oc exec -n logging $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?vhealth status index                                                        pri rep docs.count docs.deleted store.size pri.store.size 
green  open   .kibana                                                        1   0          1            0      3.1kb          3.1kb 
green  open   .operations.2017.08.30                                         1   0       1673            0   1020.5kb       1020.5kb 
green  open   project.svt0.e204fe47-8daa-11e7-bb99-02a1c4e79572.2017.08.30   1   0      54909            0      102mb          102mb

Comment 4 Mike Fiedler 2017-08-30 18:01:12 UTC

Marking this as severity high since the logging-mux pod goes into a state where it is chewing up a core and not forwarding anything to ES.

Comment 5 Noriko Hosoi 2017-08-30 18:10:31 UTC

Thanks, Mike.

"_mux-undefined_" in the index name means, the mux clients are not sending the kubernetes namespace name.

input-post-forward-mux.conf: mux_namespace_name ${record['namespace_name'] || (tag_parts[0] == "project" && tag_parts[1]) || ENV["MUX_UNDEFINED_NAMESPACE"] || "mux-undefined"}

What is the MUX_CLIENT_MODE value set in the mux clients?  minimal?

Comment 6 Rich Megginson 2017-08-30 18:17:36 UTC

Let's get some information from ES to see how it is keeping up with the ingestion rate: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html#_thread_pool_fields

oc exec -n logging $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/thread_pool?v\&h=host,ba,bs,bq,bqs,br,bl,bc

Looking for any non-zero value in the br or bq column - this indicates that ES is having trouble keeping up with mux.  If the bc column is continually increasing, this indicates that ES is ingesting data

Comment 7 Rich Megginson 2017-08-30 18:18:40 UTC

(In reply to Noriko Hosoi from comment #5)
> Thanks, Mike.
> 
> "_mux-undefined_" in the index name means, the mux clients are not sending
> the kubernetes namespace name.
> 
> input-post-forward-mux.conf: mux_namespace_name ${record['namespace_name']
> || (tag_parts[0] == "project" && tag_parts[1]) ||
> ENV["MUX_UNDEFINED_NAMESPACE"] || "mux-undefined"}
> 
> What is the MUX_CLIENT_MODE value set in the mux clients?  minimal?

It should be "maximal", which means the fluentd clients are doing all of the processing _except_ for kubernetes metadata, which should be done by mux.  For the OpenShift only use case, this is the whole point of mux, to offload the kubernetes metadata processing.

Comment 8 Noriko Hosoi 2017-08-30 18:20:33 UTC

(In reply to Rich Megginson from comment #7)
> It should be "maximal", which means the fluentd clients are doing all of the
> processing _except_ for kubernetes metadata, which should be done by mux. 
> For the OpenShift only use case, this is the whole point of mux, to offload
> the kubernetes metadata processing.

Yes, that's what I expect, too.

Comment 9 Rich Megginson 2017-08-30 18:25:11 UTC

@nhosoi the mux-undefined in the buffer file name is correct - this is what I would expect when clients send data that needs k8s metadata - https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/input-post-forward-mux.conf#L81 - the rewrite tag filter tags those records with kubernetes.mux.var.log.containers.mux-mux.mux-mux.mux-undefined.

For records read from the journal by the fluentd mux clients, the CONTAINER_NAME should be in the record and should contain the correct kubernetes data which is used to look up the additional k8s metadata: https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/filter-pre-mux.conf

Finally, these records are not re-tagged because at this point the tag doesn't matter, once the k8s metadata has been added.

Comment 10 Noriko Hosoi 2017-08-30 18:27:06 UTC

@richm: ok, thanks!

Comment 11 Mike Fiedler 2017-08-30 18:35:23 UTC

From the logging-fluentd daemonset:

        - name: MUX_CLIENT_MODE
          value: maximal
        image: registry.ops.openshift.com/openshift3/logging-fluentd:v3.6.173.0.21


Even when things seem to be working well I see files appearing/disappearing from /var/lib/fluentd with mux-undefined in the name.   

The logging was working fine in this cluster and then hits this condition suddenly.  The logging-mux logs are quiet when everything is working OK.  

I can usually get it working again by deleting the bad files, re-tagging the nodes to restart logging-fluentd and re-scaling logging-mux to restart it.

I am attaching the full daemonset for logging-fluentd and the dc for logging-mux - let me know what other information might help.

Comment 12 Mike Fiedler 2017-08-30 18:37:03 UTC

Created attachment 1320259 [details]
logging-mux dc

Comment 13 Mike Fiedler 2017-08-30 18:37:25 UTC

Created attachment 1320260 [details]
logging-fluentd ds

Comment 14 Noriko Hosoi 2017-08-30 18:52:12 UTC

(In reply to Mike Fiedler from comment #12)
> Created attachment 1320259 [details]
> logging-mux dc

Sorry for my ignorance...  What does this 'cpu: "1"' mean?

        resources:
          limits:
            cpu: "1"
            memory: 2Gi

This is my mux resources:

        resources:
          limits:
            cpu: 500m
            memory: 2Gi
          requests:
            memory: 2Gi

Comment 15 Mike Fiedler 2017-08-30 18:57:40 UTC

It means 1 core/CPU.   Same meaning as 1000m (millicore).   Fluentd can only use 1 core without the multi-process plugin, so I let it use a full core.  It is the same as removing the limit, which might be less confusing.

The instance where ES and logging-mux are running has 8 cores and 32GB RAM

Comment 16 Mike Fiedler 2017-08-30 19:16:38 UTC

Output from command in comment 6 immediately after triggering the issue:

root@ip-172-31-11-80: ~/svt/openshift_scalability # oc exec -n logging $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/thread_pool?v\&h=host,ba,bs,bq,bqs,br,bl,bc
host        ba bs bq bqs br bl   bc 
172.20.1.62  0  8  0  50  0  8 3935

I will run this periodically and watch for non-0 bq and br.

Comment 17 Mike Fiedler 2017-08-30 19:18:28 UTC

oc exec -n logging $POD --   curl -s -k --cert /etc/elasticsearch/secret/admin-cert   --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize

host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
172.20.1.62           3935             0          0           0             50

Comment 18 Rich Megginson 2017-08-30 22:43:46 UTC

ok - let's find out why mux is throwing that exception - 

oc edit configmap logging-mux

just before the first output, add this:

<filter **>
  @type stdout
</filter>

then redeploy mux

mux will spew a lot of messages to its pod log, so we may have to revert this once we have reproduced the exception

Comment 24 Mike Fiedler 2017-09-01 00:40:41 UTC

Setting severity back down to medium since the issue is timing on namespace deletion.   When the issue occurs with persistent file buffer storage for mux, the "bad" buffers have to be deleted and logging-mux restarted for indexing to restart.

Comment 25 Peter Portante 2017-09-09 19:49:05 UTC

I don't think we can delete any logs.  If we mux has a buffer payload, we need to index it.  However, if we can't fetch the namespace information, we should still index it into the operations logs, adding some sort of tag so they can easily found.

Comment 26 Rich Megginson 2017-09-19 13:10:57 UTC

I think this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1493022 and possibly a dup

Comment 27 Peter Portante 2017-09-19 13:33:18 UTC

I'd like to amend comment 25, I don't think data from a namespace should land in the operations index.  Instead, I would like to propose that it land in an index using the namespace_id.  That way, if records are missing, they could be found via inspection later.

Comment 28 Rich Megginson 2017-09-19 14:19:22 UTC

(In reply to Peter Portante from comment #27)
> I'd like to amend comment 25, I don't think data from a namespace should
> land in the operations index.  Instead, I would like to propose that it land
> in an index using the namespace_id.  That way, if records are missing, they
> could be found via inspection later.

Do you mean namespace_name?

Comment 29 Peter Portante 2017-09-20 10:16:08 UTC

(In reply to Rich Megginson from comment #28)
> Do you mean namespace_name?

Thanks, I incorrectly referred to namespace_id.  But if namespace_name is not available, we still have to save the log entry to some index that operations can review.

Comment 32 Ruben Romero Montes 2017-09-25 14:38:46 UTC

Created attachment 1330590 [details]
logging-logs-lab.tar.gz

Attached logs from an isolated new instance. It seems it's not related to the deletion of a namespace.
It happens in 2 fluentd pods out of 3.

Comment 33 Ruben Romero Montes 2017-09-28 08:36:47 UTC

Created attachment 1331832 [details]
var/lib/fluentd

Docker is configured to use journald log-driver on all nodes

Comment 36 Rich Megginson 2017-09-28 20:09:32 UTC

I have a post about additional steps to get more information: https://richmegginson.livejournal.com/29741.html

Note: if you change the configmap to do this, be sure to change it back _immediately_ after reproducing the problem, as this will cause fluentd to spew a large volume of data.

Comment 56 Mike Fiedler 2017-10-16 18:08:50 UTC

In logging-fluentd 3.7.0-0.146, the symptoms are changed some.  The scenario is the same - delete projects with log records for those projects in flight. 

1)The message in the summary and description of this bz now identifies the field which is missing (detailed message below).   The log record is still dropped (tracked by this bz), but the cause of the error is much clearer.

2)In 3.7, it does not appear logging-mux gets "wedged"/stuck/hung when this error is seen - it continues to process additional log records.


2017-10-16 14:59:27 +0000 [error]: record cannot use elasticsearch index name type project_full: record is missing kubernetes.namespace_id field: {"message"=>"centos-logtest-2m0mh : 9146 : pltsv7c UIkDnWI w53XbuF 5T2fi16 rSz9FeR sO8udd8 CqMFXHt HWfNaPz MpoCAhP rc5JE1T 5PmQMWz 24m5XTe 44GDlMs fgg333a eOGHpXe mVQeORs 3yUwveU GCbbCFL lKnvRPm rpE4mlp 1YBqq5X mgJ7uoa 1VQIAkq Q1vcDlf 6DNCGhZ JQVq36V oSj3hH0 Y1zxPGE 1JppZ9S YRMrgE7 yzXoYRV KYsZ00z D5HEQgt hlH07GV QfvpAbz IdSD4rm 8oh3dEm RH4LBlf LVDf2be dJrEXnf DNqKAhN 0fgwd16 rVYn7QW 8jnvlXd o0pVrKq yWRO4FK Lg8K2Hy uGQoldq KSBApDL qdTM2c6 jRfDGo2 1WOVADQ kijJjwc Y23DhV7 14Vpi8T Cu1D6uG NFjBZwg 6ZK4d4o OplqWmR Yq0t3oD QvMCgQk kBEIYwV iE9qYI1 UgIuIcy IszjfCe oj3xVuV BZlhamz 18RDnaz TI9oZw9 rrglonY AIxiITW bC01VyL eQxcl1r jp4ARzC KG0lS7a KUuyhYd wSFi2kz o1ry07g 9BjdyZa LlHodLh RAI4g9o JvAqN05 vwuHtMt KYcS2px 4HGSgzg 0kV0G6y MFv5rM1 CDQElCj Za29ug9 VYTPYlO M2nEq7u Crtcxll ZFT5eI6 PNcYlQh ZPqAocy nTOODwC FONf0fn TUfzDWZ 3KcMuip YuezYdV X9puqct Et2LLXd 3OTt18Z pJn9WJq yWRU0DN 41GMLN3 s5qQcGC EqX2gkU gi9tbyG RrHjY41 wrxqhTD ktYwuNj 8nPRrxz 9V8WVgV S2djzxv eWvTy55 iAeHOBR 7Ktzp8u BhleJtj MldegvK e9CYpxw JbqVcBa nU7B96H rs16FXc Otvdbyl gqmuk1E Ma8Zol9 y12UUwY \n", "level"=>"info", "hostname"=>"ip-172-31-17-160.us-west-2.compute.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"172.22.0.116", "ipaddr6"=>"fe80::858:acff:fe16:74", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2017-10-16T14:57:34.607762+00:00", "version"=>"0.12.39 1.6.0"}, "normalizer"=>{"ipaddr4"=>"172.22.2.5", "ipaddr6"=>"fe80::ec7c:49ff:fe45:2d93", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2017-10-16T14:59:27.590889+00:00", "version"=>"0.12.39 1.6.0"}}, "@timestamp"=>"2017-10-16T14:57:33.960808+00:00", "CONTAINER_NAME"=>"k8s_centos-logtest.0_centos-logtest-2m0mh_svt8_0_01234567", "CONTAINER_ID_FULL"=>"b93345ce4c762916d349581ba27dd11b14954fe45183fb192dde4a4a5b988327", "docker"=>{"container_id"=>"b93345ce4c762916d349581ba27dd11b14954fe45183fb192dde4a4a5b988327"}, "kubernetes"=>{"container_name"=>"centos-logtest", "namespace_name"=>"svt8", "pod_name"=>"centos-logtest-2m0mh", "pod_id"=>"1f95265f-b282-11e7-b635-0211864528ea", "labels"=>{"run"=>"centos-logtest", "test"=>"centos-logtest"}, "host"=>"ip-172-31-17-160.us-west-2.compute.internal", "master_url"=>"https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443"}}

Comment 57 Jeff Cantrill 2017-10-18 16:32:14 UTC

Closing as dup as new issue narrows the focus of the actual problem

*** This bug has been marked as a duplicate of bug 1500882 ***

Note You need to log in before you can comment on or make changes to this bug.