Bug 1500882 - missing messages + [error]: record cannot use elasticsearch index name type project_full: record is missing kubernetes.namespace_id field:
Summary: missing messages + [error]: record cannot use elasticsearch index name type p...
Keywords:
Status: CLOSED DUPLICATE of bug 1493022
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.7.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.7.z
Assignee: Jeff Cantrill
QA Contact: Vikas Laad
URL:
Whiteboard:
: 1486493 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-11 16:31 UTC by Mike Fiedler
Modified: 2021-09-09 12:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1506286 (view as bug list)
Environment:
Last Closed: 2017-12-04 18:44:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logging-mux, logging-fluentd and node logs (3.19 MB, application/x-gzip)
2017-10-11 16:33 UTC, Mike Fiedler
no flags Details
More reasonable master log, starting closer to error in description (3.15 MB, text/plain)
2017-10-11 17:44 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github fabric8io fluent-plugin-kubernetes_metadata_filter pull 90 0 None closed WIP - Cache strategies 2020-08-25 09:27:28 UTC
Red Hat Knowledge Base (Solution) 3211281 0 None None None 2017-10-25 15:16:09 UTC

Description Mike Fiedler 2017-10-11 16:31:50 UTC
Description of problem:

I believe this is load related but need to run some variations to confirm.    During a run with

2 logging-mux
4 logging-fluentd nodes
200 pods logging on those 4 nodes at an aggregate rate of 500 1K messages/sec/node which is an aggregate throughput of 1000 messages/second/fluentd which should be close to its limits based on other testing.

At this rate I am seeing lost messages from the index and many instances of this error in the logging-mux pod logs:

2017-10-11 14:45:22 +0000 [error]: record cannot use elasticsearch index name type project_full: record is missing kubernetes.namespace_id field: {"message"=>"centos-logtest-4jlkl : 20814 : jTYkLKm N76gxm8 9dga01D 1mJ9jYZ YMY8R0p TqlfUHd 0YtllCF xFnQEj7 hCtEjhP PEstpaz 6XQzS12 hXshOM2 6BGNYbD 0BpWoQ5 WLZIdYh KVa9tEJ rOTPlgg TRC7YZH SO1UlI3 Z1sbKTp eXaruLP vsuu2eb OeS00h6 7bnGPp3 jX5x8w4 DPlZUGB vZutgSS VLjmOIU rq4Yycg ER8eu9s tslVre2 kELwT5r JqJjwuI NQIPoSV liT6n9Z AUD5MBV 06yxali K8XvjUP vuOqNju XXrw32I b0dHfTG 3tMMoZC GF9EMet I2uD2mi Fy3QKEO n4wpNz0 74Q5bJR zofX64W wKpNooL 697MHTN zXUaiB4 i8xjhKK A06pMCJ IFRsDGU 10Fxchb nw43KhD 5ifvZjC qOFPc7z uKvxePG rDHqbUN fziqAuw a5YmWN2 QKxuyPb GYATczC OkkzAGg IkGwfpY QVDkdtq xQZHCZh Vfrie53 To0ebDC UWGnJkp FpRHmuv v5Oay5S cM0JwvH I7MZlvj f1epunP SXmlIgC 46ymaxh iYV9X6H ylqyyXV J9i6rEa mJah6aH cjK9vIM JyU5RpZ Thuuc23 39iqRrm vTIRyqz kaDSyEE nQMBZWl goyrbbW HVidJ3J qDovDpS kv9zCzG gDffCmn JERwW9Z Amf5OJB tOthSIy mFIUmPu 7HxG5nT XjdHIHE tOFWsqQ 3fa123s qfBpg6w f2JFOU6 Kwyrlmc LIHp0ic MVbfYpQ V5mwv5J TOY3KnY wYYf3YG oNUOmuD BLHP0FJ v4kay2X F2CS1Dh zTjRnyW 0WM5twi UbJ2VVb 04rNxXe 7ZU3gQe 0HFrFr2 S0fkaLH 11T5igI fXFf8Wm Ol3PAu6 u4P1L1N 04mr0R6 M1YXKRR RZYqg9j \n", "level"=>"info", "hostname"=>"ip-172-31-45-166.us-west-2.compute.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"172.21.2.54", "ipaddr6"=>"fe80::858:acff:fe15:236", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2017-10-11T14:45:19.412212+00:00", "version"=>"0.12.39 1.6.0"}, "normalizer"=>{"ipaddr4"=>"172.22.0.4", "ipaddr6"=>"fe80::bc29:93ff:fe61:d6f5", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2017-10-11T14:45:22.755423+00:00", "version"=>"0.12.39 1.6.0"}}, "@timestamp"=>"2017-10-11T14:45:18.992215+00:00", "CONTAINER_NAME"=>"k8s_centos-logtest.0_centos-logtest-4jlkl_logtest43_0_01234567", "CONTAINER_ID_FULL"=>"9b33cc0ca48e2a3f411e1e4077d5a329b28c610f794eab42078ae58f31372847", "docker"=>{"container_id"=>"9b33cc0ca48e2a3f411e1e4077d5a329b28c610f794eab42078ae58f31372847"}, "kubernetes"=>{"container_name"=>"centos-logtest", "namespace_name"=>"logtest43", "pod_name"=>"centos-logtest-4jlkl", "pod_id"=>"f0fa5131-ae8d-11e7-a158-0211864528ea", "labels"=>{"run"=>"centos-logtest", "test"=>"centos-logtest"}, "host"=>"ip-172-31-45-166.us-west-2.compute.internal", "master_url"=>"https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443"}}

The project exists (not a deletion issue) and some kubernets.* info is in the metadata.   Not sure whey namespace_id is missing.  I'll attach the full pod and node logs for logging-mux.    The logging-fluentd logs show no errors, but I'll include a couple of samples.

Elastic search bulk rejected was 0 at the end of the run.  CPU on the logging-mux and elastic search nodes seemed reasonable, but I will collect pbench data to confirm.

Version-Release number of selected component (if applicable):  logging-fluentd v3.7.0-0.146.0.1  <internal brew registry>/openshift3/logging-fluentd:v3.7.0-0.146.0.1


How reproducible:  Unknown - repeating some experiments now.


Steps to Reproduce:
1.  1 master/2 infra/4 computes.  Single ES, 2 logging-mux, 4 logging-fluentd
2.  Logging deployed with the inventory below.
3.  50 pods running on each compute each logging 10 1K messages/second.  See above for aggregate logging rate. 

Actual results:

- 99/200 indices are missing messages.   101 have all messages expected.
- logging-mux logs full of the message above.   
- example:  project logtest115 index had 43150/60000 expected messages and 758 instances of the above message.



Expected results:

All messages contain expected metadata

Additional info:

[OSEv3:children]                  
masters                           
etcd                              

[masters]                         
ec2-54-200-230-39.us-west-2.compute.amazonaws.com                   

[etcd]                            
ec2-54-200-230-39.us-west-2.compute.amazonaws.com                   
[OSEv3:vars]                      
deployment_type=openshift-enterprise                                

openshift_deployment_type=openshift-enterprise                      
openshift_release=v3.7            
openshift_image_tag=v3.7.0        



openshift_logging_install_logging=true                              
openshift_logging_master_url=https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443                                             
openshift_logging_master_public_url=https://ec2-54-200-230-39.us-west-2.compute.amazonaws.com:8443                                      
openshift_logging_kibana_hostname=kibana.0918-s73.qe.rhcloud.com    
openshift_logging_namespace=logging                                 
openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/                                                                   
openshift_logging_image_version=v3.7                                
openshift_logging_es_cluster_size=1                                 
openshift_logging_es_pvc_dynamic=true                               
openshift_logging_es_pvc_size=50Gi                                  
openshift_logging_fluentd_read_from_head=false                      
openshift_logging_use_mux=true    
openshift_logging_mux_client_mode=maximal                           
openshift_logging_use_ops=false   

openshift_logging_fluentd_cpu_limit=1000m                           
openshift_logging_mux_cpu_limit=1000m                               
openshift_logging_kibana_cpu_limit=200m                             
openshift_logging_kibana_proxy_cpu_limit=100m                       
openshift_logging_es_memory_limit=9Gi 
openshift_logging_es_cpu_limit-2500m                              
openshift_logging_fluentd_memory_limit=1Gi                          
openshift_logging_mux_memory_limit=2Gi                              
openshift_logging_kibana_memory_limit=1Gi                           
openshift_logging_kibana_proxy_memory_limit=256Mi                   

openshift_logging_mux_file_buffer_storage_type=pvc                  
openshift_logging_mux_file_buffer_pvc_name=logging-mux-pvc          
openshift_logging_mux_file_buffer_pvc_size=30Gi

Comment 1 Mike Fiedler 2017-10-11 16:33:46 UTC
Created attachment 1337307 [details]
logging-mux, logging-fluentd and node logs

Comment 2 Rich Megginson 2017-10-11 17:06:55 UTC
Can we get the openshift server api logs to see if mux is making an api call to get the namespace id?

Comment 4 Mike Fiedler 2017-10-11 17:44:08 UTC
Created attachment 1337349 [details]
More reasonable master log, starting closer to error in description

Comment 5 Rich Megginson 2017-10-11 18:00:09 UTC
Do you see these errors for all namespaces all the time?  or do you see it only for this one specific namespace logtest43?
Do you see this error for all records from logtest43?

Comment 6 Mike Fiedler 2017-10-11 18:27:19 UTC
It is more than just that specific namespace - that was just an example.   

grep -oE "logtest[0-9]+" logging-mux-*.log | sort -u                                                         
logging-mux-0.log:logtest115      
logging-mux-0.log:logtest121      
logging-mux-0.log:logtest167      
logging-mux-0.log:logtest25       
logging-mux-0.log:logtest65       
logging-mux-0.log:logtest79       
logging-mux-1.log:logtest126      
logging-mux-1.log:logtest137      
logging-mux-1.log:logtest166      
logging-mux-1.log:logtest22       
logging-mux-1.log:logtest38       
logging-mux-1.log:logtest42       
logging-mux-1.log:logtest43 

So the errors are for the above namespaces and  all of those namespaces have documents which were successfully indexed in addition to the errors in the log. 

I have a new run going now and not hitting it.   Will see if I can come up with a reliable reproducer.

Comment 7 Jeff Cantrill 2017-10-18 16:32:14 UTC
*** Bug 1486493 has been marked as a duplicate of this bug. ***

Comment 8 Jeff Cantrill 2017-10-18 16:34:12 UTC
Targeting as 3.8 since:

* Would like to move records to orphan index
* Container continues to run but includes verbose log message

Comment 9 Mike Fiedler 2017-10-18 16:42:41 UTC
Adding TestBlocker and OpsBlocker tags from duplicated bug 1486493

Comment 10 N. Harrison Ripps 2017-10-25 18:53:55 UTC
After you set the target version as 3.8.0, this bug was flagged as a test blocker on 3.7 due to the record change made in comment 9. Can you please review this with QE and decide whether or not it is actually a 3.7.0 test blocker? If it is, it will have to be fixed now rather than for 3.8.0.

Comment 13 Vikas Laad 2017-11-20 20:21:31 UTC
@Jeff

I just tried to reproduce this issue, I still see following errors in mux pod logs

2017-11-20 20:19:21 +0000 [error]: record cannot use elasticsearch index name type project_full: record is missing kubernetes.namespace_id field: {"message"=>"centos-logtest-z2qnd : 5706 : RIidFlO9F Bl6XFK0jU H4rTTk3T5 xX9Ak3lqs IZJ2Crqts oAuC4mLza oF9vFtmE4 079CRDhjx XR9PMd2mc p5oyddT2I GJj5njbPJ UyCfsvkIt E3LYhVXU7 ZaaC19oKn 2Lku9HW0p p9QvSp6A5 7eTr2hoBA FAzjs8JCw cbLah6xKX wGdMWCJoY WKncnO1Go iq46KWrkY 3ZcoVcSo9 H6GOfqaDe 7FC8xOLaZ SQbhjEBGw UKUcWp3Bd cZDSFMnvk 8vd8txV6c MrpcAd3AS UNzh5aYGe 7e88gPF9D KZsRcl019 KqzzxthVE udkgaMzWD bGc8AS0XG j08dBk6un IDWlrg4ng CcbmUkriK VmKWPMICs b4bNLSBY5 mvGaY26u5 AmjroOBeu wzJSv3JLC Fh86OFeon hXh2xR0oV a6HZrjWpH pu5qRi47t qCvWlhJIB IlVriEol8 Mv2paZNuK bbGV0VfMB 731y3ntqQ O7knRUKpy JBMyeBfKY bokBBevrh E9Pf5fBVb czjItXHBX xst5tHKmD LmXGVy6OF j7EZ51QK2 9FniVcymM MV041ZEho ukZmh91p9 JJsHbf7jH 64Sjmsfky SZFwPdsyI W3gl4Um38 bWFoODbpv bc2VGGFyC CJmY9b9ct hX4SD9Oeg anHVioxlb 3BCZzwpDy t66EkgI0k j6pc7XPvU IIoXgxYN0 Smg7cEraN 9M7aYzcNY 17BwxAZ55 IyiGTRoHV huXcHk9Gc BDxOYpfwf Fd9KEAxLA VGxgslnk3 h6fNrE7ge hbdWL9vrd 4kwS6Zzxv iEXMzM3eP HnRFFUJjF T3cXxUciW KVAU8crtP 5sb5Ty2KI ViU1YYBZR uv2W5udUI Lwu3oer9o ptcQgEfED XXdhClBJE oVcby7xhL vMM2CSGy8 bjHXbTh34 Ox67mYmsF GXB2\n", "level"=>"info", "hostname"=>"ip-172-31-58-107.us-west-2.compute.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"172.22.2.107", "ipaddr6"=>"fe80::f403:4eff:fed0:1ccc", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2017-11-20T20:11:01.998263+00:00", "version"=>"0.12.39 1.6.0"}, "normalizer"=>{"ipaddr4"=>"172.22.2.108", "ipaddr6"=>"fe80::900c:57ff:feed:6a35", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2017-11-20T20:19:21.476078+00:00", "version"=>"0.12.39 1.6.0"}}, "@timestamp"=>"2017-11-20T20:11:01.179703+00:00", "CONTAINER_NAME"=>"k8s_centos-logtest.0_centos-logtest-z2qnd_svt-logging-6_0_01234567", "CONTAINER_ID_FULL"=>"d74a3adb2bd3b93baebf4ace6fec380cc22bfba5ef3af82ecfc012f199097f2b", "docker"=>{"container_id"=>"d74a3adb2bd3b93baebf4ace6fec380cc22bfba5ef3af82ecfc012f199097f2b"}, "kubernetes"=>{"container_name"=>"centos-logtest", "namespace_name"=>"svt-logging-6", "pod_name"=>"centos-logtest-z2qnd"}}

Comment 15 Vikas Laad 2017-11-20 20:48:04 UTC
For every error I see in mux logs in comment #13 I should see an entry under .orphaned

When I see the indices I dont see an index with name .orphaned

Please let me know if I am missing something.

Comment 16 Rich Megginson 2017-11-20 20:54:28 UTC
(In reply to Vikas Laad from comment #15)
> For every error I see in mux logs in comment #13 I should see an entry under
> .orphaned
> 
> When I see the indices I dont see an index with name .orphaned
> 
> Please let me know if I am missing something.

What indices do you see?

Comment 17 Vikas Laad 2017-11-20 20:56:49 UTC
health status index                                                                  pri rep docs.count docs.deleted store.size pri.store.size 
green  open   project.svt-logging-26.9bd040cc-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8560            0      4.7mb          4.7mb 
green  open   project.svt-logging-8.93a596d2-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       2600            0      1.5mb          1.5mb 
green  open   project.svt-logging-5.93880e8c-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       7680            0      4.4mb          4.4mb 
green  open   project.svt-logging-3.93c1a49b-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       5100            0      3.2mb          3.2mb 
green  open   project.svt-logging-21.9c2b871e-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      21020            0     11.2mb         11.2mb 
green  open   project.svt-logging-20.9c3f3a59-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      14460            0      7.8mb          7.8mb 
green  open   project.svt-logging-40.a4265595-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7540            0      4.1mb          4.1mb 
green  open   project.svt-logging-27.9c163a27-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      12980            0        7mb            7mb 
green  open   project.svt-logging-46.a41e66b8-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6440            0      3.6mb          3.6mb 
green  open   project.svt-logging-36.a034de43-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      14240            0        8mb            8mb 
green  open   project.svt-logging-47.a4468d73-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      13960            0      7.6mb          7.6mb 
green  open   project.svt-logging-13.97ddb938-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8380            0      5.1mb          5.1mb 
green  open   project.svt-logging-31.a027e528-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6580            0      3.6mb          3.6mb 
green  open   project.svt-logging-42.a431235a-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      14000            0      7.7mb          7.7mb 
green  open   project.svt-logging-48.a45d2ae8-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      22800            0     12.7mb         12.7mb 
green  open   project.svt-logging-22.9c141c74-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8020            0      4.8mb          4.8mb 
green  open   project.svt-logging-11.981e26e8-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8780            0      4.9mb          4.9mb 
green  open   project.svt-logging-44.a3f02150-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6560            0      3.6mb          3.6mb 
green  open   project.svt-logging-14.97925ff9-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7187            0        4mb            4mb 
green  open   project.svt-logging-9.93e10189-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0      23900            0       13mb           13mb 
green  open   project.svt-logging-38.a0315d53-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6660            0      3.6mb          3.6mb 
green  open   project.svt-logging-35.a031a758-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6680            0      3.8mb          3.8mb 
green  open   project.svt-logging-41.a41303e4-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7620            0      4.1mb          4.1mb 
green  open   .operations.2017.11.20                                                   1   0     150513            0     65.5mb         65.5mb 
green  open   .searchguard.logging-es-data-master-wx2itgxl                             1   0          5            0     33.5kb         33.5kb 
green  open   project.svt-logging-19.97ad5e69-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      23440            0     12.6mb         12.6mb 
green  open   project.svt-logging-0.9392607e-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0      21240            0     11.9mb         11.9mb 
green  open   project.svt-logging-18.9836c17c-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7689            0      4.2mb          4.2mb 
green  open   project.svt-logging-12.97ffe622-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7057            0      3.9mb          3.9mb 
green  open   project.svt-logging-15.981a8c32-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      23380            0     12.8mb         12.8mb 
green  open   project.svt-logging-32.a007c371-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      14320            0      7.8mb          7.8mb 
green  open   project.svt-logging-2.93bb49ee-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       5040            0      3.2mb          3.2mb 
green  open   project.svt-logging-34.9ff349af-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7920            0      4.3mb          4.3mb 
green  open   project.svt-logging-24.9c3c4be6-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      14520            0      7.8mb          7.8mb 
green  open   project.svt-logging-28.9be45ff7-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8140            0      4.5mb          4.5mb 
green  open   project.svt-logging-2.c65871fe-ce2d-11e7-84a3-02696c11b946.2017.11.20    1   0       2380            0      1.4mb          1.4mb 
green  open   project.svt-logging-25.9c4ab691-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      21151            0     11.9mb         11.9mb 
green  open   project.svt-logging-33.a026bb4c-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6540            0      3.9mb          3.9mb 
green  open   .operations.2017.11.18                                                   1   0      84222            0     35.4mb         35.4mb 
green  open   project.svt-logging-16.9842eff2-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       4900            0      2.6mb          2.6mb 
green  open   .kibana                                                                  1   0          1            0      3.1kb          3.1kb 
green  open   project.svt-logging-1.c66db2f1-ce2d-11e7-84a3-02696c11b946.2017.11.20    1   0       2380            0      1.4mb          1.4mb 
green  open   project.logging.9e130cf4-cbb1-11e7-8bab-02696c11b946.2017.11.20          1   0     121761            0    128.4mb        128.4mb 
green  open   project.svt-logging-30.a031b7dd-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      22960            0     12.2mb         12.2mb 
green  open   project.svt-logging-45.a45e4e2f-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       7520            0      4.1mb          4.1mb 
green  open   project.svt-logging-39.9ffdfff6-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      23080            0     12.4mb         12.4mb 
green  open   project.svt-logging-1.93b8e72e-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       5000            0      3.2mb          3.2mb 
green  open   project.svt-logging-43.a4656995-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      22720            0     12.7mb         12.7mb 
green  open   project.svt-logging-3.c64123c7-ce2d-11e7-84a3-02696c11b946.2017.11.20    1   0       2420            0      1.5mb          1.5mb 
green  open   project.svt-logging-29.9c030417-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8080            0      4.9mb          4.9mb 
green  open   project.svt-logging-10.982275fd-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      23340            0     12.4mb         12.4mb 
green  open   project.svt-logging-7.93e1e5e6-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       9060            0      4.9mb          4.9mb 
green  open   project.svt-logging-37.a0376888-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       1780            0        1mb            1mb 
green  open   project.svt-logging-4.c64ef39d-ce2d-11e7-84a3-02696c11b946.2017.11.20    1   0       2560            0      1.5mb          1.5mb 
green  open   project.svt-logging-6.93d9e344-ce2e-11e7-84a3-02696c11b946.2017.11.20    1   0       2520            0      1.5mb          1.5mb 
green  open   project.svt-logging-49.a4667054-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       6440            0      3.5mb          3.5mb 
green  open   fluentd                                                                  1   0     137640            0     58.5mb         58.5mb 
green  open   project.svt-logging-17.97d2d896-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0       8820            0      4.8mb          4.8mb 
green  open   project.svt-logging-0.c66815ab-ce2d-11e7-84a3-02696c11b946.2017.11.20    1   0       2460            0      1.4mb          1.4mb 
green  open   project.svt-logging-23.9beebcfd-ce2e-11e7-84a3-02696c11b946.2017.11.20   1   0      12100            0      6.9mb          6.9mb

Comment 18 Rich Megginson 2017-11-20 21:38:36 UTC
green  open   fluentd                                                                  1   0     137640            0     58.5mb         58.5mb 

I think this is where the records are going.

Comment 19 Vikas Laad 2017-11-20 22:04:01 UTC
OK, i will look for missing messages in under this index, will update the bz.

Comment 30 Rich Megginson 2017-11-21 21:54:29 UTC
note - do not try to use <filter **> @type stdout </filter> unless fluent is using DEBUG=true - it will create a very bad feedback loop and kill fluentd.

Comment 31 Vikas Laad 2017-11-27 16:29:16 UTC
I am seeing this error in fluentd pods, will post more details when I enable logging.

2017-11-27 16:16:27 +0000 [warn]: disconnected host="logging-mux" port=24284
2017-11-27 16:16:30 +0000 [warn]: dead connection found: logging-mux, reconnecting...
2017-11-27 16:16:31 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-27 16:16:32 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:e0cc0c"
  2017-11-27 16:16:31 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/out_secure_forward.rb:259:in `write_objects'
  2017-11-27 16:16:31 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:490:in `write'
  2017-11-27 16:16:31 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:354:in `write_chunk'
  2017-11-27 16:16:31 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:333:in `pop'
  2017-11-27 16:16:31 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:342:in `try_flush'
  2017-11-27 16:16:31 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:149:in `run'
2017-11-27 16:16:32 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-27 16:16:34 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:e0cc0c"
  2017-11-27 16:16:32 +0000 [warn]: suppressed same stacktrace
2017-11-27 16:16:34 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-27 16:16:38 +0000 error_class="RuntimeError" error="no one nodes with valid ssl session" plugin_id="object:e0cc0c"
  2017-11-27 16:16:34 +0000 [warn]: suppressed same stacktrace
2017-11-27 16:16:35 +0000 [warn]: recovered connection to dead node: logging-mux
2017-11-27 16:16:38 +0000 [warn]: retry succeeded. plugin_id="object:e0cc0c"

Comment 32 Rich Megginson 2017-11-27 16:40:51 UTC
(In reply to Vikas Laad from comment #31)
> I am seeing this error in fluentd pods, will post more details when I enable
> logging.
> 
> 2017-11-27 16:16:27 +0000 [warn]: disconnected host="logging-mux" port=24284
> 2017-11-27 16:16:30 +0000 [warn]: dead connection found: logging-mux,
> reconnecting...

What is happening on the mux pod at this time?

> 2017-11-27 16:16:31 +0000 [warn]: temporarily failed to flush the buffer.
> next_retry=2017-11-27 16:16:32 +0000 error_class="RuntimeError" error="no
> one nodes with valid ssl session" plugin_id="object:e0cc0c"
>   2017-11-27 16:16:31 +0000 [warn]:
> /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/
> out_secure_forward.rb:259:in `write_objects'
>   2017-11-27 16:16:31 +0000 [warn]:
> /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:490:in `write'
>   2017-11-27 16:16:31 +0000 [warn]:
> /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:354:in
> `write_chunk'
>   2017-11-27 16:16:31 +0000 [warn]:
> /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:333:in `pop'
>   2017-11-27 16:16:31 +0000 [warn]:
> /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:342:in `try_flush'
>   2017-11-27 16:16:31 +0000 [warn]:
> /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:149:in `run'
> 2017-11-27 16:16:32 +0000 [warn]: temporarily failed to flush the buffer.
> next_retry=2017-11-27 16:16:34 +0000 error_class="RuntimeError" error="no
> one nodes with valid ssl session" plugin_id="object:e0cc0c"
>   2017-11-27 16:16:32 +0000 [warn]: suppressed same stacktrace
> 2017-11-27 16:16:34 +0000 [warn]: temporarily failed to flush the buffer.
> next_retry=2017-11-27 16:16:38 +0000 error_class="RuntimeError" error="no
> one nodes with valid ssl session" plugin_id="object:e0cc0c"
>   2017-11-27 16:16:34 +0000 [warn]: suppressed same stacktrace
> 2017-11-27 16:16:35 +0000 [warn]: recovered connection to dead node:
> logging-mux
> 2017-11-27 16:16:38 +0000 [warn]: retry succeeded. plugin_id="object:e0cc0c"

Comment 33 Vikas Laad 2017-11-27 16:46:19 UTC
I see this error in mux pod

2017-11-27 15:39:40 +0000 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-11-27 15:39:41 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-11-27 16:16:11 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55ef9387c25e4c5f.log" tag="output_tag"
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `write'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:39:in `<<'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:200:in `block in emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/ruby/monitor.rb:211:in `mon_synchronize'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:193:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:467:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:42:in `next'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/out_copy.rb:78:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:81:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:75:in `block in emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `call'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:130:in `block in each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event.rb:129:in `each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-rewrite-tag-filter-1.5.6/lib/fluent/plugin/out_rewrite_tag_filter.rb:72:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:154:in `emit'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/event_router.rb:90:in `emit_stream'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/in_secure_forward.rb:293:in `on_message'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:124:in `on_read'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `feed_each'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:185:in `block in start'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `loop'
  2017-11-27 16:16:11 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-secure-forward-0.4.5/lib/fluent/plugin/input_session.rb:178:in `start'
2017-11-27 16:16:11 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55ef9387c25e4c5f.log" tag="kubernetes.var.log.containers.centos-logtest-8n86c_svt-logging-5_centos-logtest-ea00414d1242444633cdcc52e2b3465acf6738c2c22553015bf00e7de765ad7e.log.mux"
  2017-11-27 16:16:11 +0000 [warn]: suppressed same stacktrace
2017-11-27 16:16:11 +0000 [warn]: emit transaction failed: error_class=Errno::ENOMEM error="Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55ef9387c25e4c5f.log" tag="kubernetes.var.log.containers.centos-logtest-8n86c_svt-logging-5_centos-logtest-ea00414d1242444633cdcc52e2b3465acf6738c2c22553015bf00e7de765ad7e.log"
  2017-11-27 16:16:11 +0000 [warn]: suppressed same stacktrace
2017-11-27 16:16:11 +0000 [warn]: unexpected error in in_secure_forward from 172.21.3.13:53076 error_class=Errno::ENOMEM error=#<Errno::ENOMEM: Cannot allocate memory - /var/lib/fluentd/buffer-output-es-config.output_tag.b55ef9387c25e4c5f.log>
2017-11-27 16:16:14 +0000 [warn]: Elasticsearch rejected 1 documents due to invalid field arguments
2017-11-27 16:16:27 +0000 [error]: fluentd main process died unexpectedly. restarting.
2017-11-27 16:16:28 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

Comment 34 Rich Megginson 2017-11-27 17:10:43 UTC
ok - so mux is hitting oom-kill, and restarting, and fluentd is just detecting that.  So this particular issue will be resolved if/when we resolve the mux oom issue.

Comment 35 Jeff Cantrill 2017-12-04 18:44:34 UTC

*** This bug has been marked as a duplicate of bug 1493022 ***


Note You need to log in before you can comment on or make changes to this bug.