Bug 1578504 - fluentd pods fill up with logs saying primary shard is not active Timeout: [1m]
Summary: fluentd pods fill up with logs saying primary shard is not active Timeout: [1m]
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.6.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-15 17:45 UTC by Steven Walter
Modified: 2021-09-09 14:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-25 17:29:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3444981 0 None None None 2018-05-15 17:59:35 UTC

Description Steven Walter 2018-05-15 17:45:42 UTC
Description of problem:
Fluentd pod logs fill up over time with repeated messages. Fluentd pods either crash and restart, or have to be restarted manually at which point they resume functioning for a while. Fluentd memory usage becomes very high.

Version-Release number of selected component (if applicable):
3.6, 3.7 (verifying exact versions soon)

How reproducible:
Unconfirmed

Actual results:


\"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"}}}, {\"create\"=>{\"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSL\", \"_version\"=>1, \"_shards\"=>{\"total\"=>1, \"successful\"=>1, \"failed\"=>0}, \"status\"=>201}}, {\"create\"=>{\"_index\"=>\".operations.2018.04.25\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSM\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"}}}, {\"create\"=>{\"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSN\", \"_version\"=>1, \"_shards\"=>{\"total\"=>1, \"successful\"=>1, \"failed\"=>0}, \"status\"=>201}}, {\"create\"=>{\"_index\"=>\".operations.2018.04.25\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSO\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"}}}, . . .

Expected results:
If a shard is not available it should not recursively log this

Additional info:
Possibly related but fixed only in 3.8/9+ https://bugzilla.redhat.com/show_bug.cgi?id=1534419

Comment 5 Peter Portante 2018-05-15 20:41:35 UTC
Here is the original error message posted in the description "pretty printed" to make it easier to review:

        \"error\"=>{
            \"type\"=>\"unavailable_shards_exception\",
            \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"
            }
    }
}, {
    \"create\"=>{
        \"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\",
        \"_type\"=>\"com.redhat.viaq.common\",
        \"_id\"=>\"AWNgFjm-zg6tB7-fHPSL\",
        \"_version\"=>1,
        \"_shards\"=>{
            \"total\"=>1,
            \"successful\"=>1,
            \"failed\"=>0
            },
        \"status\"=>201
        }
}, {
    \"create\"=>{
        \"_index\"=>\".operations.2018.04.25\",
        \"_type\"=>\"com.redhat.viaq.common\",
        \"_id\"=>\"AWNgFjm-zg6tB7-fHPSM\",
        \"status\"=>503,
        \"error\"=>{
            \"type\"=>\"unavailable_shards_exception\",
            \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"
            }
        }
}, {
    \"create\"=>{
        \"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\",
        \"_type\"=>\"com.redhat.viaq.common\",
        \"_id\"=>\"AWNgFjm-zg6tB7-fHPSN\",
        \"_version\"=>1,
        \"_shards\"=>{
            \"total\"=>1,
            \"successful\"=>1,
            \"failed\"=>0},
            \"status\"=>201
            }
}, {
    \"create\"=>{
        \"_index\"=>\".operations.2018.04.25\",
        \"_type\"=>\"com.redhat.viaq.common\",
        \"_id\"=>\"AWNgFjm-zg6tB7-fHPSO\",
        \"status\"=>503,
        \"error\"=>{
            \"type\"=>\"unavailable_shards_exception\",
            \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"
            }
        }
}, . . .

Comment 7 Peter Portante 2018-05-15 21:36:16 UTC
The error indicates logs are attempting to send to the index: ".operations.2018.04.25".  It appears that that index is not open, or has some other problem with it, as it does not show up in the list of indices (see contents of attached logging dump).

However, the project.test-prod.* index does not show up in that list either, so that leaves a question about the reported error message matching the logging dump data.

Can somebody provide a list of all the indices[1] and all the shards[2] that match the posted error above?

[1] https://gist.github.com/portante/5395b05e9b996adfe60c12e8261774de
[2] https://gist.github.com/portante/cadd4b4b22b89f96645f021fe7d972e4

Comment 8 Peter Portante 2018-05-15 22:07:04 UTC
A google search of that error message, "primary shard is not active", found: https://stackoverflow.com/questions/27547091/primary-shard-is-not-active-or-isnt-assigned-is-a-known-node

Comment 11 Peter Portante 2018-05-17 15:29:55 UTC
How much disk space is available?

Comment 12 Steven Walter 2018-05-17 16:19:01 UTC
The /elasticsearch/persistent is only 74% usage.

sh-4.2$ df -h                                                                                                                                                                 
Filesystem                                                                                         Size  Used Avail Use% Mounted on                                           
/dev/mapper/docker-253:7-4336916-f0806b7a3d0d1a7b99b1627e5790c2a2821183dc138397716814f0cc2b476f50   10G  474M  9.6G   5% /                                                    
tmpfs                                                                                               16G     0   16G   0% /dev                                                 
tmpfs                                                                                               16G     0   16G   0% /sys/fs/cgroup                                       
/dev/loop0                                                                                         299G  220G   80G  74% /elasticsearch/persistent                            
/dev/mapper/VolGroup00-LogVol02                                                                     30G  1.8G   29G   6% /etc/hosts                                           
shm                                                                                                 64M     0   64M   0% /dev/shm                                             
tmpfs                                                                                               16G   32K   16G   1% /etc/elasticsearch/secret                            
tmpfs                                                                                               16G   24K   16G   1% /run/secrets/kubernetes.io/serviceaccount

Comment 13 Steven Walter 2018-05-17 16:21:58 UTC
(Noting that that appears to be lower than the default threshold for no longer assigning shards)

I wonder if there was an issue removing the oldest shard:

.searchguard.logging-es-ops-data-master-m5muqedr 0     p      STARTED          5  38.3kb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.operations.2018.05.03                           0     p      STARTED    1668352     1gb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.operations.2018.05.02                           0     p      STARTED    1612317     1gb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.operations.2018.05.01                           0     p      STARTED     315882 205.5mb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.kibana                                          0     p      STARTED          1   3.1kb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.kibana.70dff5faf497aad5ee662c3ea35bd5c0fe0367ab 0     p      STARTED          3  32.7kb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.operations.2018.04.19                           0     p      UNASSIGNED                                                               
.operations.2018.05.06                           0     p      UNASSIGNED                                                               
.operations.2018.05.05                           0     p      STARTED    4768370   3.3gb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.kibana.77c40e5d3e7aea435aa73891b206dcd3731ad21e 0     p      STARTED          3  32.7kb 10.1.2.14 logging-es-ops-data-master-m5muqedr 
.operations.2018.05.04                           0     p      STARTED    2308948   1.5gb 10.1.2.14 logging-es-ops-data-master-m5muqedr 

Doesnt explain why the newest is also unassigned though

Comment 14 Jeff Cantrill 2018-05-17 17:06:49 UTC
We could review why they are unassigned.  Depending which version of 3.6 we could either:

oc -c elasticsearch exec $POD -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

or

oc -c elasticsearch exec $POD -- QUERY=_cat/shards?h=index,shard,prirep,state,unassigned.reason  es_util | grep UNASSIGNED

Comment 15 Peter Portante 2018-05-18 05:31:01 UTC
I think we need to see all the available disk space across all ES PVs, as well.

Comment 17 Peter Portante 2018-05-25 16:55:27 UTC
Comment 16 reports that they are not using PVs for Elasticsearch.

If that is the case, Elasticsearch state is evaporating with the pod when it goes away.  There is almost no way to make this work without persistent state.


Note You need to log in before you can comment on or make changes to this bug.