Description of problem: Fluentd pod logs fill up over time with repeated messages. Fluentd pods either crash and restart, or have to be restarted manually at which point they resume functioning for a while. Fluentd memory usage becomes very high. Version-Release number of selected component (if applicable): 3.6, 3.7 (verifying exact versions soon) How reproducible: Unconfirmed Actual results: \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"}}}, {\"create\"=>{\"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSL\", \"_version\"=>1, \"_shards\"=>{\"total\"=>1, \"successful\"=>1, \"failed\"=>0}, \"status\"=>201}}, {\"create\"=>{\"_index\"=>\".operations.2018.04.25\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSM\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"}}}, {\"create\"=>{\"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSN\", \"_version\"=>1, \"_shards\"=>{\"total\"=>1, \"successful\"=>1, \"failed\"=>0}, \"status\"=>201}}, {\"create\"=>{\"_index\"=>\".operations.2018.04.25\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSO\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\"}}}, . . . Expected results: If a shard is not available it should not recursively log this Additional info: Possibly related but fixed only in 3.8/9+ https://bugzilla.redhat.com/show_bug.cgi?id=1534419
Here is the original error message posted in the description "pretty printed" to make it easier to review: \"error\"=>{ \"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\" } } }, { \"create\"=>{ \"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSL\", \"_version\"=>1, \"_shards\"=>{ \"total\"=>1, \"successful\"=>1, \"failed\"=>0 }, \"status\"=>201 } }, { \"create\"=>{ \"_index\"=>\".operations.2018.04.25\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSM\", \"status\"=>503, \"error\"=>{ \"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\" } } }, { \"create\"=>{ \"_index\"=>\"project.test-prod.18a7d719-48a4-11e7-9d36-1293b0956c5c.2018.04.19\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSN\", \"_version\"=>1, \"_shards\"=>{ \"total\"=>1, \"successful\"=>1, \"failed\"=>0}, \"status\"=>201 } }, { \"create\"=>{ \"_index\"=>\".operations.2018.04.25\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWNgFjm-zg6tB7-fHPSO\", \"status\"=>503, \"error\"=>{ \"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[.operations.2018.04.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.operations.2018.04.25] containing [3562] requests]\" } } }, . . .
The error indicates logs are attempting to send to the index: ".operations.2018.04.25". It appears that that index is not open, or has some other problem with it, as it does not show up in the list of indices (see contents of attached logging dump). However, the project.test-prod.* index does not show up in that list either, so that leaves a question about the reported error message matching the logging dump data. Can somebody provide a list of all the indices[1] and all the shards[2] that match the posted error above? [1] https://gist.github.com/portante/5395b05e9b996adfe60c12e8261774de [2] https://gist.github.com/portante/cadd4b4b22b89f96645f021fe7d972e4
A google search of that error message, "primary shard is not active", found: https://stackoverflow.com/questions/27547091/primary-shard-is-not-active-or-isnt-assigned-is-a-known-node
How much disk space is available?
The /elasticsearch/persistent is only 74% usage. sh-4.2$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/docker-253:7-4336916-f0806b7a3d0d1a7b99b1627e5790c2a2821183dc138397716814f0cc2b476f50 10G 474M 9.6G 5% / tmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/loop0 299G 220G 80G 74% /elasticsearch/persistent /dev/mapper/VolGroup00-LogVol02 30G 1.8G 29G 6% /etc/hosts shm 64M 0 64M 0% /dev/shm tmpfs 16G 32K 16G 1% /etc/elasticsearch/secret tmpfs 16G 24K 16G 1% /run/secrets/kubernetes.io/serviceaccount
(Noting that that appears to be lower than the default threshold for no longer assigning shards) I wonder if there was an issue removing the oldest shard: .searchguard.logging-es-ops-data-master-m5muqedr 0 p STARTED 5 38.3kb 10.1.2.14 logging-es-ops-data-master-m5muqedr .operations.2018.05.03 0 p STARTED 1668352 1gb 10.1.2.14 logging-es-ops-data-master-m5muqedr .operations.2018.05.02 0 p STARTED 1612317 1gb 10.1.2.14 logging-es-ops-data-master-m5muqedr .operations.2018.05.01 0 p STARTED 315882 205.5mb 10.1.2.14 logging-es-ops-data-master-m5muqedr .kibana 0 p STARTED 1 3.1kb 10.1.2.14 logging-es-ops-data-master-m5muqedr .kibana.70dff5faf497aad5ee662c3ea35bd5c0fe0367ab 0 p STARTED 3 32.7kb 10.1.2.14 logging-es-ops-data-master-m5muqedr .operations.2018.04.19 0 p UNASSIGNED .operations.2018.05.06 0 p UNASSIGNED .operations.2018.05.05 0 p STARTED 4768370 3.3gb 10.1.2.14 logging-es-ops-data-master-m5muqedr .kibana.77c40e5d3e7aea435aa73891b206dcd3731ad21e 0 p STARTED 3 32.7kb 10.1.2.14 logging-es-ops-data-master-m5muqedr .operations.2018.05.04 0 p STARTED 2308948 1.5gb 10.1.2.14 logging-es-ops-data-master-m5muqedr Doesnt explain why the newest is also unassigned though
We could review why they are unassigned. Depending which version of 3.6 we could either: oc -c elasticsearch exec $POD -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED or oc -c elasticsearch exec $POD -- QUERY=_cat/shards?h=index,shard,prirep,state,unassigned.reason es_util | grep UNASSIGNED
I think we need to see all the available disk space across all ES PVs, as well.
Comment 16 reports that they are not using PVs for Elasticsearch. If that is the case, Elasticsearch state is evaporating with the pod when it goes away. There is almost no way to make this work without persistent state.