Bug 1431551 - OpenShift Logging not stable - does not survive node outage
Summary: OpenShift Logging not stable - does not survive node outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.4.z
Assignee: Jeff Cantrill
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-13 08:37 UTC by Miheer Salunke
Modified: 2020-12-14 08:20 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The index used by the SG plugin is not seeded properly Consequence: The ES node is in a waiting state for the SG ACLs and doesnt service any requests Fix: Moved the initial seeding logic to when the server starts in its run.sh Result: The initial SG documents are seeded and the plugin services requests.
Clone Of:
Environment:
Last Closed: 2017-04-04 14:28:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
es, kibana,fluentd pod log (327.54 KB, text/plain)
2017-03-16 10:33 UTC, Junqi Zhao
no flags Details
ES, Kibana and fluentd pod log after rebooting os (95.62 KB, text/plain)
2017-03-17 08:59 UTC, Junqi Zhao
no flags Details
ES, Kibana logs and other info, kibana can be accessed (7.43 MB, text/plain)
2017-03-20 02:10 UTC, Junqi Zhao
no flags Details
ES, Kibana logs and other info, before reboot (160.11 KB, text/plain)
2017-03-21 03:08 UTC, Junqi Zhao
no flags Details
ES, Kibana logs and other info, after root (57.42 KB, text/plain)
2017-03-21 03:10 UTC, Junqi Zhao
no flags Details
before rebooting os, pods log (135.48 KB, text/plain)
2017-03-23 06:09 UTC, Junqi Zhao
no flags Details
after rebooting os, pods log (66.62 KB, text/plain)
2017-03-23 06:09 UTC, Junqi Zhao
no flags Details
fluentd log (74.21 KB, text/plain)
2017-03-24 08:04 UTC, Junqi Zhao
no flags Details
es pod log (198.31 KB, text/plain)
2017-03-27 00:38 UTC, Junqi Zhao
no flags Details
ES, Kibana logs and other info, before reboot -20170331 (108.86 KB, text/plain)
2017-03-31 06:23 UTC, Junqi Zhao
no flags Details
ES, Kibana logs and other info, after reboot -20170331 (56.56 KB, text/plain)
2017-03-31 06:24 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0865 0 normal SHIPPED_LIVE OpenShift Container Platform 3.4.1.12, 3.3.1.17-4, and 3.2.1.30 bug fix update 2017-04-04 18:27:43 UTC

Description Miheer Salunke 2017-03-13 08:37:09 UTC
Description of problem:
The OpenShift aggregated logging stack has been successfully deployed in our environment. During failover testing (rebooting nodes) we discovered that logging no longer works.



Where are you experiencing the behavior?  What environment?

Disconnected environment. Aggregated logging is deployed.
Virtual machines on RHEV
1 - LB
3 - Masters (each contains a fluentd pod)
3 - etcd
2 - nodes (each contains a fluentd pod and a logging-es pod. Logging-es writes to a local volume)

Version-Release number of selected component (if applicable):
3.4

How reproducible:
Customer end

Steps to Reproduce:
1.Mentioned in the description
2.
3.

Actual results:
After reboot logging does not work

Expected results:
After reboot logging shall work.

Additional info:

Comment 9 Junqi Zhao 2017-03-16 09:20:01 UTC
@Jeff,
I see customers' description:
'After reboot logging does not work'

What is need to reboot? the OS or the atomic-openshift-node service or re-deploy logging stacks?

Comment 10 Junqi Zhao 2017-03-16 10:32:08 UTC
@Jeff,
Test environment is the same as customer's
Virtual machines on RHEV
1 - LB
3 - Masters (each contains a fluentd pod)
3 - etcd
2 - nodes (each contains a fluentd pod and a logging-es pod. Logging-es writes to a local volume)

Tested two scenarios:
1. reboot atomic-openshift-node service, not found error as customer reported, logs can be found in Kibana UI after reboot atomic-openshift-node service.

2. reboot OS, this defect was reproduced, same error as customer reported.
see the attached log file.


Logging 3.4.1 image id:
openshift3/logging-elasticsearch    ead829151d09
openshift3/logging-deployer    eb90e5126c8d
openshift3/logging-auth-proxy    d85303b2c262
openshift3/logging-kibana    03900b0b9416
openshift3/logging-fluentd    e4b97776c79b

Comment 11 Junqi Zhao 2017-03-16 10:33:05 UTC
Created attachment 1263623 [details]
es, kibana,fluentd pod log

Comment 12 Jeff Cantrill 2017-03-16 12:26:52 UTC
The images from #3 did not include the some of the fixes identified in the attached stack trace.  Please retest with 3.4.1-10 which is: 

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=543826

I don not have sha information for ES image

Comment 13 Junqi Zhao 2017-03-17 08:58:32 UTC
tested with logging-elasticsearch:3.4.1-10, same error as yesterday.
see the attached log file.

and the curator pod's status was changed from Running -> CrashLoopBackOff -> Running, and restarted many times, and final status is CrashLoopBackOff

Comment 14 Junqi Zhao 2017-03-17 08:59:51 UTC
Created attachment 1263976 [details]
ES, Kibana and fluentd pod log after rebooting os

Comment 18 Junqi Zhao 2017-03-20 02:08:51 UTC
@Jeff,
Did you do any change to the env, I checked the pods and Kibana UI, logs can be shown in Kibana although there were errors in ES and Kibana pod logs, see the attached file.

I also found the curator pod had been restarted for 126 times. There were no events found.

# oc get events
No resources found.

I can not found the repo you want me to reproduce this issue, please give me the full address.

Comment 19 Junqi Zhao 2017-03-20 02:10:04 UTC
Created attachment 1264691 [details]
ES, Kibana logs and other info, kibana can be accessed

Comment 20 Jeff Cantrill 2017-03-20 13:02:33 UTC
@Junqui,

I did two things:

1. oc rollout latest $DCNAME - This alone made the ES pods work properly.
2. Updated $DC to the images from #17

Neither allowed me to understand what was happening only to make logging work properly.  I have been investigating this issue and:

1. Am able to recreate the problem
2. Working on a solution now.

More info as I have it.

Comment 22 Junqi Zhao 2017-03-21 01:12:40 UTC
@Jeff,

Do you want me to use jcantrill/logging-elasticsearch:3.4.fix for elasticsearch?
What about the kibana,fluentd,curator,logging-auth-proxy, shall I still use these images from brew registry?

authentication required when I try to pull these images from your registry
# docker pull docker.io/jcantrill/logging-kibana:3.4.fix
Trying to pull repository docker.io/jcantrill/logging-kibana ... 
unauthorized: authentication required

Comment 23 Junqi Zhao 2017-03-21 03:07:28 UTC
used brew registry to deploy logging stacks first, then scaled down all es dc, edited es dc, used jcantrill/logging-elasticsearch:3.4.fix image, scaled es dc up after reployed es dc.

although there were errors in es pod, but after rooting os, logs could be found in kibana. It seemed those error are not affect the whole function.

I think you can push the es image to brew registry, then I can test it again.


Attached the logs before and after rebooting os.

Comment 24 Junqi Zhao 2017-03-21 03:08:34 UTC
Created attachment 1264898 [details]
ES, Kibana logs and other info, before reboot

Comment 25 Junqi Zhao 2017-03-21 03:10:07 UTC
Created attachment 1264899 [details]
ES, Kibana logs and other info, after root

The environment is still available, you can check this issue there

Comment 26 Jeff Cantrill 2017-03-22 20:50:52 UTC
Please validate with this image: 

12835998 buildContainer (noarch) completed successfully
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=545812
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:rhaos-3.4-rhel-7-docker-candidate-20170322162429
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.4.1-15
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.4.1
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:v3.4

You should be able to both:

1. Deploy an image from a clean installation with no issues.
2. Update an image from an existing installation
3. 'Bounce' the ES pod and still see non-transient errors

Comment 27 openshift-github-bot 2017-03-22 21:37:34 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/cc6573b5744b4a06b37324d9259f2006c8f59c10
bug 1431551. Seed SG config on start instead of in plugin

(cherry picked from commit f4d71e004af5690f46f287f0fff70ff5cbbb4cb6)
(cherry picked from commit 942ed95fdf629847260507b3906d9a01f364ab0d)

Comment 28 Junqi Zhao 2017-03-23 06:08:26 UTC
Verifed with es image logging-elasticsearch:3.4.1-15.

Before rebooting os, in es pods log, there are a lot of '[com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized' error messages, in another es pod log, there is TimeoutException exception 'Timeout after 30SECONDS while retrieving configuration for [config, roles, rolesmapping, internalusers, actiongroups]', in fluentd pods log, there is ConnectionFailure error.


After rebooting os, in es pods log, NoRouteToHostException is found, I think this is expected. log entries can be found on Kibana UI.

See the attached pods log files, and the environment mentioned in Comment 15 is still available.

Comment 29 Junqi Zhao 2017-03-23 06:09:20 UTC
Created attachment 1265602 [details]
before rebooting os, pods log

Comment 30 Junqi Zhao 2017-03-23 06:09:46 UTC
Created attachment 1265603 [details]
after rebooting os, pods log

Comment 31 Jeff Cantrill 2017-03-23 18:47:34 UTC
Junqui,

Those are transient failures from ES accepting requests before the SG documents have been seeded by the run.sh script.  The logs show several statements like:

Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_roles_mapping.yml
   SUCC: Configuration for 'rolesmapping' created or updated

Followed by a statement like:

Done with success
Seeded the searchguard ACL index

Which indicates the seeding was complete.  Further more, you should now be able to:

1. Access logs via Kibana
2. See logs being pushed in by fluentd.

Comment 32 Junqi Zhao 2017-03-24 07:53:08 UTC
Although there are transient failures in es pods, but I am able to
1. Access logs via Kibana
2. See logs being pushed in by fluentd.

Comment 33 Junqi Zhao 2017-03-24 08:04:29 UTC
In fluentd pod log,we found fluentd temporarily failed to flush the buffer, but logs can be pushed to fluentd, this issue is already in bz:
https://bugzilla.redhat.com/show_bug.cgi?id=1408633

Comment 34 Junqi Zhao 2017-03-24 08:04:53 UTC
Created attachment 1265986 [details]
fluentd log

Comment 35 Jeff Cantrill 2017-03-24 13:26:47 UTC
@Junqui,

It looks like fluentd is unable to connect to ES.  Please provide ES logs so we can understand what is happing with the ES cluster.

Comment 36 Junqi Zhao 2017-03-27 00:38:00 UTC
The error shows in fluentd pod is not caused by the fix of BZ 1431551, it is already exist when we do logging 3.4.0 testing, see https://bugzilla.redhat.com/show_bug.cgi?id=1408633

Attached the ES pod log

Comment 37 Junqi Zhao 2017-03-27 00:38:22 UTC
Created attachment 1266521 [details]
es pod log

Comment 39 Junqi Zhao 2017-03-31 06:22:42 UTC
After rebooting os, elastichsearch and kibnana work well, log entries can be found on Kibana UI, there is no error shows besides errorhttps://bugzilla.redhat.com/show_bug.cgi?id=1408633
, and this transient error don't affect the whole logging function.

Set it to VERIFIED and close it.

Image ID:
/openshift3/logging-kibana          3.4.1               40f32a3a1bd6        7 hours ago         339.1 MB
/openshift3/logging-fluentd         3.4.1               a14ecb38f3a4        8 hours ago         233 MB
/openshift3/logging-curator         3.4.1               65574bade959        8 hours ago         244.3 MB
/openshift3/logging-deployer        3.4.1               6d9424eb2d29        8 hours ago         889.5 MB
/openshift3/logging-auth-proxy      3.4.1               cbb060b8e773        8 hours ago         215.1 MB
/openshift3/logging-elasticsearch   3.4.1               246537fe4546        8 days ago          399.2 MB

Comment 40 Junqi Zhao 2017-03-31 06:23:52 UTC
Created attachment 1267753 [details]
ES, Kibana logs and other info, before reboot -20170331

Comment 41 Junqi Zhao 2017-03-31 06:24:40 UTC
Created attachment 1267754 [details]
ES, Kibana logs and other info, after reboot -20170331

Comment 42 Peter Portante 2017-03-31 12:13:31 UTC
The "no route to host" warnings are something of a concern.

Comment 43 Junqi Zhao 2017-04-01 01:10:42 UTC
(In reply to Peter Portante from comment #42)
> The "no route to host" warnings are something of a concern.

from attachment in Comment 40, two es pods' ip are
NAME			      IP
logging-es-mstyj4hi-2-kcvw7   10.2.2.26
logging-es-q1vahu51-2-h2zt6   10.2.8.29


after rebooting os, IP changed to:
NAME			      IP
logging-es-mstyj4hi-2-kcvw7   10.2.2.30
logging-es-q1vahu51-2-h2zt6   10.2.8.35

from es pod NoRouteToHostException info after rebooting
"[io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider] [Asmodeus] adding endpoint /10.2.8.29, transport_address 10.2.8.29:9300"

It tried to connect 10.2.8.29 which now is not exist, maybe it cause the NoRouteToHostException, and finally it tried to connect to right ip 10.2.8.35

Comment 46 errata-xmlrpc 2017-04-04 14:28:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0865

Comment 50 Peter Portante 2017-05-10 15:56:08 UTC
Daniel, this appears to be a different problem you are seeing now, can you use BZ #1443350 to track that, as it appears to be the same one or very similar.


Note You need to log in before you can comment on or make changes to this bug.