Description of problem: The OpenShift aggregated logging stack has been successfully deployed in our environment. During failover testing (rebooting nodes) we discovered that logging no longer works. Where are you experiencing the behavior? What environment? Disconnected environment. Aggregated logging is deployed. Virtual machines on RHEV 1 - LB 3 - Masters (each contains a fluentd pod) 3 - etcd 2 - nodes (each contains a fluentd pod and a logging-es pod. Logging-es writes to a local volume) Version-Release number of selected component (if applicable): 3.4 How reproducible: Customer end Steps to Reproduce: 1.Mentioned in the description 2. 3. Actual results: After reboot logging does not work Expected results: After reboot logging shall work. Additional info:
@Jeff, I see customers' description: 'After reboot logging does not work' What is need to reboot? the OS or the atomic-openshift-node service or re-deploy logging stacks?
@Jeff, Test environment is the same as customer's Virtual machines on RHEV 1 - LB 3 - Masters (each contains a fluentd pod) 3 - etcd 2 - nodes (each contains a fluentd pod and a logging-es pod. Logging-es writes to a local volume) Tested two scenarios: 1. reboot atomic-openshift-node service, not found error as customer reported, logs can be found in Kibana UI after reboot atomic-openshift-node service. 2. reboot OS, this defect was reproduced, same error as customer reported. see the attached log file. Logging 3.4.1 image id: openshift3/logging-elasticsearch ead829151d09 openshift3/logging-deployer eb90e5126c8d openshift3/logging-auth-proxy d85303b2c262 openshift3/logging-kibana 03900b0b9416 openshift3/logging-fluentd e4b97776c79b
Created attachment 1263623 [details] es, kibana,fluentd pod log
The images from #3 did not include the some of the fixes identified in the attached stack trace. Please retest with 3.4.1-10 which is: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=543826 I don not have sha information for ES image
tested with logging-elasticsearch:3.4.1-10, same error as yesterday. see the attached log file. and the curator pod's status was changed from Running -> CrashLoopBackOff -> Running, and restarted many times, and final status is CrashLoopBackOff
Created attachment 1263976 [details] ES, Kibana and fluentd pod log after rebooting os
@Jeff, Did you do any change to the env, I checked the pods and Kibana UI, logs can be shown in Kibana although there were errors in ES and Kibana pod logs, see the attached file. I also found the curator pod had been restarted for 126 times. There were no events found. # oc get events No resources found. I can not found the repo you want me to reproduce this issue, please give me the full address.
Created attachment 1264691 [details] ES, Kibana logs and other info, kibana can be accessed
@Junqui, I did two things: 1. oc rollout latest $DCNAME - This alone made the ES pods work properly. 2. Updated $DC to the images from #17 Neither allowed me to understand what was happening only to make logging work properly. I have been investigating this issue and: 1. Am able to recreate the problem 2. Working on a solution now. More info as I have it.
@Jeff, Do you want me to use jcantrill/logging-elasticsearch:3.4.fix for elasticsearch? What about the kibana,fluentd,curator,logging-auth-proxy, shall I still use these images from brew registry? authentication required when I try to pull these images from your registry # docker pull docker.io/jcantrill/logging-kibana:3.4.fix Trying to pull repository docker.io/jcantrill/logging-kibana ... unauthorized: authentication required
used brew registry to deploy logging stacks first, then scaled down all es dc, edited es dc, used jcantrill/logging-elasticsearch:3.4.fix image, scaled es dc up after reployed es dc. although there were errors in es pod, but after rooting os, logs could be found in kibana. It seemed those error are not affect the whole function. I think you can push the es image to brew registry, then I can test it again. Attached the logs before and after rebooting os.
Created attachment 1264898 [details] ES, Kibana logs and other info, before reboot
Created attachment 1264899 [details] ES, Kibana logs and other info, after root The environment is still available, you can check this issue there
Please validate with this image: 12835998 buildContainer (noarch) completed successfully koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=545812 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:rhaos-3.4-rhel-7-docker-candidate-20170322162429 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.4.1-15 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.4.1 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:v3.4 You should be able to both: 1. Deploy an image from a clean installation with no issues. 2. Update an image from an existing installation 3. 'Bounce' the ES pod and still see non-transient errors
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/cc6573b5744b4a06b37324d9259f2006c8f59c10 bug 1431551. Seed SG config on start instead of in plugin (cherry picked from commit f4d71e004af5690f46f287f0fff70ff5cbbb4cb6) (cherry picked from commit 942ed95fdf629847260507b3906d9a01f364ab0d)
Verifed with es image logging-elasticsearch:3.4.1-15. Before rebooting os, in es pods log, there are a lot of '[com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized' error messages, in another es pod log, there is TimeoutException exception 'Timeout after 30SECONDS while retrieving configuration for [config, roles, rolesmapping, internalusers, actiongroups]', in fluentd pods log, there is ConnectionFailure error. After rebooting os, in es pods log, NoRouteToHostException is found, I think this is expected. log entries can be found on Kibana UI. See the attached pods log files, and the environment mentioned in Comment 15 is still available.
Created attachment 1265602 [details] before rebooting os, pods log
Created attachment 1265603 [details] after rebooting os, pods log
Junqui, Those are transient failures from ES accepting requests before the SG documents have been seeded by the run.sh script. The logs show several statements like: Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_roles_mapping.yml SUCC: Configuration for 'rolesmapping' created or updated Followed by a statement like: Done with success Seeded the searchguard ACL index Which indicates the seeding was complete. Further more, you should now be able to: 1. Access logs via Kibana 2. See logs being pushed in by fluentd.
Although there are transient failures in es pods, but I am able to 1. Access logs via Kibana 2. See logs being pushed in by fluentd.
In fluentd pod log,we found fluentd temporarily failed to flush the buffer, but logs can be pushed to fluentd, this issue is already in bz: https://bugzilla.redhat.com/show_bug.cgi?id=1408633
Created attachment 1265986 [details] fluentd log
@Junqui, It looks like fluentd is unable to connect to ES. Please provide ES logs so we can understand what is happing with the ES cluster.
The error shows in fluentd pod is not caused by the fix of BZ 1431551, it is already exist when we do logging 3.4.0 testing, see https://bugzilla.redhat.com/show_bug.cgi?id=1408633 Attached the ES pod log
Created attachment 1266521 [details] es pod log
After rebooting os, elastichsearch and kibnana work well, log entries can be found on Kibana UI, there is no error shows besides errorhttps://bugzilla.redhat.com/show_bug.cgi?id=1408633 , and this transient error don't affect the whole logging function. Set it to VERIFIED and close it. Image ID: /openshift3/logging-kibana 3.4.1 40f32a3a1bd6 7 hours ago 339.1 MB /openshift3/logging-fluentd 3.4.1 a14ecb38f3a4 8 hours ago 233 MB /openshift3/logging-curator 3.4.1 65574bade959 8 hours ago 244.3 MB /openshift3/logging-deployer 3.4.1 6d9424eb2d29 8 hours ago 889.5 MB /openshift3/logging-auth-proxy 3.4.1 cbb060b8e773 8 hours ago 215.1 MB /openshift3/logging-elasticsearch 3.4.1 246537fe4546 8 days ago 399.2 MB
Created attachment 1267753 [details] ES, Kibana logs and other info, before reboot -20170331
Created attachment 1267754 [details] ES, Kibana logs and other info, after reboot -20170331
The "no route to host" warnings are something of a concern.
(In reply to Peter Portante from comment #42) > The "no route to host" warnings are something of a concern. from attachment in Comment 40, two es pods' ip are NAME IP logging-es-mstyj4hi-2-kcvw7 10.2.2.26 logging-es-q1vahu51-2-h2zt6 10.2.8.29 after rebooting os, IP changed to: NAME IP logging-es-mstyj4hi-2-kcvw7 10.2.2.30 logging-es-q1vahu51-2-h2zt6 10.2.8.35 from es pod NoRouteToHostException info after rebooting "[io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider] [Asmodeus] adding endpoint /10.2.8.29, transport_address 10.2.8.29:9300" It tried to connect 10.2.8.29 which now is not exist, maybe it cause the NoRouteToHostException, and finally it tried to connect to right ip 10.2.8.35
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0865
Daniel, this appears to be a different problem you are seeing now, can you use BZ #1443350 to track that, as it appears to be the same one or very similar.