Install logging on a new cluster. Lately (since the last kernel CVE update), this has caused the es pods to fail to deploy. *If* they actually deploy, then delete the replicationcontrollers for the ES pods - leave curator running! This should be sufficient to put the curator into a continuous restart loop. If the curator pod doesn't start the restart loop, then 'oc rollout latest logging-curator' to start a fresh pod that has nothing to connect to.
3.9 PR https://github.com/openshift/origin-aggregated-logging/pull/1095
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1572419
Tested with logging-curator/images/v3.9.27-2, curator pod restated 4 times within 4m # oc describe po logging-curator-1-twg6w Name: logging-curator-1-twg6w Namespace: logging Node: 172.16.120.93/172.16.120.93 Start Time: Thu, 03 May 2018 01:31:02 -0400 Labels: component=curator deployment=logging-curator-1 deploymentconfig=logging-curator logging-infra=curator provider=openshift Annotations: openshift.io/deployment-config.latest-version=1 openshift.io/deployment-config.name=logging-curator openshift.io/deployment.name=logging-curator-1 openshift.io/scc=restricted Status: Running IP: 10.129.0.16 Controlled By: ReplicationController/logging-curator-1 Containers: curator: Container ID: docker://87f5da98dc7d29417cc5b0d34288226567e229771de55e5aa300d61de8d2904a Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9 Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator@sha256:e299568c9ec84353f9029ac22e425910d2da22c80ecb87b4a76ba9c5a9ba352d Port: <none> State: Running Started: Thu, 03 May 2018 01:32:43 -0400 Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 03 May 2018 01:32:00 -0400 Finished: Thu, 03 May 2018 01:32:00 -0400 Ready: True Restart Count: 4 Limits: memory: 256Mi Requests: cpu: 100m memory: 256Mi Environment: K8S_HOST_URL: https://kubernetes.default.svc.cluster.local ES_HOST: logging-es ES_PORT: 9200 ES_CLIENT_CERT: /etc/curator/keys/cert ES_CLIENT_KEY: /etc/curator/keys/key ES_CA: /etc/curator/keys/ca CURATOR_DEFAULT_DAYS: 30 CURATOR_RUN_HOUR: 3 CURATOR_RUN_MINUTE: 30 CURATOR_RUN_TIMEZONE: UTC CURATOR_SCRIPT_LOG_LEVEL: INFO CURATOR_LOG_LEVEL: ERROR Mounts: /etc/curator/keys from certs (ro) /etc/curator/settings from config (ro) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-dxjnt (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: certs: Type: Secret (a volume populated by a Secret) SecretName: logging-curator Optional: false config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-curator Optional: false aggregated-logging-curator-token-dxjnt: Type: Secret (a volume populated by a Secret) SecretName: aggregated-logging-curator-token-dxjnt Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m default-scheduler Successfully assigned logging-curator-1-twg6w to 172.16.120.93 Normal SuccessfulMountVolume 7m kubelet, 172.16.120.93 MountVolume.SetUp succeeded for volume "config" Normal SuccessfulMountVolume 7m kubelet, 172.16.120.93 MountVolume.SetUp succeeded for volume "certs" Normal SuccessfulMountVolume 7m kubelet, 172.16.120.93 MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-dxjnt" Warning BackOff 5m (x6 over 7m) kubelet, 172.16.120.93 Back-off restarting failed container Normal Pulled 5m (x5 over 7m) kubelet, 172.16.120.93 Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9" already present on machine Normal Created 5m (x5 over 7m) kubelet, 172.16.120.93 Created container Normal Started 5m (x5 over 7m) kubelet, 172.16.120.93 Started container
NAME READY STATUS RESTARTS AGE po/logging-curator-1-twg6w 1/1 Running 4 4m
Created attachment 1430448 [details] logging 3.9 environment dump
@josef, It throw the following message when no ES pod can be found. logging-curator-2-m5qjd 0/1 Error 2 41s [root@anli host3ha]# oc logs logging-curator-2-m5qjd WARNING:elasticsearch:HEAD http://logging-es:9200/ [status:N/A request:2.710s] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 94, in perform_request response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 649, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 345, in _make_request self._validate_conn(conn) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 844, in _validate_conn conn.connect() File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 284, in connect conn = self._new_conn() File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 150, in _new_conn self, "Failed to establish a new connection: %s" % e) NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host Traceback (most recent call last): File "run_cron.py", line 93, in <module> ccj.run() File "run_cron.py", line 38, in run if self.server_ready(): File "run_cron.py", line 70, in server_ready if es.ping(): File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 208, in ping self.transport.perform_request('HEAD', '/', params=params) File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 329, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 105, in perform_request raise ConnectionError('N/A', str(e), e) elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host) caused by: NewConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host)
Additional 3.9 fix: https://github.com/openshift/origin-aggregated-logging/pull/1138
The PR have been tested and pass, Waiting the new images.
Anli: v3.9.29 image has the latest PR
Tested with logging-curator:v3.9.38, logging-curator did not continually restart, now it uses run_cron.py in curator whether curator has connected to elasticsearch or not # oc get pod | grep curator logging-curator-1-gjczc 1/1 Running 0 7m # oc logs logging-curator-1-gjczc ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 1 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 2 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 3 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 4 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 5 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 6 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 7 ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 8 INFO:__main__:curator running [1] jobs INFO:__main__:No indices matched provided args: {'regex': None, 'index': (), 'suffix': None, 'newer_than': None, 'closed_only': False, 'prefix': None, 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'exclude': ('^\\.searchguard\\..*$', '^\\.kibana.*$'), 'older_than': 30, 'all_indices': False} INFO:__main__:curator run finish
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335