Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1564350 - logging-curator continually restarts
logging-curator continually restarts
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.9.0
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.9.z
Assigned To: Josef Karasek
Junqi Zhao
: OpsBlocker
Depends On: 1572419
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-05 23:33 EDT by Josef Karasek
Modified: 2018-08-09 18:14 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Curator checked readiness of elasticsearch at start-up. If elasticsearch wasn't ready after 1 minute curator gave up. This was repeated 5 times with exponential backoff via pod restart policy with default backofflimit=5. (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy) Consequence: Curator could not be deployed without elasticsearch. Fix: Curator checks for elasticsearch readiness indefinitely before each run. Result: Curator and elasticsearch can be deployed independently.
Story Points: ---
Clone Of: 1557483
Environment:
Last Closed: 2018-08-09 18:13:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logging 3.9 environment dump (38.84 KB, application/x-gzip)
2018-05-03 02:12 EDT, Junqi Zhao
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2335 None None None 2018-08-09 18:14 EDT

  None (edit)
Comment 1 Dan Yocum 2018-04-10 10:34:52 EDT
Install logging on a new cluster.  Lately (since the last kernel CVE update), this has caused the es pods to fail to deploy.  *If* they actually deploy, then delete the replicationcontrollers for the ES pods - leave curator running!  This should be sufficient to put the curator into a continuous restart loop.  If the curator pod doesn't start the restart loop, then 'oc rollout latest logging-curator' to start a fresh pod that has nothing to connect to.
Comment 6 Junqi Zhao 2018-05-01 22:51:15 EDT
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1572419
Comment 8 Junqi Zhao 2018-05-03 02:09:52 EDT
Tested with logging-curator/images/v3.9.27-2, curator pod restated 4 times within 4m


# oc describe po logging-curator-1-twg6w
Name:           logging-curator-1-twg6w
Namespace:      logging
Node:           172.16.120.93/172.16.120.93
Start Time:     Thu, 03 May 2018 01:31:02 -0400
Labels:         component=curator
                deployment=logging-curator-1
                deploymentconfig=logging-curator
                logging-infra=curator
                provider=openshift
Annotations:    openshift.io/deployment-config.latest-version=1
                openshift.io/deployment-config.name=logging-curator
                openshift.io/deployment.name=logging-curator-1
                openshift.io/scc=restricted
Status:         Running
IP:             10.129.0.16
Controlled By:  ReplicationController/logging-curator-1
Containers:
  curator:
    Container ID:   docker://87f5da98dc7d29417cc5b0d34288226567e229771de55e5aa300d61de8d2904a
    Image:          brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9
    Image ID:       docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator@sha256:e299568c9ec84353f9029ac22e425910d2da22c80ecb87b4a76ba9c5a9ba352d
    Port:           <none>
    State:          Running
      Started:      Thu, 03 May 2018 01:32:43 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 03 May 2018 01:32:00 -0400
      Finished:     Thu, 03 May 2018 01:32:00 -0400
    Ready:          True
    Restart Count:  4
    Limits:
      memory:  256Mi
    Requests:
      cpu:     100m
      memory:  256Mi
    Environment:
      K8S_HOST_URL:              https://kubernetes.default.svc.cluster.local
      ES_HOST:                   logging-es
      ES_PORT:                   9200
      ES_CLIENT_CERT:            /etc/curator/keys/cert
      ES_CLIENT_KEY:             /etc/curator/keys/key
      ES_CA:                     /etc/curator/keys/ca
      CURATOR_DEFAULT_DAYS:      30
      CURATOR_RUN_HOUR:          3
      CURATOR_RUN_MINUTE:        30
      CURATOR_RUN_TIMEZONE:      UTC
      CURATOR_SCRIPT_LOG_LEVEL:  INFO
      CURATOR_LOG_LEVEL:         ERROR
    Mounts:
      /etc/curator/keys from certs (ro)
      /etc/curator/settings from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-dxjnt (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  logging-curator
    Optional:    false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      logging-curator
    Optional:  false
  aggregated-logging-curator-token-dxjnt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aggregated-logging-curator-token-dxjnt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                 Age              From                    Message
  ----     ------                 ----             ----                    -------
  Normal   Scheduled              7m               default-scheduler       Successfully assigned logging-curator-1-twg6w to 172.16.120.93
  Normal   SuccessfulMountVolume  7m               kubelet, 172.16.120.93  MountVolume.SetUp succeeded for volume "config"
  Normal   SuccessfulMountVolume  7m               kubelet, 172.16.120.93  MountVolume.SetUp succeeded for volume "certs"
  Normal   SuccessfulMountVolume  7m               kubelet, 172.16.120.93  MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-dxjnt"
  Warning  BackOff                5m (x6 over 7m)  kubelet, 172.16.120.93  Back-off restarting failed container
  Normal   Pulled                 5m (x5 over 7m)  kubelet, 172.16.120.93  Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9" already present on machine
  Normal   Created                5m (x5 over 7m)  kubelet, 172.16.120.93  Created container
  Normal   Started                5m (x5 over 7m)  kubelet, 172.16.120.93  Started container
Comment 9 Junqi Zhao 2018-05-03 02:10:30 EDT
NAME                                         READY     STATUS    RESTARTS   AGE
po/logging-curator-1-twg6w                   1/1       Running   4          4m
Comment 10 Junqi Zhao 2018-05-03 02:12 EDT
Created attachment 1430448 [details]
logging 3.9 environment dump
Comment 11 Anping Li 2018-05-04 05:59:46 EDT
@josef,
It throw the following message when no ES pod can be found.

logging-curator-2-m5qjd       0/1       Error     2          41s

[root@anli host3ha]# oc logs logging-curator-2-m5qjd
WARNING:elasticsearch:HEAD http://logging-es:9200/ [status:N/A request:2.710s]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 94, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 649, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 345, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 844, in _validate_conn
    conn.connect()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host
Traceback (most recent call last):
  File "run_cron.py", line 93, in <module>
    ccj.run()
  File "run_cron.py", line 38, in run
    if self.server_ready():
  File "run_cron.py", line 70, in server_ready
    if es.ping():
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 208, in ping
    self.transport.perform_request('HEAD', '/', params=params)
  File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 329, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 105, in perform_request
    raise ConnectionError('N/A', str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host) caused by: NewConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host)
Comment 12 Jeff Cantrill 2018-05-04 09:23:01 EDT
Additional 3.9 fix: https://github.com/openshift/origin-aggregated-logging/pull/1138
Comment 13 Anping Li 2018-05-07 03:44:20 EDT
The PR have been tested and pass, Waiting the new images.
Comment 15 Josef Karasek 2018-05-09 03:22:29 EDT
Anli: v3.9.29 image has the latest PR
Comment 19 Junqi Zhao 2018-07-29 21:47:43 EDT
Tested with logging-curator:v3.9.38, logging-curator did not continually restart, now it uses run_cron.py in curator whether curator has connected to elasticsearch or not

# oc get pod | grep curator
logging-curator-1-gjczc                   1/1       Running   0          7m

# oc logs logging-curator-1-gjczc
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 1
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 2
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 3
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 4
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 5
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 6
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 7
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 8
INFO:__main__:curator running [1] jobs
INFO:__main__:No indices matched provided args: {'regex': None, 'index': (), 'suffix': None, 'newer_than': None, 'closed_only': False, 'prefix': None, 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'exclude': ('^\\.searchguard\\..*$', '^\\.kibana.*$'), 'older_than': 30, 'all_indices': False}
INFO:__main__:curator run finish
Comment 21 errata-xmlrpc 2018-08-09 18:13:46 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2335

Note You need to log in before you can comment on or make changes to this bug.