Bug 1564350

Summary:

logging-curator continually restarts

Product:

OpenShift Container Platform

Reporter:

Josef Karasek <jkarasek>

Component:

Logging

Assignee:

Josef Karasek <jkarasek>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.9.0

CC:

anli, aos-bugs, dyocum, jcantril, jkarasek, juzhao, pportant, rmeggins, wsun

Target Milestone:

---

Keywords:

OpsBlocker

Target Release:

3.9.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: Curator checked readiness of elasticsearch at start-up. If elasticsearch wasn't ready after 1 minute curator gave up. This was repeated 5 times with exponential backoff via pod restart policy with default backofflimit=5. (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy) Consequence: Curator could not be deployed without elasticsearch. Fix: Curator checks for elasticsearch readiness indefinitely before each run. Result: Curator and elasticsearch can be deployed independently.

Story Points:

---

Clone Of:

1557483

Environment:

Last Closed:

2018-08-09 22:13:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1572419

Bug Blocks:

Attachments:

Description	Flags
logging 3.9 environment dump	none

Comment 1 Dan Yocum 2018-04-10 14:34:52 UTC

Install logging on a new cluster.  Lately (since the last kernel CVE update), this has caused the es pods to fail to deploy.  *If* they actually deploy, then delete the replicationcontrollers for the ES pods - leave curator running!  This should be sufficient to put the curator into a continuous restart loop.  If the curator pod doesn't start the restart loop, then 'oc rollout latest logging-curator' to start a fresh pod that has nothing to connect to.

Comment 2 Jeff Cantrill 2018-04-24 21:18:21 UTC

3.9 PR https://github.com/openshift/origin-aggregated-logging/pull/1095

Comment 6 Junqi Zhao 2018-05-02 02:51:15 UTC

Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1572419

Comment 8 Junqi Zhao 2018-05-03 06:09:52 UTC

Tested with logging-curator/images/v3.9.27-2, curator pod restated 4 times within 4m


# oc describe po logging-curator-1-twg6w
Name:           logging-curator-1-twg6w
Namespace:      logging
Node:           172.16.120.93/172.16.120.93
Start Time:     Thu, 03 May 2018 01:31:02 -0400
Labels:         component=curator
                deployment=logging-curator-1
                deploymentconfig=logging-curator
                logging-infra=curator
                provider=openshift
Annotations:    openshift.io/deployment-config.latest-version=1
                openshift.io/deployment-config.name=logging-curator
                openshift.io/deployment.name=logging-curator-1
                openshift.io/scc=restricted
Status:         Running
IP:             10.129.0.16
Controlled By:  ReplicationController/logging-curator-1
Containers:
  curator:
    Container ID:   docker://87f5da98dc7d29417cc5b0d34288226567e229771de55e5aa300d61de8d2904a
    Image:          brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9
    Image ID:       docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator@sha256:e299568c9ec84353f9029ac22e425910d2da22c80ecb87b4a76ba9c5a9ba352d
    Port:           <none>
    State:          Running
      Started:      Thu, 03 May 2018 01:32:43 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 03 May 2018 01:32:00 -0400
      Finished:     Thu, 03 May 2018 01:32:00 -0400
    Ready:          True
    Restart Count:  4
    Limits:
      memory:  256Mi
    Requests:
      cpu:     100m
      memory:  256Mi
    Environment:
      K8S_HOST_URL:              https://kubernetes.default.svc.cluster.local
      ES_HOST:                   logging-es
      ES_PORT:                   9200
      ES_CLIENT_CERT:            /etc/curator/keys/cert
      ES_CLIENT_KEY:             /etc/curator/keys/key
      ES_CA:                     /etc/curator/keys/ca
      CURATOR_DEFAULT_DAYS:      30
      CURATOR_RUN_HOUR:          3
      CURATOR_RUN_MINUTE:        30
      CURATOR_RUN_TIMEZONE:      UTC
      CURATOR_SCRIPT_LOG_LEVEL:  INFO
      CURATOR_LOG_LEVEL:         ERROR
    Mounts:
      /etc/curator/keys from certs (ro)
      /etc/curator/settings from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-dxjnt (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  logging-curator
    Optional:    false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      logging-curator
    Optional:  false
  aggregated-logging-curator-token-dxjnt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aggregated-logging-curator-token-dxjnt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                 Age              From                    Message
  ----     ------                 ----             ----                    -------
  Normal   Scheduled              7m               default-scheduler       Successfully assigned logging-curator-1-twg6w to 172.16.120.93
  Normal   SuccessfulMountVolume  7m               kubelet, 172.16.120.93  MountVolume.SetUp succeeded for volume "config"
  Normal   SuccessfulMountVolume  7m               kubelet, 172.16.120.93  MountVolume.SetUp succeeded for volume "certs"
  Normal   SuccessfulMountVolume  7m               kubelet, 172.16.120.93  MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-dxjnt"
  Warning  BackOff                5m (x6 over 7m)  kubelet, 172.16.120.93  Back-off restarting failed container
  Normal   Pulled                 5m (x5 over 7m)  kubelet, 172.16.120.93  Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9" already present on machine
  Normal   Created                5m (x5 over 7m)  kubelet, 172.16.120.93  Created container
  Normal   Started                5m (x5 over 7m)  kubelet, 172.16.120.93  Started container

Comment 9 Junqi Zhao 2018-05-03 06:10:30 UTC

NAME                                         READY     STATUS    RESTARTS   AGE
po/logging-curator-1-twg6w                   1/1       Running   4          4m

Comment 10 Junqi Zhao 2018-05-03 06:12:43 UTC

Created attachment 1430448 [details]
logging 3.9 environment dump

Comment 11 Anping Li 2018-05-04 09:59:46 UTC

@josef,
It throw the following message when no ES pod can be found.

logging-curator-2-m5qjd       0/1       Error     2          41s

[root@anli host3ha]# oc logs logging-curator-2-m5qjd
WARNING:elasticsearch:HEAD http://logging-es:9200/ [status:N/A request:2.710s]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 94, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 649, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 345, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 844, in _validate_conn
    conn.connect()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host
Traceback (most recent call last):
  File "run_cron.py", line 93, in <module>
    ccj.run()
  File "run_cron.py", line 38, in run
    if self.server_ready():
  File "run_cron.py", line 70, in server_ready
    if es.ping():
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 208, in ping
    self.transport.perform_request('HEAD', '/', params=params)
  File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 329, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 105, in perform_request
    raise ConnectionError('N/A', str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host) caused by: NewConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host)

Comment 12 Jeff Cantrill 2018-05-04 13:23:01 UTC

Additional 3.9 fix: https://github.com/openshift/origin-aggregated-logging/pull/1138

Comment 13 Anping Li 2018-05-07 07:44:20 UTC

The PR have been tested and pass, Waiting the new images.

Comment 15 Josef Karasek 2018-05-09 07:22:29 UTC

Anli: v3.9.29 image has the latest PR

Comment 19 Junqi Zhao 2018-07-30 01:47:43 UTC

Tested with logging-curator:v3.9.38, logging-curator did not continually restart, now it uses run_cron.py in curator whether curator has connected to elasticsearch or not

# oc get pod | grep curator
logging-curator-1-gjczc                   1/1       Running   0          7m

# oc logs logging-curator-1-gjczc
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 1
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 2
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 3
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 4
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 5
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 6
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 7
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 8
INFO:__main__:curator running [1] jobs
INFO:__main__:No indices matched provided args: {'regex': None, 'index': (), 'suffix': None, 'newer_than': None, 'closed_only': False, 'prefix': None, 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'exclude': ('^\\.searchguard\\..*$', '^\\.kibana.*$'), 'older_than': 30, 'all_indices': False}
INFO:__main__:curator run finish

Comment 21 errata-xmlrpc 2018-08-09 22:13:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2335