Bug 1564350
| Summary: | logging-curator continually restarts | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Josef Karasek <jkarasek> | ||||
| Component: | Logging | Assignee: | Josef Karasek <jkarasek> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.9.0 | CC: | anli, aos-bugs, dyocum, jcantril, jkarasek, juzhao, pportant, rmeggins, wsun | ||||
| Target Milestone: | --- | Keywords: | OpsBlocker | ||||
| Target Release: | 3.9.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: Curator checked readiness of elasticsearch at start-up. If elasticsearch wasn't ready after 1 minute curator gave up. This was repeated 5 times with exponential backoff via pod restart policy with default backofflimit=5. (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy)
Consequence: Curator could not be deployed without elasticsearch.
Fix: Curator checks for elasticsearch readiness indefinitely before each run.
Result: Curator and elasticsearch can be deployed independently.
|
Story Points: | --- | ||||
| Clone Of: | 1557483 | Environment: | |||||
| Last Closed: | 2018-08-09 22:13:46 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1572419 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Comment 1
Dan Yocum
2018-04-10 14:34:52 UTC
Tested with logging-curator/images/v3.9.27-2, curator pod restated 4 times within 4m
# oc describe po logging-curator-1-twg6w
Name: logging-curator-1-twg6w
Namespace: logging
Node: 172.16.120.93/172.16.120.93
Start Time: Thu, 03 May 2018 01:31:02 -0400
Labels: component=curator
deployment=logging-curator-1
deploymentconfig=logging-curator
logging-infra=curator
provider=openshift
Annotations: openshift.io/deployment-config.latest-version=1
openshift.io/deployment-config.name=logging-curator
openshift.io/deployment.name=logging-curator-1
openshift.io/scc=restricted
Status: Running
IP: 10.129.0.16
Controlled By: ReplicationController/logging-curator-1
Containers:
curator:
Container ID: docker://87f5da98dc7d29417cc5b0d34288226567e229771de55e5aa300d61de8d2904a
Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9
Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator@sha256:e299568c9ec84353f9029ac22e425910d2da22c80ecb87b4a76ba9c5a9ba352d
Port: <none>
State: Running
Started: Thu, 03 May 2018 01:32:43 -0400
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 03 May 2018 01:32:00 -0400
Finished: Thu, 03 May 2018 01:32:00 -0400
Ready: True
Restart Count: 4
Limits:
memory: 256Mi
Requests:
cpu: 100m
memory: 256Mi
Environment:
K8S_HOST_URL: https://kubernetes.default.svc.cluster.local
ES_HOST: logging-es
ES_PORT: 9200
ES_CLIENT_CERT: /etc/curator/keys/cert
ES_CLIENT_KEY: /etc/curator/keys/key
ES_CA: /etc/curator/keys/ca
CURATOR_DEFAULT_DAYS: 30
CURATOR_RUN_HOUR: 3
CURATOR_RUN_MINUTE: 30
CURATOR_RUN_TIMEZONE: UTC
CURATOR_SCRIPT_LOG_LEVEL: INFO
CURATOR_LOG_LEVEL: ERROR
Mounts:
/etc/curator/keys from certs (ro)
/etc/curator/settings from config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-dxjnt (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
certs:
Type: Secret (a volume populated by a Secret)
SecretName: logging-curator
Optional: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: logging-curator
Optional: false
aggregated-logging-curator-token-dxjnt:
Type: Secret (a volume populated by a Secret)
SecretName: aggregated-logging-curator-token-dxjnt
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m default-scheduler Successfully assigned logging-curator-1-twg6w to 172.16.120.93
Normal SuccessfulMountVolume 7m kubelet, 172.16.120.93 MountVolume.SetUp succeeded for volume "config"
Normal SuccessfulMountVolume 7m kubelet, 172.16.120.93 MountVolume.SetUp succeeded for volume "certs"
Normal SuccessfulMountVolume 7m kubelet, 172.16.120.93 MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-dxjnt"
Warning BackOff 5m (x6 over 7m) kubelet, 172.16.120.93 Back-off restarting failed container
Normal Pulled 5m (x5 over 7m) kubelet, 172.16.120.93 Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator:v3.9" already present on machine
Normal Created 5m (x5 over 7m) kubelet, 172.16.120.93 Created container
Normal Started 5m (x5 over 7m) kubelet, 172.16.120.93 Started container
NAME READY STATUS RESTARTS AGE po/logging-curator-1-twg6w 1/1 Running 4 4m Created attachment 1430448 [details]
logging 3.9 environment dump
@josef, It throw the following message when no ES pod can be found. logging-curator-2-m5qjd 0/1 Error 2 41s [root@anli host3ha]# oc logs logging-curator-2-m5qjd WARNING:elasticsearch:HEAD http://logging-es:9200/ [status:N/A request:2.710s] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 94, in perform_request response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 649, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 345, in _make_request self._validate_conn(conn) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 844, in _validate_conn conn.connect() File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 284, in connect conn = self._new_conn() File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 150, in _new_conn self, "Failed to establish a new connection: %s" % e) NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host Traceback (most recent call last): File "run_cron.py", line 93, in <module> ccj.run() File "run_cron.py", line 38, in run if self.server_ready(): File "run_cron.py", line 70, in server_ready if es.ping(): File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 208, in ping self.transport.perform_request('HEAD', '/', params=params) File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 329, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 105, in perform_request raise ConnectionError('N/A', str(e), e) elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host) caused by: NewConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f41b9c95690>: Failed to establish a new connection: [Errno 113] No route to host) Additional 3.9 fix: https://github.com/openshift/origin-aggregated-logging/pull/1138 The PR have been tested and pass, Waiting the new images. Anli: v3.9.29 image has the latest PR Tested with logging-curator:v3.9.38, logging-curator did not continually restart, now it uses run_cron.py in curator whether curator has connected to elasticsearch or not
# oc get pod | grep curator
logging-curator-1-gjczc 1/1 Running 0 7m
# oc logs logging-curator-1-gjczc
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 1
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 2
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 3
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 4
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 5
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 6
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 7
ERROR:__main__:Connection to elasticsearch at [logging-es:9200] failed. Number of failed retries: 8
INFO:__main__:curator running [1] jobs
INFO:__main__:No indices matched provided args: {'regex': None, 'index': (), 'suffix': None, 'newer_than': None, 'closed_only': False, 'prefix': None, 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'exclude': ('^\\.searchguard\\..*$', '^\\.kibana.*$'), 'older_than': 30, 'all_indices': False}
INFO:__main__:curator run finish
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335 |