Bug 1557483 - logging-curator continually restarts
Summary: logging-curator continually restarts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.7.z
Assignee: Josef Karasek
QA Contact: Anping Li
URL:
Whiteboard:
Depends On: 1575820
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-16 16:31 UTC by Dan Yocum
Modified: 2018-06-27 07:59 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Curator gave up waiting for elasticsearch to become ready after 60 seconds. Consequence: Curator exited with error status. Fix: Wait for elasticsearch with no time limit. Exponential wait capped at 300 seconds. Result: Curator is passivated if elasticsearch is not reacheable instead of crashing. Elasticsearch status is polled every x seconds.
Clone Of:
: 1564350 (view as bug list)
Environment:
Last Closed: 2018-06-27 07:59:12 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin-aggregated-logging pull 1096 None closed [release-3.7] Curator continually restarts 2020-05-04 09:06:40 UTC
Red Hat Product Errata RHBA-2018:2009 None None None 2018-06-27 07:59:48 UTC

Description Dan Yocum 2018-03-16 16:31:47 UTC
Description of problem:

The logging-curator pod restarts continually.

Version-Release number of selected component (if applicable):

3.7.23 built on 3/10

How reproducible:

Always

Steps to Reproduce:
1. Install logging

Actual results:

# oc get pods
NAME                                       READY     STATUS    RESTARTS   AGE
logging-curator-6-mnc4r                    1/1       Running   4          19m


Expected results:

No restarts

Additional info:

# oc describe po logging-curator-6-mnc4r 
Name:		logging-curator-6-mnc4r
Namespace:	logging
Node:		ip-172-31-65-241.us-east-2.compute.internal/172.31.65.241
Start Time:	Fri, 16 Mar 2018 16:10:39 +0000
Labels:		component=curator
		deployment=logging-curator-6
		deploymentconfig=logging-curator
		logging-infra=curator
		provider=openshift
Annotations:	kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"logging","name":"logging-curator-6","uid":"7aebf18d-2934-11e8-8fe3-023...
		openshift.io/deployment-config.latest-version=6
		openshift.io/deployment-config.name=logging-curator
		openshift.io/deployment.name=logging-curator-6
		openshift.io/scc=restricted
Status:		Running
IP:		10.130.0.23
Created By:	ReplicationController/logging-curator-6
Controlled By:	ReplicationController/logging-curator-6
Containers:
  curator:
    Container ID:	docker://2141b137581ce972ca554a53a3e9860c787e611f0c978d7592e6d7c1b1f0d619
    Image:		registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7
    Image ID:		docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-curator@sha256:d75529267ef94e64a859fb7622914decc34b2b854ea2fab7b1d97b7fdf779ab6
    Port:		<none>
    State:		Terminated
      Reason:		Error
      Exit Code:	255
      Started:		Fri, 16 Mar 2018 16:22:22 +0000
      Finished:		Fri, 16 Mar 2018 16:25:58 +0000
    Last State:		Terminated
      Reason:		Error
      Exit Code:	255
      Started:		Fri, 16 Mar 2018 16:18:15 +0000
      Finished:		Fri, 16 Mar 2018 16:21:50 +0000
    Ready:		False
    Restart Count:	3
    Limits:
      memory:	512Mi
    Requests:
      cpu:	25m
      memory:	512Mi
    Environment:
      K8S_HOST_URL:		https://kubernetes.default.svc.cluster.local
      ES_HOST:			logging-es
      ES_PORT:			9200
      ES_CLIENT_CERT:		/etc/curator/keys/cert
      ES_CLIENT_KEY:		/etc/curator/keys/key
      ES_CA:			/etc/curator/keys/ca
      CURATOR_DEFAULT_DAYS:	14
      CURATOR_RUN_HOUR:		0
      CURATOR_RUN_MINUTE:	0
      CURATOR_RUN_TIMEZONE:	UTC
      CURATOR_SCRIPT_LOG_LEVEL:	INFO
      CURATOR_LOG_LEVEL:	WARN
    Mounts:
      /etc/curator/keys from certs (ro)
      /etc/curator/settings from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-4p68h (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	logging-curator
    Optional:	false
  config:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	logging-curator
    Optional:	false
  aggregated-logging-curator-token-4p68h:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	aggregated-logging-curator-token-4p68h
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	type=infra
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath			Type		Reason			Message
  ---------	--------	-----	----							-------------			--------	------			-------
  15m		15m		1	default-scheduler									Normal		Scheduled		Successfully assigned logging-curator-6-mnc4r to ip-172-31-65-241.us-east-2.compute.internal
  15m		15m		1	kubelet, ip-172-31-65-241.us-east-2.compute.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "config" 
  15m		15m		1	kubelet, ip-172-31-65-241.us-east-2.compute.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "certs" 
  15m		15m		1	kubelet, ip-172-31-65-241.us-east-2.compute.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-4p68h" 
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Pulling			pulling image "registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7"
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Pulled			Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7"
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Created			Created container
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Started			Started container
  8m		1s		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Warning		BackOff			Back-off restarting failed container

Comment 1 Dan Yocum 2018-03-16 16:34:33 UTC
oc logs logging-curator-6-mnc4r yields no output

Comment 2 Rich Megginson 2018-03-16 16:51:13 UTC
@jcantrill is this the same problem we had previously with a bad upstream merge of curator that got built and pushed?

Comment 3 Jeff Cantrill 2018-03-16 17:45:11 UTC
Possibly,

The version that fixed the missing files per dist-git [1] is v3.7.38-1

Can we verify if that version?

[1] http://pkgs.devel.redhat.com/cgit/rpms/logging-curator-docker/commit/?h=rhaos-3.7-rhel-7&id=13573496cabb60dff1dace3e92ffe3a74fa524e5

Comment 4 Dan Yocum 2018-03-16 21:05:50 UTC
docker inspect says:


 "build-date": "2018-03-16T15:46:54.797125",

 "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/logging-curator/images/v3.7.39-1",

Comment 6 Peter Portante 2018-03-18 23:17:58 UTC
Can we also check the state of elasticsearch with the health.sh script? [1]

[1] https://gist.githubusercontent.com/portante/c801acce8ae275fdcfc743b346981b13/raw/45cff354ec4a96882f6abc0725b2b6a646c5003c/health.sh

Comment 7 Josef Karasek 2018-03-19 13:45:03 UTC
@Dan if there's nothing in the log it might mean that curator can't reach[0] elasticsearch. Can you enable debug log to get more info?

oc set env dc/logging-curator --overwrite CURATOR_SCRIPT_LOG_LEVEL=DEBUG CURATOR_LOG_LEVEL=DEBUG
oc rollout latest logging-curator

[0] https://github.com/openshift/origin-aggregated-logging/blob/release-3.7/curator/run.sh#L21

Comment 14 Jeff Cantrill 2018-04-05 14:03:56 UTC
https://github.com/openshift/origin-aggregated-logging/pull/1064

We will need to clone to put this in 3.9

Comment 15 Josef Karasek 2018-04-06 03:35:43 UTC
cloned https://bugzilla.redhat.com/show_bug.cgi?id=1564350

Comment 17 Jeff Cantrill 2018-05-02 15:37:11 UTC
3.7 CI is broken.  While trying to determine if the cherry-pick is good.  I get the following even though the pod does not appear to constantly restart:

Deployed using latest 3.7 ansible and image built from the PR in c#16


WARNING:elasticsearch:HEAD https://logging-es:9200/ [status:N/A request:0.009s]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 95, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 326, in connect
    ssl_context=context)
  File "/usr/lib/python2.7/site-packages/urllib3/util/ssl_.py", line 329, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib64/python2.7/ssl.py", line 350, in wrap_socket
    _context=self)
  File "/usr/lib64/python2.7/ssl.py", line 611, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.7/ssl.py", line 833, in do_handshake
    self._sslobj.do_handshake()
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)

Comment 19 Anping Li 2018-05-08 09:43:24 UTC
Waiting the new images

Comment 21 Junqi Zhao 2018-05-11 02:05:08 UTC
need new curator image, the latest image is still logging-curator/images/v3.7.46-1

Comment 22 Junqi Zhao 2018-05-11 02:31:53 UTC
# oc logs logging-curator-1-sbmpq -n logging
sh: run.sh: No such file or directory

Comment 23 Jeff Cantrill 2018-05-11 14:29:19 UTC
This requires a new-build to satisfy

Comment 25 Josef Karasek 2018-05-17 09:34:42 UTC
`docker inspect logging-curator/images:v3.7.48-1` shows correct CMD

Comment 26 Anping Li 2018-05-22 07:59:31 UTC
works well with logging-curator:v3.7.48

Comment 28 errata-xmlrpc 2018-06-27 07:59:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2009


Note You need to log in before you can comment on or make changes to this bug.