Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1557483

Summary:	logging-curator continually restarts
Product:	OpenShift Container Platform	Reporter:	Dan Yocum <dyocum>
Component:	Logging	Assignee:	Josef Karasek <jkarasek>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.7.1	CC:	aos-bugs, dyocum, jcantril, jkarasek, juzhao, pportant, rmeggins, smunilla
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	3.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Curator gave up waiting for elasticsearch to become ready after 60 seconds. Consequence: Curator exited with error status. Fix: Wait for elasticsearch with no time limit. Exponential wait capped at 300 seconds. Result: Curator is passivated if elasticsearch is not reacheable instead of crashing. Elasticsearch status is polled every x seconds.	Story Points:	---
Clone Of:
Clones:	1564350 (view as bug list)		Environment:
Last Closed:	2018-06-27 07:59:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1575820
Bug Blocks:

Description Dan Yocum 2018-03-16 16:31:47 UTC

Description of problem:

The logging-curator pod restarts continually.

Version-Release number of selected component (if applicable):

3.7.23 built on 3/10

How reproducible:

Always

Steps to Reproduce:
1. Install logging

Actual results:

# oc get pods
NAME                                       READY     STATUS    RESTARTS   AGE
logging-curator-6-mnc4r                    1/1       Running   4          19m


Expected results:

No restarts

Additional info:

# oc describe po logging-curator-6-mnc4r 
Name:		logging-curator-6-mnc4r
Namespace:	logging
Node:		ip-172-31-65-241.us-east-2.compute.internal/172.31.65.241
Start Time:	Fri, 16 Mar 2018 16:10:39 +0000
Labels:		component=curator
		deployment=logging-curator-6
		deploymentconfig=logging-curator
		logging-infra=curator
		provider=openshift
Annotations:	kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"logging","name":"logging-curator-6","uid":"7aebf18d-2934-11e8-8fe3-023...
		openshift.io/deployment-config.latest-version=6
		openshift.io/deployment-config.name=logging-curator
		openshift.io/deployment.name=logging-curator-6
		openshift.io/scc=restricted
Status:		Running
IP:		10.130.0.23
Created By:	ReplicationController/logging-curator-6
Controlled By:	ReplicationController/logging-curator-6
Containers:
  curator:
    Container ID:	docker://2141b137581ce972ca554a53a3e9860c787e611f0c978d7592e6d7c1b1f0d619
    Image:		registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7
    Image ID:		docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-curator@sha256:d75529267ef94e64a859fb7622914decc34b2b854ea2fab7b1d97b7fdf779ab6
    Port:		<none>
    State:		Terminated
      Reason:		Error
      Exit Code:	255
      Started:		Fri, 16 Mar 2018 16:22:22 +0000
      Finished:		Fri, 16 Mar 2018 16:25:58 +0000
    Last State:		Terminated
      Reason:		Error
      Exit Code:	255
      Started:		Fri, 16 Mar 2018 16:18:15 +0000
      Finished:		Fri, 16 Mar 2018 16:21:50 +0000
    Ready:		False
    Restart Count:	3
    Limits:
      memory:	512Mi
    Requests:
      cpu:	25m
      memory:	512Mi
    Environment:
      K8S_HOST_URL:		https://kubernetes.default.svc.cluster.local
      ES_HOST:			logging-es
      ES_PORT:			9200
      ES_CLIENT_CERT:		/etc/curator/keys/cert
      ES_CLIENT_KEY:		/etc/curator/keys/key
      ES_CA:			/etc/curator/keys/ca
      CURATOR_DEFAULT_DAYS:	14
      CURATOR_RUN_HOUR:		0
      CURATOR_RUN_MINUTE:	0
      CURATOR_RUN_TIMEZONE:	UTC
      CURATOR_SCRIPT_LOG_LEVEL:	INFO
      CURATOR_LOG_LEVEL:	WARN
    Mounts:
      /etc/curator/keys from certs (ro)
      /etc/curator/settings from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-4p68h (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	logging-curator
    Optional:	false
  config:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	logging-curator
    Optional:	false
  aggregated-logging-curator-token-4p68h:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	aggregated-logging-curator-token-4p68h
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	type=infra
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath			Type		Reason			Message
  ---------	--------	-----	----							-------------			--------	------			-------
  15m		15m		1	default-scheduler									Normal		Scheduled		Successfully assigned logging-curator-6-mnc4r to ip-172-31-65-241.us-east-2.compute.internal
  15m		15m		1	kubelet, ip-172-31-65-241.us-east-2.compute.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "config" 
  15m		15m		1	kubelet, ip-172-31-65-241.us-east-2.compute.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "certs" 
  15m		15m		1	kubelet, ip-172-31-65-241.us-east-2.compute.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-4p68h" 
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Pulling			pulling image "registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7"
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Pulled			Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7"
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Created			Created container
  15m		3m		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Normal		Started			Started container
  8m		1s		4	kubelet, ip-172-31-65-241.us-east-2.compute.internal	spec.containers{curator}	Warning		BackOff			Back-off restarting failed container

Comment 1 Dan Yocum 2018-03-16 16:34:33 UTC

oc logs logging-curator-6-mnc4r yields no output

Comment 2 Rich Megginson 2018-03-16 16:51:13 UTC

@jcantrill is this the same problem we had previously with a bad upstream merge of curator that got built and pushed?

Comment 3 Jeff Cantrill 2018-03-16 17:45:11 UTC

Possibly,

The version that fixed the missing files per dist-git [1] is v3.7.38-1

Can we verify if that version?

[1] http://pkgs.devel.redhat.com/cgit/rpms/logging-curator-docker/commit/?h=rhaos-3.7-rhel-7&id=13573496cabb60dff1dace3e92ffe3a74fa524e5

Comment 4 Dan Yocum 2018-03-16 21:05:50 UTC

docker inspect says:


 "build-date": "2018-03-16T15:46:54.797125",

 "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/logging-curator/images/v3.7.39-1",

Comment 6 Peter Portante 2018-03-18 23:17:58 UTC

Can we also check the state of elasticsearch with the health.sh script? [1]

[1] https://gist.githubusercontent.com/portante/c801acce8ae275fdcfc743b346981b13/raw/45cff354ec4a96882f6abc0725b2b6a646c5003c/health.sh

Comment 7 Josef Karasek 2018-03-19 13:45:03 UTC

@Dan if there's nothing in the log it might mean that curator can't reach[0] elasticsearch. Can you enable debug log to get more info?

oc set env dc/logging-curator --overwrite CURATOR_SCRIPT_LOG_LEVEL=DEBUG CURATOR_LOG_LEVEL=DEBUG
oc rollout latest logging-curator

[0] https://github.com/openshift/origin-aggregated-logging/blob/release-3.7/curator/run.sh#L21

Comment 14 Jeff Cantrill 2018-04-05 14:03:56 UTC

https://github.com/openshift/origin-aggregated-logging/pull/1064

We will need to clone to put this in 3.9

Comment 15 Josef Karasek 2018-04-06 03:35:43 UTC

cloned https://bugzilla.redhat.com/show_bug.cgi?id=1564350

Comment 16 Jeff Cantrill 2018-04-23 14:04:13 UTC

3.7 PR https://github.com/openshift/origin-aggregated-logging/pull/1096

Comment 17 Jeff Cantrill 2018-05-02 15:37:11 UTC

3.7 CI is broken.  While trying to determine if the cherry-pick is good.  I get the following even though the pod does not appear to constantly restart:

Deployed using latest 3.7 ansible and image built from the PR in c#16


WARNING:elasticsearch:HEAD https://logging-es:9200/ [status:N/A request:0.009s]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 95, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 326, in connect
    ssl_context=context)
  File "/usr/lib/python2.7/site-packages/urllib3/util/ssl_.py", line 329, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib64/python2.7/ssl.py", line 350, in wrap_socket
    _context=self)
  File "/usr/lib64/python2.7/ssl.py", line 611, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.7/ssl.py", line 833, in do_handshake
    self._sslobj.do_handshake()
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)

Comment 19 Anping Li 2018-05-08 09:43:24 UTC

Waiting the new images

Comment 21 Junqi Zhao 2018-05-11 02:05:08 UTC

need new curator image, the latest image is still logging-curator/images/v3.7.46-1

Comment 22 Junqi Zhao 2018-05-11 02:31:53 UTC

# oc logs logging-curator-1-sbmpq -n logging
sh: run.sh: No such file or directory

Comment 23 Jeff Cantrill 2018-05-11 14:29:19 UTC

This requires a new-build to satisfy

Comment 25 Josef Karasek 2018-05-17 09:34:42 UTC

`docker inspect logging-curator/images:v3.7.48-1` shows correct CMD

Comment 26 Anping Li 2018-05-22 07:59:31 UTC

works well with logging-curator:v3.7.48

Comment 28 errata-xmlrpc 2018-06-27 07:59:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2009