Description of problem: The logging-curator pod restarts continually. Version-Release number of selected component (if applicable): 3.7.23 built on 3/10 How reproducible: Always Steps to Reproduce: 1. Install logging Actual results: # oc get pods NAME READY STATUS RESTARTS AGE logging-curator-6-mnc4r 1/1 Running 4 19m Expected results: No restarts Additional info: # oc describe po logging-curator-6-mnc4r Name: logging-curator-6-mnc4r Namespace: logging Node: ip-172-31-65-241.us-east-2.compute.internal/172.31.65.241 Start Time: Fri, 16 Mar 2018 16:10:39 +0000 Labels: component=curator deployment=logging-curator-6 deploymentconfig=logging-curator logging-infra=curator provider=openshift Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"logging","name":"logging-curator-6","uid":"7aebf18d-2934-11e8-8fe3-023... openshift.io/deployment-config.latest-version=6 openshift.io/deployment-config.name=logging-curator openshift.io/deployment.name=logging-curator-6 openshift.io/scc=restricted Status: Running IP: 10.130.0.23 Created By: ReplicationController/logging-curator-6 Controlled By: ReplicationController/logging-curator-6 Containers: curator: Container ID: docker://2141b137581ce972ca554a53a3e9860c787e611f0c978d7592e6d7c1b1f0d619 Image: registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-curator@sha256:d75529267ef94e64a859fb7622914decc34b2b854ea2fab7b1d97b7fdf779ab6 Port: <none> State: Terminated Reason: Error Exit Code: 255 Started: Fri, 16 Mar 2018 16:22:22 +0000 Finished: Fri, 16 Mar 2018 16:25:58 +0000 Last State: Terminated Reason: Error Exit Code: 255 Started: Fri, 16 Mar 2018 16:18:15 +0000 Finished: Fri, 16 Mar 2018 16:21:50 +0000 Ready: False Restart Count: 3 Limits: memory: 512Mi Requests: cpu: 25m memory: 512Mi Environment: K8S_HOST_URL: https://kubernetes.default.svc.cluster.local ES_HOST: logging-es ES_PORT: 9200 ES_CLIENT_CERT: /etc/curator/keys/cert ES_CLIENT_KEY: /etc/curator/keys/key ES_CA: /etc/curator/keys/ca CURATOR_DEFAULT_DAYS: 14 CURATOR_RUN_HOUR: 0 CURATOR_RUN_MINUTE: 0 CURATOR_RUN_TIMEZONE: UTC CURATOR_SCRIPT_LOG_LEVEL: INFO CURATOR_LOG_LEVEL: WARN Mounts: /etc/curator/keys from certs (ro) /etc/curator/settings from config (ro) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-curator-token-4p68h (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: certs: Type: Secret (a volume populated by a Secret) SecretName: logging-curator Optional: false config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-curator Optional: false aggregated-logging-curator-token-4p68h: Type: Secret (a volume populated by a Secret) SecretName: aggregated-logging-curator-token-4p68h Optional: false QoS Class: Burstable Node-Selectors: type=infra Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 15m 15m 1 default-scheduler Normal Scheduled Successfully assigned logging-curator-6-mnc4r to ip-172-31-65-241.us-east-2.compute.internal 15m 15m 1 kubelet, ip-172-31-65-241.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "config" 15m 15m 1 kubelet, ip-172-31-65-241.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "certs" 15m 15m 1 kubelet, ip-172-31-65-241.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "aggregated-logging-curator-token-4p68h" 15m 3m 4 kubelet, ip-172-31-65-241.us-east-2.compute.internal spec.containers{curator} Normal Pulling pulling image "registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7" 15m 3m 4 kubelet, ip-172-31-65-241.us-east-2.compute.internal spec.containers{curator} Normal Pulled Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/logging-curator:v3.7" 15m 3m 4 kubelet, ip-172-31-65-241.us-east-2.compute.internal spec.containers{curator} Normal Created Created container 15m 3m 4 kubelet, ip-172-31-65-241.us-east-2.compute.internal spec.containers{curator} Normal Started Started container 8m 1s 4 kubelet, ip-172-31-65-241.us-east-2.compute.internal spec.containers{curator} Warning BackOff Back-off restarting failed container
oc logs logging-curator-6-mnc4r yields no output
@jcantrill is this the same problem we had previously with a bad upstream merge of curator that got built and pushed?
Possibly, The version that fixed the missing files per dist-git [1] is v3.7.38-1 Can we verify if that version? [1] http://pkgs.devel.redhat.com/cgit/rpms/logging-curator-docker/commit/?h=rhaos-3.7-rhel-7&id=13573496cabb60dff1dace3e92ffe3a74fa524e5
docker inspect says: "build-date": "2018-03-16T15:46:54.797125", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/logging-curator/images/v3.7.39-1",
Can we also check the state of elasticsearch with the health.sh script? [1] [1] https://gist.githubusercontent.com/portante/c801acce8ae275fdcfc743b346981b13/raw/45cff354ec4a96882f6abc0725b2b6a646c5003c/health.sh
@Dan if there's nothing in the log it might mean that curator can't reach[0] elasticsearch. Can you enable debug log to get more info? oc set env dc/logging-curator --overwrite CURATOR_SCRIPT_LOG_LEVEL=DEBUG CURATOR_LOG_LEVEL=DEBUG oc rollout latest logging-curator [0] https://github.com/openshift/origin-aggregated-logging/blob/release-3.7/curator/run.sh#L21
https://github.com/openshift/origin-aggregated-logging/pull/1064 We will need to clone to put this in 3.9
cloned https://bugzilla.redhat.com/show_bug.cgi?id=1564350
3.7 PR https://github.com/openshift/origin-aggregated-logging/pull/1096
3.7 CI is broken. While trying to determine if the cherry-pick is good. I get the following even though the pod does not appear to constantly restart: Deployed using latest 3.7 ansible and image built from the PR in c#16 WARNING:elasticsearch:HEAD https://logging-es:9200/ [status:N/A request:0.009s] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 95, in perform_request response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 333, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen chunked=chunked) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 346, in _make_request self._validate_conn(conn) File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn conn.connect() File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 326, in connect ssl_context=context) File "/usr/lib/python2.7/site-packages/urllib3/util/ssl_.py", line 329, in ssl_wrap_socket return context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/lib64/python2.7/ssl.py", line 350, in wrap_socket _context=self) File "/usr/lib64/python2.7/ssl.py", line 611, in __init__ self.do_handshake() File "/usr/lib64/python2.7/ssl.py", line 833, in do_handshake self._sslobj.do_handshake() SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)
Waiting the new images
need new curator image, the latest image is still logging-curator/images/v3.7.46-1
# oc logs logging-curator-1-sbmpq -n logging sh: run.sh: No such file or directory
This requires a new-build to satisfy
`docker inspect logging-curator/images:v3.7.48-1` shows correct CMD
works well with logging-curator:v3.7.48
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2009