1393769 – Logging upgrade to 3.4.0 failed by "timed out waiting for the condition"

Bug 1393769 - Logging upgrade to 3.4.0 failed by "timed out waiting for the condition"

Summary: Logging upgrade to 3.4.0 failed by "timed out waiting for the condition"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	ewolinet
QA Contact:	Xia Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-10 10:11 UTC by Xia Zhao
Modified:	2017-03-08 18:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-01-18 12:54:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
deployer_pod_log_1st_upgrade (4.65 KB, text/plain) 2016-11-10 10:11 UTC, Xia Zhao	no flags	Details
deployer_pod_log_2nd_upgrade (151.80 KB, text/plain) 2016-11-10 10:11 UTC, Xia Zhao	no flags	Details
configmap used (591 bytes, text/plain) 2016-11-10 10:12 UTC, Xia Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Xia Zhao 2016-11-10 10:11:04 UTC

Created attachment 1219294 [details]
deployer_pod_log_1st_upgrade

Description of problem:
Upgrade logging stacks from 3.3.1 level to 3.4.0 level, firstly failed by:
+ oc delete daemonset logging-fluentd
error: timed out waiting for the condition

Rerun with mode=upgrade, it failed again:
logging-deployer-e7n25        0/1       Error              0          40m
logging-deployer-ppviz        0/1       Completed          0          1h
logging-deployer-sefg4        0/1       Error              0          22m
logging-es-ujttvlhx-4-qfvwc   0/1       CrashLoopBackOff   3          1m


Version-Release number of selected component (if applicable):
brew registry:
openshift3/logging-deployer        3.4.0               c364ab9c2f75

# openshift version
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0


How reproducible:
Always

Steps to Reproduce:
1.Install openshift 3.4.0
2.Deploy logging 3.3.1 level:
IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.3.1,MODE=install
3.Config hostPath PV for es

4.Upgrade logging stacks:
$oadm policy add-cluster-role-to-user cluster-admin xiazhao
$oc delete template logging-deployer-account-template logging-deployer-template
$oc create -f https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/deployer/deployer.yaml
$oc new-app logging-deployer-account-template
$oc get template logging-deployer-template -o yaml -n logging | sed  's/\(image:\s.*\)logging-deployment\(.*\)/\1logging-deployer\2/g' | oc apply -n logging -f -
$oc policy add-role-to-user edit --serviceaccount logging-deployer
$oc policy add-role-to-user daemonset-admin --serviceaccount logging-deployer
$oadm policy add-cluster-role-to-user oauth-editor system:serviceaccount:logging:logging-deployer
$oadm policy add-cluster-role-to-user rolebinding-reader system:serviceaccount:logging:aggregated-logging-elasticsearch
$oc new-app logging-deployer-template -p IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.4.0,MODE=upgrade

5.Check for upgrade result
6.Rerun step
$oc new-app logging-deployer-template -p IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.4.0,MODE=upgrade
Check for upgrade result

Actual results:
5.,6.# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-xncx8       1/1       Running            0          32m
logging-deployer-e7n25        0/1       Error              0          15m
logging-es-ujttvlhx-2-ahto8   1/1       Running            0          22m
logging-deployer-sefg4        0/1       Error              0          22m
logging-kibana-1-ohveh        2/2       Running            0          31m

Expected results:
Upgraded to 3.4.0 successfully

Additional info:
deployer pod logs attached

Comment 1 Xia Zhao 2016-11-10 10:11:57 UTC

Created attachment 1219296 [details]
deployer_pod_log_2nd_upgrade

Comment 2 Xia Zhao 2016-11-10 10:12:21 UTC

Created attachment 1219298 [details]
configmap used

Comment 3 ewolinet 2016-11-10 17:26:52 UTC

I'm not sure what caused the first upgrade to fail. After deleting the daemonset we should see the deployer loop through and wait for the fluentd pods to stop, but it looked to fail issuing the delete. I'm not able to recreate this.


The second failure looks like it timed out waiting to confirm that the components started back up. It looks like the elasticsearch pod was having errors starting up (CrashLoopBackOff). So this is expected that the deployer would fail.

Could you post the the logs for the failing ES pod here?

Comment 7 Xia Zhao 2016-11-14 08:39:52 UTC

Issue reproduced while upgrading with this latest deployer image:

brew...:xxxx/openshift3/logging-deployer        3.4.0               08eaf2753130        2 days ago          764.3 MB

# openshift version
openshift v3.4.0.25+1f36858
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

$ oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-1-03b1r       1/1       Running     0          26m
logging-deployer-6mnzo        0/1       Completed   0          27m
logging-deployer-pazr6        0/1       Error       0          13m
logging-es-8nm18kcw-2-63gnd   1/1       Running     0          22m
logging-kibana-1-70r1l        2/2       Running     1          26m

$ oc logs -f logging-deployer-pazr6
...
+ scaleDown
+ fluentd_dc=(`oc get dc -l logging-infra=fluentd -o jsonpath='{.items[*].metadata.name}'`)
++ oc get dc -l logging-infra=fluentd -o 'jsonpath={.items[*].metadata.name}'
No resources found.
+ local fluentd_dc
+ [[ -z '' ]]
++ oc get daemonset -l logging-infra=fluentd -o 'jsonpath={.items[*].spec.template.spec.nodeSelector}'
+ local 'selector=map[logging-infra-fluentd:true]'
+ [[ -n map[logging-infra-fluentd:true] ]]
++ sed s/:/=/g
++ echo logging-infra-fluentd:true
+ fluentd_nodeselector=logging-infra-fluentd=true
+ oc delete daemonset logging-fluentd
error: timed out waiting for the condition

Comment 10 ewolinet 2016-11-14 18:22:47 UTC

This can also be recreated by running
  $ oc delete daemonset/logging-fluentd

If we surround it by 'date' commands, we see that this is timing out after 5 minutes. 

Adding '--timeout=360s' to the oc delete did not resolves this, it still timed out after 5 minutes.

# date; oc delete daemonset/logging-fluentd --timeout=360s; date
Mon Nov 14 12:14:48 EST 2016
error: timed out waiting for the condition
Mon Nov 14 12:19:48 EST 2016

Adding --grace-period=360 causes this to still time out:

# date; oc delete daemonset/logging-fluentd --grace-period=360; date
Mon Nov 14 12:25:21 EST 2016
error: timed out waiting for the condition
Mon Nov 14 12:30:22 EST 2016


# date; oc delete daemonset/logging-fluentd --grace-period=360 --timeout=360s; dateMon Nov 14 12:34:48 EST 2016
error: timed out waiting for the condition
Mon Nov 14 12:39:49 EST 2016

Will update the deployer so the oc delete will not cause upgrading to error out.

Comment 11 ewolinet 2016-11-14 19:35:06 UTC

Should be resolved as of

12091903 buildContainer (noarch) completed successfully
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=524330
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:rhaos-3.4-rhel-7-docker-candidate-20161114141130
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.26-2
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.26
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:3.4.0

Comment 12 Xia Zhao 2016-11-15 09:41:33 UTC

Tested with the new image, it's fixed. Upgrade finished successfully, and logging worked fine after upgrading. Set to verified.

Images tested:
brew....:xxxx/openshift3/logging-deployer        3.4.0               b8847e716761        11 hours ago        762.7 MB

Comment 13 ewolinet 2016-12-12 15:47:36 UTC

Prerelease issue, no docs needed.

Comment 15 errata-xmlrpc 2017-01-18 12:54:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.