Created attachment 1219294 [details] deployer_pod_log_1st_upgrade Description of problem: Upgrade logging stacks from 3.3.1 level to 3.4.0 level, firstly failed by: + oc delete daemonset logging-fluentd error: timed out waiting for the condition Rerun with mode=upgrade, it failed again: logging-deployer-e7n25 0/1 Error 0 40m logging-deployer-ppviz 0/1 Completed 0 1h logging-deployer-sefg4 0/1 Error 0 22m logging-es-ujttvlhx-4-qfvwc 0/1 CrashLoopBackOff 3 1m Version-Release number of selected component (if applicable): brew registry: openshift3/logging-deployer 3.4.0 c364ab9c2f75 # openshift version openshift v3.4.0.23+24b1a58 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 How reproducible: Always Steps to Reproduce: 1.Install openshift 3.4.0 2.Deploy logging 3.3.1 level: IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.3.1,MODE=install 3.Config hostPath PV for es 4.Upgrade logging stacks: $oadm policy add-cluster-role-to-user cluster-admin xiazhao $oc delete template logging-deployer-account-template logging-deployer-template $oc create -f https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/deployer/deployer.yaml $oc new-app logging-deployer-account-template $oc get template logging-deployer-template -o yaml -n logging | sed 's/\(image:\s.*\)logging-deployment\(.*\)/\1logging-deployer\2/g' | oc apply -n logging -f - $oc policy add-role-to-user edit --serviceaccount logging-deployer $oc policy add-role-to-user daemonset-admin --serviceaccount logging-deployer $oadm policy add-cluster-role-to-user oauth-editor system:serviceaccount:logging:logging-deployer $oadm policy add-cluster-role-to-user rolebinding-reader system:serviceaccount:logging:aggregated-logging-elasticsearch $oc new-app logging-deployer-template -p IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.4.0,MODE=upgrade 5.Check for upgrade result 6.Rerun step $oc new-app logging-deployer-template -p IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.4.0,MODE=upgrade Check for upgrade result Actual results: 5.,6.# oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-xncx8 1/1 Running 0 32m logging-deployer-e7n25 0/1 Error 0 15m logging-es-ujttvlhx-2-ahto8 1/1 Running 0 22m logging-deployer-sefg4 0/1 Error 0 22m logging-kibana-1-ohveh 2/2 Running 0 31m Expected results: Upgraded to 3.4.0 successfully Additional info: deployer pod logs attached
Created attachment 1219296 [details] deployer_pod_log_2nd_upgrade
Created attachment 1219298 [details] configmap used
I'm not sure what caused the first upgrade to fail. After deleting the daemonset we should see the deployer loop through and wait for the fluentd pods to stop, but it looked to fail issuing the delete. I'm not able to recreate this. The second failure looks like it timed out waiting to confirm that the components started back up. It looks like the elasticsearch pod was having errors starting up (CrashLoopBackOff). So this is expected that the deployer would fail. Could you post the the logs for the failing ES pod here?
Issue reproduced while upgrading with this latest deployer image: brew...:xxxx/openshift3/logging-deployer 3.4.0 08eaf2753130 2 days ago 764.3 MB # openshift version openshift v3.4.0.25+1f36858 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-03b1r 1/1 Running 0 26m logging-deployer-6mnzo 0/1 Completed 0 27m logging-deployer-pazr6 0/1 Error 0 13m logging-es-8nm18kcw-2-63gnd 1/1 Running 0 22m logging-kibana-1-70r1l 2/2 Running 1 26m $ oc logs -f logging-deployer-pazr6 ... + scaleDown + fluentd_dc=(`oc get dc -l logging-infra=fluentd -o jsonpath='{.items[*].metadata.name}'`) ++ oc get dc -l logging-infra=fluentd -o 'jsonpath={.items[*].metadata.name}' No resources found. + local fluentd_dc + [[ -z '' ]] ++ oc get daemonset -l logging-infra=fluentd -o 'jsonpath={.items[*].spec.template.spec.nodeSelector}' + local 'selector=map[logging-infra-fluentd:true]' + [[ -n map[logging-infra-fluentd:true] ]] ++ sed s/:/=/g ++ echo logging-infra-fluentd:true + fluentd_nodeselector=logging-infra-fluentd=true + oc delete daemonset logging-fluentd error: timed out waiting for the condition
This can also be recreated by running $ oc delete daemonset/logging-fluentd If we surround it by 'date' commands, we see that this is timing out after 5 minutes. Adding '--timeout=360s' to the oc delete did not resolves this, it still timed out after 5 minutes. # date; oc delete daemonset/logging-fluentd --timeout=360s; date Mon Nov 14 12:14:48 EST 2016 error: timed out waiting for the condition Mon Nov 14 12:19:48 EST 2016 Adding --grace-period=360 causes this to still time out: # date; oc delete daemonset/logging-fluentd --grace-period=360; date Mon Nov 14 12:25:21 EST 2016 error: timed out waiting for the condition Mon Nov 14 12:30:22 EST 2016 # date; oc delete daemonset/logging-fluentd --grace-period=360 --timeout=360s; dateMon Nov 14 12:34:48 EST 2016 error: timed out waiting for the condition Mon Nov 14 12:39:49 EST 2016 Will update the deployer so the oc delete will not cause upgrading to error out.
Should be resolved as of 12091903 buildContainer (noarch) completed successfully koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=524330 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:rhaos-3.4-rhel-7-docker-candidate-20161114141130 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.26-2 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.26 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:3.4.0
Tested with the new image, it's fixed. Upgrade finished successfully, and logging worked fine after upgrading. Set to verified. Images tested: brew....:xxxx/openshift3/logging-deployer 3.4.0 b8847e716761 11 hours ago 762.7 MB
Prerelease issue, no docs needed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066