Bug 1393775

Summary: Logging upgrade to 3.4.0 failed by "Unable to find log message from cluster.service from pod logging-es-3bjvollr-4-mhyt5 within 300 seconds"
Product: OpenShift Container Platform Reporter: Xia Zhao <xiazhao>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Xia Zhao <xiazhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.4.0CC: aos-bugs, tdawson, xiazhao
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 12:54:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
deployer_pod_log none

Description Xia Zhao 2016-11-10 10:24:08 UTC
Created attachment 1219302 [details]
deployer_pod_log

Description of problem:
Upgrade logging stacks from 3.2.0 level to 3.4.0 level, it failed by:
Unable to find log message from cluster.service from pod logging-es-3bjvollr-4-mhyt5 within 300 seconds
# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-rbae8       0/1       CrashLoopBackOff   4          18m
logging-deployer-cwpmt        0/1       Error              0          22m
logging-deployer-pdkwp        0/1       Completed          0          31m
logging-es-3bjvollr-4-mhyt5   0/1       CrashLoopBackOff   8          17m
logging-fluentd-f31ok         1/1       Running            0          18m


Version-Release number of selected component (if applicable):
brew registry:
openshift3/logging-deployer        3.4.0               c364ab9c2f75

# openshift version
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0


How reproducible:
Always

Steps to Reproduce:
1.Install openshift 3.2.0
2.Deploy logging 3.2.0 level:
IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.2.0,MODE=install

3.Upgrade logging stacks:
$oadm policy add-cluster-role-to-user cluster-admin xiazhao
$oc delete template logging-deployer-account-template logging-deployer-template
$oc create -f https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/deployer/deployer.yaml
$oc new-app logging-deployer-account-template
$oc get template logging-deployer-template -o yaml -n logging | sed  's/\(image:\s.*\)logging-deployment\(.*\)/\1logging-deployer\2/g' | oc apply -n logging -f -
$oc policy add-role-to-user edit --serviceaccount logging-deployer
$oc policy add-role-to-user daemonset-admin --serviceaccount logging-deployer
$oadm policy add-cluster-role-to-user oauth-editor system:serviceaccount:logging:logging-deployer
$oadm policy add-cluster-role-to-user rolebinding-reader system:serviceaccount:logging:aggregated-logging-elasticsearch
$oc new-app logging-deployer-template -p PUBLIC_MASTER_URL=https://{master-domain}:8443,ENABLE_OPS_CLUSTER=false,IMAGE_PREFIX=brew...:xxxx/openshift3/,IMAGE_VERSION=3.4.0,ES_INSTANCE_RAM=1G,ES_CLUSTER_SIZE=1,KIBANA_HOSTNAME={kibana-route},KIBANA_OPS_HOSTNAME={kibana-ops-route},MASTER_URL=https://{master-domain}:8443,MODE=upgrade

4.Check for upgrade result


Actual results:
4.Upgrade failed

Expected results:
Upgraded to 3.4.0 successfully

Additional info:
deployer pod logs attached

Comment 1 ewolinet 2016-11-10 17:31:48 UTC
This looks to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1393769

The deployer failed while waiting for the EFK components to scale up.

The difference in the error message is that there was a window of time that the Deployer was able to see the ES pod had started, but it couldn't find a message in the logs to confirm that the service was available.

Comment 4 Xia Zhao 2016-11-14 09:50:44 UTC
It's fixed. Tested with latest deployer image of 3.4.0, upgraded successfully, and kibana & kibana ops UI accesible with log entries:

$ oc get po
NAME                              READY     STATUS      RESTARTS   AGE
logging-curator-1-n27sm           1/1       Running     0          4m
logging-curator-ops-1-izno3       1/1       Running     0          4m
logging-deployer-o8b77            0/1       Completed   0          10m
logging-deployer-r8kpd            0/1       Completed   0          6m
logging-es-flruj8ta-4-4gnaz       1/1       Running     0          4m
logging-es-ops-rpbmoj63-4-mgqhe   1/1       Running     0          4m
logging-fluentd-qxxjh             1/1       Running     0          4m
logging-kibana-2-j5ohm            2/2       Running     0          3m
logging-kibana-ops-3-9pr69        2/2       Running     0          3m

I'm not sure why the upgrade pod refused to show all logs by a short write issue:

$ oc logs -f logging-deployer-r8kpd
++ oc get dc -l logging-infra=elasticsearch -o 'jsonpath={.items[*].metadata.name}'
+ for dc in '$(oc get dc -l $label -o jsonpath='\''{.items[*].metadata.name}'\'')'
+ patchDCImage logging-es-flruj8ta logging-elasticsearch false
+ local dc=logging-es-flruj8ta
+ local image=logging-elasticsearch
+ local kibana=false
++ oc get dc/logging-es-flruj8ta -o 'jsonpath={.status.latestVersion}'
+ local version=1
+ local authProxy_patch
+ '[' false = true ']'
+ patchIfValid dc/logging-es-flruj8ta
'{.spec.template.spec.containers[0].image}=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.4.0
'
error: short write


# openshift version
openshift v3.4.0.25+1f36858
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0


Images tested with:
brew....:xxxx/openshift3/logging-deployer        3.4.0               08eaf2753130        2 days ago          764.3 MB

Comment 5 ewolinet 2016-12-12 15:48:12 UTC
Prerelease issue, no docs needed.

Comment 7 errata-xmlrpc 2017-01-18 12:54:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066