Bug 1372540

Summary: Upgrade failed from logging 3.2.1 to 3.3 on upgraded OSE env
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: LoggingAssignee: ewolinet
Status: CLOSED CURRENTRELEASE QA Contact: Wei Sun <wsun>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: anli, aos-bugs, ewolinet, jokerman, lmeyer, mmccomas, tdawson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-22 22:24:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Anping Li 2016-09-02 03:30:08 UTC
Description of problem:
upgrade logging 3.2.1 to 3.3 failed after upgrade OSE3.2->OCP 3.3

Version-Release number of selected component (if applicable):
openshift-ansible-3.3.20

How reproducible:
always

Steps to Reproduce:
1. install OSE 3.2
2. deploy logging.

3. upgrade OSE 3.2.1 to OCP 3.3
4. check the logging applications status.
[root@anli-working ha2]# oc get pods
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-9gmwb           1/1       Running   0          26m
logging-curator-ops-1-aeykn       1/1       Running   0          30m
logging-es-2ct7rh6u-2-0czhe       1/1       Running   0          27m
logging-es-ops-hrp6pnho-1-tfmpe   1/1       Running   0          27m
logging-fluentd-1-agmf9           1/1       Running   0          27m
logging-fluentd-1-ewfst           1/1       Running   0          29m
logging-fluentd-1-ezbyq           1/1       Running   0          26m
logging-fluentd-1-gm3vk           1/1       Running   0          31m
logging-fluentd-1-p52a7           1/1       Running   0          30m
logging-kibana-1-ywbq2            2/2       Running   0          29m
logging-kibana-ops-1-syq3k        2/2       Running   2          26m


5. deploy account.

[root@anli-working ha2]# oc new-app logging-deployer-account-template
--> Deploying template "logging-deployer-account-template" in project "openshift"

     logging-deployer-account-template
     ---------
     Template for creating the deployer account and roles needed for the aggregated logging deployer. Create as cluster-admin.

--> Creating resources with label app=logging-deployer-account-template ...
    error: serviceaccounts "logging-deployer" already exists
    error: serviceaccounts "aggregated-logging-kibana" already exists
    error: serviceaccounts "aggregated-logging-elasticsearch" already exists
    error: serviceaccounts "aggregated-logging-fluentd" already exists
    error: serviceaccounts "aggregated-logging-curator" already exists
    clusterrole "oauth-editor" created
    clusterrole "daemonset-admin" created
    rolebinding "logging-deployer-edit-role" created
    rolebinding "logging-deployer-dsadmin-role" created
--> Failed
[root@anli-working ha2]# oc policy add-role-to-user edit --serviceaccount logging-deployer
[root@anli-working ha2]# oc policy add-role-to-user daemonset-admin --serviceaccount logging-deployer
[root@anli-working ha2]# oadm policy add-cluster-role-to-user oauth-editor system:serviceaccount:logging:logging-deployer

6. upgrade logging.

[root@anli-working ha2]# oc new-app logging-deployer-template -p ENABLE_OPS_CLUSTER=true,IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/,KIBANA_HOSTNAME=kibana.0823-voo.qe.rhcloud.com,KIBANA_OPS_HOSTNAME=kibana-ops.0823-voo.qe.rhcloud.com,PUBLIC_MASTER_URL=https://openshift-166.lab.sjc.redhat.com:443,ES_INSTANCE_RAM=2048M,MASTER_URL=https://openshift-166.lab.sjc.redhat.com:443,MODE=upgrade
--> Deploying template "logging-deployer-template" in project "openshift"

     logging-deployer-template
     ---------
     Template for running the aggregated logging deployer in a pod. Requires empowered 'logging-deployer' service account.

     * With parameters:
        * MODE=upgrade
        * IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/
        * IMAGE_VERSION=latest
        * IMAGE_PULL_SECRET=
        * INSECURE_REGISTRY=false
        * ENABLE_OPS_CLUSTER=true
        * KIBANA_HOSTNAME=kibana.0823-voo.qe.rhcloud.com
        * KIBANA_OPS_HOSTNAME=kibana-ops.0823-voo.qe.rhcloud.com
        * PUBLIC_MASTER_URL=https://openshift-166.lab.sjc.redhat.com:443
        * MASTER_URL=https://openshift-166.lab.sjc.redhat.com:443
        * ES_CLUSTER_SIZE=1
        * ES_INSTANCE_RAM=2048M
        * ES_PVC_SIZE=
        * ES_PVC_PREFIX=logging-es-
        * ES_PVC_DYNAMIC=
        * ES_NODE_QUORUM=
        * ES_RECOVER_AFTER_NODES=
        * ES_RECOVER_EXPECTED_NODES=
        * ES_RECOVER_AFTER_TIME=5m
        * ES_OPS_CLUSTER_SIZE=
        * ES_OPS_INSTANCE_RAM=8G
        * ES_OPS_PVC_SIZE=
        * ES_OPS_PVC_PREFIX=logging-es-ops-
        * ES_OPS_PVC_DYNAMIC=
        * ES_OPS_NODE_QUORUM=
        * ES_OPS_RECOVER_AFTER_NODES=
        * ES_OPS_RECOVER_EXPECTED_NODES=
        * ES_OPS_RECOVER_AFTER_TIME=5m
        * FLUENTD_NODESELECTOR=logging-infra-fluentd=true
        * ES_NODESELECTOR=
        * ES_OPS_NODESELECTOR=
        * KIBANA_NODESELECTOR=
        * KIBANA_OPS_NODESELECTOR=
        * CURATOR_NODESELECTOR=
        * CURATOR_OPS_NODESELECTOR=

--> Creating resources with label app=logging-deployer-template ...
    pod "logging-deployer-67o1q" created
--> Success
    Run 'oc status' to view your app.

7. check pod status.
[root@anli-working ha2]# oc get pods
NAME                              READY     STATUS        RESTARTS   AGE
logging-deployer-67o1q            1/1       Running       0          3m
logging-es-2ct7rh6u-2-0czhe       1/1       Running       0          31m
logging-es-ops-hrp6pnho-1-tfmpe   1/1       Running       0          31m
logging-kibana-1-ywbq2            2/2       Terminating   0          32m
logging-kibana-ops-1-syq3k        2/2       Terminating   2          29m


NAME                           READY     STATUS              RESTARTS   AGE
logging-curator-2-deploy       0/1       ContainerCreating   0          1m
logging-deployer-67o1q         1/1       Running             0          5m
logging-es-2ct7rh6u-3-deploy   1/1       Running             0          1m
logging-kibana-2-deploy        1/1       Running             0          1m
[root@anli-working ha2]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
logging-curator-2-deploy   1/1       Running   0          1m
logging-deployer-67o1q     1/1       Running   0          5m
[root@anli-working ha2]# oc get pods
NAME                     READY     STATUS    RESTARTS   AGE
logging-deployer-67o1q   1/1       Running   0          5m


[root@anli-working ha2]# oc get pods
NAME                           READY     STATUS              RESTARTS   AGE
logging-curator-1-deploy       1/1       Running             0          1m
logging-curator-1-g5mgc        0/1       RunContainerError   1          1m
logging-curator-ops-1-bqfe2    0/1       RunContainerError   0          1m
logging-curator-ops-1-deploy   1/1       Running             0          1m
logging-deployer-67o1q         0/1       Error               0          7m

[root@anli-working ha2]# oc get pods
NAME                           READY     STATUS             RESTARTS   AGE
logging-curator-1-deploy       1/1       Running            0          8m
logging-curator-1-g5mgc        0/1       CrashLoopBackOff   6          7m

    0/1       CrashLoopBackOff   6          8m
logging-curator-ops-1-deploy   1/1       Running            0          8m
logging-deployer-67o1q         0/1       Error              0          14m

[root@anli-working ha2]# oc get pods
NAME                           READY     STATUS    RESTARTS   AGE
logging-curator-1-deploy       0/1       Error     0          14m
logging-curator-ops-1-deploy   0/1       Error     0          14m
logging-deployer-67o1q         0/1       Error     0          20m

8. check the deployer logs. please refer to the attached files.
oc logs logging-curator-1-deploy >logging-curator-1-deploy.logs
oc logs logging-curator-ops-1-deploy >logging-curator-ops-1-deploy.logs
oc logs logging-deployer-67o1q >logging-deployer-67o1q.logs
oc describe logging-curator-1-g5mgc
oc describe logging-curator-1-g5mgc.describe

Actual results:
tailf logging-deployer-67o1q.logs
+  oc patch deploymentconfig/logging-es-2ct7rh6u --type=json --patch  '[{"op": "replace", "path":  "/spec/template/spec/containers/0/volumeMounts/0/mountPath", "value":  "/etc/elasticsearch/secret"},{"op": "add", "path":  "/spec/template/spec/containers/0/volumeMounts/1", "value": {"name":  "elasticsearch-config", "mountPath": "/usr/share/elasticsearch/config",  "readOnly": true}},{"op": "add", "path":  "/spec/template/spec/volumes/1", "value": {"name":  "elasticsearch-config", "configMap": {"name":  "logging-elasticsearch"}}}]'
"logging-es-2ct7rh6u" patched
+ oc deploy deploymentconfig/logging-es-2ct7rh6u --latest
Error  from server: Operation cannot be fulfilled on deploymentconfigs  "logging-es-2ct7rh6u": the object has been modified; please apply your  changes to the latest version and try again

In logging-curator-1-g5mgc.describe
  3m    3m      1       {kubelet openshift-155.lab.sjc.redhat.com}      spec.containers{curator}        Warning Failed          Failed to start container with docker id de85f02852f4 with error: Error response from daemon: Cannot start container de85f02852f4f76fabf4752a4b076e6cacf6aa7f470cde4ae8adb57c4bd0196c: [9] System error: invalid character '}' looking for beginning of value
  2m    2m      1       {kubelet openshift-155.lab.sjc.redhat.com}      spec.containers{curator}        Normal  Created         Created container with docker id 8d6946ca3815
  2m    2m      1       {kubelet openshift-155.lab.sjc.redhat.com}      spec.containers{curator}        Warning Failed          Failed to start container with docker id 8d6946ca3815 with error: Error response from daemon: Cannot start container 8d6946ca3815b85459e5f4a1dda88842773aa4a8e0270b446f680b260ada395f: [9] System error: invalid character '}' looking for beginning of value
  2m    2m      1       {kubelet openshift-155.lab.sjc.redhat.com}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "curator" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 8d6946ca3815b85459e5f4a1dda88842773aa4a8e0270b446f680b260ada395f: [9] System error: invalid character '}' looking for beginning of value"


In logging-curator-ops-1-bqfe2.describe
  6m    6m      1       {kubelet openshift-114.lab.sjc.redhat.com}      spec.containers{curator}        Warning Failed          Failed to start container with docker id 1620eda5c6be with error: Error response from daemon: Cannot start container 1620eda5c6be602917675e02debbac2e26b503924ff87a135912cca8afc5b261: [9] System error: invalid character '}' looking for beginning of value
  6m    6m      1       {kubelet openshift-114.lab.sjc.redhat.com}      spec.containers{curator}        Normal  Created         Created container with docker id 2e366425c3d8
  6m    6m      1       {kubelet openshift-114.lab.sjc.redhat.com}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "curator" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 2e366425c3d83954faf323937db2f575725835739247593fe992a7164221fb55: [9] System error: invalid character '}' looking for beginning of value"

  6m    6m      1       {kubelet openshift-114.lab.sjc.redhat.com}      spec.containers{curator}        Warning Failed          Failed to start container with docker id 2e366425c3d8 with error: Error response from daemon: Cannot start container 2e366425c3d83954faf323937db2f575725835739247593fe992a7164221fb55: [9] System error: invalid character '}' looking for beginning of value
  6m    6m      1       {kubelet openshift-114.lab.sjc.redhat.com}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "curator" with CrashLoopBackOff: "Back-off 20s restarting failed container=curator pod=logging-curator-ops-1-bqfe2_logging(db3b6ad0-70b8-11e6-b266-fa163e493d67)"

  5m    5m      1       {kubelet openshift-114.lab.sjc.redhat.com}      spec.containers{curator}        Normal  Created         Created container with docker id 95341f392f63
  5m    5m      1       {kubelet openshift-114.lab.sjc.redhat.com}      spec.containers{curator}        Warning Failed          Failed to start container with docker id 95341f392f63 with error: Error response from daemon: Cannot start container 95341f392f6394306d3dcda45b860c13a81d4afa2ab27524dd666bf2dc6abb56: [9] System error: could not synchronise with container process
  5m    5m      1       {kubelet openshift-114.lab.sjc.redhat.com}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "curator" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 95341f392f6394306d3dcda45b860c13a81d4afa2ab27524dd666bf2dc6abb56: [9] System error: could not synchronise with container process"

  5m    5m      3       {kubelet openshift-114.lab.sjc.redhat.com}              Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "curator" with CrashLoopBackOff: "Back-off 40s restarting failed container=curator pod=logging-curator-ops-1-bqfe2_logging(db3b6ad0-70b8-11e6-b266-fa163e493d67)"


Expected results:


Additional info:

Comment 1 Luke Meyer 2016-09-02 16:20:44 UTC
What was the IMAGE_PREFIX when installing 3.2? I ask because obviously the upgraded version is internal-only, and I'm wondering if a different registry was used for the initial install. The upgrade is in-place, and though it does update image tags, it doesn't update image names/repos (unfortunately... maybe it should), so the upgrade could be trying to deploy e.g. registry.access.redhat.com/openshift3/logging-curator:3.3.0 and that will not resolve until release.

In other words, you can't upgrade to a different IMAGE_PREFIX.

A full describe of one of the upgraded DCs and/or full list of namespace events would probably shed light on what it's trying to do.

Comment 2 ewolinet 2016-09-02 17:42:23 UTC
I just ran an upgrade from 3.2.1 to 3.3.0 for aggregated logging and wasn't able to recreate this -- I see that the deployer completed successfully.

It looks like an oc deploy failed during the deployer for one of the es deployment configs. If you look, you see that the deployer pod for your MODE=upgrade is in status ERROR.

A list of events from that time as Luke is requesting would be useful.

Comment 3 Anping Li 2016-09-05 01:17:22 UTC
Yes, I used a different docker-registry server during upgrade, registry.access.redhat.com-> brew-pulp-docker01.web.prod.ext.phx2.redhat.com

Comment 4 ewolinet 2016-09-06 13:13:39 UTC
Anping, can you verify if rerunning this test with first installing logging 3.2 from the brew-pulp repo and then upgrade to 3.3 using the same repo still causes you to see this issue?

When I ran my test and was unable to recreate, it was while using the same repo for 3.2.1 and 3.3

Comment 5 Luke Meyer 2016-09-06 13:36:16 UTC
I'm tempted to close this NOTABUG, but maybe we should keep it around to remind ourselves to handle this situation (upgrading to a different registry) better.

Comment 6 Anping Li 2016-09-09 12:39:33 UTC
(In reply to ewolinet from comment #4)
> Anping, can you verify if rerunning this test with first installing logging
> 3.2 from the brew-pulp repo and then upgrade to 3.3 using the same repo
> still causes you to see this issue?
> 
> When I ran my test and was unable to recreate, it was while using the same
> repo for 3.2.1 and 3.3

If rerun with install configure, the upgrade works well.

Comment 8 Anping Li 2016-09-21 01:30:33 UTC
The issue will be addressed in repo,so QA move to verified. please feel free to open it if need,