Bug 1557044 - Logging error while upgrading 3.6 to 3.7.23
Summary: Logging error while upgrading 3.6 to 3.7.23
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.7.z
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-15 20:53 UTC by Matthew Barnes
Modified: 2018-10-08 10:08 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: v3.7 introduced a second container in the ES pod, our role didn't account for the fact that upgrading to 3.7 would mean there was only one container. Consequence: The playbook run fails because it doesn't have a second container. Fix: Expand the check to also see if there is more than one container in the pod to account for upgrades from pre v3.7 -> v3.7 Result: The playbook is able to run to completion when upgrading from 3.6 to 3.7
Clone Of:
Environment:
Last Closed: 2018-06-27 07:59:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2009 0 None None None 2018-06-27 07:59:38 UTC

Description Matthew Barnes 2018-03-15 20:53:37 UTC
While upgrading OCP from 3.6 to 3.7.23 using openshift-ansible 3.7.37, I encountered this error:

TASK [openshift_logging_elasticsearch : set_fact] ******************************
Thursday 15 March 2018  19:04:57 +0000 (0:00:00.621)       0:03:55.458 ******** 
fatal: [204.236.205.127]: FAILED! => {"msg": "The conditional check 'item.status.containerStatuses[1].ready == true' failed. The error was: error while evaluating conditional (item.status.containerStatuses[1].ready == true): list object has no element 1\n\nThe error appears to have been in '/home/opsmedic/aos-cd/git/openshift-tools/openshift/installer/vendored/openshift-ansible-3.7.37/roles/openshift_logging_elasticsearch/tasks/get_es_version.yml': line 9, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- set_fact:\n  ^ here\n"}

PLAY RECAP *********************************************************************
18.232.94.26               : ok=47   changed=5    unreachable=0    failed=0   
204.236.205.127            : ok=105  changed=8    unreachable=0    failed=1   
34.207.102.240             : ok=47   changed=5    unreachable=0    failed=0   
34.227.95.221              : ok=47   changed=5    unreachable=0    failed=0   
34.228.247.65              : ok=47   changed=5    unreachable=0    failed=0   
35.173.181.73              : ok=47   changed=5    unreachable=0    failed=0   
35.174.0.226               : ok=48   changed=6    unreachable=0    failed=0   
52.71.96.22                : ok=47   changed=5    unreachable=0    failed=0   
52.91.74.69                : ok=48   changed=6    unreachable=0    failed=0   
54.152.199.184             : ok=47   changed=5    unreachable=0    failed=0   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization             : Complete
Logging Install            : In Progress


Deployment Config
-----------------

apiVersion: v1
kind: DeploymentConfig
metadata:
  creationTimestamp: 2018-03-14T18:02:25Z
  generation: 2
  labels:
    component: es
    deployment: logging-es-data-master-1stfd5d8
    logging-infra: elasticsearch
    provider: openshift
  name: logging-es-data-master-1stfd5d8
  namespace: logging
  resourceVersion: "57277"
  selfLink: /oapi/v1/namespaces/logging/deploymentconfigs/logging-es-data-master-1stfd5d8
  uid: d9f830ac-27b1-11e8-9294-0ede7012d9c6
spec:
  replicas: 1
  selector:
    component: es
    deployment: logging-es-data-master-1stfd5d8
    logging-infra: elasticsearch
    provider: openshift
  strategy:
    activeDeadlineSeconds: 21600
    recreateParams:
      timeoutSeconds: 600
    resources: {}
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: es
        deployment: logging-es-data-master-1stfd5d8
        logging-infra: elasticsearch
        provider: openshift
      name: logging-es-data-master-1stfd5d8
    spec:
      containers:
      - env:
        - name: DC_NAME
          value: logging-es-data-master-1stfd5d8
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: KUBERNETES_TRUST_CERT
          value: "true"
        - name: SERVICE_DNS
          value: logging-es-cluster
        - name: CLUSTER_NAME
          value: logging-es
        - name: INSTANCE_RAM
          value: 12Gi
        - name: HEAP_DUMP_LOCATION
          value: /elasticsearch/persistent/heapdump.hprof
        - name: NODE_QUORUM
          value: "2"
        - name: RECOVER_EXPECTED_NODES
          value: "3"
        - name: RECOVER_AFTER_TIME
          value: 5m
        - name: READINESS_PROBE_TIMEOUT
          value: "30"
        - name: POD_LABEL
          value: component=es
        - name: IS_MASTER
          value: "true"
        - name: HAS_DATA
          value: "true"
        image: registry.reg-aws.openshift.com:443/openshift3/logging-elasticsearch:v3.6
        imagePullPolicy: IfNotPresent
        name: elasticsearch
        ports:
        - containerPort: 9200
          name: restapi
          protocol: TCP
        - containerPort: 9300
          name: cluster
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /usr/share/java/elasticsearch/probe/readiness.sh
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 30
        resources:
          limits:
            memory: 12Gi
          requests:
            cpu: 375m
            memory: 12Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/elasticsearch/secret
          name: elasticsearch
          readOnly: true
        - mountPath: /usr/share/java/elasticsearch/config
          name: elasticsearch-config
          readOnly: true
        - mountPath: /elasticsearch/persistent
          name: elasticsearch-storage
      dnsPolicy: ClusterFirst
      nodeSelector:
        type: infra
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        supplementalGroups:
        - 65534
      serviceAccount: aggregated-logging-elasticsearch
      serviceAccountName: aggregated-logging-elasticsearch
      terminationGracePeriodSeconds: 30
      volumes:
      - name: elasticsearch
        secret:
          defaultMode: 420
          secretName: logging-elasticsearch
      - configMap:
          defaultMode: 420
          name: logging-elasticsearch
        name: elasticsearch-config
      - name: elasticsearch-storage
        persistentVolumeClaim:
          claimName: logging-es-0
  test: false
  triggers:
  - type: ConfigChange
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2018-03-14T18:03:54Z
    lastUpdateTime: 2018-03-14T18:03:54Z
    message: replication controller "logging-es-data-master-1stfd5d8-1" successfully
      rolled out
    reason: NewReplicationControllerAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: 2018-03-14T22:36:42Z
    lastUpdateTime: 2018-03-14T22:36:42Z
    message: Deployment config has minimum availability.
    status: "True"
    type: Available
  details:
    causes:
    - type: ConfigChange
    message: config change
  latestVersion: 1
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  unavailableReplicas: 0
  updatedReplicas: 1

Comment 2 Matthew Barnes 2018-03-19 20:50:35 UTC
Changing to "installer" component since this was an openshift-ansible issue.

Comment 4 Matthew Barnes 2018-04-23 16:42:44 UTC
I retested a 3.6->3.7 upgrade with openshift-ansible 3.7.42 on RHEL7, but it failed for a different reason:

TASK [Upgrade master packages] *************************************************
Monday 23 April 2018  16:32:29 +0000 (0:00:05.043)       0:03:23.697 ********** 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TemplateRuntimeError: no test named 'equalto'
fatal: [mbarnestest-master-84785]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

The "equalto" filter was added in Jinja 2.8 [1], but the latest available package on RHEL7 is python-jinja2-2.7.2-2.el7.

Because of that failure, I can't confirm this particular bug is fixed yet.

[1] http://jinja.pocoo.org/docs/2.10/changelog/#version-2-8

Comment 5 Matthew Barnes 2018-04-23 16:45:03 UTC
Looks like the equalto issue might be fixed in 3.7.43.  Will retry.

Comment 13 Matthew Barnes 2018-05-04 20:58:57 UTC
Okay, finally got a successful upgrade with openshift-ansible 3.7.44 after adding a 3rd "infra" node for the 3rd ES pod to run on.

Apparently before I was getting away with running 3 ES pods on only 2 nodes, but I guess the extra container per pod pushed the memory requirement beyond what 2 nodes could handle?

In any case, sorry for the tangent.  Looks like this is fixed.

Comment 15 Anping Li 2018-05-10 09:20:31 UTC
The logging can be updated from v3.6.173.0.118 to v3.7.46 via openshift-ansile:v3.7.46. After upgrade, the index can be retrieved in kibana. the 

Key varaibles:
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_number_of_shards=1
openshift_logging_es_number_of_replicas=1
openshift_logging_es_memory_limit=2Gi
openshift_logging_es_cluster_size=3


After upgrade, 
1) the cluster is healthy.
oc exec -c elasticsearch logging-es-data-master-ew54449w-2-hrnn7 -- curl -s -XGET --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "logging-es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 10,
  "active_shards" : 23,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

2) The new logs can be gathered and view in kibana
project.anlitest.a840e525-5431-11e8-9ea4-fa163e36cc89.2018.05.10

3) The old index can be view in kibana
project.install-test.20a798cb-53fe-11e8-8ad7-fa163e36cc89.2018.05.10

Comment 16 Marc Jadoul 2018-05-19 11:56:05 UTC
I am trying to upgrade to 3.7.46....
But I still have the error.

Is the fix really in 3.7.46?

Comment 17 ewolinet 2018-05-22 19:36:19 UTC
@Marc,

What version of openshift-ansible are you using?

Comment 18 Marc Jadoul 2018-05-23 08:05:59 UTC
Hello,
We are using 3.7.46.

Comment 19 ewolinet 2018-05-23 16:33:32 UTC
Thanks for that Marc,

I think I see what the issue is. 
Just to confirm though, do you happen to have an Logging Ops deployment? And what line are you seeing the failure occur on, is it line 45 of get_es_version.yml?

Comment 21 Marc Jadoul 2018-05-23 17:36:34 UTC
Hello,
Yes we deployed an Logging OPS. 

Yes line 45!

fatal: [XXX]: FAILED! => {"msg": "The conditional check 'item.status.containerStatuses[1].ready == true' failed. The error was: error while evaluating conditional (item.status.containerStatuses[1].ready == true): list object has no element 1\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_logging_elasticsearch/tasks/get_es_version.yml': line 45, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- set_fact:\n  ^ here\n"}

Comment 25 Anping Li 2018-06-21 06:18:15 UTC
The upgrade pass with openshift3/ose-ansible/images/v3.7.55-1

Comment 27 errata-xmlrpc 2018-06-27 07:59:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2009


Note You need to log in before you can comment on or make changes to this bug.