Bug 1421630 - [Intservice_public_324]logging upgrade failed since node not well labeled
Summary: [Intservice_public_324]logging upgrade failed since node not well labeled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.5.z
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-13 09:59 UTC by Xia Zhao
Modified: 2017-12-14 21:01 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-12-14 21:01:20 UTC
Target Upstream Version:


Attachments (Terms of Use)
ansible_upgrade_log (1.30 MB, text/plain)
2017-02-13 09:59 UTC, Xia Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3438 normal SHIPPED_LIVE OpenShift Container Platform 3.6 and 3.5 bug fix and enhancement update 2017-12-15 01:58:11 UTC

Description Xia Zhao 2017-02-13 09:59:04 UTC
Created attachment 1249787 [details]
ansible_upgrade_log

Description of problem:
Upgrade logging stacks from 3.3.1 to 3.5.0, ansible script failed at 

TASK [openshift_logging : command] failed eventually at last:

RUNNING HANDLER [openshift_logging : restart master] ***************************

PLAY RECAP *********************************************************************
$master               : ok=428  changed=52   unreachable=0    failed=1   


# oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-2-deploy      0/1       Error       0          20m
logging-deployer-vf4l1        0/1       Completed   0          36m
logging-es-5glarbby-2-hrs2f   0/1       Pending     0          19m


ES pod not able to start due to node not labeled well:
            -------
  35m        1s        126    {default-scheduler }            Warning        FailedScheduling    pod (logging-es-5glarbby-2-hrs2f) failed to fit in any node
fit failure summary on nodes : MatchNodeSelector (1)

And the node label "logging-infra-fluentd=true" before upgrade are lost:

# oc get node --show-labels
NAME                                                STATUS                     AGE       LABELS
$master   Ready,SchedulingDisabled   6h        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=$master,role=node
$node   Ready                      6h        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=$node,registry=enabled,role=node,router=enabled


Version-Release number of selected component (if applicable):
https://github.com/openshift/openshift-ansible/

How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.3.1 stacks (on OCP 3.5.0) with journald log driver enabled and node selectors defined in configmap:
"use-journal": "true"
"curator-nodeselector": "logging-infra-fluentd=true"
"es-nodeselector": "logging-infra-fluentd=true"
"kibana-nodeselector": "logging-infra-fluentd=true"

Also bound es with hostPV storage on es node, wait until log entries shown on kibana UI.

2.Upgrade to logging 3.5.0 stacks by using ansible
3.Check upgrade result

Actual results:
Upgrade failed at TASK [openshift_logging : command]

Expected results:
CEFK pods should be running post upgrade

Additional info:
Ansible log attached
Repro env attached

Comment 4 Jeff Cantrill 2017-02-16 15:04:57 UTC
Per @ewolinetz, running playbook in upgrade will scale down and remove the node labels.

Upon creation of the 3.5 logging objects, node selectors will only be applied if you set the following in the inventory:

openshift_logging_es_nodeselector
openshift_logging_es_ops_nodeselector
openshift_logging_kibana_nodeselector
openshift_logging_kibana_ops_nodeselector
openshift_logging_curator_nodeselector
openshift_logging_curator_ops_nodeselector
openshift_logging_fluentd_nodeselector
openshift_logging_fluentd_ops_nodeselector

which must be a hash like: {'logging-infra-fluentd': 'true'}

Comment 5 Xia Zhao 2017-02-20 09:15:26 UTC
The original issue can be fixed (with https://bugzilla.redhat.com/show_bug.cgi?id=1424981 encountered) after setting these in the inventory file used for upgrade:
openshift_logging_es_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_kibana_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_curator_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'}


It's better to mention this in the doc, or end users may think that nodeselectors should be inherited from prior to upgrade.

@jcantril Do you think it necessary to have a separate doc issue to track this?

Comment 6 Jeff Cantrill 2017-02-20 20:26:26 UTC
In the 3.5 document PR we have made the change to reflect this.  It may be worth a separate issue to explicitly note the difference between 3.4 and 3.5

Comment 9 errata-xmlrpc 2017-12-14 21:01:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3438


Note You need to log in before you can comment on or make changes to this bug.