Bug 1421630

Summary: [Intservice_public_324]logging upgrade failed since node not well labeled
Product: OpenShift Container Platform Reporter: Xia Zhao <xiazhao>
Component: InstallerAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Xia Zhao <xiazhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-14 21:01:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible_upgrade_log none

Description Xia Zhao 2017-02-13 09:59:04 UTC
Created attachment 1249787 [details]
ansible_upgrade_log

Description of problem:
Upgrade logging stacks from 3.3.1 to 3.5.0, ansible script failed at 

TASK [openshift_logging : command] failed eventually at last:

RUNNING HANDLER [openshift_logging : restart master] ***************************

PLAY RECAP *********************************************************************
$master               : ok=428  changed=52   unreachable=0    failed=1   


# oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-2-deploy      0/1       Error       0          20m
logging-deployer-vf4l1        0/1       Completed   0          36m
logging-es-5glarbby-2-hrs2f   0/1       Pending     0          19m


ES pod not able to start due to node not labeled well:
            -------
  35m        1s        126    {default-scheduler }            Warning        FailedScheduling    pod (logging-es-5glarbby-2-hrs2f) failed to fit in any node
fit failure summary on nodes : MatchNodeSelector (1)

And the node label "logging-infra-fluentd=true" before upgrade are lost:

# oc get node --show-labels
NAME                                                STATUS                     AGE       LABELS
$master   Ready,SchedulingDisabled   6h        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=$master,role=node
$node   Ready                      6h        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=$node,registry=enabled,role=node,router=enabled


Version-Release number of selected component (if applicable):
https://github.com/openshift/openshift-ansible/

How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.3.1 stacks (on OCP 3.5.0) with journald log driver enabled and node selectors defined in configmap:
"use-journal": "true"
"curator-nodeselector": "logging-infra-fluentd=true"
"es-nodeselector": "logging-infra-fluentd=true"
"kibana-nodeselector": "logging-infra-fluentd=true"

Also bound es with hostPV storage on es node, wait until log entries shown on kibana UI.

2.Upgrade to logging 3.5.0 stacks by using ansible
3.Check upgrade result

Actual results:
Upgrade failed at TASK [openshift_logging : command]

Expected results:
CEFK pods should be running post upgrade

Additional info:
Ansible log attached
Repro env attached

Comment 4 Jeff Cantrill 2017-02-16 15:04:57 UTC
Per @ewolinetz, running playbook in upgrade will scale down and remove the node labels.

Upon creation of the 3.5 logging objects, node selectors will only be applied if you set the following in the inventory:

openshift_logging_es_nodeselector
openshift_logging_es_ops_nodeselector
openshift_logging_kibana_nodeselector
openshift_logging_kibana_ops_nodeselector
openshift_logging_curator_nodeselector
openshift_logging_curator_ops_nodeselector
openshift_logging_fluentd_nodeselector
openshift_logging_fluentd_ops_nodeselector

which must be a hash like: {'logging-infra-fluentd': 'true'}

Comment 5 Xia Zhao 2017-02-20 09:15:26 UTC
The original issue can be fixed (with https://bugzilla.redhat.com/show_bug.cgi?id=1424981 encountered) after setting these in the inventory file used for upgrade:
openshift_logging_es_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_kibana_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_curator_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'}


It's better to mention this in the doc, or end users may think that nodeselectors should be inherited from prior to upgrade.

@jcantril Do you think it necessary to have a separate doc issue to track this?

Comment 6 Jeff Cantrill 2017-02-20 20:26:26 UTC
In the 3.5 document PR we have made the change to reflect this.  It may be worth a separate issue to explicitly note the difference between 3.4 and 3.5

Comment 9 errata-xmlrpc 2017-12-14 21:01:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3438