Bug 1426511 - Failed to fit node if nodeselector sepecified when upgrade logging stacks via ansible
Summary: Failed to fit node if nodeselector sepecified when upgrade logging stacks via...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: ewolinet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-24 05:41 UTC by Junqi Zhao
Modified: 2017-07-24 14:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-04-26 05:36:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod description info (8.52 KB, text/plain)
2017-02-24 05:41 UTC, Junqi Zhao
no flags Details
fully ansible running log (873.10 KB, text/plain)
2017-02-24 05:42 UTC, Junqi Zhao
no flags Details
ansible inventory file (986 bytes, text/plain)
2017-02-24 05:52 UTC, Junqi Zhao
no flags Details
logging_331_dc_info.txt (11.42 KB, text/plain)
2017-02-28 06:21 UTC, Junqi Zhao
no flags Details
logging_350_dc_info.txt (22.56 KB, text/plain)
2017-02-28 06:22 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1129 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update 2017-04-26 09:35:35 UTC

Description Junqi Zhao 2017-02-24 05:41:09 UTC
Created attachment 1257146 [details]
pod description info

Description of problem:

This issue was found when verifying
 https://bugzilla.redhat.com/show_bug.cgi?id=1424981.
Install logging 3.3.1 first, specified nodeselector in configmap, steps please see 'Steps to Reproduce', after upgrading logging from 3.3.1 to 3.5.0, pod status is pending, describe pod, found error "failed to fit in any node fit failure summary on nodes : MatchNodeSelector (1)", details please see the attached file.

# oc get po
NAME                           READY     STATUS      RESTARTS   AGE
logging-curator-2-deploy       0/1       Pending     0          38m
logging-deployer-d29lm         0/1       Completed   0          1h
logging-es-0j5o77ot-2-deploy   0/1       Pending     0          37m
logging-kibana-2-deploy        0/1       Pending     0          37m

# oc describe po logging-curator-2-deploy
-------------------------------------------snip-------------------------------------------
Events:
  FirstSeen    LastSeen    Count    From            SubObjectPath    Type        Reason            Message
  ---------    --------    -----    ----            -------------    --------    ------            -------
  22m        2s        81    {default-scheduler }            Warning        FailedScheduling    pod (logging-curator-2-deploy) failed to fit in any node
fit failure summary on nodes : MatchNodeSelector (1)
-------------------------------------------snip-------------------------------------------

Version-Release number of selected component (if applicable):
https://github.com/openshift/openshift-ansible/ -b master
head version:aec9cd888fdc83b8c3f82f469bd07c66aa5183b1


How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.3.1 stacks (on OCP 3.5.0) with journald log driver enabled and node selectors defined in configmap:
"use-journal": "true"
"curator-nodeselector": "logging-infra-fluentd=true"
"es-nodeselector": "logging-infra-fluentd=true"
"kibana-nodeselector": "logging-infra-fluentd=true"

Also bound es with hostPV storage on es node, wait until log entries shown on kibana UI.

2.Upgrade to logging 3.5.0 stacks by using ansible, specifying these parameters in inventory file (as in the attachment):
openshift_logging_fluentd_use_journal=true

openshift_logging_es_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_kibana_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_curator_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'}

3.Check upgrade result

Actual results:
Upgrade failed, failed to fit node if nodeselector specified

Expected results:
Upgrade should be successful

Additional info:
Ansible upgrade log attached
inventory file for the upgrade attached

Comment 1 Junqi Zhao 2017-02-24 05:42:42 UTC
Created attachment 1257147 [details]
fully ansible running log

Comment 2 Junqi Zhao 2017-02-24 05:52:52 UTC
Created attachment 1257150 [details]
ansible inventory file

Comment 3 Jeff Cantrill 2017-02-27 19:14:14 UTC
Can you attach the dc definitions after upgrade as well as node details so we may confirm we have matching selectors.

Comment 4 Junqi Zhao 2017-02-28 06:21:05 UTC
Tested again, nodeSelector: logging-infra-fluentd: "true" are in both 3.3.1 and 3.5.0 dc, but pod is still in pending status after upgrading to 3.5.0,failed to fit on nodes.
see attached logging_331_dc_info.txt and logging_350_dc_info.txt

Comment 5 Junqi Zhao 2017-02-28 06:21:30 UTC
Created attachment 1258273 [details]
logging_331_dc_info.txt

Comment 6 Junqi Zhao 2017-02-28 06:22:02 UTC
Created attachment 1258274 [details]
logging_350_dc_info.txt

Comment 7 Jeff Cantrill 2017-02-28 14:21:35 UTC
Can you attach node info: 'oc get nodes -o yaml'

Comment 8 Jeff Cantrill 2017-02-28 14:25:13 UTC
@Junqi I'm asking for the node yaml because I want to confirm the node is labeled with the selector to match the pod specs.

Comment 9 Jeff Cantrill 2017-02-28 14:46:47 UTC
Found the issue, during upgrade, we stop the cluster, which unlabels the node.  We then try to start ES for upgrade but it cant be placed because it has the same nodeSelector that is used to deploy fluentd.

Comment 10 ewolinet 2017-02-28 14:53:31 UTC
@Junqi,

Can you use a different node label (or omit it) for Elasticsearch?
As part of the upgrade entry point, for the index migration we have only Elasticsearch running. However you are using the same node selector for ES as you are for Fluentd, and with the node not being labelled for Fluentd to be deployed, this means that the ES pod will never be able to be deployed scheduled on that node so the role will fail waiting for the pod to be available.

If you'd like to still verify that we are adding the node selector to the ES DC, i would recommend pre-labelling the node with that label prior to running the playbook with the logging entry point.

Comment 11 Jeff Cantrill 2017-02-28 16:39:21 UTC
Lowering the severity as this is not a blocker and can be resolved by using a nodeSelector that is different from fluentd

Comment 12 Junqi Zhao 2017-03-02 07:15:44 UTC
(In reply to ewolinet from comment #10)
> @Junqi,
> 
> Can you use a different node label (or omit it) for Elasticsearch?
> As part of the upgrade entry point, for the index migration we have only
> Elasticsearch running. However you are using the same node selector for ES
> as you are for Fluentd, and with the node not being labelled for Fluentd to
> be deployed, this means that the ES pod will never be able to be deployed
> scheduled on that node so the role will fail waiting for the pod to be
> available.
> 
> If you'd like to still verify that we are adding the node selector to the ES
> DC, i would recommend pre-labelling the node with that label prior to
> running the playbook with the logging entry point.


I have one questions:
1. Deployed logging 3.3.1 succeesfully, used same nodeSelector for 
curator, es and kibana
curator-nodeselector=logging-infra-fluentd=true
es-nodeselector=logging-infra-fluentd=true
kibana-nodeselector=logging-infra-fluentd=true

es_nodeselector and fluentd_nodeselector use different value, and upgrade to 3.5.0, same error happens.
openshift_logging_es_nodeselector={'logging-infra-elasticsearch':'true'}
openshift_logging_kibana_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_curator_nodeselector={'logging-infra-fluentd':'true'}
openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'}

if we want to use nodeSelector for es, kibana and curator, should they have different value with each other and with fluentd_nodeselector, and they should be deployed on different node?

Comment 13 Jeff Cantrill 2017-03-02 15:10:02 UTC
They need to be different then fluentd since this is the mechanism by which fluentd gets deployed/undeployed.  In practice, we should use these selectors and affinity/anti-affinity to spread the components across the infra nodes.

Comment 14 ewolinet 2017-03-02 16:15:38 UTC
@Junqi

To add to what Jeff said above - 

this is in part due to the way that the 3.3 deployer unscheduled the Fluentd pods as part of the upgrade vs how the 3.5 ansible role does.

With the deployer, we did not grant it access to label and unlabel nodes, so we simply deleted the logging-fluentd daemonset object and then later recreated it based on its template to schedule and unschedule the nodes.

With the new role, we have access to label and unlabel nodes so we use that approach rather than deleting and recreating the daemonset object.

Comment 15 Junqi Zhao 2017-03-03 08:51:32 UTC
use different nodeSelector with fluentd, this issue does not exist.
but another error happend,https://bugzilla.redhat.com/show_bug.cgi?id=1428711

Set this defect as VERIFIED

Comment 16 Rich Megginson 2017-03-04 02:58:56 UTC
How did you do the upgrade from 3.3 to 3.5?  I'm following the official documentation for 3.4: https://docs.openshift.com/container-platform/3.4/install_config/upgrading/automated_upgrades.html#preparing-for-an-automated-upgrade

except that I'm using 

ansible-playbook -vvv -i /root/ansible-inventory playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml

And I get this error message:

MSG:

openshift_release is 3.3 which is not a valid release for a 3.5 upgrade

Comment 17 Junqi Zhao 2017-03-06 00:32:02 UTC
(In reply to Rich Megginson from comment #16)
> How did you do the upgrade from 3.3 to 3.5?  I'm following the official
> documentation for 3.4:
> https://docs.openshift.com/container-platform/3.4/install_config/upgrading/
> automated_upgrades.html#preparing-for-an-automated-upgrade
> 
> except that I'm using 
> 
> ansible-playbook -vvv -i /root/ansible-inventory
> playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml
> 
> And I get this error message:
> 
> MSG:
> 
> openshift_release is 3.3 which is not a valid release for a 3.5 upgrade

We specified the following ansible parameters  to upgrade from 3.3.1 to 3.5.0

openshift_logging_install_logging=false
openshift_logging_upgrade_logging=true


Since the upgrade has a lot of errors, we have not upgraded successfully until now, this defect is about nodeSelector, use different nodeSelector with fluentd, this issue does not exist. We will continue to do the upgrade testing, will let you know if the upgrade process is successful

Comment 19 errata-xmlrpc 2017-04-26 05:36:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129


Note You need to log in before you can comment on or make changes to this bug.