Created attachment 1257146 [details] pod description info Description of problem: This issue was found when verifying https://bugzilla.redhat.com/show_bug.cgi?id=1424981. Install logging 3.3.1 first, specified nodeselector in configmap, steps please see 'Steps to Reproduce', after upgrading logging from 3.3.1 to 3.5.0, pod status is pending, describe pod, found error "failed to fit in any node fit failure summary on nodes : MatchNodeSelector (1)", details please see the attached file. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-2-deploy 0/1 Pending 0 38m logging-deployer-d29lm 0/1 Completed 0 1h logging-es-0j5o77ot-2-deploy 0/1 Pending 0 37m logging-kibana-2-deploy 0/1 Pending 0 37m # oc describe po logging-curator-2-deploy -------------------------------------------snip------------------------------------------- Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 22m 2s 81 {default-scheduler } Warning FailedScheduling pod (logging-curator-2-deploy) failed to fit in any node fit failure summary on nodes : MatchNodeSelector (1) -------------------------------------------snip------------------------------------------- Version-Release number of selected component (if applicable): https://github.com/openshift/openshift-ansible/ -b master head version:aec9cd888fdc83b8c3f82f469bd07c66aa5183b1 How reproducible: Always Steps to Reproduce: 1.Deploy logging 3.3.1 stacks (on OCP 3.5.0) with journald log driver enabled and node selectors defined in configmap: "use-journal": "true" "curator-nodeselector": "logging-infra-fluentd=true" "es-nodeselector": "logging-infra-fluentd=true" "kibana-nodeselector": "logging-infra-fluentd=true" Also bound es with hostPV storage on es node, wait until log entries shown on kibana UI. 2.Upgrade to logging 3.5.0 stacks by using ansible, specifying these parameters in inventory file (as in the attachment): openshift_logging_fluentd_use_journal=true openshift_logging_es_nodeselector={'logging-infra-fluentd':'true'} openshift_logging_kibana_nodeselector={'logging-infra-fluentd':'true'} openshift_logging_curator_nodeselector={'logging-infra-fluentd':'true'} openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'} 3.Check upgrade result Actual results: Upgrade failed, failed to fit node if nodeselector specified Expected results: Upgrade should be successful Additional info: Ansible upgrade log attached inventory file for the upgrade attached
Created attachment 1257147 [details] fully ansible running log
Created attachment 1257150 [details] ansible inventory file
Can you attach the dc definitions after upgrade as well as node details so we may confirm we have matching selectors.
Tested again, nodeSelector: logging-infra-fluentd: "true" are in both 3.3.1 and 3.5.0 dc, but pod is still in pending status after upgrading to 3.5.0,failed to fit on nodes. see attached logging_331_dc_info.txt and logging_350_dc_info.txt
Created attachment 1258273 [details] logging_331_dc_info.txt
Created attachment 1258274 [details] logging_350_dc_info.txt
Can you attach node info: 'oc get nodes -o yaml'
@Junqi I'm asking for the node yaml because I want to confirm the node is labeled with the selector to match the pod specs.
Found the issue, during upgrade, we stop the cluster, which unlabels the node. We then try to start ES for upgrade but it cant be placed because it has the same nodeSelector that is used to deploy fluentd.
@Junqi, Can you use a different node label (or omit it) for Elasticsearch? As part of the upgrade entry point, for the index migration we have only Elasticsearch running. However you are using the same node selector for ES as you are for Fluentd, and with the node not being labelled for Fluentd to be deployed, this means that the ES pod will never be able to be deployed scheduled on that node so the role will fail waiting for the pod to be available. If you'd like to still verify that we are adding the node selector to the ES DC, i would recommend pre-labelling the node with that label prior to running the playbook with the logging entry point.
Lowering the severity as this is not a blocker and can be resolved by using a nodeSelector that is different from fluentd
(In reply to ewolinet from comment #10) > @Junqi, > > Can you use a different node label (or omit it) for Elasticsearch? > As part of the upgrade entry point, for the index migration we have only > Elasticsearch running. However you are using the same node selector for ES > as you are for Fluentd, and with the node not being labelled for Fluentd to > be deployed, this means that the ES pod will never be able to be deployed > scheduled on that node so the role will fail waiting for the pod to be > available. > > If you'd like to still verify that we are adding the node selector to the ES > DC, i would recommend pre-labelling the node with that label prior to > running the playbook with the logging entry point. I have one questions: 1. Deployed logging 3.3.1 succeesfully, used same nodeSelector for curator, es and kibana curator-nodeselector=logging-infra-fluentd=true es-nodeselector=logging-infra-fluentd=true kibana-nodeselector=logging-infra-fluentd=true es_nodeselector and fluentd_nodeselector use different value, and upgrade to 3.5.0, same error happens. openshift_logging_es_nodeselector={'logging-infra-elasticsearch':'true'} openshift_logging_kibana_nodeselector={'logging-infra-fluentd':'true'} openshift_logging_curator_nodeselector={'logging-infra-fluentd':'true'} openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'} if we want to use nodeSelector for es, kibana and curator, should they have different value with each other and with fluentd_nodeselector, and they should be deployed on different node?
They need to be different then fluentd since this is the mechanism by which fluentd gets deployed/undeployed. In practice, we should use these selectors and affinity/anti-affinity to spread the components across the infra nodes.
@Junqi To add to what Jeff said above - this is in part due to the way that the 3.3 deployer unscheduled the Fluentd pods as part of the upgrade vs how the 3.5 ansible role does. With the deployer, we did not grant it access to label and unlabel nodes, so we simply deleted the logging-fluentd daemonset object and then later recreated it based on its template to schedule and unschedule the nodes. With the new role, we have access to label and unlabel nodes so we use that approach rather than deleting and recreating the daemonset object.
use different nodeSelector with fluentd, this issue does not exist. but another error happend,https://bugzilla.redhat.com/show_bug.cgi?id=1428711 Set this defect as VERIFIED
How did you do the upgrade from 3.3 to 3.5? I'm following the official documentation for 3.4: https://docs.openshift.com/container-platform/3.4/install_config/upgrading/automated_upgrades.html#preparing-for-an-automated-upgrade except that I'm using ansible-playbook -vvv -i /root/ansible-inventory playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml And I get this error message: MSG: openshift_release is 3.3 which is not a valid release for a 3.5 upgrade
(In reply to Rich Megginson from comment #16) > How did you do the upgrade from 3.3 to 3.5? I'm following the official > documentation for 3.4: > https://docs.openshift.com/container-platform/3.4/install_config/upgrading/ > automated_upgrades.html#preparing-for-an-automated-upgrade > > except that I'm using > > ansible-playbook -vvv -i /root/ansible-inventory > playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml > > And I get this error message: > > MSG: > > openshift_release is 3.3 which is not a valid release for a 3.5 upgrade We specified the following ansible parameters to upgrade from 3.3.1 to 3.5.0 openshift_logging_install_logging=false openshift_logging_upgrade_logging=true Since the upgrade has a lot of errors, we have not upgraded successfully until now, this defect is about nodeSelector, use different nodeSelector with fluentd, this issue does not exist. We will continue to do the upgrade testing, will let you know if the upgrade process is successful
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1129