Bug 1738758

Summary: the logging-es-ops nodeSelector are changed after upgrade
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: LoggingAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED DEFERRED QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, jcantril, nhosoi, rmeggins
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-02 01:32:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
deploymentconfig Before upgrade
none
Deployment configure after upgrade none

Description Anping Li 2019-08-08 05:57:56 UTC
Description of problem:

The nodeSelector are changed in the logging-es-ops deploymentconfig after upgrade. that logging-es pod couldn't be started.

1) nodeSelector before Upgrade:
cat elasticsearch-dc-before-upgrade.json | jq '.items[].metadata.name, .items[].spec.template.spec.nodeSelector'
"logging-es-data-master-ajbqhp8h"
"logging-es-data-master-telafmeq"
"logging-es-ops-data-master-0fr84k1a"
"logging-es-ops-data-master-9961o92h"
"logging-es-ops-data-master-o7nhcbo4"
{
  "logging-es-node": "1"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-ops-node": "2"
}
{
  "logging-es-ops-node": "0"
}
{
  "logging-es-ops-node": "1"
}


2) Logging Inventory used for upgrade
openshift_logging_install_logging=true
openshift_logging_es_cluster_size=2
openshift_logging_es_number_of_replicas=1
openshift_logging_es_number_of_shards=1
openshift_logging_es_memory_limit=2Gi
openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"}

openshift_logging_use_ops=true
openshift_logging_es_ops_cluster_size=3
openshift_logging_es_ops_number_of_replicas=1
openshift_logging_es_ops_number_of_shards=1
openshift_logging_es_ops_memory_limit=2Gi
openshift_logging_es_ops_nodeselector={"node-role.kubernetes.io/compute": "true"}

openshift_logging_elasticsearch_storage_type=hostmount

3)nodeSelector after Upgrade
cat elasticsearch-dc-after.json | jq '.items[].metadata.name, .items[].spec.template.spec.nodeSelector'
"logging-es-data-master-ajbqhp8h"
"logging-es-data-master-telafmeq"
"logging-es-ops-data-master-0fr84k1a"
"logging-es-ops-data-master-9961o92h"
"logging-es-ops-data-master-o7nhcbo4"
{
  "logging-es-node": "1"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-node": "2"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-node": "1"


Version-Release number of selected component (if applicable):
openshift3/ose-ansible:v3.11.135 

How reproducible:
Always

Steps to Reproduce:
1. deploy logging using openshift_logging_use_ops=true

openshift_logging_install_logging=true
openshift_logging_es_cluster_size=2
openshift_logging_es_number_of_replicas=1
openshift_logging_es_number_of_shards=1
openshift_logging_es_memory_limit=2Gi
openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"}

openshift_logging_use_ops=true
openshift_logging_es_ops_cluster_size=3
openshift_logging_es_ops_number_of_replicas=1
openshift_logging_es_ops_number_of_shards=1
openshift_logging_es_ops_memory_limit=2Gi
openshift_logging_es_ops_nodeselector={"node-role.kubernetes.io/compute": "true"}

2. Add hostpath volume and nodeSelector to ES and ES-Ops deployment configure
"logging-es-data-master-ajbqhp8h"
"logging-es-data-master-telafmeq"
"logging-es-ops-data-master-0fr84k1a"
"logging-es-ops-data-master-9961o92h"
"logging-es-ops-data-master-o7nhcbo4"
{
  "logging-es-node": "1"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-ops-node": "2"
}
{
  "logging-es-ops-node": "0"
}
{
  "logging-es-ops-node": "1"
}

3. Upgrade to latest version using openshift3/ose-ansible:v3.11.135 

Actual results:
The logging-es-ops pod couldn't be started, as the nodeSelector are changed after upgrade
cat elasticsearch-dc-after.json | jq '.items[].metadata.name, .items[].spec.template.spec.nodeSelector'
"logging-es-data-master-ajbqhp8h"
"logging-es-data-master-telafmeq"
"logging-es-ops-data-master-0fr84k1a"
"logging-es-ops-data-master-9961o92h"
"logging-es-ops-data-master-o7nhcbo4"
{
  "logging-es-node": "1"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-node": "2"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-node": "1"

Comment 1 Anping Li 2019-08-08 06:13:34 UTC
Created attachment 1601703 [details]
deploymentconfig Before upgrade

Comment 2 Anping Li 2019-08-08 06:14:07 UTC
Created attachment 1601704 [details]
Deployment configure after upgrade

Comment 3 Anping Li 2019-08-08 06:35:21 UTC
There is timespan between deployment configuration changed and logging-es-ops deploymentconfigure rollout.  workaound:  Correct the nodeSelector in logging-es-ops deploymentconfigure before the rollout. you can correct them at the point logging-es is restarting.

Comment 4 Noriko Hosoi 2019-09-17 22:53:34 UTC
Hi Anping,

Sorry for my ignorance, but I'd like to learn a couple more things...

1) Could you share these outputs?

   - the ansible log from the upgrade
   - ES log when it fails to start
   - oc get events | grep Warning

2) If you label nodes and nodeSelector like this from the beginning, the logging-es pods do not start?

The logging-es-ops pod couldn't be started, as the nodeSelector are changed after upgrade
cat elasticsearch-dc-after.json | jq '.items[].metadata.name, .items[].spec.template.spec.nodeSelector'
"logging-es-data-master-ajbqhp8h"
"logging-es-data-master-telafmeq"
"logging-es-ops-data-master-0fr84k1a"
"logging-es-ops-data-master-9961o92h"
"logging-es-ops-data-master-o7nhcbo4"
{
  "logging-es-node": "1"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-node": "2"
}
{
  "logging-es-node": "0"
}
{
  "logging-es-node": "1"
}

Comment 5 Anping Li 2019-10-22 03:12:01 UTC
sorry for the later I will provide you  the logs in the next 3.11 testing

Comment 7 Jeff Cantrill 2020-02-02 01:32:09 UTC
Closing DEFERRED. Please reopen if problem persists and there are open customer cases.