Bug 1842608

Summary: Logging upgrade to 3.11.219 fails
Product: OpenShift Container Platform Reporter: Jon <jharding>
Component: LoggingAssignee: Sergey Yedrikov <syedriko>
Status: CLOSED WONTFIX QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, jcantril, mburke, periklis, syedriko
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.zFlags: jharding: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-15 19:36:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jon 2020-06-01 17:08:21 UTC
Description of problem:
Failing to upgrade ELK after cluster upgrade 

Version-Release number of selected component (if applicable):
3.11.153  --> 3.11.219

How reproducible:
100% so far

Steps to Reproduce:
1. Upgrade clsuter from 3.11.153 to 3.11.219
2. upgrade ELK stack per https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#upgrading-efk-logging-stack


Actual results:
TASK [openshift_logging : fail] ******************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_logging/tasks/main.yaml:2
Monday 01 June 2020  10:45:54 -0600 (0:00:00.555)       0:00:18.187 *********** 
fatal: [masterc01.testnet.net]: FAILED! => {
    "changed": false, 
    "msg": "Only one Fluentd nodeselector key pair should be provided"

Expected results:
Upgrade completion with out failure 

Additional info:

Comment 2 Jeff Cantrill 2020-06-08 00:10:54 UTC
Please attach your inventory file and entire log.   Looks like the error is telling us what the problem is: "Only one Fluentd nodeselector key pair should be provided"

Comment 4 Periklis Tsirakidis 2020-07-10 14:26:30 UTC
Moving to UpcomingSprint as unlikely to be resolved by EOS

Comment 6 Jeff Cantrill 2020-08-20 13:39:00 UTC

*** This bug has been marked as a duplicate of bug 1848454 ***

Comment 11 Sergey Yedrikov 2020-09-10 14:36:31 UTC
Hi Jon,

A couple of questions:
Any particular reason you're not relying on the playbooks/openshift-logging/config.yml and shut things down, patch them, etc manually?
At which point of this workflow did you run the logging dump tool? Could you run it just before you run the Ansible PB?

Regards,
Sergey.

Comment 13 Jeff Cantrill 2020-09-12 01:58:23 UTC
Moving to UpcomingSprint as unlikely to be addressed by EOD

Comment 15 Sergey Yedrikov 2020-09-22 15:41:43 UTC
Filed a docs issue for 3.11: https://github.com/openshift/openshift-docs/issues/25677, closing this one.

Comment 16 Sergey Yedrikov 2020-10-20 21:33:56 UTC
From the update documentation:

3. Run the openshift-logging/config.yml playbook according to the deploying the EFK stack instructions to complete the logging upgrade. You run the installation playbook for the new OpenShift Container Platform version to upgrade the logging deployment.

so /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

So yes setting the nodeSelector to non-existing: true, is what the instructions say to do and this will cause the DS to terminate the logging-fluentd pods as expected.

then we have to remove the other nodeSelector so that we don't get the error about nodeSelectors and break the updating of the EFK stack so I run the oc command you had me try 
oc patch ds logging-fluentd --type json -p '[{ "op": "remove", "path": "/spec/template/spec/nodeSelector/logging-infra-fluentd" }]'

This works, then we update the EFK stack
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml
and the EFK gets updated, however during that process as I showed in the last email it labels all nodes with non-existing: true, and the DS starts spinning up openshift-fluentd pods, the correct version. 

After the update I remove the patch that was applied and add back the original 
oc patch ds logging-fluentd -p '{"spec": {"template": {"spec": {"nodeSelector": {"logging-infra-fluentd": "true"}}}}}'
oc patch ds logging-fluentd --type json -p '[{ "op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing" }]'

It all works except that now all nodes have an extra label of non-existing: true. 

I found this out but updating from 3.11.153 to 3.11.248 then I went to update to 3.11.286 and that is when I saw this as when I want to patch to remove logging-infra-fluentd": "true" it did not terminate the logging-fluentd pods and started me looking into why.

Comment 18 Sergey Yedrikov 2021-01-04 14:51:10 UTC
Docs PR https://github.com/openshift/openshift-docs/pull/27310