Bug 1842608 - Logging upgrade to 3.11.219 fails
Summary: Logging upgrade to 3.11.219 fails
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.z
Assignee: Sergey Yedrikov
QA Contact: Anping Li
URL:
Whiteboard: logging-core
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-01 17:08 UTC by Jon
Modified: 2021-01-04 14:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 19:36:46 UTC
Target Upstream Version:
Embargoed:
jharding: needinfo-


Attachments (Terms of Use)

Description Jon 2020-06-01 17:08:21 UTC
Description of problem:
Failing to upgrade ELK after cluster upgrade 

Version-Release number of selected component (if applicable):
3.11.153  --> 3.11.219

How reproducible:
100% so far

Steps to Reproduce:
1. Upgrade clsuter from 3.11.153 to 3.11.219
2. upgrade ELK stack per https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#upgrading-efk-logging-stack


Actual results:
TASK [openshift_logging : fail] ******************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_logging/tasks/main.yaml:2
Monday 01 June 2020  10:45:54 -0600 (0:00:00.555)       0:00:18.187 *********** 
fatal: [masterc01.testnet.net]: FAILED! => {
    "changed": false, 
    "msg": "Only one Fluentd nodeselector key pair should be provided"

Expected results:
Upgrade completion with out failure 

Additional info:

Comment 2 Jeff Cantrill 2020-06-08 00:10:54 UTC
Please attach your inventory file and entire log.   Looks like the error is telling us what the problem is: "Only one Fluentd nodeselector key pair should be provided"

Comment 4 Periklis Tsirakidis 2020-07-10 14:26:30 UTC
Moving to UpcomingSprint as unlikely to be resolved by EOS

Comment 6 Jeff Cantrill 2020-08-20 13:39:00 UTC

*** This bug has been marked as a duplicate of bug 1848454 ***

Comment 11 Sergey Yedrikov 2020-09-10 14:36:31 UTC
Hi Jon,

A couple of questions:
Any particular reason you're not relying on the playbooks/openshift-logging/config.yml and shut things down, patch them, etc manually?
At which point of this workflow did you run the logging dump tool? Could you run it just before you run the Ansible PB?

Regards,
Sergey.

Comment 13 Jeff Cantrill 2020-09-12 01:58:23 UTC
Moving to UpcomingSprint as unlikely to be addressed by EOD

Comment 15 Sergey Yedrikov 2020-09-22 15:41:43 UTC
Filed a docs issue for 3.11: https://github.com/openshift/openshift-docs/issues/25677, closing this one.

Comment 16 Sergey Yedrikov 2020-10-20 21:33:56 UTC
From the update documentation:

3. Run the openshift-logging/config.yml playbook according to the deploying the EFK stack instructions to complete the logging upgrade. You run the installation playbook for the new OpenShift Container Platform version to upgrade the logging deployment.

so /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

So yes setting the nodeSelector to non-existing: true, is what the instructions say to do and this will cause the DS to terminate the logging-fluentd pods as expected.

then we have to remove the other nodeSelector so that we don't get the error about nodeSelectors and break the updating of the EFK stack so I run the oc command you had me try 
oc patch ds logging-fluentd --type json -p '[{ "op": "remove", "path": "/spec/template/spec/nodeSelector/logging-infra-fluentd" }]'

This works, then we update the EFK stack
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml
and the EFK gets updated, however during that process as I showed in the last email it labels all nodes with non-existing: true, and the DS starts spinning up openshift-fluentd pods, the correct version. 

After the update I remove the patch that was applied and add back the original 
oc patch ds logging-fluentd -p '{"spec": {"template": {"spec": {"nodeSelector": {"logging-infra-fluentd": "true"}}}}}'
oc patch ds logging-fluentd --type json -p '[{ "op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing" }]'

It all works except that now all nodes have an extra label of non-existing: true. 

I found this out but updating from 3.11.153 to 3.11.248 then I went to update to 3.11.286 and that is when I saw this as when I want to patch to remove logging-infra-fluentd": "true" it did not terminate the logging-fluentd pods and started me looking into why.

Comment 18 Sergey Yedrikov 2021-01-04 14:51:10 UTC
Docs PR https://github.com/openshift/openshift-docs/pull/27310


Note You need to log in before you can comment on or make changes to this bug.