Bug 1446504

Summary: Not able to collect any log entries after upgrading to 3.4.1
Product: OpenShift Container Platform Reporter: Xia Zhao <xiazhao>
Component: LoggingAssignee: Peter Portante <pportant>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Xia Zhao <xiazhao>
Severity: medium Docs Contact:
Priority: high    
Version: 3.4.1CC: aos-bugs, jcantril, pportant, pweil, rmeggins, wabouham, xiazhao
Target Milestone: ---Keywords: Regression, Reopened
Target Release: 3.7.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-13 18:02:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Upgrade log
none
fluentd_daemonset_before_upgrade
none
fluentd_daemonset_after_upgrade
none
fluentd_daemonset_after_upgrade_Jun29_latest
none
new upgrade log on July 13, 2017
none
output of command $ oc describe ds logging-fluentd none

Description Xia Zhao 2017-04-28 08:56:03 UTC
Created attachment 1274842 [details]
Upgrade log

Description of problem:
Not able to collect any log entries after upgrading to 3.4.1, the upgrade pod completed successfully but fluentd pod is not on. 

Version-Release number of selected component (if applicable):
the latest deployer image on brew registry:
openshift3/logging-deployer        3.4.1               7ca4d5f34949        36 hours ago        857.5 MB

# openshift version
openshift v3.4.1.18
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

How reproducible:
Always

Steps to Reproduce:
1.Install Openshift 3.4.1 env, deploy logging 3.3.1 level stacks onto it (bound elasticsearch with persistent storage)
2.Create some user projects and wait for log entries got present in kibana 
3.Upgrade logging stacks from 3.3.1 level to 3.4.1 level
4.Visit kibana after upgrade

Actual results:
3.The upgrade pod completed successfully, but fluentd pod is not on
4.Not able to collect any log entries after upgrading to 3.4.1

Expected results:
3. Fluentd pod should be on
4. Should be able to collect log entries after upgrading to 3.4.1

Additional info:
Upgrade log attached
Test env provided
This is a regression, it used to work fine previously

Comment 3 Rich Megginson 2017-05-05 15:14:59 UTC
Hmm - something was wrong with fluentd before the upgrade?

+ oc delete daemonset logging-fluentd
error: timed out waiting for the condition

not sure if this is expected

+++ oc get pods -l component=fluentd -o 'jsonpath={.items[?@.status.phase=="Running")].metadata.name}'
No resources found.

I suppose this is ok if there is no daemonset?

If you run this test again, can you verify that fluentd is running and the daemonset exists before?

oc get pods|grep fluentd
oc get daemonset|grep fluentd

Comment 4 Xia Zhao 2017-05-09 02:57:20 UTC
(In reply to Rich Megginson from comment #3)
> Hmm - something was wrong with fluentd before the upgrade?
> 
> + oc delete daemonset logging-fluentd
> error: timed out waiting for the condition
> 
> not sure if this is expected
> 
> +++ oc get pods -l component=fluentd -o
> 'jsonpath={.items[?@.status.phase=="Running")].metadata.name}'
> No resources found.
> 
> I suppose this is ok if there is no daemonset?
> 
> If you run this test again, can you verify that fluentd is running and the
> daemonset exists before?
> 
> oc get pods|grep fluentd
> oc get daemonset|grep fluentd

Hi Rich,

Before upgrade, I checked that the fluentd is running, and also logged in the kiaban UI at 3.3.1 level, I saw some log entries there. 

Step 3 & Step 7 in this comment also reflected the fact that fluentd is fine before upgrade, but missing after upgrade: https://bugzilla.redhat.com/show_bug.cgi?id=1440855#c14.

I'll also double check this and get back to you later.

Thanks,
Xia

Comment 5 Xia Zhao 2017-05-10 02:56:32 UTC
Tested with the latest 3.4.1 deployer, issue was reproduced:

Images tested with:
openshift3/logging-deployer        3.4.1               dcee53833a87

# openshift version
openshift v3.4.1.18
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Before upgrade, checked that fluentd is running and fluentd daemonset exist.

After upgrade, fluentd is missing but the daemonset is still exist:

$ oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-2-e20qw       1/1       Running     0          18h
logging-deployer-eq3z2        0/1       Completed   0          19h
logging-deployer-wmegy        0/1       Completed   0          18h
logging-es-mtnndj71-3-o6byt   1/1       Running     0          18h
logging-kibana-3-j87ka        2/2       Running     7          18h

$ oc get daemonset
NAME              DESIRED   CURRENT   READY     NODE-SELECTOR                                                               AGE
logging-fluentd   0         0         0         54247ed8-348f-11e7-81e5-4a55ae8fca9f=54247f09-348f-11e7-81e5-4a55ae8fca9f   19h

Attached the fluentd daemonset before and after upgrade.

Comment 6 Xia Zhao 2017-05-10 02:58:43 UTC
Created attachment 1277486 [details]
fluentd_daemonset_before_upgrade

Comment 7 Xia Zhao 2017-05-10 02:59:10 UTC
Created attachment 1277487 [details]
fluentd_daemonset_after_upgrade

Comment 8 Rich Megginson 2017-05-10 03:11:20 UTC
Did the node labels change?

If you do

oc label nodes --all logging-infra-fluentd=true

does fluentd start?

Comment 10 Rich Megginson 2017-05-11 17:38:46 UTC
ok, here is the problem: attachment 1277487 [details]
fluentd_daemonset_after_upgrade

      nodeSelector:
        54247ed8-348f-11e7-81e5-4a55ae8fca9f: 54247f09-348f-11e7-81e5-4a55ae8fca9f

upgrade clobbered the nodeSelector :-(

This is what it was before the upgrade:

      nodeSelector:
        logging-infra-fluentd: "true"

The workaround is to `oc edit daemonset logging-fluentd` and fix the nodeSelector.

Comment 11 Jeff Cantrill 2017-05-11 17:42:03 UTC
How was that change introduced?  Ansible?  Something else?  Should we close this issue with a note to check the DS or do we need to find root cause?

Comment 12 Rich Megginson 2017-05-11 17:52:17 UTC
(In reply to Jeff Cantrill from comment #11)
> How was that change introduced?  Ansible?  Something else?  Should we close
> this issue with a note to check the DS or do we need to find root cause?

We need to find the root cause.  I don't think Xia used ansible to upgrade - 3.4 uses the deployer pod with MODE=upgrade - Xia, please confirm.

Looking at the upgrade code, it isn't obvious to me where the nodeselector would be changed.

Comment 13 Xia Zhao 2017-05-12 06:07:18 UTC
(In reply to Rich Megginson from comment #12)
> (In reply to Jeff Cantrill from comment #11)
> > How was that change introduced?  Ansible?  Something else?  Should we close
> > this issue with a note to check the DS or do we need to find root cause?
> 
> We need to find the root cause.  I don't think Xia used ansible to upgrade -
> 3.4 uses the deployer pod with MODE=upgrade - Xia, please confirm.

Yes, I used the deployer pod with MODE=upgrade, not using the ansible.

> 
> Looking at the upgrade code, it isn't obvious to me where the nodeselector
> would be changed.

Comment 14 Jeff Cantrill 2017-05-22 17:36:42 UTC
Is there anyway you are able to tells from which version to which version you upgraded?  This would help us narrow down the problem

Comment 15 Xia Zhao 2017-05-25 08:37:21 UTC
Hi Jeff,

Here is the detailed steps (copied from comment #0), please let me know if anything I can assist:

Steps to Reproduce:
1.Install Openshift 3.4.1 env, deploy logging 3.3.1 level stacks onto it (bound elasticsearch with persistent storage)
2.Create some user projects and wait for log entries got present in kibana 
3.Upgrade logging stacks from 3.3.1 level to 3.4.1 level
4.Visit kibana after upgrade

Thanks,
Xia

Comment 16 Jeff Cantrill 2017-06-02 21:22:11 UTC
I was referring to the version such as 3.4.1-20.  We can set up a 3.3.1 cluster from which to start, but there have been many changes in the 3.4.1 release stream which may have resolved your issue.  Are you able to provide more fine grained information regarding the upgrade issue?

Comment 17 Xia Zhao 2017-06-08 09:59:58 UTC
Understood -- I noticed the deployer image in comment #0 somehow did not exist on brew registry when I curl it with this command:

# curl -X GET -k  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/v1/repositories/openshift3/logging-deployer/tags  | python -mjson.tool | grep 7ca4

Let me retry it with the latest one 3.4.0-21 whose image id: 7a004ab105c -- will update you soon.

Comment 18 Xia Zhao 2017-06-13 14:24:12 UTC
Upgraded to logging-deployer v3.4.1.34-1, not been able to look into the original problem since upgrade failed earlier:

# oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-deployer-5wwr8        0/1       Error       0          4h
logging-deployer-f31ae        0/1       Completed   0          4h
logging-es-8ei2lcah-3-55kvc   1/1       Running     0          4h

failed by this error:
Unable to find log message from cluster.service from pod logging-es-8ei2lcah-3-55kvc within 300 seconds

I'll retest to see if this can be reproduced.

Comment 20 Xia Zhao 2017-06-14 07:22:51 UTC
Reproduced the blocking issue in comment #18, created https://bugzilla.redhat.com/show_bug.cgi?id=1461294 to track it. The needinfo work have to be blocked here.

Comment 21 Jeff Cantrill 2017-06-26 15:19:42 UTC
The work around is to relabel the node selector for fluentd.

Comment 22 Xia Zhao 2017-06-29 07:24:24 UTC
Hi Jeff, 

To relabel node is not the case, according to what I mentioned in comment #9, the following message given by upgrade pod's log in the end is not true:

"
Note: if your previous deployment used a DaemonSet for Fluentd, there should be
no additional actions to deploy your pods -- the deployer did not unlabel any nodes.
Upgrade complete!
"

Here is the situation I encountered when verifying https://bugzilla.redhat.com/show_bug.cgi?id=1461294:

before upgrade I have fluentd running fine:
$ oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-1-9a34s       1/1       Running     0          5m
logging-deployer-yb0dt        0/1       Completed   0          5m
logging-es-qjyixjac-1-sa5kz   1/1       Running     0          5m
logging-fluentd-5vae8         1/1       Running     0          5m
logging-kibana-1-i6dtj        2/2       Running     2          5m

after upgrade fluentd pod is missing even when node labels are not changed:

$ oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-2-3wk7c       1/1       Running     0          35m
logging-deployer-85sda        0/1       Completed   0          43m
logging-deployer-yb0dt        0/1       Completed   0          53m
logging-es-qjyixjac-3-kb1zf   1/1       Running     0          36m
logging-kibana-3-jqbs9        2/2       Running     0          35m

Checked on master that I did have a node labeled with logging-infra-fluentd=true

Images tested with:
logging-deployer        3.4.1               3cfbb48d63f0        5 days ago          855.8 MB

# openshift version
openshift v3.4.1.44
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 23 Xia Zhao 2017-06-29 07:31:30 UTC
Created attachment 1292801 [details]
fluentd_daemonset_after_upgrade_Jun29_latest

Comment 24 Jeff Cantrill 2017-06-29 13:20:58 UTC
There is something missing from the details of what is happening with this cluster.  The log is referencing the wrong version of fluentd if this is an upgrade to 3.4:  'image: registry.ops.openshift.com/openshift3/logging-fluentd:3.3.1'  

* Can you provide more details of the environment: https://github.com/openshift/origin-aggregated-logging/blob/master/docs/issues.md

* Have you tried 'oc describe ds logging-fluentd' to see if maybe there is some hints there

Comment 26 Xia Zhao 2017-07-13 08:05:59 UTC
Created attachment 1297447 [details]
new upgrade log on July 13, 2017

Comment 27 Xia Zhao 2017-07-13 08:07:56 UTC
Created attachment 1297448 [details]
output of command $ oc describe ds logging-fluentd

Comment 29 Xia Zhao 2017-07-24 02:41:14 UTC
The test env in comment #28 was recycled.