Bug 1446504
Summary: | Not able to collect any log entries after upgrading to 3.4.1 | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xia Zhao <xiazhao> | ||||||||||||||
Component: | Logging | Assignee: | Peter Portante <pportant> | ||||||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Xia Zhao <xiazhao> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | high | ||||||||||||||||
Version: | 3.4.1 | CC: | aos-bugs, jcantril, pportant, pweil, rmeggins, wabouham, xiazhao | ||||||||||||||
Target Milestone: | --- | Keywords: | Regression, Reopened | ||||||||||||||
Target Release: | 3.7.0 | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2017-09-13 18:02:28 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Hmm - something was wrong with fluentd before the upgrade? + oc delete daemonset logging-fluentd error: timed out waiting for the condition not sure if this is expected +++ oc get pods -l component=fluentd -o 'jsonpath={.items[?@.status.phase=="Running")].metadata.name}' No resources found. I suppose this is ok if there is no daemonset? If you run this test again, can you verify that fluentd is running and the daemonset exists before? oc get pods|grep fluentd oc get daemonset|grep fluentd (In reply to Rich Megginson from comment #3) > Hmm - something was wrong with fluentd before the upgrade? > > + oc delete daemonset logging-fluentd > error: timed out waiting for the condition > > not sure if this is expected > > +++ oc get pods -l component=fluentd -o > 'jsonpath={.items[?@.status.phase=="Running")].metadata.name}' > No resources found. > > I suppose this is ok if there is no daemonset? > > If you run this test again, can you verify that fluentd is running and the > daemonset exists before? > > oc get pods|grep fluentd > oc get daemonset|grep fluentd Hi Rich, Before upgrade, I checked that the fluentd is running, and also logged in the kiaban UI at 3.3.1 level, I saw some log entries there. Step 3 & Step 7 in this comment also reflected the fact that fluentd is fine before upgrade, but missing after upgrade: https://bugzilla.redhat.com/show_bug.cgi?id=1440855#c14. I'll also double check this and get back to you later. Thanks, Xia Tested with the latest 3.4.1 deployer, issue was reproduced: Images tested with: openshift3/logging-deployer 3.4.1 dcee53833a87 # openshift version openshift v3.4.1.18 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Before upgrade, checked that fluentd is running and fluentd daemonset exist. After upgrade, fluentd is missing but the daemonset is still exist: $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-2-e20qw 1/1 Running 0 18h logging-deployer-eq3z2 0/1 Completed 0 19h logging-deployer-wmegy 0/1 Completed 0 18h logging-es-mtnndj71-3-o6byt 1/1 Running 0 18h logging-kibana-3-j87ka 2/2 Running 7 18h $ oc get daemonset NAME DESIRED CURRENT READY NODE-SELECTOR AGE logging-fluentd 0 0 0 54247ed8-348f-11e7-81e5-4a55ae8fca9f=54247f09-348f-11e7-81e5-4a55ae8fca9f 19h Attached the fluentd daemonset before and after upgrade. Created attachment 1277486 [details]
fluentd_daemonset_before_upgrade
Created attachment 1277487 [details]
fluentd_daemonset_after_upgrade
Did the node labels change? If you do oc label nodes --all logging-infra-fluentd=true does fluentd start? ok, here is the problem: attachment 1277487 [details]
fluentd_daemonset_after_upgrade
nodeSelector:
54247ed8-348f-11e7-81e5-4a55ae8fca9f: 54247f09-348f-11e7-81e5-4a55ae8fca9f
upgrade clobbered the nodeSelector :-(
This is what it was before the upgrade:
nodeSelector:
logging-infra-fluentd: "true"
The workaround is to `oc edit daemonset logging-fluentd` and fix the nodeSelector.
How was that change introduced? Ansible? Something else? Should we close this issue with a note to check the DS or do we need to find root cause? (In reply to Jeff Cantrill from comment #11) > How was that change introduced? Ansible? Something else? Should we close > this issue with a note to check the DS or do we need to find root cause? We need to find the root cause. I don't think Xia used ansible to upgrade - 3.4 uses the deployer pod with MODE=upgrade - Xia, please confirm. Looking at the upgrade code, it isn't obvious to me where the nodeselector would be changed. (In reply to Rich Megginson from comment #12) > (In reply to Jeff Cantrill from comment #11) > > How was that change introduced? Ansible? Something else? Should we close > > this issue with a note to check the DS or do we need to find root cause? > > We need to find the root cause. I don't think Xia used ansible to upgrade - > 3.4 uses the deployer pod with MODE=upgrade - Xia, please confirm. Yes, I used the deployer pod with MODE=upgrade, not using the ansible. > > Looking at the upgrade code, it isn't obvious to me where the nodeselector > would be changed. Is there anyway you are able to tells from which version to which version you upgraded? This would help us narrow down the problem Hi Jeff, Here is the detailed steps (copied from comment #0), please let me know if anything I can assist: Steps to Reproduce: 1.Install Openshift 3.4.1 env, deploy logging 3.3.1 level stacks onto it (bound elasticsearch with persistent storage) 2.Create some user projects and wait for log entries got present in kibana 3.Upgrade logging stacks from 3.3.1 level to 3.4.1 level 4.Visit kibana after upgrade Thanks, Xia I was referring to the version such as 3.4.1-20. We can set up a 3.3.1 cluster from which to start, but there have been many changes in the 3.4.1 release stream which may have resolved your issue. Are you able to provide more fine grained information regarding the upgrade issue? Understood -- I noticed the deployer image in comment #0 somehow did not exist on brew registry when I curl it with this command: # curl -X GET -k brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/v1/repositories/openshift3/logging-deployer/tags | python -mjson.tool | grep 7ca4 Let me retry it with the latest one 3.4.0-21 whose image id: 7a004ab105c -- will update you soon. Upgraded to logging-deployer v3.4.1.34-1, not been able to look into the original problem since upgrade failed earlier: # oc get po NAME READY STATUS RESTARTS AGE logging-deployer-5wwr8 0/1 Error 0 4h logging-deployer-f31ae 0/1 Completed 0 4h logging-es-8ei2lcah-3-55kvc 1/1 Running 0 4h failed by this error: Unable to find log message from cluster.service from pod logging-es-8ei2lcah-3-55kvc within 300 seconds I'll retest to see if this can be reproduced. Reproduced the blocking issue in comment #18, created https://bugzilla.redhat.com/show_bug.cgi?id=1461294 to track it. The needinfo work have to be blocked here. The work around is to relabel the node selector for fluentd. Hi Jeff, To relabel node is not the case, according to what I mentioned in comment #9, the following message given by upgrade pod's log in the end is not true: " Note: if your previous deployment used a DaemonSet for Fluentd, there should be no additional actions to deploy your pods -- the deployer did not unlabel any nodes. Upgrade complete! " Here is the situation I encountered when verifying https://bugzilla.redhat.com/show_bug.cgi?id=1461294: before upgrade I have fluentd running fine: $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-9a34s 1/1 Running 0 5m logging-deployer-yb0dt 0/1 Completed 0 5m logging-es-qjyixjac-1-sa5kz 1/1 Running 0 5m logging-fluentd-5vae8 1/1 Running 0 5m logging-kibana-1-i6dtj 2/2 Running 2 5m after upgrade fluentd pod is missing even when node labels are not changed: $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-2-3wk7c 1/1 Running 0 35m logging-deployer-85sda 0/1 Completed 0 43m logging-deployer-yb0dt 0/1 Completed 0 53m logging-es-qjyixjac-3-kb1zf 1/1 Running 0 36m logging-kibana-3-jqbs9 2/2 Running 0 35m Checked on master that I did have a node labeled with logging-infra-fluentd=true Images tested with: logging-deployer 3.4.1 3cfbb48d63f0 5 days ago 855.8 MB # openshift version openshift v3.4.1.44 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Created attachment 1292801 [details]
fluentd_daemonset_after_upgrade_Jun29_latest
There is something missing from the details of what is happening with this cluster. The log is referencing the wrong version of fluentd if this is an upgrade to 3.4: 'image: registry.ops.openshift.com/openshift3/logging-fluentd:3.3.1' * Can you provide more details of the environment: https://github.com/openshift/origin-aggregated-logging/blob/master/docs/issues.md * Have you tried 'oc describe ds logging-fluentd' to see if maybe there is some hints there Created attachment 1297447 [details]
new upgrade log on July 13, 2017
Created attachment 1297448 [details]
output of command $ oc describe ds logging-fluentd
The test env in comment #28 was recycled. |
Created attachment 1274842 [details] Upgrade log Description of problem: Not able to collect any log entries after upgrading to 3.4.1, the upgrade pod completed successfully but fluentd pod is not on. Version-Release number of selected component (if applicable): the latest deployer image on brew registry: openshift3/logging-deployer 3.4.1 7ca4d5f34949 36 hours ago 857.5 MB # openshift version openshift v3.4.1.18 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 How reproducible: Always Steps to Reproduce: 1.Install Openshift 3.4.1 env, deploy logging 3.3.1 level stacks onto it (bound elasticsearch with persistent storage) 2.Create some user projects and wait for log entries got present in kibana 3.Upgrade logging stacks from 3.3.1 level to 3.4.1 level 4.Visit kibana after upgrade Actual results: 3.The upgrade pod completed successfully, but fluentd pod is not on 4.Not able to collect any log entries after upgrading to 3.4.1 Expected results: 3. Fluentd pod should be on 4. Should be able to collect log entries after upgrading to 3.4.1 Additional info: Upgrade log attached Test env provided This is a regression, it used to work fine previously