Bug 1421623

Summary:

Diagnostics for a healthy logging system failed via ansible installation

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Logging

Assignee:

Luke Meyer <lmeyer>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

low

Docs Contact:

Priority:

low

Version:

3.5.0

CC:

aos-bugs, knakayam, lmeyer, nhosoi, orhan.biyiklioglu, rmeggins, smunilla, xtian

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: The AggregatedLogging diagnostic was not updated to reflect updates made to logging deployment. Consequence: The diagnostic incorrectly reported errors for an unnecessary ServiceAccount and (if present) the mux deployment. Fix: These errors are no longer reported. In addition, warnings about missing optional components were all downgraded to Info level. Result: The diagnostic no longer needlessly alarms the user for these issues.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-10 05:17:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logging diagnostics info	none
issue is fixed, logging diagnostics info	none

Description Junqi Zhao 2017-02-13 09:44:29 UTC

Description of problem:
Deploy logging 3.5.0 stacks with ansible scripts. After EFK pods are all running, run "oadm diagnostics AggregatedLogging" command, error "Did not find ServiceAccounts: logging-deployer" throw out.  
PS: Although I did not set  openshift_logging_use_ops=true, there are warning info for ops logging  services, I think these info should not pop out.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.5.0.18+9a5d1aa
kubernetes v1.5.2+43a9be4
etcd 3.1.0


Image id:
openshift3/logging-elasticsearch    d715f4d34ad4
openshift3/logging-kibana    e0ab09c2cbeb
openshift3/logging-fluentd    47057624ecab
openshift3/logging-auth-proxy    139f7943475e
openshift3/logging-curator    7f034fdf7702


How reproducible:
Always

Steps to Reproduce:
1. Deploy logging 3.5.0 stacks with ansible

The inventory file:
[oo_first_master]
$master ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="~/libra.pem" openshift_public_hostname=$master

[oo_first_master:vars]
deployment_type=openshift-enterprise
openshift_release=v3.5.0
openshift_logging_install_logging=true

openshift_logging_kibana_hostname=kibana.$subdomain
openshift_logging_kibana_ops_hostname=kibana-ops.$subdomain
public_master_url=https://$master:8443
openshift_logging_fluentd_hosts=$node

openshift_logging_image_prefix=$registry/openshift3/
openshift_logging_image_version=3.5.0

openshift_logging_namespace=logging
openshift_logging_fluentd_use_journal=fasle

2. Wait until EFK pods are all running, and ES in green status
3. oadm diagnostics AggregatedLogging

Actual results:
# oadm diagnostics AggregatedLogging
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'
Info:  Using context for cluster-admin access: 'juzhao/ip-172-18-9-26-ec2-internal:8443/system:admin'

[Note] Running diagnostic: AggregatedLogging
       Description: Check aggregated logging integration for proper configuration
       
Info:  Found route 'logging-kibana' matching logging URL 'kibana.0213-1ht.qe.rhcloud.com' in project: 'juzhao'

WARN:  [AGL0030 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:207]
       The project 'juzhao' was found with either a missing or non-empty node selector annotation.
       This could keep Fluentd from running on certain nodes and collecting logs from the entire cluster.
       You can correct it by editing the project:
       
         $ oc edit namespace juzhao
       
       and updating the annotation:
       
         'openshift.io/node-selector' : ""
       
ERROR: [AGL0515 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:96]
       Did not find ServiceAccounts: logging-deployer.  The logging infrastructure will not function 
       properly without them.  You may need to re-run the installer.
       
Info:  Did not find a DeploymentConfig to support component 'curator-ops'.  If you require
       a separate ElasticSearch cluster to aggregate operations logs, please re-install
       or update logging and specify the appropriate switch to enable the ops cluster.
       
Info:  Did not find a DeploymentConfig to support component 'es-ops'.  If you require
       a separate ElasticSearch cluster to aggregate operations logs, please re-install
       or update logging and specify the appropriate switch to enable the ops cluster.
       
Info:  Did not find a DeploymentConfig to support component 'kibana-ops'.  If you require
       a separate ElasticSearch cluster to aggregate operations logs, please re-install
       or update logging and specify the appropriate switch to enable the ops cluster.
       
WARN:  [AGL0425 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:104]
       There are some nodes that match the selector for DaemonSet 'logging-fluentd'.  
       A list of matching nodes can be discovered by running:
       
         $ oc get nodes -l logging-infra-fluentd=true
       
WARN:  [AGL0215 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:104]
       Expected to find 'logging-es-ops' among the logging services for the project but did not. This
       may not matter if you chose not to install a separate logging stack to support operations.
       
WARN:  [AGL0215 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:104]
       Expected to find 'logging-es-ops-cluster' among the logging services for the project but did not. This
       may not matter if you chose not to install a separate logging stack to support operations.
       
WARN:  [AGL0215 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:104]
       Expected to find 'logging-kibana-ops' among the logging services for the project but did not. This
       may not matter if you chose not to install a separate logging stack to support operations.
       
[Note] Summary of diagnostics execution (version v3.5.0.19+199197c):
[Note] Warnings seen: 5
[Note] Errors seen: 1


Expected results:
Logging system running healthy & no issue found.

Additional info:

Comment 1 Junqi Zhao 2017-05-10 06:19:48 UTC

Same issue with Logging 3.6.0

# oc version
oc v3.6.65
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO


# docker images | grep logging
openshift3/logging-kibana          v3.6                dc571aa09d26        8 hours ago         342.4 MB
openshift3/logging-elasticsearch   v3.6                d2709cc1e16a        8 hours ago         404.5 MB
openshift3/logging-fluentd         v3.6                aafaf8787b29        8 hours ago         232.5 MB
openshift3/logging-auth-proxy      v3.6                11f731349ff9        2 days ago          229.6 MB
openshift3/logging-curator         v3.6                028e689a3276        6 days ago          211.1 MB

Comment 2 Noriko Hosoi 2017-05-16 23:19:38 UTC

The problem is still observed in the latest origin/origin-aggregated-logging code.

# oadm diagnostics AggregatedLogging --diaglevel=0

debug: Checking ServiceAccounts in project 'logging'...

ERROR: [AGL0515 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:97]
       Did not find ServiceAccounts: logging-deployer.  The logging infrastructure will not function
       properly without them.  You may need to re-run the installer.

The problem is likely the ServiceAccountName for deployer is "deployer", not "logging-deployer".

# oc get serviceaccounts
NAME                               SECRETS   AGE
aggregated-logging-curator         2         23h
aggregated-logging-elasticsearch   2         23h
aggregated-logging-fluentd         2         23h
aggregated-logging-kibana          2         23h
builder                            2         23h
default                            2         23h
deployer                           2         23h

[origin]
diff --git a/pkg/diagnostics/cluster/aggregated_logging/serviceaccounts.go b/pkg/diagnostics/cluster/aggregated_logging/serviceaccounts.go
index 779ced8..a73e83d 100644
--- a/pkg/diagnostics/cluster/aggregated_logging/serviceaccounts.go
+++ b/pkg/diagnostics/cluster/aggregated_logging/serviceaccounts.go
@@ -8,7 +8,7 @@ import (
        "k8s.io/apimachinery/pkg/util/sets"
 )
 
-var serviceAccountNames = sets.NewString("logging-deployer", "aggregated-logging-kibana", "aggregated-logging-curator", "aggregated-logging-elasticsearch", fluentdServiceAccountName)
+var serviceAccountNames = sets.NewString("deployer", "aggregated-logging-kibana", "aggregated-logging-curator", "aggregated-logging-elasticsearch", fluentdServiceAccountName)
 
 const serviceAccountsMissing = `
 Did not find ServiceAccounts: %s.  The logging infrastructure will not function

Comment 3 Noriko Hosoi 2017-05-16 23:35:46 UTC

There is another error reported for logging-mux in the recent code.

# oadm diagnostics AggregatedLogging --diaglevel=0
    [...]
debug: Checking for DeploymentConfigs in project 'logging' with selector 'logging-infra'
    [...]
debug: Found DeploymentConfig 'logging-kibana-ops' for component 'kibana-ops'
debug: Found DeploymentConfig 'logging-mux' for component 'mux'
debug: Getting pods that match selector 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift'
debug: Checking status of Pod 'logging-curator-1-dm8bf'...
    [...]

ERROR: [AGL0095 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:97]
       There were no Pods found for DeploymentConfig 'logging-mux'.  Try running
       the following commands for additional information:
      
         $ oc describe dc logging-mux -n logging
         $ oc get events -n logging

logging-mux is supposed to be generated from the fluentd daemonset(?)

But deploymentconfigs with --selector logging-infra returns logging-mux as follows.

# oc get dc --selector logging-infra
NAME                      REVISION   DESIRED   CURRENT   TRIGGERED BY
logging-curator           1          1         1         config
logging-curator-ops       1          1         1         config
logging-es-ok6sold4       1          1         1         config
logging-es-ops-7cmmv0rj   1          1         1         config
logging-kibana            1          1         1         config
logging-kibana-ops        1          1         1         config
logging-mux               1          1         1         config

Should the behaviour be fixed or we could just add logging-mux to the loggingComponents in origin/pkg/diagnostics/cluster/aggregated_logging/deploymentconfigs.go?

Comment 4 Rich Megginson 2017-05-17 00:43:12 UTC

In 3.5 and later there is no deployer any more.  We should just get rid of all references to logging-deployer or logging-deployment or deployer.

> logging-mux is supposed to be generated from the fluentd daemonset(?)

Not exactly.  setup-mux.sh will create a deploymentconfig (dc) based on the fluentd daemonset.

Is there a mux pod running?  If so, then "There were no Pods found for DeploymentConfig 'logging-mux'." is correct.

I don't think we should add mux to the origin code.

Comment 5 Noriko Hosoi 2017-05-17 01:03:55 UTC

(In reply to Rich Megginson from comment #4)
> In 3.5 and later there is no deployer any more.  We should just get rid of
> all references to logging-deployer or logging-deployment or deployer.
> 
> > logging-mux is supposed to be generated from the fluentd daemonset(?)
> 
> Not exactly.  setup-mux.sh will create a deploymentconfig (dc) based on the
> fluentd daemonset.

Ah, I see.

> Is there a mux pod running?

Yes, it is.

> If so, then "There were no Pods found for
> DeploymentConfig 'logging-mux'." is correct.
> 
> I don't think we should add mux to the origin code.

Ok.  Now, could there be any way to downgrade this "ERROR" to "INFO" or something less scary?

ERROR: [AGL0095 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:97]
       There were no Pods found for DeploymentConfig 'logging-mux'.  Try running
       the following commands for additional information:
      
         $ oc describe dc logging-mux -n logging
         $ oc get events -n logging

Comment 6 Rich Megginson 2017-05-17 01:13:06 UTC

Hmm - I would rather know why the code can find the mux dc, but cannot find the mux pod?

Comment 7 Noriko Hosoi 2017-05-17 22:25:29 UTC

(In reply to Rich Megginson from comment #6)
> Hmm - I would rather know why the code can find the mux dc, but cannot find
> the mux pod?

Isn't it because the pods "that match selector 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift' are retrieved?  The mux pod is not in the list (the line 26 in deploymentconfigs.go below).

debug: Getting pods that match selector 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift'
debug: Checking status of Pod 'logging-curator-1-dm8bf'...
debug: Checking status of Pod 'logging-curator-ops-1-bng6s'...
debug: Checking status of Pod 'logging-es-ok6sold4-1-xrx0l'...
debug: Checking status of Pod 'logging-es-ops-7cmmv0rj-1-hfbvv'...
debug: Checking status of Pod 'logging-kibana-1-358kz'...
debug: Checking status of Pod 'logging-kibana-ops-1-sxczs'...
ERROR: [AGL0095 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:97]
       There were no Pods found for DeploymentConfig 'logging-mux'.

"deploymentconfigs.go"
 25 // loggingComponents are those 'managed' by rep controllers (e.g. fluentd is deployed with a DaemonSet)
 26 var loggingComponents = sets.NewString(componentNameEs, componentNameEsOps, componentNameKibana, componentNameKibanaOps, componentNameCurator,     componentNameCuratorOps)
 27

Comment 8 Rich Megginson 2017-05-17 22:34:39 UTC

> Isn't it because the pods "that match selector 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift' are retrieved?  The mux pod is not in the list (the line 26 in deploymentconfigs.go below).

OK.  We need to add that to setup-mux.sh and the mux ansible code.

Comment 9 Noriko Hosoi 2017-05-18 01:20:50 UTC

(In reply to Rich Megginson from comment #8)
> > Isn't it because the pods "that match selector 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift' are retrieved?  The mux pod is not in the list (the line 26 in deploymentconfigs.go below).
> 
> OK.  We need to add that to setup-mux.sh and the mux ansible code.

Do you have an idea how it could be done in setup-mux.sh/mux ansible code?

In the current diagnostic code in origin, it looks to me the selector 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift' is hardcoded in "deploymentconfigs.go"...  

This is the oc get pods output when the selector is given.  (Note: no mux)

$ oc get pods -l 'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift'
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-dm8bf           1/1       Running   0          2d
logging-curator-ops-1-bng6s       1/1       Running   0          2d
logging-es-ok6sold4-1-xrx0l       1/1       Running   0          2d
logging-es-ops-7cmmv0rj-1-hfbvv   1/1       Running   0          2d
logging-kibana-1-358kz            2/2       Running   7          2d
logging-kibana-ops-1-sxczs        2/2       Running   7          2d

This selector returns what we want, but I'm not sure if this is always correct or not...  And this change could be made in setup-mux.sh or not, either...

$ oc get pods -l 'component!=fluentd,provider=openshift'
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-dm8bf           1/1       Running   0          2d
logging-curator-ops-1-bng6s       1/1       Running   0          2d
logging-es-ok6sold4-1-xrx0l       1/1       Running   0          2d
logging-es-ops-7cmmv0rj-1-hfbvv   1/1       Running   0          2d
logging-kibana-1-358kz            2/2       Running   7          2d
logging-kibana-ops-1-sxczs        2/2       Running   7          2d
logging-mux-1-fb981               1/1       Running   0          2d

Comment 10 Rich Megginson 2017-05-18 02:04:06 UTC

Ok.  This sounds like a bug in the go code.  It shouldn't be looking for dcs using --selector logging-infra, then finding all of the pods in that list by using a different selector  'component in (curator,curator-ops,es,es-ops,kibana,kibana-ops),provider=openshift'.  I don't know why it can't just do a query "give me all of the pods for dc $dc".  It should not have a hard coded selector for the pod query.

Comment 11 Junqi Zhao 2017-06-27 09:59:01 UTC

Same error on logging 3.6.0

Did not find ServiceAccounts: logging-deployer.  The logging infrastructure will not function  properly without them.  You may need to re-run the installer.

Comment 12 Luke Meyer 2017-06-30 21:01:54 UTC

https://github.com/openshift/origin/pull/14991

Comment 13 Luke Meyer 2017-07-05 22:26:06 UTC

PR merged

Comment 15 Junqi Zhao 2017-07-10 04:28:05 UTC

Tested with the latest openshift-ansible version 3.6.139-1, issue was not fixed, "Did not find ServiceAccounts: logging-deployer" error info still threw out, see the attached file.

Comment 16 Junqi Zhao 2017-07-10 04:28:36 UTC

Created attachment 1295702 [details]
logging diagnostics info

Comment 17 Luke Meyer 2017-07-10 13:24:07 UTC

Output has:

[Note] Summary of diagnostics execution (version v3.6.126.14):

Seems like something bundled an older build of the openshift binaries.

Comment 18 Luke Meyer 2017-07-10 19:21:37 UTC

It's not clear to me that ose-ansible 3.6.139-1 ended up being an accepted build. It looks like it has the matching version of oc but somehow you ended up running the 3.6.126.14 client. The fix for this bug isn't on that branch.

In any case 3.6.140-1 is available. Can you retest with that? Just make sure of the client version you end up running. You could even run a newer client against an older cluster if you have one handy.  It's the client that runs the diagnostic so that's the version that matters here.

Comment 19 Junqi Zhao 2017-07-11 01:25:22 UTC

Verified this issue with same OCP version and openshift-ansible version, issue was fixed, diagnostics info see the attached file

Testing env:
# openshift version
openshift v3.6.140
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


# rpm -qa | grep openshift-ansible
openshift-ansible-callback-plugins-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-playbooks-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-lookup-plugins-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-roles-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-docs-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-filter-plugins-3.6.140-1.git.0.4a02427.el7.noarch

Comment 20 Junqi Zhao 2017-07-11 01:25:56 UTC

Created attachment 1296007 [details]
issue is fixed, logging diagnostics info

Comment 22 errata-xmlrpc 2017-08-10 05:17:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716