Bug 1359771

Summary:	False error: oadm diagnostics reports emptyDIR error when using S3
Product:	OpenShift Container Platform	Reporter:	Ryan Cook <rcook>
Component:	Node	Assignee:	Luke Meyer <lmeyer>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.2.1	CC:	aos-bugs, erich, jokerman, lmeyer, mmccomas, tdawson
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Diagnostics reported an error when the registry was not backed by a persistent storage volume on the pod, without considering alternative methods of storage. Consequence: If the registry had been reconfigured to use S3 as storage, diagnostics reported an error. Fix: Now this diagnostic checks to see if registry configuration has been customized and does not report an error if so. It is assumed the admin that does the configuration knows what they're doing. Result: No more false alerts on S3-configured registry.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-27 09:41:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ryan Cook 2016-07-25 12:29:46 UTC

Description of problem: 
When running a cluster on AWS using S3 as the back-end storage for docker registry and using the oadm diagnostics command an error occurs.

Version-Release number of selected component (if applicable):


How reproducible: Deploy cluster on AWS with S3 backed registry and run oadm diagnostics


Steps to Reproduce:
1. deploy cluster on aws
2. back docker registry with S3 bucket
3. oadm diagnostics

Actual results:
ERROR: [DClu1007 from diagnostic ClusterRegistry@openshift/origin/pkg/diagnostics/cluster/registry.go:209]
The "docker-registry" service has multiple associated pods each using
ephemeral storage. These are likely to have inconsistent stores of
images. Builds and deployments that use images from the registry may
fail sporadically. Use a single registry or add a shared storage volume
to the registries.<scollier> ERROR: [DClu1007 from diagnostic ClusterRegistry@openshift/origin/pkg/diagnostics/cluster/registry.go:209]
The "docker-registry" service has multiple associated pods each using
ephemeral storage. These are likely to have inconsistent stores of
images. Builds and deployments that use images from the registry may
fail sporadically. Use a single registry or add a shared storage volume
to the registries.


Expected results:
No error. When manually browsing the S3 bucket all files are in place

Additional info:
I believe this is a new health check within the latest release as I never received it before 3 weeks ago.

Comment 1 Luke Meyer 2016-07-25 13:24:09 UTC

This occurs with the registry scaled up and using backing S3 storage. It should check for the presence of user-replaced config as described under https://docs.openshift.com/enterprise/3.2/install_config/install/docker_registry.html#storage-for-the-registry and assume the user knows what they're doing if it is present.

Comment 2 Luke Meyer 2016-08-09 19:59:57 UTC

I have a PR at https://github.com/openshift/origin/pull/10313

Comment 3 openshift-github-bot 2016-08-10 19:14:48 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/d7d58deba96d942f244490509b3b933ffe5659c5
diagnostics: fix bug 1359771

Comment 4 zhou ying 2016-08-11 01:52:02 UTC

Confirmed with ami:devenv-rhel7_4801, the bug has fixed:
openshift version
openshift v1.3.0-alpha.3+e1e7edb
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git


oadm diagnostics --config=openshift.local.config/master/admin.kubeconfig
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at 'openshift.local.config/master/admin.kubeconfig'
Info:  Successfully read a client config file at '/openshift.local.config/master/admin.kubeconfig'
Info:  Using context for cluster-admin access: 'default/172-18-8-237:8443/system:admin'

[Note] Running diagnostic: ConfigContexts[default/172-18-8-237:8443/system:admin]
       Description: Validate client config context is complete and has connectivity
       
Info:  The current client config context is 'default/172-18-8-237:8443/system:admin':
       The server URL is 'https://172.18.8.237:8443'
       The user authentication is 'system:admin/172-18-8-237:8443'
       The current project is 'default'
       Successfully requested project list; has access to project(s):
         [default kube-system openshift openshift-infra test zhouy]
       
[Note] Running diagnostic: ConfigContexts[default/ec2-54-196-94-236-compute-1-amazonaws-com:8443/system:admin]
       Description: Validate client config context is complete and has connectivity
       
Info:  For client config context 'default/ec2-54-196-94-236-compute-1-amazonaws-com:8443/system:admin':
       The server URL is 'https://ec2-54-196-94-236.compute-1.amazonaws.com:8443'
       The user authentication is 'system:admin/172-18-8-237:8443'
       The current project is 'default'
       Successfully requested project list; has access to project(s):
         [kube-system openshift openshift-infra test zhouy default]
       
[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint
       
WARN:  [DCli2006 from diagnostic DiagnosticPod@openshift/origin/pkg/diagnostics/client/run_diagnostics_pod.go:134]
       Timed out preparing diagnostic pod logs for streaming, so this diagnostic cannot run.
       It is likely that the image 'openshift/origin-deployer:v1.3.0-alpha.3' was not pulled and running yet.
       Last error: (*errors.StatusError[2]) container "pod-diagnostics" in pod "pod-diagnostic-test-jxd7s" is waiting to start: ContainerCreating
       
[Note] Running diagnostic: ClusterRegistry
       Description: Check that there is a working Docker registry
       
[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects
       
[Note] Running diagnostic: ClusterRoles
       Description: Check that the default ClusterRoles are present and contain the expected permissions
       
[Note] Running diagnostic: ClusterRouterName
       Description: Check there is a working router
       
WARN:  [DClu2001 from diagnostic ClusterRouter@openshift/origin/pkg/diagnostics/cluster/router.go:129]
       There is no "router" DeploymentConfig. The router may have been named
       something different, in which case this warning may be ignored.
       
       A router is not strictly required; however it is needed for accessing
       pods from external networks and its absence likely indicates an incomplete
       installation of the cluster.
       
       Use the 'oadm router' command to create a router.
       
[Note] Running diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)
       
Info:  Found a node with same IP as master: ip-172-18-8-237.ec2.internal

[Note] Skipping diagnostic: MetricsApiProxy
       Description: Check the integrated heapster metrics can be reached via the API proxy
       Because: The heapster service does not exist in the openshift-infra project at this time,
       so it is not available for the Horizontal Pod Autoscaler to use as a source of metrics.
       
[Note] Running diagnostic: NodeDefinitions
       Description: Check node records on master
       
[Note] Skipping diagnostic: ServiceExternalIPs
       Description: Check for existing services with ExternalIPs that are disallowed by master config
       Because: No master config file was detected
       
[Note] Summary of diagnostics execution (version v1.3.0-alpha.3+e1e7edb):
[Note] Warnings seen: 2

Comment 5 Luke Meyer 2016-08-11 13:17:29 UTC

Moving to MODIFIED for enterprise to manage.

Comment 6 Troy Dawson 2016-09-01 15:48:37 UTC

This has been merged into ose and is in OSE v3.3.0.28 or newer.

Comment 8 zhou ying 2016-09-02 02:08:50 UTC

confirmed with latest OCP, the issue has fixed:
openshift version
openshift v3.3.0.28
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git
[root@ip-172-18-10-128 ~]# oadm diagnostics 
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'
Info:  Using context for cluster-admin access: 'default/ip-172-18-10-128-ec2-internal:8443/system:admin'
[Note] Performing systemd discovery

[Note] Running diagnostic: ConfigContexts[default/ec2-54-161-124-51-compute-1-amazonaws-com:8443/system:admin]
       Description: Validate client config context is complete and has connectivity
       
Info:  For client config context 'default/ec2-54-161-124-51-compute-1-amazonaws-com:8443/system:admin':
       The server URL is 'https://ec2-54-161-124-51.compute-1.amazonaws.com:8443'
       The user authentication is 'system:admin/ip-172-18-10-128-ec2-internal:8443'
       The current project is 'default'
       Successfully requested project list; has access to project(s):
         [logging management-infra openshift openshift-infra default install-test kube-system]
       
[Note] Running diagnostic: ConfigContexts[default/ip-172-18-10-128-ec2-internal:8443/system:admin]
       Description: Validate client config context is complete and has connectivity
       
Info:  The current client config context is 'default/ip-172-18-10-128-ec2-internal:8443/system:admin':
       The server URL is 'https://ip-172-18-10-128.ec2.internal:8443'
       The user authentication is 'system:admin/ip-172-18-10-128-ec2-internal:8443'
       The current project is 'default'
       Successfully requested project list; has access to project(s):
         [management-infra openshift openshift-infra default install-test kube-system logging]
       
[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint
       
Info:  Output from the diagnostic pod (image openshift3/ose-deployer:v3.3.0.28):
       [Note] Running diagnostic: PodCheckAuth
              Description: Check that service account credentials authenticate as expected
              
       Info:  Service account token successfully authenticated to master
       Info:  Service account token was authenticated by the integrated registry.
       
       [Note] Running diagnostic: PodCheckDns
              Description: Check that DNS within a pod works as expected
              
       [Note] Summary of diagnostics execution (version v3.3.0.28):
       [Note] Completed with no errors or warnings seen.
       
[Note] Running diagnostic: ClusterRegistry
       Description: Check that there is a working Docker registry
       
[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects
       
Info:  clusterrolebinding/cluster-readers has more subjects than expected.
       
       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.
       
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount management-infra management-admin    }.

[Note] Running diagnostic: ClusterRoles
       Description: Check that the default ClusterRoles are present and contain the expected permissions
       
[Note] Running diagnostic: ClusterRouterName
       Description: Check there is a working router
       
[Note] Running diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)
       
Info:  Found a node with same IP as master: ip-172-18-10-128.ec2.internal

[Note] Skipping diagnostic: MetricsApiProxy
       Description: Check the integrated heapster metrics can be reached via the API proxy
       Because: The heapster service does not exist in the openshift-infra project at this time,
       so it is not available for the Horizontal Pod Autoscaler to use as a source of metrics.
       
[Note] Running diagnostic: NodeDefinitions
       Description: Check node records on master
       
WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node ip-172-18-10-128.ec2.internal is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node ip-172-18-10-128.ec2.internal --schedulable=true
       
       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').
       
[Note] Running diagnostic: ServiceExternalIPs
       Description: Check for existing services with ExternalIPs that are disallowed by master config
       
[Note] Running diagnostic: AnalyzeLogs
       Description: Check for recent problems in systemd service logs
       
Info:  Checking journalctl logs for 'atomic-openshift-master' service
Info:  Checking journalctl logs for 'atomic-openshift-node' service

WARN:  [DS2005 from diagnostic AnalyzeLogs@openshift/origin/pkg/diagnostics/systemd/analyze_logs.go:120]
       Found 'atomic-openshift-node' journald log message:
         W0901 21:43:32.203430   16664 subnets.go:236] Could not find an allocated subnet for node: ip-172-18-10-128.ec2.internal, Waiting...
       
       This warning occurs when the node is trying to request the
       SDN subnet it should be configured with according to the master,
       but either can't connect to it or has not yet been assigned a subnet.
       
       This can occur before the master becomes fully available and defines a
       record for the node to use; the node will wait until that occurs,
       so the presence of this message in the node log isn't necessarily a
       problem as long as the SDN is actually working, but this message may
       help indicate the problem if it is not working.
       
       If the master is available and this log message persists, then it may
       be a sign of a different misconfiguration. Check the master's URL in
       the node kubeconfig.
        * Is the protocol http? It should be https.
        * Can you reach the address and port from the node using curl -k?
       
Info:  Checking journalctl logs for 'docker' service

[Note] Running diagnostic: MasterConfigCheck
       Description: Check the master config file
       
WARN:  [DH0005 from diagnostic MasterConfigCheck@openshift/origin/pkg/diagnostics/host/check_master_config.go:52]
       Validation of master config file '/etc/origin/master/master-config.yaml' warned:
       assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
       assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console
       
[Note] Running diagnostic: NodeConfigCheck
       Description: Check the node config file
       
Info:  Found a node config file: /etc/origin/node/node-config.yaml

[Note] Running diagnostic: UnitStatus
       Description: Check status for related systemd units
       
[Note] Summary of diagnostics execution (version v3.3.0.28):
[Note] Warnings seen: 3

Comment 10 errata-xmlrpc 2016-09-27 09:41:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933