Bug 1484475 - Improve error messages for FailedMount
Summary: Improve error messages for FailedMount
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.7.0
Assignee: Seth Jennings
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks: 1724792
TreeView+ depends on / blocked
 
Reported: 2017-08-23 17:09 UTC by Thom Carlin
Modified: 2019-06-28 16:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-11-28 22:07:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Thom Carlin 2017-08-23 17:09:59 UTC
Description of problem:

When mounts fail for containers, there is not enough information to easily understand what the cause is.

Version-Release number of selected component (if applicable):

3.5

How reproducible:

100% for a FailedMount

Steps to Reproduce:
1. Start a pod resulting in Failedmount
2. Try to diagnose using "oc" commands

Actual results:

The mount fails but no clear reason why

Expected results:

Clear understanding of why the mount failed.

Additional info:

This occurred for me with CNS (Gluster).  I believe that https://github.com/kubernetes/kubernetes/pull/42006 may help:
"This fixes the problem of mount errors being eaten and not displayed to users again. Specifically erros caught in MountVolume.NewMounter (like missing endpoints, etc...)

Current behavior for any mount failure:

Events:
  FirstSeen    LastSeen    Count    From            SubObjectPath    Type        Reason        Message
  ---------    --------    -----    ----            -------------    --------    ------        -------
  12m        12m        1    default-scheduler            Normal        Scheduled    Successfully assigned glusterfs-bb-pod1 to 127.0.0.1
  10m        1m        5    kubelet, 127.0.0.1            Warning        FailedMount    Unable to mount volumes for pod "glusterfs-bb-pod1_default(67c9dfa7-f9f5-11e6-aee2-5254003a59cf)": timeout expired waiting for volumes to attach/mount for pod "default"/"glusterfs-bb-pod1". list of unattached/unmounted volumes=[glusterfsvol]
  10m        1m        5    kubelet, 127.0.0.1            Warning        FailedSync    Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"glusterfs-bb-pod1". list of unattached/unmounted volumes=[glusterfsvol]
New Behavior:

For example on glusterfs - deliberately didn't create endpoints, now correct message is displayed:

Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----			-------------	--------	------		-------
  2m		2m		1	default-scheduler			Normal		Scheduled	Successfully assigned glusterfs-bb-pod1 to 127.0.0.1
  54s		54s		1	kubelet, 127.0.0.1			Warning		FailedMount	Unable to mount volumes for pod "glusterfs-bb-pod1_default(8edd2c25-fa09-11e6-92ae-5254003a59cf)": timeout expired waiting for volumes to attach/mount for pod "default"/"glusterfs-bb-pod1". With error timed out waiting for the condition. list of unattached/unmounted volumes=[glusterfsvol]
  54s		54s		1	kubelet, 127.0.0.1			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"glusterfs-bb-pod1". With error timed out waiting for the condition. list of unattached/unmounted volumes=[glusterfsvol]
  2m		6s		814	kubelet, 127.0.0.1			Warning		FailedMount	MountVolume.NewMounter failed for volume "kubernetes.io/glusterfs/8edd2c25-fa09-11e6-92ae-5254003a59cf-glusterfsvol" (spec.Name: "glusterfsvol") pod "8edd2c25-fa09-11e6-92ae-5254003a59cf" (UID: "8edd2c25-fa09-11e6-92ae-5254003a59cf") with: endpoints "glusterfs-cluster" not found"

Comment 1 Seth Jennings 2017-08-25 19:43:05 UTC
This change is already in origin/master via the kube 1.7 rebase.  The question is should we pick it to 3.6.

There has been a lot of focus in 3.6 on reducing API server requests/load.  This generates yet another event on a per failing pod basis.  It does provide useful information to the user, however.

Let me ask around and see what we want to do.

Comment 2 Seth Jennings 2017-08-25 19:56:33 UTC
fixed in 3.7 via rebase with https://github.com/kubernetes/kubernetes/pull/42006

Comment 3 weiwei jiang 2017-08-31 10:19:22 UTC
Checked with:
# oc version 
oc v3.7.0-0.123.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server xxx
openshift v3.7.0-0.123.0
kubernetes v1.7.0+695f48a16f

And error messages for FailedMount are more clear now.

Comment 6 errata-xmlrpc 2017-11-28 22:07:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.