1565361 – atomic-openshift-node.service fails to start, /var mount options don't include grpquota on XFS FS

Bug 1565361 - atomic-openshift-node.service fails to start, /var mount options don't include grpquota on XFS FS

Summary: atomic-openshift-node.service fails to start, /var mount options don't includ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Scott Dodson
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-09 22:16 UTC by Dan Yocum
Modified:	2018-06-21 13:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-21 13:20:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dan Yocum 2018-04-09 22:16:56 UTC

Description of problem:

atomic-openshift-node service fails to (re)start on infra nodes

Version-Release number of selected component (if applicable):

3.9.14

How reproducible:

Every

Steps to Reproduce:
1. Install openshift
2. restart atomic-openshift-node on type=node nodes
3.

Actual results:

Fails with this error:

-- Unit run-94204.scope has begun starting up.
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal atomic-openshift-node[94192]: I0409 22:14:03.911813   94192 mount_linux.go:208] Detected OS with systemd
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal atomic-openshift-node[94192]: W0409 22:14:03.911981   94192 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal atomic-openshift-node[94192]: I0409 22:14:03.914041   94192 node.go:294] Starting openshift-sdn network plugin
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal atomic-openshift-node[94192]: F0409 22:14:03.920305   94192 node.go:108] Could not set up local quota, /var/lib/origin/openshift.local.volumes is not on a filesystem mounted with the grpquota option
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal dnsmasq[3290]: setting upstream servers from DBus
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal dnsmasq[3290]: using nameserver 172.31.0.2#53
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit atomic-openshift-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit atomic-openshift-node.service has failed.
-- 
-- The result is failed.
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal systemd[1]: Unit atomic-openshift-node.service entered failed state.
Apr 09 22:14:03 ip-172-31-73-199.us-east-2.compute.internal systemd[1]: atomic-openshift-node.service failed.


NB: "Could not set up local quota"  

When did this become necessary?  It wasn't an issue on <=3.9.7

Expected results:

Succeeds!

Additional info:

The node-config.yaml file:

Comment 3 Dan Yocum 2018-04-10 15:55:53 UTC

Adding sdodson since openshift-ansbible should ensure that the grpquota mount option is set in the AMI during provisioning.

The GCP provisioner already has this set correctly in roles/openshift_gcp/tasks/configure_gcp_base_image.yml

Comment 4 Scott Dodson 2018-04-17 12:35:27 UTC

Need to

if /var is a mount, if so set gquota on it.
else, set gquota on /

They're doing this for azure too, I'm not sure if we'll have a generic solution for all providers as the filesystem mounts and types may vary.

https://github.com/openshift/openshift-ansible/pull/7783/files#diff-076e901ea50a9e0fc07272be1e6d035dR37

Comment 5 Scott Dodson 2018-04-17 13:21:13 UTC

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_gcp/tasks/configure_gcp_base_image.yml#L7

Comment 6 Michael Gugino 2018-04-17 17:00:38 UTC

We don't specify any publicly available AMI or any publicly-available steps for building the base AMI.

The installer doesn't doesn't have any insight into what is being mounted where.  If these options are required at install time, they should be built into the AMI, which I believe we discussed this before.

GCE's steps are specific to an image which is part of the CI job.  Those steps might not be generally useful to people using other images, or at the least their images would have to conform to what the installer is doing.

In order to fix this, we need the steps to create the AMI in question published so others can consume our build process, or we need a public AMI.

Comment 7 Dan Yocum 2018-04-18 17:04:11 UTC

#
# /etc/fstab
# Created by anaconda on Thu Nov 16 21:00:13 2017
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/rootvg-rootvol /                       xfs     defaults        0 0
UUID=a8e4daf2-900b-4719-a468-ea60b590a38c /boot                   ext4    defaults        1 2
/dev/mapper/rootvg-var  /var                    xfs     defaults        0 0


Adding a 'grpquota' to the 'defaults' does the right thing and allows atomic-openshift-node to start successfully.

Comment 8 Dan Yocum 2018-05-22 15:28:11 UTC

I've just deployed a 3.9.25 cluster and this problem is resolved.  atomic-openshift-node restarts fine and the 'grpquota' option on xfs formated /var partitions is NOT required:


/dev/mapper/rootvg-var on /var type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
/dev/mapper/rootvg-var on /var/lib/docker/containers type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
/dev/mapper/rootvg-var on /var/lib/docker/devicemapper type xfs (rw,relatime,seclabel,attr2,inode64,noquota)



# systemctl status atomic-openshift-node
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: active (running) since Tue 2018-05-22 15:24:09 UTC; 2min 39s ago
...

Comment 9 Scott Dodson 2018-06-21 13:20:04 UTC

This needs to be fixed in whatever mechanism is provisioning and mounting the filesystems. This is not handled by openshift-ansible.

Note You need to log in before you can comment on or make changes to this bug.