1423640 – Restart of atomic-openshift-node service terminates pod glusterfs mount

Bug 1423640 - Restart of atomic-openshift-node service terminates pod glusterfs mount

Summary: Restart of atomic-openshift-node service terminates pod glusterfs mount

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.4.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Jan Safranek
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:	1424680
Blocks:	1462254 1466217 1472370 1472372
TreeView+	depends on / blocked

Reported:	2017-02-17 14:11 UTC by Takeshi Larsson
Modified:	2020-08-13 08:52 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When the atomic-openshift-node service got restarted, all processes in its control group are terminated, including the glusterfs mounted points. Consequence: Each glusterfs volume in OpenShift corresponds to one mounted point. If all mounting point are lost, so are all the volumes. Fix: Set the control group mode to terminate only the main process and leave the remaining glusterfs mounting points untouched. Result: When the atomic-openshift-node service is restarted no glusterfs mounting point is terminated.
Clone Of:
Clones:	1462254 1466848 1472370 1472372 (view as bug list)
Environment:
Last Closed:	2017-09-08 03:15:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1473531	0	medium	CLOSED	atomic-openshift-node does not terminate journalctl child process cleanly	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:2642	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.6.1 bug fix and enhancement update	2017-09-08 07:14:52 UTC

Internal Links: 1473531

Description Takeshi Larsson 2017-02-17 14:11:23 UTC

Description of problem:
When a restart of the atomic-openshift-node.service is done either manually or via ansible-playbook.
All pods that have a active PVC mounted in a pod lose that glusterfs mount, the pods must then be deleted for it to get that mount back.

A colleague tested this in his home lab and he was able to reproduce the same behaviour.

Version-Release number of selected component (if applicable):
3.4.1.2

How reproducible:
Always

Steps to Reproduce:
1. Create a pod (jenkins persistent for example) using a glusterfs backed persistent volume
2. Restarted the atomic-openshift-node service on the node the pod is allocated to.
3. oc rsh - into the pod and run df. it should return the following error: 
df: '/var/lib/jenkins': Transport endpoint is not connected


Actual results:
df: '/var/lib/jenkins': Transport endpoint is not connected

Expected results:
No lost mount!

Additional info:

Comment 1 hchen 2017-02-17 19:57:54 UTC

gluster mount is a fuse mount; restarting openshift node service kills fuse daemon inside the container, and thus causes "Transport endpoint is not connected" error.

Comment 2 Takeshi Larsson 2017-02-17 20:45:39 UTC

Is this the intended behaviour? Is there a workaround or is it supposed to be like this?

Comment 3 hchen 2017-02-17 21:59:43 UTC

The solution has to wait for the containerized mount feature. Mounting a FUSE filesystem (like glusterfs) inside the node container cannot survive a restart. 

Containerized mount will run mount in a separate container, thus node restart won't stop mount container.

Comment 5 Bradley Childs 2017-02-18 18:23:39 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1424680

Comment 6 Eric Paris 2017-02-27 19:37:18 UTC

Is it the pod or the container that needs deleted and recreated? Or the running container on the node? What exactly needs to be restart/recreated to recover?

Comment 9 Takeshi Larsson 2017-03-06 13:21:49 UTC

(In reply to Eric Paris from comment #6)
> Is it the pod or the container that needs deleted and recreated? Or the
> running container on the node? What exactly needs to be restart/recreated to
> recover?

Hi sorry for the late reply, usually the pod will "restart" but this does not seem to fix the issue. I have to delete the pod and let the RC deploy a brand new pod.

Comment 12 Bradley Childs 2017-04-19 22:17:43 UTC

Fix for this is in gluster. Assigning to Michael Adam, and adding UpcomingRelease.

Comment 20 Humble Chirammal 2017-06-16 16:32:17 UTC

OCP 3.4 backport PR # https://github.com/openshift/ose/pull/790

Comment 22 Wenqi He 2017-06-22 07:43:28 UTC

Tested on below version:
openshift v3.4.1.42
kubernetes v1.4.0+776c994

This issue still exist after restart node service:
$ oc rsh gluster
/ $ ls /mnt/gluster
ls: /mnt/gluster: Transport endpoint is not connected

Comment 25 Wenqi He 2017-06-22 11:56:30 UTC

Re-assign QA to Jianwei for helping to deeper investigation about the glusterfs version. Thanks.

Comment 28 Jianwei Hou 2017-06-28 08:50:06 UTC

After glusterfs-fuse upgrade, the auto_umount became a valid mount option, tested with:

glusterfs-fuse-3.8.4-18.4.el7.x86_64
glusterfs-libs-3.8.4-18.4.el7.x86_64
glusterfs-3.8.4-18.4.el7.x86_64
glusterfs-client-xlators-3.8.4-18.4.el7.x86_64

[root@ip-172-18-7-60 ~]# grep auto_unmount /var/log/messages
Jun 28 04:36:45 ip-172-18-7-60 atomic-openshift-node: I0628 04:36:45.887638   49916 mount_linux.go:103] Mounting 172.18.12.63:vol_76b54eb86baf987d6955e6c2451fb813 /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48 glusterfs [log-level=ERROR log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log auto_unmount]
Jun 28 04:42:18 ip-172-18-7-60 atomic-openshift-node: I0628 04:42:18.789212   49916 mount_linux.go:103] Mounting 172.18.12.63:vol_ca1abfa950a9e790fdef002c6246e66c /var/lib/origin/openshift.local.volumes/pods/b1ad9c7b-5bdd-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-94c8f1db-5bdd-11e7-bc93-0e76bec68f48 glusterfs [log-level=ERROR log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-94c8f1db-5bdd-11e7-bc93-0e76bec68f48/gluster1-glusterfs.log auto_unmount]


But even though once atomic-openshift-node is restarted, the 'Transport endpoint is not connected' message shown.

# oc exec -it gluster1 -- ls /mnt/gluster
ls: /mnt/gluster: Transport endpoint is not connected

Comment 29 Jianwei Hou 2017-06-28 09:01:23 UTC

The glusterfs process:

root      54424      1  0 04:47 ?        00:00:00 /usr/sbin/glusterfs --log-level=ERROR --log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log --fuse-mountopts=auto_unmount --volfile-server=172.18.12.63 --volfile-id=vol_76b54eb86baf987d6955e6c2451fb813 --fuse-mountopts=auto_unmount /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48

Comment 30 Humble Chirammal 2017-06-28 09:32:42 UTC

(In reply to Jianwei Hou from comment #29)
> The glusterfs process:
> 
> root      54424      1  0 04:47 ?        00:00:00 /usr/sbin/glusterfs
> --log-level=ERROR
> --log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/
> glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log
> --fuse-mountopts=auto_unmount --volfile-server=172.18.12.63
> --volfile-id=vol_76b54eb86baf987d6955e6c2451fb813
> --fuse-mountopts=auto_unmount
> /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-
> 0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-
> 0e76bec68f48

The OCP/Kube side patch was designed to pass 'auto_unmount' option to this mount. As you pointed out or from this comment, we could see that, it has happened or working as expected. 

'auto_unmount' option is supposed to take care the reported issue/scenario. I am assigning this bug to Gluster FUSE engineers for further check on this.

Comment 31 Amar Tumballi 2017-06-28 10:16:38 UTC

Some update about 'auto_unmount' option:

'auto_unmount' option to glusterfs process will now start another forked process, which monitors the fuse-daemon process, which takes care of clearing all the filedescriptors on mount point, so the mount point will be properly unmounted.

This case helps when a gluster process crashes because of any internal bug, there wont be any stale mounts. This can be tested by killing 1 process out of the 2 glusterfs processes visible in `ps aux` output. 

I see that it is confusing to know which is the monitoring process and which is the mount process just by 'ps' output.

Csaba, any suggestions / clarification on how to distinguish the glusterfs process from monitoring process here?


----

Again, if both the glusterfs processes get killed, the feature won't be effective. If customer wants the mount point to be present even after all glusterfs process termination, then we need an external script/tool to support this behavior.

Comment 32 Csaba Henk 2017-06-28 23:18:10 UTC

Eg. "ps --forest" can be used to display processes in their ancestry tree.

So if the pids of glusterfs client processes for the given mount have already been
identified,

$ ps --forest <pid1> <pid2>

can be used. Otherwise something like

$ pgrep glusterfs | xargs ps --forest

The parent is the actual client process, the child is the unmount agent.

While it's a bit hacky, this is suitable also for identification of actual client
/ unmount agent in a script, by grepping above output for the absence / presence of the child ASCII marker "\_".

Comment 35 Jan Safranek 2017-06-29 14:59:07 UTC

From the bug report it's not clear if the customer runs atomic-openshift-node service in a container or on the host.

I tested this on OSE 3.4.1.44 from RPM, i.e. atomic-openshift-node is running on the host and not in a container. When OpenShift mounts a gluster volume, it executes "/bin/mount -t glusterfs <what> <where>". /bin/mount spawns a fuse daemon /usr/sbin/glusterfs that handles the mount. So far so good.

Now user restarts atomic-openshift-node service. systemd kills /usr/bin/openshift process and then it kills also all its children as cleanup of the service cgroup. So the gluster fuse daemon gets killed too - it was spawned (indirectly) by atomic-openshift-node service and systemd remembers it.

Adding this to /lib/systemd/system/atomic-openshift-node.service fixes the problem to me:

[Service]
KillMode=process

Note that it leaves all processes in the openshift's cgroup running when restarting the service. Who knows what's running there, maybe some processes there expected that they were killed automatically when the service restarts. Someone smarter than me should approve this change.

Tested current atomic-openshift-node-3.6.126.3-1.git.0.6168324.el7.x86_64, the behavior is the same.

====

Running atomic-openshift-node in a container suffers from a similar bug, we try to mount the gluster volume on the host via nsenter, but that does not escape docker container cgroup. So gluster fuse daemon runs in the host mount namespace and at the same time it runs in container's cgroup. And everything in the cgroup is killed when docker stops the container. So we need something stronger than nsenter here.

Comment 38 Jan Safranek 2017-06-30 12:37:09 UTC

(In reply to Takeshi Larsson from comment #37)
> we were installing openshift using RPMS.

Cool, so we need just to change the service file as recommended in comment #35 (and test it properly!) Who is in charge of service files and node process decomposition?

As a workaround, customer can edit the service file + restart it:

$ cat <<EOF >/etc/systemd/system/atomic-openshift-node.service
.include /lib/systemd/system/atomic-openshift-node.service
[Service]
KillMode=process
EOF

$ systemctl daemon-reload
$ systemctl restart atomic-openshift-node

Please report any success or failure of this workaround.

Comment 40 Matthew Robson 2017-06-30 13:15:32 UTC

The use case here is atomic-openshift-node running on a host (RPM install), but I would say both should work similarly.

I think the risk is that if that mounts are not properly terminated, there is an issue if that POD moves to a new node and can not re-bind the gluster PV...

The goal of the gluster fix was to ensure that when the gluster process died (from a node restart) the unmount was cleanly handled so when the POD came back, it would cleanly remount.

Comment 41 Scott Dodson 2017-06-30 13:38:12 UTC

Jan, we'll need to fix this for containerized too. What suggestions do you have for fixing this?

Once the exact changes necessary for the systemd units are defined I'm happy to have this move to OpenShift Installer component.

Comment 42 Jan Safranek 2017-06-30 14:18:34 UTC

(In reply to Scott Dodson from comment #41)
> Jan, we'll need to fix this for containerized too. What suggestions do you
> have for fixing this?

Step 1: create a separate bug so we don't mix fixes for OpenShift from RPM and in container. The root cause for these bugs are different, they will have different fixes and probably also testing.

I am going to clone this bug shortly.

Comment 43 Jan Safranek 2017-06-30 14:26:07 UTC

From now on, let's track progress for OpenShift Node service running from RPM here. I created bug #1466848 to track the same issue in containerized openshift.

Comment 49 Scott Dodson 2017-07-11 14:20:37 UTC

Need to add the following to our node service that openshift-ansible deploys for rpm based installs in openshift_node and openshift_node_upgrade roles.

[Service]
KillMode=process

Comment 50 Jan Chaloupka 2017-07-13 14:54:37 UTC

Upstream PR: https://github.com/openshift/openshift-ansible/pull/4755

Comment 52 Jianwei Hou 2017-07-20 07:02:06 UTC

I did a quick test on openshift v3.6.153, this is still reproducible after adding KillMode=process.

# cat /lib/systemd/system/atomic-openshift-node.service
[Unit]
Description=Atomic OpenShift Node
After=docker.service
After=openvswitch.service
Wants=docker.service
Documentation=https://github.com/openshift/origin

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
Environment=GOTRACEBACK=crash
ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS
LimitNOFILE=65536
LimitCORE=infinity
WorkingDirectory=/var/lib/origin/
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
OOMScoreAdjust=-999
KillMode=process

[Install]
WantedBy=multi-user.target

Comment 53 Jianwei Hou 2017-07-20 09:41:05 UTC

Sorry, I missed comment 47

So tested again, this time I just removed /etc/systemd/system/atomic-openshift-node.service and used the same /lib/systemd/system/atomic-openshift-node.service in previous comment.

This time after restart, the problem is gone! This bug is good to verify as soon as we verify installer had added KillMode=process

Comment 54 Jianwei Hou 2017-07-20 09:52:13 UTC

Verified on v3.6.153

After installation, the "KillMode=process" is added to /etc/systemd/system/atomic-openshift-node.service. With this option added, the "Transport endpoint is not connected" problem is fixed!

```
# cat /etc/systemd/system/atomic-openshift-node.service
[Unit]
Description=OpenShift Node
After=docker.service
Wants=openvswitch.service
After=ovsdb-server.service
After=ovs-vswitchd.service
Wants=docker.service
Documentation=https://github.com/openshift/origin
Requires=dnsmasq.service
After=dnsmasq.service

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
Environment=GOTRACEBACK=crash
ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/
ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1
ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf
ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:
ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS
LimitNOFILE=65536
LimitCORE=infinity
WorkingDirectory=/var/lib/origin/
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
OOMScoreAdjust=-999
KillMode=process

[Install]
WantedBy=multi-user.target
```

Comment 56 Scott Dodson 2017-07-25 18:39:58 UTC

These changes have been reverted. Moving back to assigned and re-assigning to storage team to come up with newer plan. I'll leave it up to storage team to decide if this is a 3.6 blocker or not.

Comment 57 Scott Dodson 2017-07-25 18:40:41 UTC

https://github.com/openshift/openshift-ansible/pull/4755 contains more discussion as to why this was reverted.

Comment 59 Bradley Childs 2017-07-26 13:56:25 UTC

The ansible fix was reverted upstream because it caused problems in other components.  Jan is working on another fix that doesnt' involve ansible.

Comment 60 Jan Safranek 2017-07-26 14:44:58 UTC

upstream PR: https://github.com/kubernetes/kubernetes/pull/49640

Comment 63 Jan Safranek 2017-08-10 10:16:45 UTC

Filled downstream PR for 3.6.x: https://github.com/openshift/ose/pull/829

Comment 67 Jianwei Hou 2017-08-30 08:27:57 UTC

Verified this is fixed on v3.6.173.0.21. 
The fix resolved the issue after the ansible pull was reverted.

Comment 71 errata-xmlrpc 2017-09-08 03:15:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2642

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
aos-storage-staff
atumball
bchilds
bleanhar
bmchugh
csaba
ekuric
eparis
erich
hchiramm
jhou
jkaur
jnordell
jokerman
jsafrane
knakayam
mmccomas
mrobson
rcyriac
rhs-bugs
sdodson
tlarsson
trankin