Bug 1423640 - Restart of atomic-openshift-node service terminates pod glusterfs mount
Summary: Restart of atomic-openshift-node service terminates pod glusterfs mount
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.4.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.6.z
Assignee: Jan Safranek
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On: 1424680
Blocks: 1462254 1466217 1472370 1472372
TreeView+ depends on / blocked
 
Reported: 2017-02-17 14:11 UTC by Takeshi Larsson
Modified: 2020-08-13 08:52 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the atomic-openshift-node service got restarted, all processes in its control group are terminated, including the glusterfs mounted points. Consequence: Each glusterfs volume in OpenShift corresponds to one mounted point. If all mounting point are lost, so are all the volumes. Fix: Set the control group mode to terminate only the main process and leave the remaining glusterfs mounting points untouched. Result: When the atomic-openshift-node service is restarted no glusterfs mounting point is terminated.
Clone Of:
: 1462254 1466848 1472370 1472372 (view as bug list)
Environment:
Last Closed: 2017-09-08 03:15:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1473531 0 medium CLOSED atomic-openshift-node does not terminate journalctl child process cleanly 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2017:2642 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6.1 bug fix and enhancement update 2017-09-08 07:14:52 UTC

Internal Links: 1473531

Description Takeshi Larsson 2017-02-17 14:11:23 UTC
Description of problem:
When a restart of the atomic-openshift-node.service is done either manually or via ansible-playbook.
All pods that have a active PVC mounted in a pod lose that glusterfs mount, the pods must then be deleted for it to get that mount back.

A colleague tested this in his home lab and he was able to reproduce the same behaviour.

Version-Release number of selected component (if applicable):
3.4.1.2

How reproducible:
Always

Steps to Reproduce:
1. Create a pod (jenkins persistent for example) using a glusterfs backed persistent volume
2. Restarted the atomic-openshift-node service on the node the pod is allocated to.
3. oc rsh - into the pod and run df. it should return the following error: 
df: '/var/lib/jenkins': Transport endpoint is not connected


Actual results:
df: '/var/lib/jenkins': Transport endpoint is not connected

Expected results:
No lost mount!

Additional info:

Comment 1 hchen 2017-02-17 19:57:54 UTC
gluster mount is a fuse mount; restarting openshift node service kills fuse daemon inside the container, and thus causes "Transport endpoint is not connected" error.

Comment 2 Takeshi Larsson 2017-02-17 20:45:39 UTC
Is this the intended behaviour? Is there a workaround or is it supposed to be like this?

Comment 3 hchen 2017-02-17 21:59:43 UTC
The solution has to wait for the containerized mount feature. Mounting a FUSE filesystem (like glusterfs) inside the node container cannot survive a restart. 

Containerized mount will run mount in a separate container, thus node restart won't stop mount container.

Comment 5 Bradley Childs 2017-02-18 18:23:39 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1424680

Comment 6 Eric Paris 2017-02-27 19:37:18 UTC
Is it the pod or the container that needs deleted and recreated? Or the running container on the node? What exactly needs to be restart/recreated to recover?

Comment 9 Takeshi Larsson 2017-03-06 13:21:49 UTC
(In reply to Eric Paris from comment #6)
> Is it the pod or the container that needs deleted and recreated? Or the
> running container on the node? What exactly needs to be restart/recreated to
> recover?

Hi sorry for the late reply, usually the pod will "restart" but this does not seem to fix the issue. I have to delete the pod and let the RC deploy a brand new pod.

Comment 12 Bradley Childs 2017-04-19 22:17:43 UTC
Fix for this is in gluster. Assigning to Michael Adam, and adding UpcomingRelease.

Comment 20 Humble Chirammal 2017-06-16 16:32:17 UTC
OCP 3.4 backport PR # https://github.com/openshift/ose/pull/790

Comment 22 Wenqi He 2017-06-22 07:43:28 UTC
Tested on below version:
openshift v3.4.1.42
kubernetes v1.4.0+776c994

This issue still exist after restart node service:
$ oc rsh gluster
/ $ ls /mnt/gluster
ls: /mnt/gluster: Transport endpoint is not connected

Comment 25 Wenqi He 2017-06-22 11:56:30 UTC
Re-assign QA to Jianwei for helping to deeper investigation about the glusterfs version. Thanks.

Comment 28 Jianwei Hou 2017-06-28 08:50:06 UTC
After glusterfs-fuse upgrade, the auto_umount became a valid mount option, tested with:

glusterfs-fuse-3.8.4-18.4.el7.x86_64
glusterfs-libs-3.8.4-18.4.el7.x86_64
glusterfs-3.8.4-18.4.el7.x86_64
glusterfs-client-xlators-3.8.4-18.4.el7.x86_64

[root@ip-172-18-7-60 ~]# grep auto_unmount /var/log/messages
Jun 28 04:36:45 ip-172-18-7-60 atomic-openshift-node: I0628 04:36:45.887638   49916 mount_linux.go:103] Mounting 172.18.12.63:vol_76b54eb86baf987d6955e6c2451fb813 /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48 glusterfs [log-level=ERROR log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log auto_unmount]
Jun 28 04:42:18 ip-172-18-7-60 atomic-openshift-node: I0628 04:42:18.789212   49916 mount_linux.go:103] Mounting 172.18.12.63:vol_ca1abfa950a9e790fdef002c6246e66c /var/lib/origin/openshift.local.volumes/pods/b1ad9c7b-5bdd-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-94c8f1db-5bdd-11e7-bc93-0e76bec68f48 glusterfs [log-level=ERROR log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-94c8f1db-5bdd-11e7-bc93-0e76bec68f48/gluster1-glusterfs.log auto_unmount]


But even though once atomic-openshift-node is restarted, the 'Transport endpoint is not connected' message shown.

# oc exec -it gluster1 -- ls /mnt/gluster
ls: /mnt/gluster: Transport endpoint is not connected

Comment 29 Jianwei Hou 2017-06-28 09:01:23 UTC
The glusterfs process:

root      54424      1  0 04:47 ?        00:00:00 /usr/sbin/glusterfs --log-level=ERROR --log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log --fuse-mountopts=auto_unmount --volfile-server=172.18.12.63 --volfile-id=vol_76b54eb86baf987d6955e6c2451fb813 --fuse-mountopts=auto_unmount /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48

Comment 30 Humble Chirammal 2017-06-28 09:32:42 UTC
(In reply to Jianwei Hou from comment #29)
> The glusterfs process:
> 
> root      54424      1  0 04:47 ?        00:00:00 /usr/sbin/glusterfs
> --log-level=ERROR
> --log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/
> glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log
> --fuse-mountopts=auto_unmount --volfile-server=172.18.12.63
> --volfile-id=vol_76b54eb86baf987d6955e6c2451fb813
> --fuse-mountopts=auto_unmount
> /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-
> 0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-
> 0e76bec68f48

The OCP/Kube side patch was designed to pass 'auto_unmount' option to this mount. As you pointed out or from this comment, we could see that, it has happened or working as expected. 

'auto_unmount' option is supposed to take care the reported issue/scenario. I am assigning this bug to Gluster FUSE engineers for further check on this.

Comment 31 Amar Tumballi 2017-06-28 10:16:38 UTC
Some update about 'auto_unmount' option:

'auto_unmount' option to glusterfs process will now start another forked process, which monitors the fuse-daemon process, which takes care of clearing all the filedescriptors on mount point, so the mount point will be properly unmounted.

This case helps when a gluster process crashes because of any internal bug, there wont be any stale mounts. This can be tested by killing 1 process out of the 2 glusterfs processes visible in `ps aux` output. 

I see that it is confusing to know which is the monitoring process and which is the mount process just by 'ps' output.

Csaba, any suggestions / clarification on how to distinguish the glusterfs process from monitoring process here?


----

Again, if both the glusterfs processes get killed, the feature won't be effective. If customer wants the mount point to be present even after all glusterfs process termination, then we need an external script/tool to support this behavior.

Comment 32 Csaba Henk 2017-06-28 23:18:10 UTC
Eg. "ps --forest" can be used to display processes in their ancestry tree.

So if the pids of glusterfs client processes for the given mount have already been
identified,

$ ps --forest <pid1> <pid2>

can be used. Otherwise something like

$ pgrep glusterfs | xargs ps --forest

The parent is the actual client process, the child is the unmount agent.

While it's a bit hacky, this is suitable also for identification of actual client
/ unmount agent in a script, by grepping above output for the absence / presence of the child ASCII marker "\_".

Comment 35 Jan Safranek 2017-06-29 14:59:07 UTC
From the bug report it's not clear if the customer runs atomic-openshift-node service in a container or on the host.

I tested this on OSE 3.4.1.44 from RPM, i.e. atomic-openshift-node is running on the host and not in a container. When OpenShift mounts a gluster volume, it executes "/bin/mount -t glusterfs <what> <where>". /bin/mount spawns a fuse daemon /usr/sbin/glusterfs that handles the mount. So far so good.

Now user restarts atomic-openshift-node service. systemd kills /usr/bin/openshift process and then it kills also all its children as cleanup of the service cgroup. So the gluster fuse daemon gets killed too - it was spawned (indirectly) by atomic-openshift-node service and systemd remembers it.

Adding this to /lib/systemd/system/atomic-openshift-node.service fixes the problem to me:

[Service]
KillMode=process

Note that it leaves all processes in the openshift's cgroup running when restarting the service. Who knows what's running there, maybe some processes there expected that they were killed automatically when the service restarts. Someone smarter than me should approve this change.

Tested current atomic-openshift-node-3.6.126.3-1.git.0.6168324.el7.x86_64, the behavior is the same.

====

Running atomic-openshift-node in a container suffers from a similar bug, we try to mount the gluster volume on the host via nsenter, but that does not escape docker container cgroup. So gluster fuse daemon runs in the host mount namespace and at the same time it runs in container's cgroup. And everything in the cgroup is killed when docker stops the container. So we need something stronger than nsenter here.

Comment 38 Jan Safranek 2017-06-30 12:37:09 UTC
(In reply to Takeshi Larsson from comment #37)
> we were installing openshift using RPMS.

Cool, so we need just to change the service file as recommended in comment #35 (and test it properly!) Who is in charge of service files and node process decomposition?

As a workaround, customer can edit the service file + restart it:

$ cat <<EOF >/etc/systemd/system/atomic-openshift-node.service
.include /lib/systemd/system/atomic-openshift-node.service
[Service]
KillMode=process
EOF

$ systemctl daemon-reload
$ systemctl restart atomic-openshift-node

Please report any success or failure of this workaround.

Comment 40 Matthew Robson 2017-06-30 13:15:32 UTC
The use case here is atomic-openshift-node running on a host (RPM install), but I would say both should work similarly.

I think the risk is that if that mounts are not properly terminated, there is an issue if that POD moves to a new node and can not re-bind the gluster PV...

The goal of the gluster fix was to ensure that when the gluster process died (from a node restart) the unmount was cleanly handled so when the POD came back, it would cleanly remount.

Comment 41 Scott Dodson 2017-06-30 13:38:12 UTC
Jan, we'll need to fix this for containerized too. What suggestions do you have for fixing this?

Once the exact changes necessary for the systemd units are defined I'm happy to have this move to OpenShift Installer component.

Comment 42 Jan Safranek 2017-06-30 14:18:34 UTC
(In reply to Scott Dodson from comment #41)
> Jan, we'll need to fix this for containerized too. What suggestions do you
> have for fixing this?

Step 1: create a separate bug so we don't mix fixes for OpenShift from RPM and in container. The root cause for these bugs are different, they will have different fixes and probably also testing.

I am going to clone this bug shortly.

Comment 43 Jan Safranek 2017-06-30 14:26:07 UTC
From now on, let's track progress for OpenShift Node service running from RPM here. I created bug #1466848 to track the same issue in containerized openshift.

Comment 49 Scott Dodson 2017-07-11 14:20:37 UTC
Need to add the following to our node service that openshift-ansible deploys for rpm based installs in openshift_node and openshift_node_upgrade roles.

[Service]
KillMode=process

Comment 50 Jan Chaloupka 2017-07-13 14:54:37 UTC
Upstream PR: https://github.com/openshift/openshift-ansible/pull/4755

Comment 52 Jianwei Hou 2017-07-20 07:02:06 UTC
I did a quick test on openshift v3.6.153, this is still reproducible after adding KillMode=process.

# cat /lib/systemd/system/atomic-openshift-node.service
[Unit]
Description=Atomic OpenShift Node
After=docker.service
After=openvswitch.service
Wants=docker.service
Documentation=https://github.com/openshift/origin

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
Environment=GOTRACEBACK=crash
ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS
LimitNOFILE=65536
LimitCORE=infinity
WorkingDirectory=/var/lib/origin/
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
OOMScoreAdjust=-999
KillMode=process

[Install]
WantedBy=multi-user.target

Comment 53 Jianwei Hou 2017-07-20 09:41:05 UTC
Sorry, I missed comment 47

So tested again, this time I just removed /etc/systemd/system/atomic-openshift-node.service and used the same /lib/systemd/system/atomic-openshift-node.service in previous comment.

This time after restart, the problem is gone! This bug is good to verify as soon as we verify installer had added KillMode=process

Comment 54 Jianwei Hou 2017-07-20 09:52:13 UTC
Verified on v3.6.153

After installation, the "KillMode=process" is added to /etc/systemd/system/atomic-openshift-node.service. With this option added, the "Transport endpoint is not connected" problem is fixed!

```
# cat /etc/systemd/system/atomic-openshift-node.service
[Unit]
Description=OpenShift Node
After=docker.service
Wants=openvswitch.service
After=ovsdb-server.service
After=ovs-vswitchd.service
Wants=docker.service
Documentation=https://github.com/openshift/origin
Requires=dnsmasq.service
After=dnsmasq.service

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
Environment=GOTRACEBACK=crash
ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/
ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1
ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf
ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:
ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS
LimitNOFILE=65536
LimitCORE=infinity
WorkingDirectory=/var/lib/origin/
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
OOMScoreAdjust=-999
KillMode=process

[Install]
WantedBy=multi-user.target
```

Comment 56 Scott Dodson 2017-07-25 18:39:58 UTC
These changes have been reverted. Moving back to assigned and re-assigning to storage team to come up with newer plan. I'll leave it up to storage team to decide if this is a 3.6 blocker or not.

Comment 57 Scott Dodson 2017-07-25 18:40:41 UTC
https://github.com/openshift/openshift-ansible/pull/4755 contains more discussion as to why this was reverted.

Comment 59 Bradley Childs 2017-07-26 13:56:25 UTC
The ansible fix was reverted upstream because it caused problems in other components.  Jan is working on another fix that doesnt' involve ansible.

Comment 60 Jan Safranek 2017-07-26 14:44:58 UTC
upstream PR: https://github.com/kubernetes/kubernetes/pull/49640

Comment 63 Jan Safranek 2017-08-10 10:16:45 UTC
Filled downstream PR for 3.6.x: https://github.com/openshift/ose/pull/829

Comment 67 Jianwei Hou 2017-08-30 08:27:57 UTC
Verified this is fixed on v3.6.173.0.21. 
The fix resolved the issue after the ansible pull was reverted.

Comment 71 errata-xmlrpc 2017-09-08 03:15:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2642


Note You need to log in before you can comment on or make changes to this bug.