Description of problem: When a restart of the atomic-openshift-node.service is done either manually or via ansible-playbook. All pods that have a active PVC mounted in a pod lose that glusterfs mount, the pods must then be deleted for it to get that mount back. A colleague tested this in his home lab and he was able to reproduce the same behaviour. Version-Release number of selected component (if applicable): 3.4.1.2 How reproducible: Always Steps to Reproduce: 1. Create a pod (jenkins persistent for example) using a glusterfs backed persistent volume 2. Restarted the atomic-openshift-node service on the node the pod is allocated to. 3. oc rsh - into the pod and run df. it should return the following error: df: '/var/lib/jenkins': Transport endpoint is not connected Actual results: df: '/var/lib/jenkins': Transport endpoint is not connected Expected results: No lost mount! Additional info:
gluster mount is a fuse mount; restarting openshift node service kills fuse daemon inside the container, and thus causes "Transport endpoint is not connected" error.
Is this the intended behaviour? Is there a workaround or is it supposed to be like this?
The solution has to wait for the containerized mount feature. Mounting a FUSE filesystem (like glusterfs) inside the node container cannot survive a restart. Containerized mount will run mount in a separate container, thus node restart won't stop mount container.
https://bugzilla.redhat.com/show_bug.cgi?id=1424680
Is it the pod or the container that needs deleted and recreated? Or the running container on the node? What exactly needs to be restart/recreated to recover?
(In reply to Eric Paris from comment #6) > Is it the pod or the container that needs deleted and recreated? Or the > running container on the node? What exactly needs to be restart/recreated to > recover? Hi sorry for the late reply, usually the pod will "restart" but this does not seem to fix the issue. I have to delete the pod and let the RC deploy a brand new pod.
Fix for this is in gluster. Assigning to Michael Adam, and adding UpcomingRelease.
OCP 3.4 backport PR # https://github.com/openshift/ose/pull/790
Tested on below version: openshift v3.4.1.42 kubernetes v1.4.0+776c994 This issue still exist after restart node service: $ oc rsh gluster / $ ls /mnt/gluster ls: /mnt/gluster: Transport endpoint is not connected
Re-assign QA to Jianwei for helping to deeper investigation about the glusterfs version. Thanks.
After glusterfs-fuse upgrade, the auto_umount became a valid mount option, tested with: glusterfs-fuse-3.8.4-18.4.el7.x86_64 glusterfs-libs-3.8.4-18.4.el7.x86_64 glusterfs-3.8.4-18.4.el7.x86_64 glusterfs-client-xlators-3.8.4-18.4.el7.x86_64 [root@ip-172-18-7-60 ~]# grep auto_unmount /var/log/messages Jun 28 04:36:45 ip-172-18-7-60 atomic-openshift-node: I0628 04:36:45.887638 49916 mount_linux.go:103] Mounting 172.18.12.63:vol_76b54eb86baf987d6955e6c2451fb813 /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48 glusterfs [log-level=ERROR log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log auto_unmount] Jun 28 04:42:18 ip-172-18-7-60 atomic-openshift-node: I0628 04:42:18.789212 49916 mount_linux.go:103] Mounting 172.18.12.63:vol_ca1abfa950a9e790fdef002c6246e66c /var/lib/origin/openshift.local.volumes/pods/b1ad9c7b-5bdd-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-94c8f1db-5bdd-11e7-bc93-0e76bec68f48 glusterfs [log-level=ERROR log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-94c8f1db-5bdd-11e7-bc93-0e76bec68f48/gluster1-glusterfs.log auto_unmount] But even though once atomic-openshift-node is restarted, the 'Transport endpoint is not connected' message shown. # oc exec -it gluster1 -- ls /mnt/gluster ls: /mnt/gluster: Transport endpoint is not connected
The glusterfs process: root 54424 1 0 04:47 ? 00:00:00 /usr/sbin/glusterfs --log-level=ERROR --log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log --fuse-mountopts=auto_unmount --volfile-server=172.18.12.63 --volfile-id=vol_76b54eb86baf987d6955e6c2451fb813 --fuse-mountopts=auto_unmount /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93-0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48
(In reply to Jianwei Hou from comment #29) > The glusterfs process: > > root 54424 1 0 04:47 ? 00:00:00 /usr/sbin/glusterfs > --log-level=ERROR > --log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/ > glusterfs/pvc-016e2938-5bdc-11e7-bc93-0e76bec68f48/gluster-glusterfs.log > --fuse-mountopts=auto_unmount --volfile-server=172.18.12.63 > --volfile-id=vol_76b54eb86baf987d6955e6c2451fb813 > --fuse-mountopts=auto_unmount > /var/lib/origin/openshift.local.volumes/pods/eb480603-5bdc-11e7-bc93- > 0e76bec68f48/volumes/kubernetes.io~glusterfs/pvc-016e2938-5bdc-11e7-bc93- > 0e76bec68f48 The OCP/Kube side patch was designed to pass 'auto_unmount' option to this mount. As you pointed out or from this comment, we could see that, it has happened or working as expected. 'auto_unmount' option is supposed to take care the reported issue/scenario. I am assigning this bug to Gluster FUSE engineers for further check on this.
Some update about 'auto_unmount' option: 'auto_unmount' option to glusterfs process will now start another forked process, which monitors the fuse-daemon process, which takes care of clearing all the filedescriptors on mount point, so the mount point will be properly unmounted. This case helps when a gluster process crashes because of any internal bug, there wont be any stale mounts. This can be tested by killing 1 process out of the 2 glusterfs processes visible in `ps aux` output. I see that it is confusing to know which is the monitoring process and which is the mount process just by 'ps' output. Csaba, any suggestions / clarification on how to distinguish the glusterfs process from monitoring process here? ---- Again, if both the glusterfs processes get killed, the feature won't be effective. If customer wants the mount point to be present even after all glusterfs process termination, then we need an external script/tool to support this behavior.
Eg. "ps --forest" can be used to display processes in their ancestry tree. So if the pids of glusterfs client processes for the given mount have already been identified, $ ps --forest <pid1> <pid2> can be used. Otherwise something like $ pgrep glusterfs | xargs ps --forest The parent is the actual client process, the child is the unmount agent. While it's a bit hacky, this is suitable also for identification of actual client / unmount agent in a script, by grepping above output for the absence / presence of the child ASCII marker "\_".
From the bug report it's not clear if the customer runs atomic-openshift-node service in a container or on the host. I tested this on OSE 3.4.1.44 from RPM, i.e. atomic-openshift-node is running on the host and not in a container. When OpenShift mounts a gluster volume, it executes "/bin/mount -t glusterfs <what> <where>". /bin/mount spawns a fuse daemon /usr/sbin/glusterfs that handles the mount. So far so good. Now user restarts atomic-openshift-node service. systemd kills /usr/bin/openshift process and then it kills also all its children as cleanup of the service cgroup. So the gluster fuse daemon gets killed too - it was spawned (indirectly) by atomic-openshift-node service and systemd remembers it. Adding this to /lib/systemd/system/atomic-openshift-node.service fixes the problem to me: [Service] KillMode=process Note that it leaves all processes in the openshift's cgroup running when restarting the service. Who knows what's running there, maybe some processes there expected that they were killed automatically when the service restarts. Someone smarter than me should approve this change. Tested current atomic-openshift-node-3.6.126.3-1.git.0.6168324.el7.x86_64, the behavior is the same. ==== Running atomic-openshift-node in a container suffers from a similar bug, we try to mount the gluster volume on the host via nsenter, but that does not escape docker container cgroup. So gluster fuse daemon runs in the host mount namespace and at the same time it runs in container's cgroup. And everything in the cgroup is killed when docker stops the container. So we need something stronger than nsenter here.
(In reply to Takeshi Larsson from comment #37) > we were installing openshift using RPMS. Cool, so we need just to change the service file as recommended in comment #35 (and test it properly!) Who is in charge of service files and node process decomposition? As a workaround, customer can edit the service file + restart it: $ cat <<EOF >/etc/systemd/system/atomic-openshift-node.service .include /lib/systemd/system/atomic-openshift-node.service [Service] KillMode=process EOF $ systemctl daemon-reload $ systemctl restart atomic-openshift-node Please report any success or failure of this workaround.
The use case here is atomic-openshift-node running on a host (RPM install), but I would say both should work similarly. I think the risk is that if that mounts are not properly terminated, there is an issue if that POD moves to a new node and can not re-bind the gluster PV... The goal of the gluster fix was to ensure that when the gluster process died (from a node restart) the unmount was cleanly handled so when the POD came back, it would cleanly remount.
Jan, we'll need to fix this for containerized too. What suggestions do you have for fixing this? Once the exact changes necessary for the systemd units are defined I'm happy to have this move to OpenShift Installer component.
(In reply to Scott Dodson from comment #41) > Jan, we'll need to fix this for containerized too. What suggestions do you > have for fixing this? Step 1: create a separate bug so we don't mix fixes for OpenShift from RPM and in container. The root cause for these bugs are different, they will have different fixes and probably also testing. I am going to clone this bug shortly.
From now on, let's track progress for OpenShift Node service running from RPM here. I created bug #1466848 to track the same issue in containerized openshift.
Need to add the following to our node service that openshift-ansible deploys for rpm based installs in openshift_node and openshift_node_upgrade roles. [Service] KillMode=process
Upstream PR: https://github.com/openshift/openshift-ansible/pull/4755
I did a quick test on openshift v3.6.153, this is still reproducible after adding KillMode=process. # cat /lib/systemd/system/atomic-openshift-node.service [Unit] Description=Atomic OpenShift Node After=docker.service After=openvswitch.service Wants=docker.service Documentation=https://github.com/openshift/origin [Service] Type=notify EnvironmentFile=/etc/sysconfig/atomic-openshift-node Environment=GOTRACEBACK=crash ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS LimitNOFILE=65536 LimitCORE=infinity WorkingDirectory=/var/lib/origin/ SyslogIdentifier=atomic-openshift-node Restart=always RestartSec=5s OOMScoreAdjust=-999 KillMode=process [Install] WantedBy=multi-user.target
Sorry, I missed comment 47 So tested again, this time I just removed /etc/systemd/system/atomic-openshift-node.service and used the same /lib/systemd/system/atomic-openshift-node.service in previous comment. This time after restart, the problem is gone! This bug is good to verify as soon as we verify installer had added KillMode=process
Verified on v3.6.153 After installation, the "KillMode=process" is added to /etc/systemd/system/atomic-openshift-node.service. With this option added, the "Transport endpoint is not connected" problem is fixed! ``` # cat /etc/systemd/system/atomic-openshift-node.service [Unit] Description=OpenShift Node After=docker.service Wants=openvswitch.service After=ovsdb-server.service After=ovs-vswitchd.service Wants=docker.service Documentation=https://github.com/openshift/origin Requires=dnsmasq.service After=dnsmasq.service [Service] Type=notify EnvironmentFile=/etc/sysconfig/atomic-openshift-node Environment=GOTRACEBACK=crash ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS LimitNOFILE=65536 LimitCORE=infinity WorkingDirectory=/var/lib/origin/ SyslogIdentifier=atomic-openshift-node Restart=always RestartSec=5s OOMScoreAdjust=-999 KillMode=process [Install] WantedBy=multi-user.target ```
These changes have been reverted. Moving back to assigned and re-assigning to storage team to come up with newer plan. I'll leave it up to storage team to decide if this is a 3.6 blocker or not.
https://github.com/openshift/openshift-ansible/pull/4755 contains more discussion as to why this was reverted.
The ansible fix was reverted upstream because it caused problems in other components. Jan is working on another fix that doesnt' involve ansible.
upstream PR: https://github.com/kubernetes/kubernetes/pull/49640
Filled downstream PR for 3.6.x: https://github.com/openshift/ose/pull/829
Verified this is fixed on v3.6.173.0.21. The fix resolved the issue after the ansible pull was reverted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2642