Description of problem: upgrade failed when swap on This cause atomic-openshift-node.service not running Failure happens when upgrade 3.7 to 3.9 Upgrade from 3.6 to 3.7 is OK with swap on swap on is not supported on OCP 3.9, but it is supported on OCP 3.7. So it is better to add a check before upgrade to OCP 3.9. should avoid upgrade failure. Version-Release number of the following components: openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch How reproducible: Always Steps to Reproduce: 1. upgrade with swap on cluster with openshift_disable_swap=false Actual results: Failure summary: 1. Hosts: wmengupgradeetcd36-master-1.0316-a6r.qe.rhcloud.com Play: Drain and upgrade master nodes Task: Wait for node to be ready Message: Failed without returning a message. Expected results: Upgrade succeeds Additional info: # free -h total used free shared buff/cache available Mem: 25G 909M 17G 104M 7.2G 23G Swap: 2.0G 0B 2.0G # swapon -s Filename Type Size Used Priority /var/swapfile file 2097148 0 -1 # oc get nodes NAME STATUS ROLES AGE VERSION wmengupgradeetcd36-master-1 NotReady,SchedulingDisabled <none> 4h v1.7.6+a08f5eeb62 wmengupgradeetcd36-nrr-1 Ready <none> 4h v1.7.6+a08f5eeb62 wmengupgradeetcd36-nrr-2 Ready <none> 4h v1.7.6+a08f5eeb62 # systemctl status atomic-openshift-node.service ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: activating (auto-restart) (Result: exit-code) since 五 2018-03-16 01:55:37 EDT; 2s ago Docs: https://github.com/openshift/origin Process: 41812 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS) Process: 41809 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS) Process: 41761 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255) Process: 41758 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS) Process: 41756 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS) Main PID: 41761 (code=exited, status=255) 3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: Failed to start OpenShift Node. 3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. 3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service failed. # journalctl -u atomic-openshift-node --no-pager 3月 16 01:44:53 wmengupgradeetcd36-master-1 atomic-openshift-node[35430]: F0316 01:44:53.592524 35430 node.go:264] failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename Type Size Used Priority /var/swapfile file 2097148 0 -1] 3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a 3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: Failed to start OpenShift Node. 3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. 3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service failed.
After did some investigation, seem like this is related to kube "fail-swap-on" default setting. In 3.9.9, when swap is on, node service is started successfully. But in 3.8.34, when swap is on, node service fail to be started, just like the initial report. Seem like in 3.9.9, fail-swap-on is set to false by default, while in 3.8.34, fail-swap-on is set to true by default, then hit this bug. Because here is doing 3.7 -> 3.8 -> 3.9 upgrade. In 3.9 doc, there are several doc is asking user to disable swap, but did not mentioned that in upgrade section, so maybe we could fix this bug in 3.9.z to do a pre-check to ask user disable swap before upgrade. Based on this, I would set the target release to 3.9.z. @wmeng, pls make sure "disable swap" as a must in upgrade doc.
@scott, set target release to 3.9.z is okay for you?
doc issue is tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1557218
(In reply to Johnny Liu from comment #2) > @scott, set target release to 3.9.z is okay for you? Yes, as long as this doesn't disrupt the upgrade path from 3.7 to 3.9. The upgrade should be disabling swap while the node is drained, we need to figure out why this is not happening.
(In reply to Scott Dodson from comment #4) > The upgrade should be disabling swap while the node is drained, we need to > figure out why this is not happening. After talking about the initial reporter, when he was installing 3.7 env with openshift_disable_swap=false in inventory file, then trigger upgrade to 3.9 with the same openshift_disable_swap=false setting, that is why swap is not disabled by openshift-ansible. then the issue is hit.
I think we should remove this ability in 3.9.z.
The ability to override disabling swap has been removed in 3.9. Swap will be disabled during upgrade while the node is drained. https://github.com/openshift/openshift-ansible/pull/10607 Fixed in openshift-ansible-3.9.51-1
Fixed. openshift-ansible-3.9.54-1.git.0.8a67eb1.el7.noarch before upgrade: # free -h total used free shared buff/cache available Mem: 15G 1.1G 8.9G 1.4M 5.5G 14G Swap: 2.0G 0B 2.0G # cat /etc/fstab # # /etc/fstab # Created by anaconda on Sun Nov 25 17:28:52 2018 # # Accessible filesystems, by reference, are maintained under '/dev/disk' # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # /dev/mapper/rhel-root / xfs defaults 0 0 UUID=cd1d5cbd-93f3-4222-9596-b4f7f22e52d1 /boot xfs defaults 0 0 /var/swapfile swap swap defaults 0 0 upgrade success. after upgrade: # free -h total used free shared buff/cache available Mem: 15G 1.6G 3.8G 2.7M 10G 13G Swap: 0B 0B 0B # cat /etc/fstab # # /etc/fstab # Created by anaconda on Sun Nov 25 17:28:52 2018 # # Accessible filesystems, by reference, are maintained under '/dev/disk' # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # /dev/mapper/rhel-root / xfs defaults 0 0 UUID=cd1d5cbd-93f3-4222-9596-b4f7f22e52d1 /boot xfs defaults 0 0 #/var/swapfile swap swap defaults 0 0 Kernel Version: 3.10.0-957.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748