Bug 1557200

Summary: Need add a check for swap configuration for upgrade
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: Cluster Version OperatorAssignee: Russell Teague <rteague>
Status: CLOSED ERRATA QA Contact: Weihua Meng <wmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.9.0CC: aos-bugs, jiajliu, jialiu, jokerman, mmccomas, rteague, sdodson
Target Milestone: ---Keywords: Triaged
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The ability to skip disabling swap by use of openshift_disable_swap=False has been removed from 3.9. This feature was undocumented and should not be used.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-13 19:26:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weihua Meng 2018-03-16 07:42:06 UTC
Description of problem:
upgrade failed when swap on
This cause atomic-openshift-node.service not running 

Failure happens when upgrade 3.7 to 3.9
Upgrade from 3.6 to 3.7 is OK with swap on

swap on is not supported on OCP 3.9, but it is supported on OCP 3.7.
So it is better to add a check before upgrade to OCP 3.9.
should avoid upgrade failure.
 
Version-Release number of the following components:
openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. upgrade with swap on cluster with
openshift_disable_swap=false

Actual results:
Failure summary:


  1. Hosts:    wmengupgradeetcd36-master-1.0316-a6r.qe.rhcloud.com
     Play:     Drain and upgrade master nodes
     Task:     Wait for node to be ready
     Message:  Failed without returning a message.

Expected results:
Upgrade succeeds

Additional info:
# free -h
              total        used        free      shared  buff/cache   available
Mem:            25G        909M         17G        104M        7.2G         23G
Swap:          2.0G          0B        2.0G
# swapon -s
Filename				Type		Size	Used	Priority
/var/swapfile                          	file	2097148	0	-1


# oc get nodes
NAME                          STATUS                        ROLES     AGE       VERSION
wmengupgradeetcd36-master-1   NotReady,SchedulingDisabled   <none>    4h        v1.7.6+a08f5eeb62
wmengupgradeetcd36-nrr-1      Ready                         <none>    4h        v1.7.6+a08f5eeb62
wmengupgradeetcd36-nrr-2      Ready                         <none>    4h        v1.7.6+a08f5eeb62


# systemctl status atomic-openshift-node.service 
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (auto-restart) (Result: exit-code) since 五 2018-03-16 01:55:37 EDT; 2s ago
     Docs: https://github.com/openshift/origin
  Process: 41812 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 41809 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 41761 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
  Process: 41758 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 41756 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 41761 (code=exited, status=255)

3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: Failed to start OpenShift Node.
3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service failed.


# journalctl -u atomic-openshift-node --no-pager

3月 16 01:44:53 wmengupgradeetcd36-master-1 atomic-openshift-node[35430]: F0316 01:44:53.592524   35430 node.go:264] failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename                                Type                Size        Used        Priority /var/swapfile                           file                2097148        0        -1]
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: Failed to start OpenShift Node.
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service failed.

Comment 1 Johnny Liu 2018-03-16 11:56:55 UTC
After did some investigation, seem like this is related to kube "fail-swap-on" default setting.

In 3.9.9, when swap is on, node service is started successfully.
But in 3.8.34, when swap is on, node service fail to be started, just like the initial report.

Seem like in 3.9.9, fail-swap-on is set to false by default, while in 3.8.34, fail-swap-on is set to true by default, then hit this bug. Because here is doing 3.7 -> 3.8 -> 3.9 upgrade.

In 3.9 doc, there are several doc is asking user to disable swap, but did not mentioned that in upgrade section, so maybe we could fix this bug in 3.9.z to do a pre-check to ask user disable swap before upgrade. 

Based on this, I would set the target release to 3.9.z.


@wmeng, pls make sure "disable swap" as a must in upgrade doc.

Comment 2 Johnny Liu 2018-03-16 11:57:49 UTC
@scott, set target release to 3.9.z is okay for you?

Comment 3 Johnny Liu 2018-03-16 12:27:56 UTC
doc issue is tracking here:
https://bugzilla.redhat.com/show_bug.cgi?id=1557218

Comment 4 Scott Dodson 2018-03-21 19:19:06 UTC
(In reply to Johnny Liu from comment #2)
> @scott, set target release to 3.9.z is okay for you?

Yes, as long as this doesn't disrupt the upgrade path from 3.7 to 3.9. The upgrade should be disabling swap while the node is drained, we need to figure out why this is not happening.

Comment 5 Johnny Liu 2018-03-22 03:07:29 UTC
(In reply to Scott Dodson from comment #4)
> The upgrade should be disabling swap while the node is drained, we need to
> figure out why this is not happening.

After talking about the initial reporter, when he was installing 3.7 env with openshift_disable_swap=false in inventory file, then trigger upgrade to 3.9 with the same openshift_disable_swap=false setting, that is why swap is not disabled by openshift-ansible. then the issue is hit.

Comment 6 Scott Dodson 2018-03-22 12:31:01 UTC
I think we should remove this ability in 3.9.z.

Comment 7 Russell Teague 2018-11-20 15:24:38 UTC
The ability to override disabling swap has been removed in 3.9.  Swap will be disabled during upgrade while the node is drained.

https://github.com/openshift/openshift-ansible/pull/10607

Fixed in openshift-ansible-3.9.51-1

Comment 8 Weihua Meng 2018-11-26 03:31:17 UTC
Fixed.

openshift-ansible-3.9.54-1.git.0.8a67eb1.el7.noarch

before upgrade:
# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        1.1G        8.9G        1.4M        5.5G         14G
Swap:          2.0G          0B        2.0G

# cat /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sun Nov 25 17:28:52 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/rhel-root	/	xfs	defaults	0 0
UUID=cd1d5cbd-93f3-4222-9596-b4f7f22e52d1	/boot	xfs	defaults	0 0
/var/swapfile  swap swap  defaults  0 0 

upgrade success.
after upgrade: 
# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        1.6G        3.8G        2.7M         10G         13G
Swap:            0B          0B          0B

# cat /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sun Nov 25 17:28:52 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/rhel-root	/	xfs	defaults	0 0
UUID=cd1d5cbd-93f3-4222-9596-b4f7f22e52d1	/boot	xfs	defaults	0 0
#/var/swapfile  swap swap  defaults  0 0 


Kernel Version: 3.10.0-957.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)

Comment 11 errata-xmlrpc 2018-12-13 19:26:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748