Bug 1557200 - Need add a check for swap configuration for upgrade
Summary: Need add a check for swap configuration for upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.9.z
Assignee: Russell Teague
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-16 07:42 UTC by Weihua Meng
Modified: 2018-12-13 19:27 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The ability to skip disabling swap by use of openshift_disable_swap=False has been removed from 3.9. This feature was undocumented and should not be used.
Clone Of:
Environment:
Last Closed: 2018-12-13 19:26:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3748 0 None None None 2018-12-13 19:27:10 UTC

Description Weihua Meng 2018-03-16 07:42:06 UTC
Description of problem:
upgrade failed when swap on
This cause atomic-openshift-node.service not running 

Failure happens when upgrade 3.7 to 3.9
Upgrade from 3.6 to 3.7 is OK with swap on

swap on is not supported on OCP 3.9, but it is supported on OCP 3.7.
So it is better to add a check before upgrade to OCP 3.9.
should avoid upgrade failure.
 
Version-Release number of the following components:
openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. upgrade with swap on cluster with
openshift_disable_swap=false

Actual results:
Failure summary:


  1. Hosts:    wmengupgradeetcd36-master-1.0316-a6r.qe.rhcloud.com
     Play:     Drain and upgrade master nodes
     Task:     Wait for node to be ready
     Message:  Failed without returning a message.

Expected results:
Upgrade succeeds

Additional info:
# free -h
              total        used        free      shared  buff/cache   available
Mem:            25G        909M         17G        104M        7.2G         23G
Swap:          2.0G          0B        2.0G
# swapon -s
Filename				Type		Size	Used	Priority
/var/swapfile                          	file	2097148	0	-1


# oc get nodes
NAME                          STATUS                        ROLES     AGE       VERSION
wmengupgradeetcd36-master-1   NotReady,SchedulingDisabled   <none>    4h        v1.7.6+a08f5eeb62
wmengupgradeetcd36-nrr-1      Ready                         <none>    4h        v1.7.6+a08f5eeb62
wmengupgradeetcd36-nrr-2      Ready                         <none>    4h        v1.7.6+a08f5eeb62


# systemctl status atomic-openshift-node.service 
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (auto-restart) (Result: exit-code) since 五 2018-03-16 01:55:37 EDT; 2s ago
     Docs: https://github.com/openshift/origin
  Process: 41812 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 41809 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 41761 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
  Process: 41758 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 41756 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 41761 (code=exited, status=255)

3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: Failed to start OpenShift Node.
3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
3月 16 01:55:37 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service failed.


# journalctl -u atomic-openshift-node --no-pager

3月 16 01:44:53 wmengupgradeetcd36-master-1 atomic-openshift-node[35430]: F0316 01:44:53.592524   35430 node.go:264] failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename                                Type                Size        Used        Priority /var/swapfile                           file                2097148        0        -1]
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: Failed to start OpenShift Node.
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
3月 16 01:44:53 wmengupgradeetcd36-master-1 systemd[1]: atomic-openshift-node.service failed.

Comment 1 Johnny Liu 2018-03-16 11:56:55 UTC
After did some investigation, seem like this is related to kube "fail-swap-on" default setting.

In 3.9.9, when swap is on, node service is started successfully.
But in 3.8.34, when swap is on, node service fail to be started, just like the initial report.

Seem like in 3.9.9, fail-swap-on is set to false by default, while in 3.8.34, fail-swap-on is set to true by default, then hit this bug. Because here is doing 3.7 -> 3.8 -> 3.9 upgrade.

In 3.9 doc, there are several doc is asking user to disable swap, but did not mentioned that in upgrade section, so maybe we could fix this bug in 3.9.z to do a pre-check to ask user disable swap before upgrade. 

Based on this, I would set the target release to 3.9.z.


@wmeng, pls make sure "disable swap" as a must in upgrade doc.

Comment 2 Johnny Liu 2018-03-16 11:57:49 UTC
@scott, set target release to 3.9.z is okay for you?

Comment 3 Johnny Liu 2018-03-16 12:27:56 UTC
doc issue is tracking here:
https://bugzilla.redhat.com/show_bug.cgi?id=1557218

Comment 4 Scott Dodson 2018-03-21 19:19:06 UTC
(In reply to Johnny Liu from comment #2)
> @scott, set target release to 3.9.z is okay for you?

Yes, as long as this doesn't disrupt the upgrade path from 3.7 to 3.9. The upgrade should be disabling swap while the node is drained, we need to figure out why this is not happening.

Comment 5 Johnny Liu 2018-03-22 03:07:29 UTC
(In reply to Scott Dodson from comment #4)
> The upgrade should be disabling swap while the node is drained, we need to
> figure out why this is not happening.

After talking about the initial reporter, when he was installing 3.7 env with openshift_disable_swap=false in inventory file, then trigger upgrade to 3.9 with the same openshift_disable_swap=false setting, that is why swap is not disabled by openshift-ansible. then the issue is hit.

Comment 6 Scott Dodson 2018-03-22 12:31:01 UTC
I think we should remove this ability in 3.9.z.

Comment 7 Russell Teague 2018-11-20 15:24:38 UTC
The ability to override disabling swap has been removed in 3.9.  Swap will be disabled during upgrade while the node is drained.

https://github.com/openshift/openshift-ansible/pull/10607

Fixed in openshift-ansible-3.9.51-1

Comment 8 Weihua Meng 2018-11-26 03:31:17 UTC
Fixed.

openshift-ansible-3.9.54-1.git.0.8a67eb1.el7.noarch

before upgrade:
# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        1.1G        8.9G        1.4M        5.5G         14G
Swap:          2.0G          0B        2.0G

# cat /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sun Nov 25 17:28:52 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/rhel-root	/	xfs	defaults	0 0
UUID=cd1d5cbd-93f3-4222-9596-b4f7f22e52d1	/boot	xfs	defaults	0 0
/var/swapfile  swap swap  defaults  0 0 

upgrade success.
after upgrade: 
# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        1.6G        3.8G        2.7M         10G         13G
Swap:            0B          0B          0B

# cat /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sun Nov 25 17:28:52 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/rhel-root	/	xfs	defaults	0 0
UUID=cd1d5cbd-93f3-4222-9596-b4f7f22e52d1	/boot	xfs	defaults	0 0
#/var/swapfile  swap swap  defaults  0 0 


Kernel Version: 3.10.0-957.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)

Comment 11 errata-xmlrpc 2018-12-13 19:26:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748


Note You need to log in before you can comment on or make changes to this bug.