1378693 – [RHEL73]OSE3.2 can not work correctly after upgrade rhel7.2 to rhel7.3

Bug 1378693 - [RHEL73]OSE3.2 can not work correctly after upgrade rhel7.2 to rhel7.3

Summary: [RHEL73]OSE3.2 can not work correctly after upgrade rhel7.2 to rhel7.3

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1375561
TreeView+	depends on / blocked

Reported:	2016-09-23 06:17 UTC by liujia
Modified:	2017-02-27 06:41 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-07 15:56:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
docker failure (13.72 KB, text/x-vhdl) 2016-09-26 19:16 UTC, Scott Dodson	no flags	Details
logs (195.73 KB, application/x-gzip) 2016-09-27 01:24 UTC, Scott Dodson	no flags	Details
pre_reboot (43.89 KB, application/x-gzip) 2016-09-27 05:40 UTC, liujia	no flags	Details
post_reboot (32.48 KB, application/x-gzip) 2016-09-27 05:41 UTC, liujia	no flags	Details
Logs and debug.sh output pre and post upgrade (990.22 KB, application/x-gzip) 2016-09-27 14:40 UTC, Scott Dodson	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1380141	0	high	CLOSED	Updating iptables-services breaks working systems	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	2755871	0	None	None	None	2016-11-28 12:42:24 UTC

Description liujia 2016-09-23 06:17:30 UTC

Description of problem:
After upgrading rhel7.2 to rhel7.3 without upgrade of OSE3.2 pkgs and docker pkgs 
1)new-app failed with error:
F0922 22:54:28.190693       1 builder.go:204] Error: build error: fatal: unable to access 'https://github.com/openshift/django-ex.git/': Could not resolve host: github.com; Unknown error
2) It can not reach external network(github.com) in the pod, but the host can reach it.
3) if restart node service, 1) and 2)  can work correctly.
4) after reboot host, all pods can not be deployed
# oc get pods
NAME                       READY     STATUS              RESTARTS   AGE
docker-registry-1-deploy   0/1       DeadlineExceeded    0          1h
docker-registry-2-gacfr    0/1       ContainerCreating   0          1h
router-1-eezqe             0/1       ContainerCreating   0          1h
 

Version-Release number of selected component (if applicable):
atomic-openshift-3.2.1.15-1.git.0.d84be7f.el7.x86_64
docker-1.10.3-46.el7.14.x86_64

before:
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux openshift-197.lab.eng.nay.redhat.com 3.10.0-327.18.2.el7.x86_64 #1 SMP Fri Apr 8 05:09:53 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
after:
Red Hat Enterprise Linux Server release 7.3 Beta (Maipo)
Linux openshift-197.lab.eng.nay.redhat.com 3.10.0-506.el7.x86_64 #1 SMP Mon Sep 12 23:31:02 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
always

Steps to Reproduce:
1.install ose3.2 on rhel-7.2.

2.new-app and curl service, all can work correctly.

3.update repos of rhel-7 and RHEL-7-extra to upgrade rhel only.

4.run "yum -y update" to upgrade rhel7.2 to rhel7.3.
About the updated packages, please refer to the attached file

5. Reboot host and check the pod status

Actual results:
4) rhel7.2 has been updated to 7.3 successfully.

4) master and node serive are running ,but "oc new-app" will fail with error.

4) It can not reach external network(github.com) in the pod, but the host can reach it.

5) oc describe docker-registry-2-gacfr 
<---snip-->
   2h        12s        684        {kubelet 192.168.0.16}                 Warning        FailedSync        Error syncing pod, skipping: failed to  "StartContainer" for "POD" with RunContainerError: "runContainer: API  error (500): Container command could not be invoked.\n"

Expected results:
it can work correctly just like which does before upgrade

Additional info:

Comment 1 liujia 2016-09-23 06:19:11 UTC

Created attachment 1203992 [details]
update pkgs

Comment 2 liujia 2016-09-23 06:19:36 UTC

Created attachment 1203993 [details]
service status

Comment 3 Ben Bennett 2016-09-23 17:39:46 UTC

Do you have any saved iptables rules in /etc/sysconfig/iptables ?

This sounds like what happens when the docker rules are missing from iptables.  Can you please attach the output from iptables-save to the bug.

Comment 4 liujia 2016-09-26 07:19:35 UTC

Created attachment 1204689 [details]
iptables

Comment 5 liujia 2016-09-26 07:21:41 UTC

(In reply to Ben Bennett from comment #3)
> Do you have any saved iptables rules in /etc/sysconfig/iptables ?
> 
> This sounds like what happens when the docker rules are missing from
> iptables.  Can you please attach the output from iptables-save to the bug.

pls see the attachments.

Comment 6 Ben Bennett 2016-09-26 18:03:37 UTC

The post upgrade rules do not have any entries for OpenShift.

And the service logs you posted for atomic-openshift-node show all sorts of horrible errors, e.g.:
  Error syncing pod a5d46296-8135-11e6-b985-fa163ea49727, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Container command could not be invoked.\n"

Can you please get more logs for the three services?  Ideally all of whatever journalctl -u <service> spits out for each.

Comment 7 Scott Dodson 2016-09-26 19:16:56 UTC

Created attachment 1204943 [details]
docker failure

After upgrading from 7.2 to 7.3 I get the same errors mentioned in comment 6, even for something as simple as `docker run -it rhel7` I've masked all openshift processes, rebooted and this is what I get in the logs.

Comment 8 Scott Dodson 2016-09-26 19:40:16 UTC

selinux avcs

type=SYSCALL msg=audit(1474918662.450:140): arch=c000003e syscall=56 success=yes exit=2872 a0=6c020011 a1=0 a2=0 a3=0 items=0 ppid=1 pid=2739 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="docker-current" exe="/usr/bin/docker-current" subj=system_u:system_r:unconfined_service_t:s0 key=(null)
type=AVC msg=audit(1474918662.580:141): avc:  denied  { transition } for  pid=2872 comm="exe" path="/usr/bin/bash" dev="dm-4" ino=27263397 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:svirt_lxc_net_t:s0:c222,c346 tclass=process
type=SYSCALL msg=audit(1474918662.580:141): arch=c000003e syscall=59 success=no exit=-13 a0=c8205ca900 a1=c8205ca910 a2=c8205446c0 a3=0 items=0 ppid=2710 pid=2872 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=4294967295 comm="exe" exe="/usr/bin/docker-current" subj=system_u:system_r:unconfined_service_t:s0 key=(null)

Comment 9 Daniel Walsh 2016-09-26 20:09:21 UTC

You are not running the correct docker/selinux-policy packages.

docker-1.10.3-55.el7
selinux-policy-3.13.1-100.el7

Comment 11 Scott Dodson 2016-09-27 01:13:44 UTC

When i ensure that we get docker-1.10.3-55.el7 and selinux-policy-3.13.1-100.el7 during the upgrade existing pods lose network connectivity. This is not limited to dns resolution, I cannot reach the kubernetes ip address. Restarting `atomic-openshift-node` restores networking to existing pods and builds work properly after that point.

Assigning back to networking team, however I believe it's reasonable to expect that hosts are rebooted and this too resolves the problem.

Comment 12 Scott Dodson 2016-09-27 01:24:13 UTC

Created attachment 1205010 [details]
logs

docker.log, atomic-openshift-node.log, and complete journal.log

After waiting for iptables sync i restarted docker, that did not resolve the networking problem, I then restarted atomic-openshift-node and everything started working.

Comment 14 liujia 2016-09-27 05:40:19 UTC

Created attachment 1205053 [details]
pre_reboot

Comment 15 liujia 2016-09-27 05:41:07 UTC

Created attachment 1205054 [details]
post_reboot

Comment 16 liujia 2016-09-27 05:41:38 UTC

(In reply to Ben Bennett from comment #6)
> The post upgrade rules do not have any entries for OpenShift.
> 
> And the service logs you posted for atomic-openshift-node show all sorts of
> horrible errors, e.g.:
>   Error syncing pod a5d46296-8135-11e6-b985-fa163ea49727, skipping: failed
> to "StartContainer" for "POD" with RunContainerError: "runContainer: API
> error (500): Container command could not be invoked.\n"
> 
> Can you please get more logs for the three services?  Ideally all of
> whatever journalctl -u <service> spits out for each.

Hi Ben

I read that Scott had done some verification steps with logs.
I added my logs about three services too.
pre_reboot refer to operations as "update rhel->restart node service" 
post reboot refer to "reboot host"
Hope to help on it.

Comment 17 Scott Dodson 2016-09-27 14:40:51 UTC

Created attachment 1205255 [details]
Logs and debug.sh output pre and post upgrade

I rolled back ose3-node1.example.com to 7.2, performed some builds to ensure I had a pod running on that host and then performed the upgrade again. openshift-sdn-debug* files were from the master, ose3-node1.tar.gz contains logs of ovs flows, iptables, and the journal from ose3-node1.example.com

Comment 18 Ben Bennett 2016-09-28 14:32:35 UTC

The conclusion is that there are saved iptables rules that are applied when the iptables service is restarted as part of the upgrade.

Restarting openshift-node seems to fix the problem, and rebooting definitely does.

So, this isn't really something we can do anything about in OpenShift, but I'm not sure what the path forwards is.

I think the best we can do is to document this clearly somewhere.  But I am not sure where.  Eric, do you have any thoughts?

Comment 19 Eric Paris 2016-09-28 16:22:39 UTC

What is triggering the restart?

rpm -qa --scripts | grep -C 50 iptables.service

Might help find it...

That seems like the real bug here, no?

Comment 20 Ben Bennett 2016-09-28 17:24:51 UTC

That's a useful command.  sdodson used it and found that iptables-service triggers the reload when it is upgraded.

Comment 21 Eric Paris 2016-09-28 18:38:50 UTC

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1380141 which is I think the real bug. But their is likely some reason it needs to do what it is doing.

Comment 22 Eric Rich 2016-09-28 19:09:57 UTC

Is it possible for OpenShift to provide systems scripts/triggers that could combat this? IE: Watch for restarts in iptables, and restart docker/openshift?

Comment 23 Ben Bennett 2016-09-28 19:20:18 UTC

We could add a reload dependency to openshift-node so when iptables-services is reloaded, we reload.

There are obvious advantages and disadvantages to that, so I'm not sure if I want to advocate for that as a solution.  FWIW docker can be broken by missing iptables rules sometimes too... so to fix everything we need to add that dependency to docker too.  But I think it would be surprising to me as an admin that restarting the iptables "service" would restart docker and kill all running containers.

Comment 24 Eric Paris 2016-09-28 20:03:21 UTC

@ben while it may be surprising, would it at least 'not ever be broken' ?  Try to decide if I think it is a good tradeoff...

Comment 25 Eric Rich 2016-09-29 13:45:34 UTC

(In reply to Ben Bennett from comment #23)
> But I think it would be surprising to me as an admin that restarting the iptables "service" would restart docker and kill all running containers.

I agree, but restarting the service is better IMO than having a system that does not behave the way you expect. Is there a way to have it message to the admin that other services will be restarted? 

With that said, in reality the thing that is needed, is a "graceful" restart. Where the daemon restarts, but leaves the containers running and updates them after the restart as needed.

Comment 26 Ben Bennett 2016-10-07 14:44:01 UTC

The decision made was to document what needs to be done:
  https://github.com/openshift/openshift-docs/pull/3006

Then to write a kbase article and get the docs link into the OS release notes, and the OSE 3.3 release notes.

Comment 28 Ben Bennett 2016-11-07 15:56:23 UTC

With the kbase article and the docs in 3.3, I think this is adequately addressed.

Note You need to log in before you can comment on or make changes to this bug.