Description of problem: After upgrading rhel7.2 to rhel7.3 without upgrade of OSE3.2 pkgs and docker pkgs 1)new-app failed with error: F0922 22:54:28.190693 1 builder.go:204] Error: build error: fatal: unable to access 'https://github.com/openshift/django-ex.git/': Could not resolve host: github.com; Unknown error 2) It can not reach external network(github.com) in the pod, but the host can reach it. 3) if restart node service, 1) and 2) can work correctly. 4) after reboot host, all pods can not be deployed # oc get pods NAME READY STATUS RESTARTS AGE docker-registry-1-deploy 0/1 DeadlineExceeded 0 1h docker-registry-2-gacfr 0/1 ContainerCreating 0 1h router-1-eezqe 0/1 ContainerCreating 0 1h Version-Release number of selected component (if applicable): atomic-openshift-3.2.1.15-1.git.0.d84be7f.el7.x86_64 docker-1.10.3-46.el7.14.x86_64 before: Red Hat Enterprise Linux Server release 7.2 (Maipo) Linux openshift-197.lab.eng.nay.redhat.com 3.10.0-327.18.2.el7.x86_64 #1 SMP Fri Apr 8 05:09:53 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux after: Red Hat Enterprise Linux Server release 7.3 Beta (Maipo) Linux openshift-197.lab.eng.nay.redhat.com 3.10.0-506.el7.x86_64 #1 SMP Mon Sep 12 23:31:02 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux How reproducible: always Steps to Reproduce: 1.install ose3.2 on rhel-7.2. 2.new-app and curl service, all can work correctly. 3.update repos of rhel-7 and RHEL-7-extra to upgrade rhel only. 4.run "yum -y update" to upgrade rhel7.2 to rhel7.3. About the updated packages, please refer to the attached file 5. Reboot host and check the pod status Actual results: 4) rhel7.2 has been updated to 7.3 successfully. 4) master and node serive are running ,but "oc new-app" will fail with error. 4) It can not reach external network(github.com) in the pod, but the host can reach it. 5) oc describe docker-registry-2-gacfr <---snip--> 2h 12s 684 {kubelet 192.168.0.16} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Container command could not be invoked.\n" Expected results: it can work correctly just like which does before upgrade Additional info:
Created attachment 1203992 [details] update pkgs
Created attachment 1203993 [details] service status
Do you have any saved iptables rules in /etc/sysconfig/iptables ? This sounds like what happens when the docker rules are missing from iptables. Can you please attach the output from iptables-save to the bug.
Created attachment 1204689 [details] iptables
(In reply to Ben Bennett from comment #3) > Do you have any saved iptables rules in /etc/sysconfig/iptables ? > > This sounds like what happens when the docker rules are missing from > iptables. Can you please attach the output from iptables-save to the bug. pls see the attachments.
The post upgrade rules do not have any entries for OpenShift. And the service logs you posted for atomic-openshift-node show all sorts of horrible errors, e.g.: Error syncing pod a5d46296-8135-11e6-b985-fa163ea49727, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Container command could not be invoked.\n" Can you please get more logs for the three services? Ideally all of whatever journalctl -u <service> spits out for each.
Created attachment 1204943 [details] docker failure After upgrading from 7.2 to 7.3 I get the same errors mentioned in comment 6, even for something as simple as `docker run -it rhel7` I've masked all openshift processes, rebooted and this is what I get in the logs.
selinux avcs type=SYSCALL msg=audit(1474918662.450:140): arch=c000003e syscall=56 success=yes exit=2872 a0=6c020011 a1=0 a2=0 a3=0 items=0 ppid=1 pid=2739 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="docker-current" exe="/usr/bin/docker-current" subj=system_u:system_r:unconfined_service_t:s0 key=(null) type=AVC msg=audit(1474918662.580:141): avc: denied { transition } for pid=2872 comm="exe" path="/usr/bin/bash" dev="dm-4" ino=27263397 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:svirt_lxc_net_t:s0:c222,c346 tclass=process type=SYSCALL msg=audit(1474918662.580:141): arch=c000003e syscall=59 success=no exit=-13 a0=c8205ca900 a1=c8205ca910 a2=c8205446c0 a3=0 items=0 ppid=2710 pid=2872 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=4294967295 comm="exe" exe="/usr/bin/docker-current" subj=system_u:system_r:unconfined_service_t:s0 key=(null)
You are not running the correct docker/selinux-policy packages. docker-1.10.3-55.el7 selinux-policy-3.13.1-100.el7
When i ensure that we get docker-1.10.3-55.el7 and selinux-policy-3.13.1-100.el7 during the upgrade existing pods lose network connectivity. This is not limited to dns resolution, I cannot reach the kubernetes ip address. Restarting `atomic-openshift-node` restores networking to existing pods and builds work properly after that point. Assigning back to networking team, however I believe it's reasonable to expect that hosts are rebooted and this too resolves the problem.
Created attachment 1205010 [details] logs docker.log, atomic-openshift-node.log, and complete journal.log After waiting for iptables sync i restarted docker, that did not resolve the networking problem, I then restarted atomic-openshift-node and everything started working.
Created attachment 1205053 [details] pre_reboot
Created attachment 1205054 [details] post_reboot
(In reply to Ben Bennett from comment #6) > The post upgrade rules do not have any entries for OpenShift. > > And the service logs you posted for atomic-openshift-node show all sorts of > horrible errors, e.g.: > Error syncing pod a5d46296-8135-11e6-b985-fa163ea49727, skipping: failed > to "StartContainer" for "POD" with RunContainerError: "runContainer: API > error (500): Container command could not be invoked.\n" > > Can you please get more logs for the three services? Ideally all of > whatever journalctl -u <service> spits out for each. Hi Ben I read that Scott had done some verification steps with logs. I added my logs about three services too. pre_reboot refer to operations as "update rhel->restart node service" post reboot refer to "reboot host" Hope to help on it.
Created attachment 1205255 [details] Logs and debug.sh output pre and post upgrade I rolled back ose3-node1.example.com to 7.2, performed some builds to ensure I had a pod running on that host and then performed the upgrade again. openshift-sdn-debug* files were from the master, ose3-node1.tar.gz contains logs of ovs flows, iptables, and the journal from ose3-node1.example.com
The conclusion is that there are saved iptables rules that are applied when the iptables service is restarted as part of the upgrade. Restarting openshift-node seems to fix the problem, and rebooting definitely does. So, this isn't really something we can do anything about in OpenShift, but I'm not sure what the path forwards is. I think the best we can do is to document this clearly somewhere. But I am not sure where. Eric, do you have any thoughts?
What is triggering the restart? rpm -qa --scripts | grep -C 50 iptables.service Might help find it... That seems like the real bug here, no?
That's a useful command. sdodson used it and found that iptables-service triggers the reload when it is upgraded.
I filed https://bugzilla.redhat.com/show_bug.cgi?id=1380141 which is I think the real bug. But their is likely some reason it needs to do what it is doing.
Is it possible for OpenShift to provide systems scripts/triggers that could combat this? IE: Watch for restarts in iptables, and restart docker/openshift?
We could add a reload dependency to openshift-node so when iptables-services is reloaded, we reload. There are obvious advantages and disadvantages to that, so I'm not sure if I want to advocate for that as a solution. FWIW docker can be broken by missing iptables rules sometimes too... so to fix everything we need to add that dependency to docker too. But I think it would be surprising to me as an admin that restarting the iptables "service" would restart docker and kill all running containers.
@ben while it may be surprising, would it at least 'not ever be broken' ? Try to decide if I think it is a good tradeoff...
(In reply to Ben Bennett from comment #23) > But I think it would be surprising to me as an admin that restarting the iptables "service" would restart docker and kill all running containers. I agree, but restarting the service is better IMO than having a system that does not behave the way you expect. Is there a way to have it message to the admin that other services will be restarted? With that said, in reality the thing that is needed, is a "graceful" restart. Where the daemon restarts, but leaves the containers running and updates them after the restart as needed.
The decision made was to document what needs to be done: https://github.com/openshift/openshift-docs/pull/3006 Then to write a kbase article and get the docs link into the OS release notes, and the OSE 3.3 release notes.
With the kbase article and the docs in 3.3, I think this is adequately addressed.