Description of problem: We've been seeing periodic but pretty consistent problems both pushing and pulling from the registry with the error 'No route to host' Version-Release number of selected component (if applicable): 3.7.23-1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
us-west-2 should now be happy/working fine. we are still root causing/final solution and will update with how we got it working....
Created attachment 1395746 [details] check-sdn.sh I run this script with: ansible 'starter*infra*' -u root -m script -a check-sdn.sh If the script exits with a 'FAIL' that means the OVS rules are messed up. It can affect other communication paths, but since the most common is infra<->making sure those stay pretty clean is more important. On us-west-2 we saw compute nodes unable to pull from the registry because the infra nodes rule sets were messed up.
Created attachment 1395747 [details] flush-infra.sh Running from a master with affected infra nodes this script will drain the infra node, delete all of the containers and cruft left behind, and then start the infra node again. This results in a new clean OVS ruleset
fixed by https://github.com/openshift/origin/pull/18617
Tested on v3.9.3 There is no replay of DeleteHostSubnetRules event when deleting the node.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489