Bug 1688317 - sporadic timeouts in opening TCP connection prevents Satellite upgrades
Summary: sporadic timeouts in opening TCP connection prevents Satellite upgrades
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Upgrades
Version: 6.4.2
Hardware: x86_64
OS: Linux
high
high vote
Target Milestone: 6.4.3
Assignee: satellite6-bugs
QA Contact: Nikhil Kathole
URL:
Whiteboard:
Depends On: 1687190
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-13 14:28 UTC by sthirugn@redhat.com
Modified: 2019-10-30 17:14 UTC (History)
8 users (show)

Fixed In Version: foreman-1.18.0.41-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1687190
Environment:
Last Closed: 2019-04-29 18:15:37 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0903 None None None 2019-04-29 18:15:39 UTC

Description sthirugn@redhat.com 2019-03-13 14:28:21 UTC
+++ This bug was initially created as a clone of Bug #1687190 +++

Description of problem:
Upgrading Satellite (e.g. from 6.3 to 6.4) can fail at random rake script with backtrace like:

RestClient::Exceptions::OpenTimeout: Timed out connecting to server
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:731:in `rescue in transmit'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:642:in `transmit'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:145:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:52:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/resource.rb:67:in `post'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/base.rb:92:in `get_response'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/base.rb:72:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/resources/repository.rb:147:in `unit_search'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/extensions/repository.rb:199:in `distributions'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:620:in `import_distribution_data'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:615:in `block in import_distributions'
/opt/theforeman/tfm-ror51/root/usr/share/gems/gems/activerecord-5.1.6/lib/active_record/relation/delegation.rb:39:in `each'
/opt/theforeman/tfm-ror51/root/usr/share/gems/gems/activerecord-5.1.6/lib/active_record/relation/delegation.rb:39:in `each'

(now it depends per use case / rake script failed, e.g. few times:

/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:614:in `import_distributions'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/lib/katello/tasks/upgrades/3.7/make_all_ks_repos_bootable.rake:7:in `block (4 levels) in <top (required)>'
/opt/rh/rh-ruby24/root/usr/share/gems/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'

)

Usually it happens in a rake script that calls REST requests in a tight loop. Very random occurrences, more often on 6.5.


inecas++ ++ found out a fix:

diff --git a/app/models/setting.rb b/app/models/setting.rb
index 72a094024aa135fa41e0f6454d07c1d79fa3ac0a..e19ad55b24afb74c83f9e3033db2756541830fab 100644
--- a/app/models/setting.rb
+++ b/app/models/setting.rb
@@ -93,12 +93,8 @@ class Setting < ApplicationRecord
   def self.[](name)
     name = name.to_s
     cache_value = Setting.cache.read(name)
-    if cache_value.nil?
-      value = find_by(:name => name).try(:value)
-      Setting.cache.write(name, value)
-      return value
-    else
-      return cache_value
+    cache.fetch(name) do
+      find_by(:name => name).try(:value)
     end
   end

that really works.

This BZ is rather a tracker to ensure we fix it in downstream / to decide where all to fix it.


Version-Release number of selected component (if applicable):
Sat 6.4.2


How reproducible:
very sporadicaly during upgrade. Bit artificial reproducer found by Ivan is provided below.


Steps to Reproduce:
1. Decrease katello rest timeout to see the timeouts much sooner than after 1 hour:
/etc/foreman/plugins/katello.yaml :

  :rest_client_timeout: 30

2. Run either (or all) scripts that fires REST requests in a tight loop:

a) requests to 127.0.0.1/pub :
mkdir -p ~/repro
cat <<EOF | foreman-rake console | tee ~/repro/pub-$(date -Isec).log
   conf.return_format = ""
   log = Logger.new(STDOUT)
   log.info "=== process pid #{\$\$}"
   url = URI.parse('http://127.0.0.1/pub/')
   threads = 1.times.map do |i|
     Thread.new do
       log.info "Start thread no. #{i+1}"
         iteration = 0
         cycle = 0
         while true
         begin
           cycle += 1
           if cycle == 1000
             iteration += 1
             cycle = 0
             log.info "=== Thread #{i+1}: iteration #{iteration}k"
           end
           Setting['http_cache']
           Net::HTTP.get_response(url)
         rescue Exception => e
           log.info "Thread #{i+1}: error at iteration #{iteration * 1000 + cycle}"
           log.info "#{e.class} #{e.message}"
           log.info e.backtrace.join("\n")
           sleep(1)
         end
       end
     end
   end
   threads.each(&:join)
EOF

b) requests to pulp repo details:
cat <<EOF | foreman-rake console | tee ~/repro/pulp-$(date -Isec).log
   conf.return_format = ""
   log = Logger.new(STDOUT)
   pulp_id = Katello::Repository.first.pulp_id
   Katello.pulp_server.extensions.repository
   threads = 4.times.map do |i|
     Thread.new do
       log.info "Start thread no. #{i+1}"
       iteration = 0
       cycle = 0
       while true        
         begin
           cycle += 1
           if cycle == 1000
             iteration += 1
             cycle = 0
             log.info "=== iteration #{iteration}k"
           end
           Katello.pulp_server.extensions.repository.distributions(pulp_id)
         rescue Exception => e
           log.info "Error at iteration #{iteration * 1000 + cycle}"
           log.info "#{e.class} #{e.message}"
           log.info e.backtrace.join("\n")
         end
       end
     end
   end
   threads.each(&:join)
EOF

c) optionally, run requests against candlepin like (in rake console):

  4.times do
    Thread.new do
      begin
      org = Organization.first
      label = org.label
      iteration = 0
      cycle = 0
      while true        
        cycle += 1
        if cycle == 1000
          iteration += 1
          cycle = 0
          puts "=== iteration #{iteration}k"
        end
        Katello::Resources::Candlepin::Owner.get_ueber_cert(label)
      end
      rescue Exception => e
        puts e.message
        puts e.backtrace.join("\n")
        raise e
      end
    end
  end

3. Wait few hours and count number of timeouts


Actual results:
3. VERY depending on system, few to many timeouts happen


Expected results:
3. no timeouts


Additional info:

--- Additional comment from RHEL Product and Program Management on 2019-03-10 19:18:08 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the 'sat-backlog' flag has been set to ? to ensure that it is properly evaluated for release.

--- Additional comment from RHEL Product and Program Management on 2019-03-10 19:18:08 UTC ---

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been set to + automatically for the next planned release.

Comment 3 Nikhil Kathole 2019-04-11 09:22:47 UTC
VERIFIED

Version tested:
# rpm -q satellite
satellite-6.4.3-1.el7sat.noarch

No timeout seen with provided reproducer

I, [2019-04-11T05:06:38.925736 #19214]  INFO -- : === Thread 1: iteration 100k
I, [2019-04-11T05:06:41.913713 #19214]  INFO -- : === Thread 1: iteration 101k
I, [2019-04-11T05:06:45.022606 #19214]  INFO -- : === Thread 1: iteration 102k
I, [2019-04-11T05:06:48.103804 #19214]  INFO -- : === Thread 1: iteration 103k
I, [2019-04-11T05:06:51.326653 #19214]  INFO -- : === Thread 1: iteration 104k


I, [2019-04-11T05:19:39.623569 #1336]  INFO -- : === iteration 16k
I, [2019-04-11T05:19:40.514961 #1336]  INFO -- : === iteration 16k
I, [2019-04-11T05:20:22.603960 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:20:24.864712 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:20:25.063272 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:20:26.603654 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:21:08.424344 #1336]  INFO -- : === iteration 18k

Comment 5 errata-xmlrpc 2019-04-29 18:15:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0903


Note You need to log in before you can comment on or make changes to this bug.