Bug 1687190

Summary:	sporadic timeouts in opening TCP connection prevents Satellite upgrades
Product:	Red Hat Satellite	Reporter:	Pavel Moravec <pmoravec>
Component:	Upgrades	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Nikhil Kathole <nkathole>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.4.2	CC:	egolov, inecas, ktordeur, mbacovsk, sthirugn
Target Milestone:	6.5.0	Keywords:	FieldEngineering, PrioBumpField, Triaged
Target Release:	Unused
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	foreman-1.20.1.25-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1688317 (view as bug list)		Environment:
Last Closed:	2019-05-14 12:40:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1688317

Description Pavel Moravec 2019-03-10 19:18:00 UTC

Description of problem:
Upgrading Satellite (e.g. from 6.3 to 6.4) can fail at random rake script with backtrace like:

RestClient::Exceptions::OpenTimeout: Timed out connecting to server
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:731:in `rescue in transmit'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:642:in `transmit'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:145:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:52:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/resource.rb:67:in `post'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/base.rb:92:in `get_response'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/base.rb:72:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/resources/repository.rb:147:in `unit_search'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/extensions/repository.rb:199:in `distributions'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:620:in `import_distribution_data'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:615:in `block in import_distributions'
/opt/theforeman/tfm-ror51/root/usr/share/gems/gems/activerecord-5.1.6/lib/active_record/relation/delegation.rb:39:in `each'
/opt/theforeman/tfm-ror51/root/usr/share/gems/gems/activerecord-5.1.6/lib/active_record/relation/delegation.rb:39:in `each'

(now it depends per use case / rake script failed, e.g. few times:

/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:614:in `import_distributions'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/lib/katello/tasks/upgrades/3.7/make_all_ks_repos_bootable.rake:7:in `block (4 levels) in <top (required)>'
/opt/rh/rh-ruby24/root/usr/share/gems/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'

)

Usually it happens in a rake script that calls REST requests in a tight loop. Very random occurrences, more often on 6.5.


inecas++ ++ found out a fix:

diff --git a/app/models/setting.rb b/app/models/setting.rb
index 72a094024aa135fa41e0f6454d07c1d79fa3ac0a..e19ad55b24afb74c83f9e3033db2756541830fab 100644
--- a/app/models/setting.rb
+++ b/app/models/setting.rb
@@ -93,12 +93,8 @@ class Setting < ApplicationRecord
   def self.[](name)
     name = name.to_s
     cache_value = Setting.cache.read(name)
-    if cache_value.nil?
-      value = find_by(:name => name).try(:value)
-      Setting.cache.write(name, value)
-      return value
-    else
-      return cache_value
+    cache.fetch(name) do
+      find_by(:name => name).try(:value)
     end
   end

that really works.

This BZ is rather a tracker to ensure we fix it in downstream / to decide where all to fix it.


Version-Release number of selected component (if applicable):
Sat 6.4.2


How reproducible:
very sporadicaly during upgrade. Bit artificial reproducer found by Ivan is provided below.


Steps to Reproduce:
1. Decrease katello rest timeout to see the timeouts much sooner than after 1 hour:
/etc/foreman/plugins/katello.yaml :

  :rest_client_timeout: 30

2. Run either (or all) scripts that fires REST requests in a tight loop:

a) requests to 127.0.0.1/pub :
mkdir -p ~/repro
cat <<EOF | foreman-rake console | tee ~/repro/pub-$(date -Isec).log
   conf.return_format = ""
   log = Logger.new(STDOUT)
   log.info "=== process pid #{\$\$}"
   url = URI.parse('http://127.0.0.1/pub/')
   threads = 1.times.map do |i|
     Thread.new do
       log.info "Start thread no. #{i+1}"
         iteration = 0
         cycle = 0
         while true
         begin
           cycle += 1
           if cycle == 1000
             iteration += 1
             cycle = 0
             log.info "=== Thread #{i+1}: iteration #{iteration}k"
           end
           Setting['http_cache']
           Net::HTTP.get_response(url)
         rescue Exception => e
           log.info "Thread #{i+1}: error at iteration #{iteration * 1000 + cycle}"
           log.info "#{e.class} #{e.message}"
           log.info e.backtrace.join("\n")
           sleep(1)
         end
       end
     end
   end
   threads.each(&:join)
EOF

b) requests to pulp repo details:
cat <<EOF | foreman-rake console | tee ~/repro/pulp-$(date -Isec).log
   conf.return_format = ""
   log = Logger.new(STDOUT)
   pulp_id = Katello::Repository.first.pulp_id
   Katello.pulp_server.extensions.repository
   threads = 4.times.map do |i|
     Thread.new do
       log.info "Start thread no. #{i+1}"
       iteration = 0
       cycle = 0
       while true        
         begin
           cycle += 1
           if cycle == 1000
             iteration += 1
             cycle = 0
             log.info "=== iteration #{iteration}k"
           end
           Katello.pulp_server.extensions.repository.distributions(pulp_id)
         rescue Exception => e
           log.info "Error at iteration #{iteration * 1000 + cycle}"
           log.info "#{e.class} #{e.message}"
           log.info e.backtrace.join("\n")
         end
       end
     end
   end
   threads.each(&:join)
EOF

c) optionally, run requests against candlepin like (in rake console):

  4.times do
    Thread.new do
      begin
      org = Organization.first
      label = org.label
      iteration = 0
      cycle = 0
      while true        
        cycle += 1
        if cycle == 1000
          iteration += 1
          cycle = 0
          puts "=== iteration #{iteration}k"
        end
        Katello::Resources::Candlepin::Owner.get_ueber_cert(label)
      end
      rescue Exception => e
        puts e.message
        puts e.backtrace.join("\n")
        raise e
      end
    end
  end

3. Wait few hours and count number of timeouts


Actual results:
3. VERY depending on system, few to many timeouts happen


Expected results:
3. no timeouts


Additional info:

Comment 10 Nikhil Kathole 2019-03-29 08:37:59 UTC

VERIFIED

Version tested: 
Satellite 6.5 snap 21

Able to reproduce with 6.4 but could not see timeout on 6.5 snap 21 with above reproducer steps.

I, [2019-03-29T04:18:54.060583 #5203]  INFO -- : === Thread 1: iteration 163k
I, [2019-03-29T04:18:56.548624 #5203]  INFO -- : === Thread 1: iteration 164k
I, [2019-03-29T04:18:59.073265 #5203]  INFO -- : === Thread 1: iteration 165k
I, [2019-03-29T04:19:01.414366 #5203]  INFO -- : === Thread 1: iteration 166k
I, [2019-03-29T04:19:03.754766 #5203]  INFO -- : === Thread 1: iteration 167k

I, [2019-03-29T04:35:39.720092 #12961]  INFO -- : === iteration 19k
I, [2019-03-29T04:35:40.208662 #12961]  INFO -- : === iteration 19k
I, [2019-03-29T04:35:41.687506 #12961]  INFO -- : === iteration 19k
I, [2019-03-29T04:36:22.321864 #12961]  INFO -- : === iteration 20k
I, [2019-03-29T04:36:22.376873 #12961]  INFO -- : === iteration 20k
I, [2019-03-29T04:36:23.768408 #12961]  INFO -- : === iteration 20k
I, [2019-03-29T04:36:25.250421 #12961]  INFO -- : === iteration 20k

Comment 13 errata-xmlrpc 2019-05-14 12:40:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:1222