Bug 809234 - Pulp::Repository: Request Timeout during promotion
Summary: Pulp::Repository: Request Timeout during promotion
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Content Management
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium vote
Target Milestone: Unspecified
Assignee: Lukas Zapletal
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-02 19:43 UTC by James Laska
Modified: 2019-09-26 13:33 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When installing and preparing a CloudForms System Engine environment, re-using any previously cached non-Library pulp data may introduce failures during content promotion. Administrators are advised to start with a fresh file-system and not re-use any previously synced pulp data. Prior to installation, the directory /var/lib/pulp should not exist.
Clone Of:
Environment:
Last Closed: 2014-09-11 12:19:26 UTC
Target Upstream Version:


Attachments (Terms of Use)
katello-debug-20120402193318.tar.gz (405.34 KB, application/x-gzip)
2012-04-02 19:43 UTC, James Laska
no flags Details
katello-debug-20120405183708.tar.gz (425.45 KB, application/x-gzip)
2012-04-05 18:39 UTC, James Laska
no flags Details
Patch that adds Content-Length HTTP header to all Pulp controllers (1.00 KB, patch)
2012-04-11 12:23 UTC, Lukas Zapletal
no flags Details | Diff

Description James Laska 2012-04-02 19:43:06 UTC
Created attachment 574607 [details]
katello-debug-20120402193318.tar.gz

Description of problem:

Getting a pulp request timeout when attempting to promote a changeset that includes RHEL 5Server, 6Server and CloudForms.  See attached katello-debug log file for details.

It appears that a createrepo process is still running, even after the pulp timeout was received. 


Version-Release number of selected component (if applicable):
 * candlepin-0.5.26-1.el6.src.rpm
 * katello-0.1.307-1.el6.src.rpm
 * katello-candlepin-cert-key-pair-1.0-1.src.rpm
 * katello-certs-tools-1.0.4-1.el6.src.rpm
 * katello-cli-0.1.107-1.el6.src.rpm
 * katello-configure-0.1.107-1.el6.src.rpm
 * katello-qpid-broker-key-pair-1.0-1.src.rpm
 * katello-qpid-client-key-pair-1.0-1.src.rpm
 * katello-selinux-0.1.10-1.el6.src.rpm
 * pulp-1.0.2-1.el6.src.rpm


How reproducible:
 * I did *not* see this when running the *same* script on a system with 5G memory
 * I did see the problem when running the *same* script on a system with 8G memory
 * I did *not* see this when running the *same* script on a system with 24G memory


Steps to Reproduce:
1. katello -u admin -p admin provider synchronize --name "Red Hat"
Provider [ Red Hat ] synchronized

2. katello -u admin -p admin changeset create --name "02 - rht repos"
Successfully created changeset [ 02 - rht repos ] for environment [ Dev ]

3. katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat Enterprise Linux Server"
Successfully updated changeset [ 02 - rht repos ]

4. katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat CloudForms"
Successfully updated changeset [ 02 - rht repos ]

5. katello -u admin -p admin changeset promote --name "02 - rht repos"
  
Actual results:

Changeset [ 02 - rht repos ] promotion failed: Pulp::Repository: Request Timeout  (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/)
Error (65): command failed.

Expected results:

Changeset [ 02 - rht repos ] promotion successfully

Additional info:

katello> changeset info --name "02 - rht repos"
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                                                                                                 Changeset Info
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> Id:               1
> Name:             02 - rht repos
> Description:
>     None
> Updated At:       2012/04/02 19:27:46
> State:            failed
> Environment Id:   3
> Environment Name: Dev
> Errata:
>     
> Products:
>     Red Hat Enterprise Linux Server
>     Red Hat CloudForms
> Packages:
>     
> Repositories:
>     
> System Templates:
>     
> Distributions:


katello> repo list --org redhat --environment Library
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                                                                                  Repo List For Org redhat Environment Library
> 
>  Id     Name                                                           Package Count  
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  656    Red Hat Enterprise Linux 5 Server RPMs x86_64 5Server          13437          
>  677    Red Hat CloudForms Tools for RHEL 6 Beta RPMs x86_64 6Server   53             
>  760    Red Hat Enterprise Linux 6 Server RPMs x86_64 6Server          7061           
>  978    Red Hat CloudForms Tools for RHEL 6 RPMs x86_64 6Server        0              
>  1050   Red Hat CloudForms Tools for RHEL 5 RPMs x86_64 5Server        0              
>  1087   Red Hat CloudForms Tools for RHEL 5 Beta RPMs x86_64 5Server   54             
>  654    Red Hat Enterprise Linux 5 Server RPMs i386 5Server            10298          
>  1048   Red Hat CloudForms Tools for RHEL 5 RPMs i386 5Server          0              
>  1085   Red Hat CloudForms Tools for RHEL 5 Beta RPMs i386 5Server     54             
>  762    Red Hat Enterprise Linux 6 Server RPMs i386 6Server            5434           
>  980    Red Hat CloudForms Tools for RHEL 6 RPMs i386 6Server          0              
>  680    Red Hat CloudForms Tools for RHEL 6 Beta RPMs i386 6Server     50             
>  505    Red Hat CloudForms System Engine Beta RPMs x86_64 6Server      432            
>  601    Red Hat CloudForms Cloud Engine Beta RPMs x86_64 6Server       474            
>  592    Red Hat CloudForms Cloud Engine RPMs x86_64 6Server            0              
>  550    Red Hat CloudForms System Engine RPMs x86_64 6Server           0

Comment 1 James Laska 2012-04-02 20:17:59 UTC
Attempting to retry the failed promotion also fails ...

# katello -u admin -p admin changeset promote --name "02 - rht repos" 
Changeset [ 02 - rht repos ] promotion failed: undefined method `sync' for nil:NilClass

Comment 2 James Laska 2012-04-02 20:40:35 UTC
The re-promotion failure in comment#1 looks similar to bug#804610.  However, this changeset does not include repositories with non-ASCII characters in the name.

Comment 3 Lukas Zapletal 2012-04-03 14:41:06 UTC
So this is problem in our default setting - our Katello REST client has default timeout set to 30 seconds, and I guess in very overloaded scenarios this is not enough.

I suggest you to set

  rest_client_timeout: 150

in the katello.yml file and try to reproduce it.

Comment 4 Lukas Zapletal 2012-04-03 15:51:28 UTC
So this is what we have in our katello codebase:

vim /usr/share/katello/app/models/glue/pulp/repo.rb +145

    if self.is_cloned_in?(to_env)
      #repo is already cloned, so lets just re-sync it from its parent
      return self.get_clone(to_env).sync
    else
      #repo is not in the next environment yet, we have to clone it there
      ...
    end

While is_cloned_in? returns true, get_clone method returns nil. So Pulp reports a clone exists, but when Katello asks for it, it does not find it. In this case its:

redhat-Dev-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server

So for some reason Pulp has inconsitency - when repo was cloned its id was stored in cloned_ids array, but it was not created. I would like to ask pulp guys to confirm if this is done in one "transaction" so there will not be inconsistency like this.

Comment 5 James Laska 2012-04-03 15:56:17 UTC
cc'ing Pradeep.  Pradeep, can you answer the question from comment#4?

Comment 6 Pradeep Kilambi 2012-04-03 16:18:17 UTC
Which cloned_ids array are you referring to? the one on the parent repo? if so, pulp only stores the clone id in the cloned array after the clone repo id is created, so dont see a clone id exists in that errata and not exists. Unless when we delete the cloned repo we're not cleaning up the clone id of the parent repo's clone list? just a guess..

Comment 9 Lukas Zapletal 2012-04-04 09:10:01 UTC
So the issue is Pulp did not acceptour clone repository request and our REST client timed out after 150 seconds. 

Pulp::Repository: Request Timeout  (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/)
[ERROR: 2012-04-02 19:27:46 #9421] /usr/lib/ruby/gems/1.8/gems/rest-client-1.6.1/lib/restclient/request.rb:174:in `transmit'

Please note from the comment above I can see Pulp actually cloned the repository. So we had a timeout, but it cloned it.

The issue is this should be asynchronous in Pulp, we are expecting immediate response with a job id. What is the reason it timed out? There is no error log in pulp logs. It could be a deadlock, or overload. So the solution would be:

Pulp team:

Investigate why Pulp locked for 150 seconds on an asynchronous REST request. I mean from the code perpective, is there anything that could block the request? Is it possible to return valid response instead letting it to timeout when this scenario is reached (e.g. mongo write block is catched)?

Katello team:

Write a script that would compare pulp and katello repositories called "katello-check". It's too difficuld to compare backend engines and katello database states.

Comment 11 Lukas Zapletal 2012-04-04 13:36:28 UTC
Maybe this one is relevant: https://bugzilla.redhat.com/show_bug.cgi?id=807720

Comment 12 James Laska 2012-04-05 18:39:34 UTC
Created attachment 575524 [details]
katello-debug-20120405183708.tar.gz

Retesting on a fresh installed system using:
 * pulp-1.0.4-1.el6.noarch
 * katello-0.1.309-1.el6.noarch

I am still seeing:

> katello -u admin -p admin provider synchronize --name "Red Hat"
> Provider [ Red Hat ] synchronized
> 
> katello -u admin -p admin changeset create --name "02 - rht repos"
> Successfully created changeset [ 02 - rht repos ] for environment [ Dev ]
> 
> katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat Enterprise Linux Server"
> Successfully updated changeset [ 02 - rht repos ]
> 
> katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat CloudForms"
> Successfully updated changeset [ 02 - rht repos ]
> 
> katello -u admin -p admin changeset promote --name "02 - rht repos"
> Changeset [ 02 - rht repos ] promotion failed: Pulp::Repository: Request Timeout  (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/)


I see a genpkgmetadata.py (createrepo) is still inprogress, even after the failure.

> root     31299     1  0 17:01 ?        00:00:00 /usr/sbin/httpd
> apache   31755 31299 44 17:03 ?        00:42:19  \_ (wsgi:pulp)    
> apache    5125 31755 70 18:31 ?        00:04:40  |   \_ /usr/bin/python -t /usr/share/createrepo/genpkgmetadata.py --database --checksum sha -g /var/lib/pulp//repos/redhat/Dev/content/dist/rhel/server/5/5Server/x86_64/os/repodata/comps.xml --update /var/lib/pulp//repos/redhat/Dev/content/dist/rhel/server/5/5Server/x86_64/os

The system is available for remote debugging: flatline-katello.usersys.redhat.com (root/redhat)

Comment 13 Lukas Zapletal 2012-04-06 12:31:30 UTC
So the root cause is the same as the existing bug. I am marking this as a duplicate and reopening the original.

*** This bug has been marked as a duplicate of bug 773088 ***

Comment 15 Lukas Zapletal 2012-04-10 08:16:43 UTC
Clearing the clone flag.

Comment 16 Lukas Zapletal 2012-04-10 09:46:30 UTC
I thought it's a clone of a different (similar) bug, but it is not.

SUMMARY: After the content synchronization, new promotion is started. It seems for some reason something gets wrong in Pulp.

- 19:15 - the sync is finished:
- 19:25 - repodata generation starting (this is the last pulp log)
- 19:27 - katello is trying to start promotion and it fails

Unable to reproduce with smaller repos, it seems Pulp gets "overloaded". Must be reproduced with more content as James recommends.

Comment 18 Lukas Zapletal 2012-04-11 12:22:38 UTC
My first finding is Katello timed out AFTER httpd reported the response was returned. My wild guess would be Apache httpd, for some reason, kept the connection alive. Pulp does not set HTTP header field Content-Length, so Katello REST client must be waiting until the connection is closed.

I prepared a patch that adds this content-length header to all pulp controller responses. We will see if this helps.

@James - Please apply the patch and try to reproduce on your setup, thanks.

Comment 19 Lukas Zapletal 2012-04-11 12:23:34 UTC
Created attachment 576756 [details]
Patch that adds Content-Length HTTP header to all Pulp controllers

Comment 20 James Laska 2012-04-11 17:17:46 UTC
A few comments ...

1)  I'd like input from Lukas on whether this bug qualifies as an RC blocker.  I'm able to consistently hit this race condition on my setup.  Unfortunately, I don't know the parameters of the race condition, or how common it may be.  That makes me think this is worth reviewing in a broader context.

2) Regarding the pulp patch in comment#19 ... 

After installing the system, installing katello-all, applying the patch, starting my test (begins with katello-configure) ... I'm still seeing the failure.

> :: [   FAIL   ] :: Running 'katello -u admin -p admin changeset promote --name "Promote Red Hat provider"' (Expected 0, got 65)
> Changeset [ Promote Red Hat provider ] promotion failed: Pulp::Repository: Request Timeout  (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/)
> Error (65): command failed.  Retry (r), Continue (c), stop (s)? s
> [root@flatline-katello ~]# date
> Wed Apr 11 17:12:56 EDT 2012


3) Is "CloudForms Common" "pymongo" the proper place for this bug?  Should it be assigned to pulp?

Comment 21 James Laska 2012-04-11 20:29:37 UTC
I'm rerunning my test on several previously affected systems ...

Comment 22 James Laska 2012-04-11 20:37:38 UTC
On flatline-katello.usersys.redhat.com ...

> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> :: [   LOG    ] :: Sync and promote 'Red Hat' provider
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> 
> :: [   PASS   ] :: Running 'katello -u admin -p admin provider synchronize --name "Red Hat"'
> :: [   PASS   ] :: Running 'katello -u admin -p admin changeset create --name "Promote Red Hat provider"'
> :: [   PASS   ] :: Running 'katello -u admin -p admin changeset update --name "Promote Red Hat provider" --add_product "Red Hat Enterprise Linux Server"'
> :: [   PASS   ] :: Running 'katello -u admin -p admin changeset update --name "Promote Red Hat provider" --add_product "Red Hat CloudForms"'
> :: [   FAIL   ] :: Running 'katello -u admin -p admin changeset promote --name "Promote Red Hat provider"' (Expected 0, got 65)
Changeset [ Promote Red Hat provider ] promotion failed: Pulp::Repository: Request Timeout  (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/)
> :: [   LOG    ] :: Duration: 1h 9m 45s
> :: [   LOG    ] :: Assertions: 4 good, 1 bad
> :: [   FAIL   ] :: RESULT: Sync and promote 'Red Hat' provider
> :: [   LOG    ] :: JOURNAL XML: /tmp/beakerlib-4dqVhGV/journal.xml
> :: [   LOG    ] :: JOURNAL TXT: /tmp/beakerlib-4dqVhGV/journal.txt

Comment 23 Lukas Zapletal 2012-04-12 08:15:37 UTC
I think this is a blocker, major feature is not working and you are able to reproduce it anytime we want. Correcting component, setting to "pulp". Thanks for retesting.

Comment 25 Lukas Zapletal 2012-04-19 06:58:44 UTC
I am not sure if this is related, but we are suffering from the same bug as here on the testing box:

[root@flatline-katello ~]# service httpd restart
Stopping httpd:                                            [FAILED]
Starting httpd: Syntax error on line 10 of /etc/httpd/conf.d/katello.conf:
SSLCertificateFile: file '/etc/candlepin/certs/candlepin-ca.crt' does not exist or is empty
                                                           [FAILED]
[root@flatline-katello ~]# ll /etc/candlepin/certs/candlepin-ca.crt
ls: cannot access /etc/candlepin/certs/candlepin-ca.crt: No such file or directory

https://bugzilla.redhat.com/show_bug.cgi?id=813812

Comment 26 Lukas Zapletal 2012-04-19 08:49:05 UTC
Another finding, during the long script run I noticed there are some errors in the pulp.log:

PulpException: Architecture must be one of [noarch, i386, i686, ppc64, ppc, s390x, x86_64, ia64]

Comment 28 Lukas Zapletal 2012-04-19 12:52:55 UTC
Tried to sync this manually using pulp-admin and it worked. We should do the same with REST API.

Sync for repository test_lzap2 started
Sync: Finished
0/13514 new items downloaded
13514/13514 existing items processed

Item Details: 
Tree Files: 15/15
RPMs: 13499/13499

Comment 30 James Laska 2012-04-20 22:43:22 UTC
(In reply to comment #29)
> James will be  updating the BZ later tonight after a test run of his completes
> but most likely this will get moved to 1.1

Turns out the cause of this bug was using a cached /var/lib/pulp directory structure.  In my testing, since I'm downloading >100G of data from the CDN, I keep /var/lib/pulp as a separate disk partition.  During re-installs, I leave  /var/lib/pulp unchanged.  This helps reduce the network lag downloading the content.

Unfortunately, this also introduces this bug.  The presence of non-Library environment files on disk was causing problems.  Thankfully, I'm able to continue to cache content to reduce network download delays.  However, I need to remove all non-Library environment files before sync and promotion.  The following commands will remove the offending previously cached content...

> shopt -s extglob
> for DIR in $(echo /var/lib/pulp/repos/redhat/!(Library) /var/lib/pulp/published/repos/redhat/!(Library))
> do
>     test -d "$DIR" && rm -rf "$DIR"
> done

This problem should not present on a fresh installed system with a fully formatted disk.

I'm adjusting the target release accordingly.  Pradeep suggested there was a difference in how pulp and katello were responding to the promotion request.  I'm leaving this bug open to further review the difference in case some additional polling is needed by katello to handle this situation.

Comment 31 Lukas Zapletal 2012-04-21 09:04:03 UTC
Uh. James please report all deviations from the standard installation in advance, because I really thought this is "clean" instance (without any re-sharing or something). ;-) I guess this worth sharing on list, because more folks may be doing this. Thanks!

Comment 34 James Laska 2012-04-23 14:20:13 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When installing and preparing a CloudForms System Engine environment, re-using any previously cached non-Library pulp data may introduce failures during content promotion.  Administrators are advised to start with a fresh file-system and not re-use any previously synced pulp data.  Prior to installation, the directory /var/lib/pulp should not exist.

Comment 41 Bryan Kearney 2014-01-08 22:11:12 UTC
I did not see this in the training setup. Moving to ON_QA to verify.

Comment 43 Bryan Kearney 2014-09-11 12:19:26 UTC
This was delivered with Satellite 6.0 which was released on 10 September 2014.


Note You need to log in before you can comment on or make changes to this bug.