Bug 677735 - Pulp process becomes unresponsive after a few days
Summary: Pulp process becomes unresponsive after a few days
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Pulp
Classification: Retired
Component: z_other
Version: unspecified
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: Sprint 26
Assignee: James Slagle
QA Contact: Preethi Thomas
URL:
Whiteboard:
Depends On:
Blocks: 688298
TreeView+ depends on / blocked
 
Reported: 2011-02-15 17:36 UTC by Jason Connor
Modified: 2012-02-24 20:18 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)
pulp server log file (12.35 KB, text/plain)
2011-02-15 17:36 UTC, Jason Connor
no flags Details
httpd access logs (8.49 KB, application/octet-stream)
2011-02-15 17:44 UTC, Mike McCune
no flags Details
Full traceback from corruption error in comment #10 (68.46 KB, application/octet-stream)
2011-06-16 13:01 UTC, Jay Dobies
no flags Details

Description Jason Connor 2011-02-15 17:36:15 UTC
Created attachment 478934 [details]
pulp server log file

Description of problem:
After a few days of running, the pulp wsgi daemon process becomes unresponsive

Version-Release number of selected component (if applicable):
commit 3487ce737bb937e7c010fdae16f1edb8f3748fe0

How reproducible:
Seemingly very (more than 1 complaint)

Steps to Reproduce:
1. install pulp in developer mode
2. let run for a few days w/ no operations
3. try any query to the pulp api
  
Actual results:
response times out and is logged by apache

Expected results:
valid queries to return results, invalid to return errors

Additional info:
even querying an invalid resource such as https://localhost/pulp/api/some-invalid-handler/ does not return the expected 404

the process seems to be alive:
$ ps aux | grep pulp
apache    1637  0.6  1.6 512952 16600 ?        Sl   10:07   0:00 (wsgi:pulp)

apache log entry:
[Tue Feb 15 07:58:29 2011] [error] [client 127.0.0.1] Script timed out before returning headers: webservices.wsgi

verified on fedora 13

Comment 1 Mike McCune 2011-02-15 17:43:56 UTC
Note: Restarting qpidd doesn't help when the server is in this stuck state

Comment 2 Mike McCune 2011-02-15 17:44:39 UTC
Created attachment 478937 [details]
httpd access logs

Comment 3 Jeff Ortel 2011-02-21 15:55:56 UTC
Seeing this on Preethi's server that went unresponsive:

[Sun Feb 20 03:13:02 2011] [warn] RSA server certificate is a CA certificate (BasicConstraints: CA == TRUE !?)
[Sun Feb 20 03:16:29 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Sun Feb 20 07:16:30 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Sun Feb 20 11:16:30 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Sun Feb 20 15:16:31 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Sun Feb 20 19:16:31 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Sun Feb 20 23:16:31 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Mon Feb 21 03:16:32 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Mon Feb 21 07:16:32 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi
[Mon Feb 21 09:37:20 2011] [error] [client 10.11.231.129] Script timed out before returning headers: webservices.wsgi

Comment 4 Preethi Thomas 2011-03-07 14:26:26 UTC
Saw this today on my server 
error_log

7f371f139000-7f371f1bb000 rw-p 00000000 00:00 0 
7f371f1bb000-7f371f236000 rw-s 00000000 00:04 4314012                    /dev/zero (deleted)
7f371f236000-7f371f23d000 rw-p 00000000 00:00 0 
7f371f24b000-7f371f24e000 rw-p 00000000 00:00 0 
7f371f24e000-7f371f24f000 rw-s 00000000 fd:00 1703966                    /tmp/ffiHdgwve (deleted)
7f371f24f000-7f371f250000 r-xs 00000000 fd:00 1703966                    /tmp/ffiHdgwve (deleted)
7f371f250000-7f371f251000 rw-p 00000000 00:00 0 
7f371f251000-7f371f252000 r--p 0001e000 fd:00 268775                     /lib64/ld-2.13.so
7f371f252000-7f371f253000 rw-p 0001f000 fd:00 268775                     /lib64/ld-2.13.so
7f371f253000-7f371f254000 rw-p 00000000 00:00 0 
7f371f254000-7f371f2a7000 r-xp 00000000 fd:00 2501636                    /usr/sbin/httpd
7f371f4a6000-7f371f4a9000 r--p 00052000 fd:00 2501636                    /usr/sbin/httpd
7f371f4a9000-7f371f4aa000 rw-p 00055000 fd:00 2501636                    /usr/sbin/httpd
7f371f4aa000-7f371f4ae000 rw-p 00000000 00:00 0 
7f371f896000-7f37201ea000 rw-p 00000000 00:00 0                          [heap]
7fffc9ad8000-7fffc9b15000 rw-p 00000000 00:00 0                          [stack]
7fffc9bff000-7fffc9c00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[Sun Mar 06 03:14:52 2011] [notice] child pid 15759 exit signal Aborted (6)
*** glibc detected *** (wsgi:pulp)    : malloc(): memory corruption: 0x00007f37201a2c10 ***

Comment 5 Preethi Thomas 2011-03-07 14:28:14 UTC
in ssl_error_log

==> /var/log/httpd/ssl_access_log <==
10.11.231.129 - - [06/Mar/2011:05:38:44 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [06/Mar/2011:09:38:44 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [06/Mar/2011:13:38:45 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [06/Mar/2011:17:38:45 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [06/Mar/2011:21:38:45 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [07/Mar/2011:01:38:45 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [07/Mar/2011:05:38:46 -0500] "POST /pulp/api//consumers/preethi/profile/ HTTP/1.1" 500 627
10.11.231.129 - - [07/Mar/2011:09:05:13 -0500] "PUT /pulp/api//roles/ HTTP/1.1" 500 627
10.11.231.129 - - [07/Mar/2011:09:08:13 -0500] "GET /pulp/api//repositories/ HTTP/1.1" 500 627

Comment 7 Jason Connor 2011-04-04 16:02:36 UTC
pulp 0.0.151
fedora 13

[Sun Apr 03 03:33:12 2011] [notice] Digest: generating secret for digest authentication ...
[Sun Apr 03 03:33:12 2011] [notice] Digest: done
[Sun Apr 03 03:33:13 2011] [error] python_init: Python version mismatch, expected '2.6', found '2.6.4'.
[Sun Apr 03 03:33:13 2011] [error] python_init: Python executable found '/usr/bin/python'.
[Sun Apr 03 03:33:13 2011] [error] python_init: Python path being used '/usr/lib64/python26.zip:/usr/lib64/python2.6/:/usr/lib64/python2.6/plat-linux2:/usr/lib64/python2.6/lib-tk:/usr/lib64/python2.6/lib-old:/usr/lib64/python2.6/lib-dynload'.
[Sun Apr 03 03:33:13 2011] [notice] mod_python: Creating 4 session mutexes based on 256 max processes and 0 max threads.
[Sun Apr 03 03:33:13 2011] [notice] mod_python: using mutex_directory /tmp 
[Sun Apr 03 03:33:13 2011] [notice] SSL FIPS mode disabled
[Sun Apr 03 03:33:13 2011] [notice] Apache/2.2.17 (Unix) DAV/2 mod_auth_kerb/5.4 mod_auth_pgsql/2.0.3 PHP/5.3.5 mod_python/3.3.1 Python/2.6.4 mod_ssl/2.2.17 OpenSSL/1.0.0d-fips mod_wsgi/3.1 mod_perl/2.0.4 Perl/v5.10.1 configured -- resuming normal operations
[Sun Apr 03 03:33:16 2011] [notice] child pid 1988 exit signal Segmentation fault (11)
[Sun Apr 03 03:33:17 2011] [notice] child pid 2016 exit signal Segmentation fault (11)
Fatal Python error: GC object already tracked
[Sun Apr 03 03:33:18 2011] [error] mod_wsgi (pid=2023): Target WSGI script '/srv/pulp/bootstrap.wsgi' cannot be loaded as Python module.
[Sun Apr 03 03:33:18 2011] [error] mod_wsgi (pid=2023): Exception occurred processing WSGI script '/srv/pulp/bootstrap.wsgi'.
[Sun Apr 03 03:33:18 2011] [notice] child pid 2023 exit signal Aborted (6)
[Sun Apr 03 03:33:19 2011] [notice] child pid 2031 exit signal Segmentation fault (11)
*** glibc detected *** (wsgi:pulp)    : malloc(): memory corruption: 0x00007f01236078d0 ***

Comment 8 Jason Connor 2011-04-04 16:03:53 UTC
ssl_error_log

[Mon Apr 04 09:31:06 2011] [error] [client 127.0.0.1] Script timed out before returning headers: webservices.wsgi

Comment 9 Jason Connor 2011-04-05 15:57:36 UTC
abrtd log


Apr  3 03:30:01 jrist python: abrt: detected unhandled Python exception in /usr/lib/python2.6/site-packages/pulp-0.0.6-py2.6.egg/pulp/auditing.pyc
Apr  3 03:30:01 jrist abrtd: dumpsocket: New client connected
Apr  3 03:30:01 jrist abrtd: dumpsocket: Saved Python crash dump of pid 1676 to /var/spool/abrt/pyhook-1301823001-1676
Apr  3 03:30:01 jrist abrtd: dumpsocket: Socket client disconnected
Apr  3 03:30:01 jrist abrtd: Directory 'pyhook-1301823001-1676' creation detected
Apr  3 03:30:01 jrist abrtd: Executable '/usr/lib/python2.6/site-packages/pulp-0.0.6-py2.6.egg/pulp/auditing.pyc' doesn't belong to any package
Apr  3 03:30:01 jrist abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1301823001-1676 (res:4), deleting
Apr  3 03:30:01 jrist python: abrt: detected unhandled Python exception in /workspace/pulp/src/pulp/server/auditing.pyc
Apr  3 03:30:01 jrist abrtd: dumpsocket: New client connected
Apr  3 03:30:01 jrist abrtd: dumpsocket: Saved Python crash dump of pid 1678 to /var/spool/abrt/pyhook-1301823001-1678
Apr  3 03:30:01 jrist abrtd: dumpsocket: Socket client disconnected
Apr  3 03:30:01 jrist abrtd: Directory 'pyhook-1301823001-1678' creation detected
Apr  3 03:30:01 jrist abrtd: Executable '/workspace/pulp/src/pulp/server/auditing.pyc' doesn't belong to any package
Apr  3 03:30:01 jrist abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1301823001-1678 (res:4), deleting

Comment 10 Jay Dobies 2011-06-16 13:00:34 UTC
Not sure if it's related or not, but throwing this in here since it looks similar. This was seen on Pulp builds around .190 (maybe a few back by the download service team) by two different people:


Backtrace in /var/log/httpd/error_log.

[Thu Jun 16 17:50:55 2011] [error]   import md5
*** glibc detected *** (wsgi:pulp)    : double free or corruption (!prev):
0x00007fa1e17a8d10 ***

The full traceback is in the attachment comment_10_full_trace

Comment 11 Jay Dobies 2011-06-16 13:01:20 UTC
Created attachment 505032 [details]
Full traceback from corruption error in comment #10

Comment 12 Kedar Bidarkar 2011-06-27 08:59:34 UTC
Still persists with pulp - 194 and rhui 2.0.32


/var/log/httpd/ssl_error_log
[Mon Jun 27 03:47:55 2011] [error] [client 10.122.70.115] Script timed out before returning headers: webservices.wsgi

Connecting to RHUA [ip-10-122-70-115.ec2.internal]...
Successfully connected to [ip-10-122-70-115.ec2.internal]
Unexpected error caught at the shell level
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/rhui/tools/shell.py", line 75, in safe_listen
    self.listen(clear=first_run)
  File "/usr/lib/python2.6/site-packages/rhui/tools/shell.py", line 94, in listen
    Shell.listen(self)
  File "/usr/lib/python2.6/site-packages/rhui/common/shell.py", line 191, in listen
    item.func(*args, **item.kwargs)
  File "/usr/lib/python2.6/site-packages/rhui/tools/screens/sync.py", line 89, in cds_display
    self._sync_display('-= CDS Synchronization Status =-', self.pulp.cds_with_sync_status, True)
  File "/usr/lib/python2.6/site-packages/rhui/tools/screens/sync.py", line 268, in _sync_display
    data_sets = retrieve_func()
  File "/usr/lib/python2.6/site-packages/rhui/tools/pulp_api.py", line 518, in cds_with_sync_status
    cds_list = self.cds_list()
  File "/usr/lib/python2.6/site-packages/rhui/tools/pulp_api.py", line 509, in cds_list
    raise e
ServerRequestError: (500, '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator,\n root@localhost and inform them of the time the error occurred,\nand anything you might have done that may have\ncaused the error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n<hr>\n<address>Apache/2.2.15 (Red Hat) Server at ip-10-122-70-115.ec2.internal Port 443</address>\n</body></html>\n', None)

Comment 13 Jay Dobies 2011-07-12 17:46:05 UTC
James - this should be fixed with the mod_python removal you did. Can you comment on that and pitch this to MODIFIED?

Comment 14 James Slagle 2011-07-12 20:23:01 UTC
The memory corruption and segmentation fault errors were tracked down to a logrotate cron process that was sending a SIGHUP signal to apache.  Unfortunately, when both mod_python and mod_wsgi are loaded modules in apache, these errors ocurred.  There is further information provided at http://code.google.com/p/modwsgi/wiki/FrequentlyAskedQuestions

Now that we've moved off mod_python and rely solely on mod_wsgi, this issue is resolved.

Comment 15 Jeff Ortel 2011-07-14 20:34:30 UTC
Build 0.206, CR14.

Comment 16 Sachin Ghai 2011-07-19 07:08:56 UTC
I think following logs are related to this defect, so reporting this here.

I'm using following build on fresh installed setup:
pulp 0.208
rh-rhui-tools 2.0.40

The system was idle since last night and today when I started the rhui-manager, it was unresponsive for a long time and finally got unexpected error on UI. I got following in rhui.log:


rhui.log:
============
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/rhui/tools/shell.py", line 85, in safe_listen
    self.listen(clear=first_run)
  File "/usr/lib/python2.6/site-packages/rhui/tools/shell.py", line 111, in listen
    Shell.listen(self)
  File "/usr/lib/python2.6/site-packages/rhui/common/shell.py", line 191, in listen
    item.func(*args, **item.kwargs)
  File "/usr/lib/python2.6/site-packages/rhui/tools/screens/sync.py", line 79, in repo_display
    self._sync_display('-= Repository Synchronization Status =-', self.pulp.repos_with_sync_status, False)
  File "/usr/lib/python2.6/site-packages/rhui/tools/screens/sync.py", line 272, in _sync_display
    data_sets = retrieve_func()
  File "/usr/lib/python2.6/site-packages/rhui/tools/pulp_api.py", line 219, in repos_with_sync_status
    redhat_repos = self.redhat_repo_list()
  File "/usr/lib/python2.6/site-packages/rhui/tools/pulp_api.py", line 164, in redhat_repo_list
    raise e
ServerRequestError: (500, '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator,\n root@localhost and inform them of the time the error occurred,\nand anything you might have done that may have\ncaused the error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n<hr>\n<address>Apache/2.2.15 (Red Hat) Server at dhcp201-175.englab.pnq.redhat.com Port 443</address>\n</body></html>\n', None)


I checked the "pulp-server" status on rhua and found following:

[root@dhcp201-175 ~]# service pulp-server status
mongod dead but subsys locked
[root@dhcp201-175 ~]# 

[root@dhcp201-175 ~]# service mongod status
mongod dead but subsys locked
[root@dhcp201-175 ~]# 


error_log says:
=====================
[root@dhcp201-175 ~]# tail -n 100 /var/log/httpd/error_log
IOError: [Errno 32] Broken pipe
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.6/site-packages/grinder/activeobject.py", line 341, in <module>
    main()
  File "/usr/lib/python2.6/site-packages/grinder/activeobject.py", line 338, in main
    process()
  File "/usr/lib/python2.6/site-packages/grinder/activeobject.py", line 313, in process
    sys.stdout.flush()
Traceback (most recent call last):
  File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
IOError  File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.6/site-packages/grinder/activeobject.py", line 341, in <module>
    main()
  File "/usr/lib/python2.6/site-packages/grinder/activeobject.py", line 338, in main
    process()
  File "/usr/lib/python2.6/site-packages/grinder/activeobject.py", line 313, in process
    sys.stdout.flush()
: [Errno 32] Broken pipe
IOError: [Errno 32] Broken pipe


ssl_error_log says:
======================
[Tue Jul 19 11:46:47 2011] [error] [client 10.65.201.133] Script timed out before returning headers: webservices.wsgi

Pulp 0.208 build is installed and without mod_python package.:
==============================================================

[root@dhcp201-175 ~]# rpm -qa | grep pulp
pulp-common-0.0.208-1.el6.noarch
pulp-0.0.208-1.el6.noarch
pulp-client-0.0.208-1.el6.noarch
[root@dhcp201-175 ~]# rpm -qa | grep mod_python
[root@dhcp201-175 ~]# rpm -qa | grep mod_wsgi
mod_wsgi-3.2-3.sslpatch.el6.x86_64
[root@dhcp201-175 ~]# 


[root@dhcp201-133 ~]# rpm -qa | grep rh-rhui-tools
rh-rhui-tools-2.0.40-1.el6.noarch


James, could you please take a look ? Not sure if its related to original defect.

Comment 17 James Slagle 2011-07-26 19:29:15 UTC
The traceback from https://bugzilla.redhat.com/show_bug.cgi?id=677735#c16 is not related to this bug.  A new bug was filed for that issue (https://bugzilla.redhat.com/show_bug.cgi?id=723534).

Comment 18 Preethi Thomas 2011-09-22 18:30:34 UTC
moving to verified as I have not seen this error

root@preethi ~]# rpm -q pulp
pulp-0.0.233-1.fc15.noarch

Comment 19 Preethi Thomas 2012-02-24 20:17:01 UTC
Pulp v1.0 is released
Closed Current Release.

Comment 20 Preethi Thomas 2012-02-24 20:18:08 UTC
Pulp v1.0 is released.


Note You need to log in before you can comment on or make changes to this bug.