Bug 1287925

Summary:	/bin/sh /etc/cron.daily/rhsmd does not stop.
Product:	Red Hat Enterprise Linux 6	Reporter:	Yoshinori Takahashi <hkim>
Component:	subscription-manager	Assignee:	Chris Snyder <csnyder>
Status:	CLOSED ERRATA	QA Contact:	John Sefler <jsefler>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.7	CC:	alikins, bcourt, csnyder, hkim, jgalipea, redakkan, tmraz, vrjain
Target Milestone:	rc	Keywords:	Reopened, Triaged
Target Release:	6.9
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-21 10:54:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1269194, 1355878

Comment 12 Adrian Likins 2016-01-07 16:06:39 UTC

root     10096  0.0  0.3 358240 19184 ?        SN   Oct14   0:00 /usr/bin/python /usr/libexec/rhsmd -s
root     15435  0.0  0.3 297812 15680 ?        S    Oct13   0:00 /usr/bin/python /usr/libexec/rhsmcertd-worker


I think pid 15435 (rhsmcertd-worker) may be causing a deadlock, trying to get the lock on /var/run/rhsm/cert.pid

If possible, could the customer take a look at /var/run/rhsm/cert.pid (if it exists, and what it's content is [should just be a pid]).

And then to either kill 15435 (the rhsmcertd-worker process). That should then allow rhsmd to reap the stale lock and get a new one.  That should get rhsmd going again (at least enough to finish).

Comment 13 Adrian Likins 2016-01-07 16:36:08 UTC

That implies that there is still a blocking/locking bug, but it may be rhsmcertd-worker and not rhsmd.

Comment 21 vritant 2016-06-15 18:44:26 UTC

Yoshinori,
apologies that this bug did not get attention recently. We have a potential fix but we cant not be sure that the fix would solve the customer's issue, and it would not be good customer experience to wait until then to find out if that was a reliable fix. Also, unfortunately the logs attached do not give us enough information to be sure of what the issue is.

0. is this still an issue?

1. did we try killing the rhsmcertd-worker process like Adrian had suggested in comment 12 ?

2. is this a recurring issue? that is , does this happen every time rhsmcertd starts or was this a one time occurrence?

3. If the answer to all the above questions is yes, please provide an strace so we get find out why that process hangs every time. you could use the command:

strace -p `ps -C rhsmcertd -o pid=` -o rhsmcertd_trace.txt

Comment 29 Chris Snyder 2016-09-14 20:26:40 UTC

I believe this to be fixed in versions of subscription-manager greater than or equal to 1.17.15-1 and python-rhsm version greater than or equal to 1.17.4-1.

If you are still running into this bug using these versions (or newer) of subscription-manager and python-rhsm please reopen this bug.

Thank you.

Comment 31 Chris Snyder 2016-09-28 14:31:12 UTC

Subscription-manager 1.17.X and python-rhsm 1.17.X are being released with RHEL 7.3. As there have been a few fixes to go into the 1.17.X versions of subscription-manager and python-rhsm and when we build our next release (for EL6) it will be rebased from the previous release (1.17.X), the fix for this bug should be included when we build subscription-manager 1.18.X for RHEL 6.9.

Comment 33 Chris Snyder 2016-10-20 19:44:00 UTC

Here are links to the bugs mentioned in comment 31:

- https://bugzilla.redhat.com/show_bug.cgi?id=1351370 - A fix to ensure rhsmd exits when an exception occurs during a call to a method exposed over dbus.

- https://bugzilla.redhat.com/show_bug.cgi?id=1346417


Other possibly related upstream issues / prs:
- https://github.com/candlepin/subscription-manager/issues/1006
    - See comment 5

- https://github.com/candlepin/python-rhsm/pull/170
    - A fix from awood allowing the socket timeout to be set in rhsm.conf (and elsewhere in the codebase).


In the PR above, there was a default socket timeout of 180 added.

Hopefully this is helpful to QA!

Cheers.

Comment 34 Rehana 2017-01-06 12:49:43 UTC

since this bug doesnt have a direct reproducer, we believe that the fixes for following issues resolved the bug :
1) Bug 1351370 - [ERROR] subscription-manager:31276 @dbus_interface.py:60 - org.freedesktop.DBus.Python.OSError: Traceback
2)  Bug 1346417 - [RFE] Allow users to set socket timeout.


1) Demonstrating that "OS error " no longer happens on rhel69 with the build 
python-rhsm-1.18.6-1.el6.x86_64 
python-rhsm-certificates-1.18.6-1.el6.x86_64
subscription-manager-firstboot-1.18.6-1.el6.x86_64
subscription-manager-migration-data-2.0.32-1.el6.noarch
subscription-manager-debuginfo-1.18.6-1.el6.x86_64
subscription-manager-1.18.6-1.el6.x86_64
subscription-manager-plugin-container-1.18.6-1.el6.x86_64
subscription-manager-migration-1.18.6-1.el6.x86_64
subscription-manager-gui-1.18.6-1.el6.x86_64

#  cp -R /etc/pki/product-default/ /tmp/
# ls -R /tmp/product-default/
/tmp/product-default/:
69.pem

[root@dhcp35-181 tmp]# subscription-manager config --rhsm.productcertdir=/tmp/product-default/
[root@dhcp35-181 tmp]# subscription-manager clean
All local data removed

rhsm.log:
======
2017-01-06 07:14:02,817 [INFO] subscription-manager:5292:MainThread @managercli.py:389 - Client Versions: {'python-rhsm': '1.18.6-1.el6', 'subscription-manager': '1.18.6-1.el6'}
2017-01-06 07:14:02,820 [INFO] subscription-manager:5292:MainThread @managerlib.py:879 - Cleaned local data
2017-01-06 07:14:02,996 [INFO] rhsmd:5294:MainThread @rhsmd:261 - rhsmd started
2017-01-06 07:14:03,729 [INFO] subscription-manager-gui:32419:CertMonitorThread @connection.py:758 - Connection built: host=F21-candlepin.usersys.redhat.com port=8443 handler=/candlepin auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False
2017-01-06 07:14:03,857 [INFO] rhsmd:5296:MainThread @rhsmd:261 - rhsmd started

# ps -aux | grep "rhsmd"
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      5584  0.0  0.0 103332   816 pts/1    R+   07:36   0:00 grep rhsmd

^^ Verified that NO os Error was occuring and "rhsmd" service was running

2)Verifying with a existing non-responsive entitlement server setup (please refer https://bugzilla.redhat.com/show_bug.cgi?id=1346417#c11 to setup a non-responsive server ) 

[root@auto-services ncat_listener]# systemctl is-active ncat_listener.service
active
 
^^ making sure that the service is active on the server

retesting with the version :
 subscription-manager version
server type: This system is currently not registered.
subscription management server: 0.9.51.20-1
subscription management rules: 5.15.1
subscription-manager: 1.18.6-1.el6
python-rhsm: 1.18.6-1.el6

Now copying the server cert and trying to register the client against the non-responsive server ; the expected result is to get a timeout in the specified period 
 
# scp root.redhat.com:/root/ncat_listener/ncat_listener.pem /etc/rhsm/ca/
root.redhat.com's password: 
ncat_listener.pem                                                                                                                                                             100% 1935     1.9KB/s   00:00   

# chmod 0644 /etc/rhsm/ca/ncat_listener.pem

# subscription-manager config --server.hostname=auto-services.usersys.redhat.com --server.port=8884

# subscription-manager config --server.server_timeout=20

# time subscription-manager register --username=foo --password=bar
Registering to: auto-services.usersys.redhat.com:8884/subscription
Unable to verify server's identity: 

real	0m21.397s
user	0m0.224s
sys	0m0.055s


After a real time of 21.397s, subscription-manager command was timed out against the non-responsive server ( auto-services.usersys.redhat.com:8884)


Conclusion :
With the verification of these 1351370,1346417 bugs its verified that "rhsmd" service no-longer hangs after any error.
on :
subscription management server: 0.9.51.20-1
subscription management rules: 5.15.1
subscription-manager: 1.18.6-1.el6
python-rhsm: 1.18.6-1.el6

Marking as Verified!!

Comment 36 errata-xmlrpc 2017-03-21 10:54:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0698.html