Description of problem: Test automation runs have started hitting this error recently. Version-Release number of selected component (if applicable): [root@jsefler-os6candlepin candlepin]# git log | head -1 commit c04ffb6b2716644c44355f88f03087c33da21d87 [root@jsefler-os6server ~]# subscription-manager version server type: Red Hat Subscription Management subscription management server: 0.9.47-1 subscription management rules: 5.14 subscription-manager: 1.14.2-1.git.1.6470ef0.el6 python-rhsm: 1.14.2-1.git.0.fafb02d.el6 How reproducible: Not often - I do not know a repeatable use case. Steps to Reproduce: Pasted from a test automation log... 201503310319:14.101 - FINE: ssh root.redhat.com subscription-manager unregister 201503310319:17.043 - FINE: Stdout: 201503310319:17.043 - FINE: Stderr: Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102 201503310319:17.043 - FINE: ExitCode: 70 201503310319:17.044 - FINE: ssh root.redhat.com LINE_NUMBER=$(grep --line-number 'Making request:' /var/log/rhsm/rhsm.log | tail --lines=1 | cut --delimiter=':' --field=1); if [ -n "$LINE_NUMBER" ]; then tail -n +$LINE_NUMBER /var/log/rhsm/rhsm.log; fi; 201503310319:17.127 - WARNING: Last request from /var/log/rhsm/rhsm.log: 2015-03-31 03:19:17,873 [DEBUG] rhsmd:22939 @connection.py:494 - Making request: GET /candlepin/consumers/589555ff-42bf-45e4-9799-1419bc945006/compliance 2015-03-31 03:19:18,270 [DEBUG] rhsmd:22939 @connection.py:521 - Response: status=200, requestUuid=eabfdf97-edf7-44ac-91ee-815e82c350be 2015-03-31 03:19:18,339 [DEBUG] rhsmd:22939 @cache.py:272 - Started thread to write cache: /var/lib/rhsm/cache/entitlement_status.json 2015-03-31 03:19:19,130 [DEBUG] subscription-manager:22933 @connection.py:521 - Response: status=500 2015-03-31 03:19:19,130 [ERROR] subscription-manager:22933 @managercli.py:161 - Unregister failed 2015-03-31 03:19:19,130 [ERROR] subscription-manager:22933 @managercli.py:162 - Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102 Traceback (most recent call last): File "/usr/share/rhsm/subscription_manager/managercli.py", line 1240, in _do_command managerlib.unregister(self.cp, self.identity.uuid) File "/usr/share/rhsm/subscription_manager/managerlib.py", line 788, in unregister uep.unregisterConsumer(consumer_uuid) File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 990, in unregisterConsumer return self.conn.request_delete(method) File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 614, in request_delete return self._request("DELETE", method, params) File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 530, in _request self.validateResponse(result, request_type, handler) File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 572, in validateResponse raise RestlibException(response['status'], error_msg) RestlibException: Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102
John are there candlepin logs available for this?
From the candlepin logs this appears to always be something like: 2015-03-31 07:23:14,999 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=] INFO org.candlepin.common.filter.LoggingFilter - Request: verb=DELETE, uri=/candlepin/consumers/512a05fc-6d1c-4e10-a626-0f8e 2a91cd1f 2015-03-31 07:23:15,052 [req=d144fe22-045b-4bee-90f6-ee65827c6d8b, org=] INFO org.candlepin.common.filter.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/512a05fc-6d1c-4e10-a626-0f8e2a9 1cd1f/compliance 2015-03-31 07:23:15,131 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] INFO org.candlepin.controller.CandlepinPoolManager - Revoked entitlement: 2c90af964c6dffde014c6eb644ff459d 2015-03-31 07:23:15,141 [req=d144fe22-045b-4bee-90f6-ee65827c6d8b, org=admin] INFO org.candlepin.common.filter.LoggingFilter - Response: status=200, content-type="application/json", time=88 2015-03-31 07:23:15,149 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] INFO org.candlepin.controller.CandlepinPoolManager - Revoked entitlement: 2c90af964c6dffde014c6eb643b64598 2015-03-31 07:23:15,168 [req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9, org=] INFO org.candlepin.common.filter.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/512a05fc-6d1c-4e10-a626-0f8e2a9 1cd1f/compliance 2015-03-31 07:23:15,219 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: 0, SQLState: 23503 2015-03-31 07:23:15,219 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - ERROR: update or delete on table "cp_consumer" violates foreign key constraint "fk_consumer_checkin_consumer" on table "cp_consumer_checkin" Detail: Key (id)=(2c90af964c6dffde014c6eb569a544e3) is still referenced from table "cp_consumer_checkin". 2015-03-31 07:23:15,224 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] ERROR org.candlepin.common.exceptions.mappers.CandlepinExceptionMapper - Runtime Error could not execute statement at or g.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102 org.hibernate.exception.ConstraintViolationException: could not execute statement Looking closely at this, the failing request UUID is a DELETE consumer, but before the DELETE request can complete there's a simultaneous GET request which is going to insert another consumer checkin timestamp into the db, but the DELETE request has already loaded the consumer for deletion. As such the delete fails as the db realizes there's a foreign key still in place. John this feels like tests that should be adjusted, can you think of any way you'd be deleting a test consumer but it would still be calculating it's compliance? Can you ensure all API calls for the consumer complete before trying to delete it somehow? This would be exceedingly difficult to hit in production, and a re-try on the delete would succeed. Arguably everything working as expected but waiting to hear more from jsefler.
(In reply to Devan Goodwin from comment #2) > John this feels like tests that should be adjusted, can you think of any way > you'd be deleting a test consumer but it would still be calculating it's > compliance? This question makes me think that think that bug 1207403 (which is easy to reproduce) may be the instigator of this bug.
That looks very likely, good find.
(In reply to John Sefler from comment #3) > (In reply to Devan Goodwin from comment #2) > > John this feels like tests that should be adjusted, can you think of any way > > you'd be deleting a test consumer but it would still be calculating it's > > compliance? > > This question makes me think that think that bug 1207403 (which is easy to > reproduce) may be the instigator of this bug. The logs here look like the post DELETE request for consumers compliance still knows the consumers UUID though. So the ordering is different. Is there anymore of the rhsm.log available? Curious when the DELETE request started. My guess would be the test is registering, attach (auto or otherwise), then immediately unregistering. The subman register and/or attach triggers rhsmd (pid 22939), then subman exits. Then an immed 'unregister' starts another subman process, calling the DELETE. In the meantime the rhsmd process has made the GET compliance request. subman register --autoattach rhsmd starts subman unregister <DELETE, req=7d53040c-29a9-4250-b22e-bc76968affef> <get COMPLIANCE with then correct id> req=d144fe22-045b-4bee-90f6-ee65827c6d8b <cp does delete request> rhsmd triggered because of change <makes req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9 with id cert, still valid atm, exists on clients> <cp attempts to that update now delete consumer about req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9> <The DB gets grumpy> <DELETE fails> <req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9 fails?> rhsmd
We believe this will be fixed by reverting the dbus code that triggers events on compliance status changes. subscription-manager.git: cf23450fc1d519a7d0e90611499047ce2e7c3334 Will appear in subscription-manager-1.15.3-1
Some more analysis, I believe the issue is the rhsm-icon being triggered. We send out a dbus request to update status and potentially pop up the icon on many CLI commands, unregister (not likely our candidate here though), attach, remove, repos, etc. This is likely caused by automated tests where we run one of these commands, the dbus request goes out but is handled outside the process. The test then runs unregister, very shortly after which dbus kicks in and the icon tries to update it's status. This has to happen after the unregister request is sent, but before it completes. This is only showing up now because the insertion of a "last updated" row in the db is a new change to fix the lock wait timeouts and contention of rows in the database. Unfortunately it means that if you request to delete a consumer, which makes requests during the time the delete is running, it will fail because the delete transaction couldn't see the last checkin row in the db when it was started. The server is behaving good here I think, and I believe the issue is limited to automated situations where we're doing subscription events and then immediately unregistering. It's quite difficult to reproduce even then, my best bet was putting a sleep into unregister, at which point it becomes quite easy. At present I can't really see any solution here. The client is correct to trigger the icon when it's doing so, and everything is perfectly valid when that happens, the consumer still exists as the unregister hasn't completed yet. The only option I can think of is attempting to catch the exception server side and re-try. (will be difficult across three databases) Re-tries are generally something we try to avoid. Or, we could perhaps try to signal on the client that an unregister is underway, and have the icon no-op when it sees this...
For the effort, and given this is probably not possible in normal usage (only automated testing), we're going to propose at the very least deferring, but it's probably worth a close. The effort to fix will be quite a mess, and the benefit will be very small. Requesting needinfo from John for this thoughts.
We did think of a possible solution to this server side. We could investigate making the last checkin rows not have a foreign key to consumer and instead replace it with an index to make sure lookups remain fast. We'd need provisions to cleanup orphaned rows, but this way a get request during a delete would not cause the delete to fail. Steps to reproduce: - Put a Thread.sleep somewhere in ConsumerResource.deleteConsumer and deploy. - subscription-manager unregister - switch to another terminal and subscription-manager status Moving this to Candlepin and off 6.7.
Created attachment 1181700 [details] foreman-debug Hit this bug with sat6.2 snap20.2 on creating repositories simultaneously. Attaching a foreman debug (happend at 2016-07-19 11:14:40,556)
Roman I think your bug is unrelated to this. This bug concerns logic around cp_consumer_checkin. That table is not present in Candlepin that is integrated in Sat 6.2. Even looking at your logs, the error [1] is in different code path, on different table. Your bug is caused by concurrent demotions/promotions of content, which is programming error on client's side (in your case Katello is the client). Not Candlepin server bug. [1] Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "cp_env_cnt_env_id_cntid_ky" 585 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) ~[postgresql-9.0-801.jdbc4.jar:n
(In reply to Filip Nguyen from comment #12) > Roman I think your bug is unrelated to this. This bug concerns logic around > cp_consumer_checkin. That table is not present in Candlepin that is > integrated in Sat 6.2. Even looking at your logs, the error [1] is in > different code path, on different table. > > Your bug is caused by concurrent demotions/promotions of content, which is > programming error on client's side (in your case Katello is the client). Not > Candlepin server bug. > > [1] > Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value > violates unique constraint "cp_env_cnt_env_id_cntid_ky" > 585 at > org.postgresql.core.v3.QueryExecutorImpl. > receiveErrorResponse(QueryExecutorImpl.java:2102) > ~[postgresql-9.0-801.jdbc4.jar:n thanks Filip. I'll report this as a katello issue then.
As per comment 13, closing as NOT_A_BUG.