Description of problem: The userspace processes seem to be slow enough to continually cause timeouts messages during most cmirror operations. Jan 21 15:05:35 hayes-03 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_ GET_SYNC_COUNT] Jan 21 15:05:35 hayes-03 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_GET_SYNC_COUNT] Jan 21 15:05:35 hayes-03 kernel: device-mapper: dm-log-clustered: [77938][0][0][0][0][0] [10] Version-Release number of selected component (if applicable): cmirror-1.1.5-4.el5 How reproducible: Often
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This is a real problem if it repeats over and over... means processing isn't getting done for some reason. If it only happens once in a while and things keep working, then it is probably a minor issue. I can see that you have had 10 such timeouts... how often do you see this message? I've probably done most of the optimizing I can do in this release. So, if it's timing out due to slowness, then we may have to skip this bug. If timing out due to userspace failure to process, then this bug must be fixed.
I'm still at times seeing these timeout messages cause problems, like in bz 430296.
Saw a similar issue while testing for bz 429708. device-mapper: dm-log-clustered: Request timed out on DM_CLOG_FLUSH:81746 - retrying device-mapper: dm-log-clustered: Request timed out on DM_CLOG_FLUSH:87824 - retrying device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:87826 - retrying device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0 device-mapper: dm-log-clustered: Request timed out on DM_CLOG_CLEAR_REGION:91049 - retrying The sync is either stuck or so slow that it nolonger moves: [root@taft-01 ~]# date Mon Feb 18 14:33:47 CST 2008 [root@taft-01 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert taft-02.28775 lock_stress mwi-a- 500.00M taft-02.28775_mlog 24.00 taft-03.28789 lock_stress mwi-a- 500.00M taft-03.28789_mlog 79.20 [root@taft-01 ~]# date Mon Feb 18 14:42:15 CST 2008 [root@taft-01 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert taft-02.28775 lock_stress mwi-a- 500.00M taft-02.28775_mlog 24.00 taft-03.28789 lock_stress mwi-a- 500.00M taft-03.28789_mlog 79.20
I believe that I'm still seeing this timing issue cause problems during lvm operations. qarshd[25822]: Running cmdline: lvconvert -m 2 lock_stress/taft-02.5326 Down converting cmirror from 3 legs to 2 on taft-02 Error locking on node taft-03: LV lock_stress/taft-02.5326_mimage_3 in use: not deactivating Taft-03: Mar 3 16:49:51 taft-03 lvm[6770]: Monitoring mirror device lock_stress-taft--02.5326 for events Mar 3 16:49:51 taft-03 clogd[6682]: Invalid request_type
On taft-03, you can see that the third image didn't get deleted when the down conversion happened. taft-02.5326 lock_stress mwi-a- 500.00M taft-02.5326_mlog 100.00 taft-02.5326_mimage_0(0) ,taft-02.5326_mimage_1(0),taft-02.5326_mimage_2(0) [taft-02.5326_mimage_0] lock_stress iwi-ao 500.00M /dev/sdb1(125) [taft-02.5326_mimage_1] lock_stress iwi-ao 500.00M /dev/sdd1(125) [taft-02.5326_mimage_2] lock_stress iwi-ao 500.00M /dev/sde1(125) taft-02.5326_mimage_3 lock_stress -wi--- 500.00M /dev/sdf1(125)
I saw another instance where a lvm cmd failed and the only thing in the logs was this dm-log-clustered time out message. Mar 4 19:05:55 taft-04 qarshd[17728]: Running cmdline: lvchange --resync -y mirror_sanity/resync_regular Mar 4 19:05:46 taft-02 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_STATUS_INFO:836712 - retrying Resyncing resync_regular mirror Error locking on node taft-02: LV mirror_sanity/resync_regular_mlog in use: not deactivating Unable to deactivate log LV resync_regular_mlog after wiping for resync
Making this bz a little more visible.
Just an FYI, that with enough lvm processes, you'll see these messages repeat even if the cmds aren't failing or stuck. 2.6.18-90.el5 lvm2-2.02.32-4.el5 lvm2-cluster-2.02.32-4.el5 kmod-cmirror-0.1.8-1.el5 [...] device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:151238 - retrg device-mapper: dm-log-clustered: Request timed out on DM_CLOG_CLEAR_REGION:161349 - retryig device-mapper: dm-log-clustered: Request timed out on DM_CLOG_CLEAR_REGION:161609 - retryig device-mapper: dm-log-clustered: Request timed out on DM_CLOG_CLEAR_REGION:173434 - retryig device-mapper: dm-log-clustered: Request timed out on DM_CLOG_CLEAR_REGION:194022 - retryig [...]
I believe this is now fixed. I have seen "Request timed out..." messages in further testing, but it is due to an AIS failure. If you see a message from OpenAIS like, "disconnecting from library" - then that is a different bug.
This should also help (at least in the generic sense of preventing 'timed out' messages): commit a21d1c03600e011cf1ee132100ac823bf3478877 Author: Jonathan Brassow <jbrassow> Date: Thu May 15 13:02:19 2008 -0500 clogd + dm-log-clustered: Improve communication error handling (K <-> U) Connector (netlink) is an unreliable protocol. From time to time, I get a connector header in userspace that has all the right fields (seq #, size of data, etc), but the data is all zeros. When this happens, I use the seq number in the header to request the kernel to resend the request. This should avoid those nasty 'Invalid request' messages... which are then followed by 'Request timed out' (because the invalid request was ignored).
I was able to hit this while running the latest 5.2.Z kern while running the test cmirror_lock_stress. All user space cmds (lvs) are hung. The following is scrolling over and over on all nodes: Jun 26 14:42:28 hayes-02 kernel: device-mapper: dm-log-clustered: [cE1hohMV] Request timed out: [DM_CLOG_CTg Jun 26 14:42:28 hayes-02 clogd[5954]: Invalid log request received, ignoring. device-mapper: dm-log-clustered: [cE1hohMV] Request timed out: [DM_CLOG_CTR/726] - retrying Jun 26 14:42:43 hayes-02 kernel: device-mapper: dm-log-clustered: [cE1hohMV] Request timed out: [DM_CLOG_CTg Jun 26 14:42:43 hayes-02 clogd[5954]: Invalid log request received, ignoring. I'll include a dump of all the running processes. 2.6.18-92.1.5.el5 cmirror-1.1.18-1.el5 BUILT: Tue 20 May 2008 02:53:58 PM CDT kmod-cmirror-0.1.10-1.el5 BUILT: Build Date: Tue 20 May 2008 02:55:48 PM CDT
Created attachment 310372 [details] log from haye-01
Created attachment 310373 [details] log from hayes-02
Created attachment 310374 [details] log from hayes-03
I would be _very_ surprised to see you get this anymore... "clogd[5954]: Invalid log request received, ignoring." I believe this to be an artifact of the bug being fixed, but the build issue we were having causing things to not work properly.
I don't see the 'clogd Invalid log request received' messages anymore. I am however still seeing a potential deadlock when running cmirror_lock_stress. I'll file a new bz for that and mark this one verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html