| Summary: | Out of memory, self heal troubles, random errors and crashes with 3.0.3 | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Dushyanth Harinath <dushyanth.h> | ||||||||||||||||
| Component: | replicate | Assignee: | Pavan Vilas Sondur <pavan> | ||||||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||
| Priority: | low | ||||||||||||||||||
| Version: | 3.0.3 | CC: | amarts, gluster-bugs | ||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||
| OS: | Linux | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | Type: | --- | |||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
Dushyanth Harinath
2010-04-30 16:13:40 UTC
Thanks for the detailed information you have provided. You obviously seem to be running into multiple problems - * The crash you've reported has been seen before and the fix has been sent for review. It should be in the next 3.0.x release. See http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=803 * About the memory usage - Can you run glusterfs without io-cache and check the memory usage with the same use-case? * I just needed a little more information on dovecot getting stuck trying to access the folder during self-heal: - It looks like lock requests have bailed out while one or more servers went down and came up. Is it possible to tell us when did the server go down, whether self-heal completed etc. - Does this readily happen? Does dovecot hang each time a 'down' server is brought up and self heal is in play? - Can you post the whole client/server logs, run in 'TRACE' loglevel - There is an option to the locks translator, 'option trace on' . Please turn it on and post the log files when the issue is hit. * Also, can you let us know in brief how you have setup dovecot with glusterfs, so that we'll try and reproduce the problems in-house. Running additional tests you have requested - will keep you posted - Thx I was successful in replicating this issue with dovecot. Here are the details + entire log files in TRACE mode and with trace option enabled in locks translator 1. We have two email accounts with home directory set to folders under glusterfs mountpoits. mail1 & mail2 are the two nodes and mail2 is passive. /etc/glusterfs/glusterfs-home51.vol on /gluster/home51 type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072) /etc/glusterfs/glusterfs-home61.vol on /gluster/home61 type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072) 2. First we tested by sending few mails manually to this accounts to make sure gluster works fine. I then ran a postal test to send continuous mails to these accounts. postal command line was `postal -p LMTP_PORT -t 1 -c 1 MAIL1_IP email.txt`. The file email.txt contains the below two rows gluster5 gluster6 With both gluster severs running i wasn't seeing any errors. After a while, we unmounted the gluster disks and stopped gluster server on mail2. Note that mails are being delivered to mail1 only and mail2 is passive. [2010-05-10 16:38:56] W [glusterfsd.c:965:cleanup_and_exit] glusterfs: shutting down 3. We then changed postal command line to below to attempt multiple mail deliveries at the same time - `postal -p LMTP_PORT -t 10 -c 1 mail1_IP email.txt`. This invokes 10 postal threads and delivers one email per connection. The test continues till its killed. 4. We immediately started seeing the below errors in dovecot logs May 10 16:50:19 mail dovecot: lmtp(30603, gluster6): Error: Corrupted transaction log file /gluster/home61/glus ter6/dbox/mailboxes/INBOX/dbox-Mails/dovecot.index.log seq 8: Unexpected garbage at EOF (sync_offset=99760) May 10 16:50:19 mail dovecot: lmtp(30603, gluster6): Warning: fscking index file /gluster/home61/gluster6/dbox/ mailboxes/INBOX/dbox-Mails/dovecot.index May 10 16:50:19 mail dovecot: lmtp(30596, gluster6): Warning: mdbox /gluster/home61/gluster6/dbox/storage: Inco nsistency in map index (8,37032 != 8,37144) May 10 16:50:19 mail dovecot: lmtp(30596, gluster6): Warning: mdbox /gluster/home61/gluster6/dbox/storage: rebu ilding indexes May 10 16:50:19 mail dovecot: lmtp(30598, gluster6): Error: Transaction log file /gluster/home61/gluster6/dbox/ mailboxes/INBOX/dbox-Mails/dovecot.index.log: marked corrupted May 10 16:50:19 mail dovecot: lmtp(24892, gluster6): Error: open() failed with file /gluster/home61/gluster6/db ox/mailboxes/INBOX/dbox-Mails/dovecot.index.log: Transport endpoint is not connected May 10 16:50:19 mail dovecot: lmtp(24892, gluster6): 6qweBiDp50s8YQAATEXDjg: msgid=<4BE7EBEE.1FB.002.gluster5:d irecti.com.pw> from=<gluster5@SYNTAX_ERROR> size=9465: save failed to INBOX: Internal error occurred. Refer to server log for more information. [2010-05-10 16:50:14] May 10 16:50:19 mail dovecot: lmtp(24892, gluster6): Error: BUG: Saving failed to unknown storage 5. At this time, the errors on mail1 was as below. The lmtp processes which were delivering mail to this home dir got stuck after the first error we encountered above - refer to the attached lsof file. [2010-05-10 16:40:02] T [socket.c:176:__socket_disconnect] mail2-home61-brick: shutdown() returned -1. set connection state to -1 [2010-05-10 16:40:02] T [client-protocol.c:6374:protocol_client_cleanup] mail2-home61-brick: cleaning up state in transport object 0x6151f0 [2010-05-10 16:40:02] T [socket.c:91:__socket_rwv] mail2-home61-brick: EOF from peer 172.16.2.206:6997 [2010-05-10 16:40:02] T [socket.c:580:__socket_proto_state_machine] mail2-home61-brick: read (Transport endpoint is not con nected) in state 1 (172.16.2.206:6997) [2010-05-10 16:40:02] T [client-protocol.c:6374:protocol_client_cleanup] mail2-home61-brick: cleaning up state in transport object 0x6151f0 [2010-05-10 16:40:02] T [fuse-bridge.c:702:fuse_attr_cbk] glusterfs-fuse: 399891: LOOKUP() / => 1 [2010-05-10 16:40:02] T [fuse-bridge.c:600:fuse_lookup] glusterfs-fuse: 399892: LOOKUP /gluster6(407699638) [2010-05-10 16:40:02] T [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 399892: LOOKUP() /gluster6 => 407699638 (4076996 38) [2010-05-10 16:40:02] T [fuse-bridge.c:600:fuse_lookup] glusterfs-fuse: 399893: LOOKUP /gluster6/dbox(407699640) [2010-05-10 16:40:02] T [afr-self-heal-common.c:1557:afr_self_heal] home61-mirror-1: performing self heal on /gluster6/dbox (metadata=0 data=0 entry=1) Why is it doing self-heal when the remote volume is still down ? Glusterfsd was started at the below time on node2 [2010-05-10 17:08:35] N [glusterfsd.c:1408:main] glusterfs: Successfully started 7. The self-heal also seems to be taking a lot of time # On mail1 - the amt of mails delivered to these mailboxes # grep ^Received /gluster/home51/gluster5/dbox/storage/m.* | wc -l ; grep ^Received /gluster/home61/gluster6/dbox/storage/m.* | wc -l 3831 3653 # On mail2 - the amt of mails delivered to these mailboxes # grep ^Received /gluster/home51/gluster5/dbox/storage/m.* | wc -l ; grep ^Received /gluster/home61/gluster6/dbox/storage/m.* | wc -l 3831 3277 When glusterfsd was started on mail2, gluster6 mailbox on mail had about 2.6K mails. Its been nearly 2 hrs since then and it seems to be catching up very slowly. The files open for the account gluster6 as of now (2 hrs after the mail2 volumes came up) is also attached in file - lsof-files-open-now.txt. Iam attaching the below files * logs.tar.gz contains full logs with TRACE mode on both nodes * lsof outputs * Dovecot configuration file to replicate this issue in-house * Dovecot RPM * Postal Binary - Works under Centos 5.X 64 Bit Haven't got around to test the out of mem issue with iocache turned off yet. Will do and let you know. Thanks Dushyanth Created attachment 193 [details]
md5sums
Created attachment 194 [details]
strace output for kudzu run that doesn't detect the modem
Created attachment 195 [details]
new patch for tmpwatch-2.2-1
Created attachment 196 [details]
patch that corrects that tupo...
Created attachment 197 [details]
cleanup diff for /etc/rc.d/init.d/functions
Comment on attachment 197 [details]
cleanup diff for /etc/rc.d/init.d/functions
Patched postal for LMTP delivery
Created attachment 198 [details]
strace of kudzu-0.38-1 not detectingmodem
Created attachment 199 [details]
add as patch5 to source rpm and rebuild
Hi, Any updates on this issue ? Have you been able to test & verify this bug ? Do let me know if you need any other information. Iam able to succesfully reproduce this issue everytime even without iocache, writebehind, readahead & stats prefetch translators. I currently have only replicate translator enabled. Thanks Dushyanth Hi Dushyanth, Thank you for the detailed info. I am looking into the logs and isolating multiple bugs. We're using your dovecot configuration to reproduce in-house. Will update soon with the root causes of the problems seen. >Haven't got around to test the out of mem issue with iocache turned off yet.
>Will do and let you know.
Apart from the other things you've reported, did you get a chance to run the tests without io-cache and specifically have you seen memory usage problems?
I haven't done a extensive test with iocache turned off. Will do that this week or early next week. * testing without iocache & quickread translators does not show any OOM issues Most of the self-heal (replicate related) bugs are now fixed with 3.1.0 branch. As we are just week behind the GA release time.. we would like you to test the particular bug in 3.1.0RC releases, and let us know if its fixed. With 3.1 the code where the crash happened is removed.. Marking it as fixed for 3.1 |