Bug 865812
Summary: | glusterd stops responding after ext4 errors | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Christian Wittwer <christian.wittwer> | ||||||
Component: | glusterd | Assignee: | krishnan parthasarathi <kparthas> | ||||||
Status: | CLOSED DEFERRED | QA Contact: | |||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.3.0 | CC: | bugs, diankenneth, gluster-bugs, nsathyan | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-12-14 19:40:29 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Any updates on that? We're having the error over and over again (more or less every month). It's very annoying because the resync is blocking all virtual machines running backed by that gluster cluster. What can we do to provide you with more information? Any updates here? Maybe an idea, or do you need more information? Christian, Sorry for not having responded to this bug in a long time. I am not an ext4 expert, so I am not sure why you are seeing those errors you see in /var/log/kern.log. Was your disk almost full whenever you saw these messages on kern.log? Are you sure you see a spike in CPU for the glusterd or was it the glusterfsd (of the brick whose ext4 backend has failed)? Could you attach all the logs under /var/log/glusterfs, so that we could get a complete picture? By the way, which version of glusterfs are you using? No, the brick has plenty of space available (7 TB). The cpu is used by the glusterfsd, you're right. root 2578 215 0.0 1381588 29344 ? Rsl Jun24 6305:32 /usr/sbin/glusterfsd <snip> I'm using gluster 3.3.1, you'll find all logs attached to this ticket. Thanks for your support! Created attachment 765790 [details]
/var/log/glusterfs
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed. |
Created attachment 626005 [details] CPU usage of a gluster brick Description of problem: We had some strange outage on our Gluster volume during the past weeks. Our setup: * Distributed-Replicate with 4 bricks (ext4). * Uubuntu 12.04 (kernel 3.2.0-26-generic) From time to time randomly on of our bricks goes offline. The cpu usage increases dramatically (see attached Zabbix graph), a top show that the glusterd is consuming all the cpu. Example logs from the 2012-10-07, outage starts at 4:30am. /var/log/glusterfs/mnt-vol0.log ---------------------------------------------------------------------------------------------------------------------------------------------------------- [2012-10-07 04:32:55.370503] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-vol0-client-3: server 127.0.0.1:24009 has not responded in the last 42 seconds, disconnecting. [2012-10-07 04:32:55.471492] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f2e6d0c35b0] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f2e6d0c 3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f2e6d0c314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at 2012-10-07 04:32:13.177609 (xid=0x238418 76x) [2012-10-07 04:32:55.471522] W [client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-vol0-client-3: remote operation failed: Transport endpoint is not connected [2012-10-07 04:32:55.471583] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f2e6d0c35b0] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f2e6d0c3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f2e6d0c314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-10-07 04:32:13.177623 (xid=0x23841877x) [2012-10-07 04:32:55.475059] W [client-handshake.c:275:client_ping_cbk] 0-vol0-client-3: timer must have expired [2012-10-07 04:32:55.475079] I [client.c:2090:client_rpc_notify] 0-vol0-client-3: disconnected [2012-10-07 04:32:55.479697] W [client3_1-fops.c:5267:client3_1_finodelk] 0-vol0-client-3: (f879cc43-5107-4937-9505-89752f06d8f3) remote_fd is -1. EBADFD [2012-10-07 04:33:48.846301] W [client3_1-fops.c:5267:client3_1_finodelk] 0-vol0-client-3: (b3c78591-8588-4e6a-8fab-7b2c71a1ec3f) remote_fd is -1. EBADFD [2012-10-07 04:33:48.848424] W [client3_1-fops.c:5267:client3_1_finodelk] 0-vol0-client-3: (a670c9bc-7d60-4319-99df-cccd1f4af368) remote_fd is -1. EBADFD [2012-10-07 05:03:16.529583] E [rpc-clnt.c:208:call_bail] 0-vol0-client-3: bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0x23841878x sent = 2012-10-07 04:33:06.421023. timeout = 1800 [2012-10-07 05:03:16.529619] W [client-handshake.c:1819:client_dump_version_cbk] 0-vol0-client-3: received RPC status error ---------------------------------------------------------------------------------------------------------------------------------------------------------- More or less 30min before (a timeout in gluster?) I have some strange error about ext4 in my kern.log /var/log/kern.log ---------------------------------------------------------------------------------------------------------------------------------------------------------- Oct 7 04:02:10 localhost kernel: [995105.183616] JBD: sdb1-8: jh->b_next_transaction (1639457215, (null), 0) != transaction (ffff8824066f7f00, 95295038) Oct 7 04:02:10 localhost kernel: [995105.194676] ------------[ cut here ]------------ Oct 7 04:02:10 localhost kernel: [995105.205113] WARNING: at /build/buildd/linux-3.2.0/fs/jbd2/transaction.c:1175 jbd2_journal_dirty_metadata+0x199/0x230() Oct 7 04:02:10 localhost kernel: [995105.205115] Hardware name: PowerEdge R710 Oct 7 04:02:10 localhost kernel: [995105.205116] Modules linked in: xt_multiport ipmi_si ebt_arp ebt_ip mpt2sas scsi_transport_sas raid_class mptctl mptbase ipmi_devintf ipmi_msghandler ip6table_filter ip6 _tables ebtable_nat ebtables dell_rbu ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tabl es kvm_intel kvm vesafb nbd nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge 8021q garp stp joydev usbhid hid i7core_edac dm_multipath psmouse edac_core dcdbas serio_raw lp parport mac_hid acpi_power_meter aacraid usb_storage ses enclosure bnx2x libcrc32c mdio megaraid_sas bnx2 [last unlo aded: ipmi_si] Oct 7 04:02:10 localhost kernel: [995105.205163] Pid: 3255, comm: glusterfsd Not tainted 3.2.0-26-generic #41-Ubuntu Oct 7 04:02:10 localhost kernel: [995105.205164] Call Trace: Oct 7 04:02:10 localhost kernel: [995105.205172] [<ffffffff810672af>] warn_slowpath_common+0x7f/0xc0 Oct 7 04:02:10 localhost kernel: [995105.205174] [<ffffffff8106730a>] warn_slowpath_null+0x1a/0x20 Oct 7 04:02:10 localhost kernel: [995105.205177] [<ffffffff8125f579>] jbd2_journal_dirty_metadata+0x199/0x230 Oct 7 04:02:10 localhost kernel: [995105.205181] [<ffffffff81241f1b>] __ext4_handle_dirty_metadata+0x8b/0x130 Oct 7 04:02:10 localhost kernel: [995105.205184] [<ffffffff812519e1>] ext4_xattr_release_block+0xe1/0x190 Oct 7 04:02:10 localhost kernel: [995105.205186] [<ffffffff81251f1b>] ext4_xattr_block_set+0x21b/0x670 Oct 7 04:02:10 localhost kernel: [995105.205188] [<ffffffff81252822>] ext4_xattr_set_handle+0x382/0x4f0 Oct 7 04:02:10 localhost kernel: [995105.205190] [<ffffffff81252a22>] ext4_xattr_set+0x92/0x100 Oct 7 04:02:10 localhost kernel: [995105.205192] [<ffffffff81250cf0>] ? ext4_xattr_find_entry+0x90/0x100 Oct 7 04:02:10 localhost kernel: [995105.205194] [<ffffffff812534fd>] ext4_xattr_trusted_set+0x2d/0x30 Oct 7 04:02:10 localhost kernel: [995105.205200] [<ffffffff8119afcb>] generic_setxattr+0x6b/0x90 Oct 7 04:02:10 localhost kernel: [995105.205203] [<ffffffff8119b82b>] __vfs_setxattr_noperm+0x7b/0x1c0 Oct 7 04:02:10 localhost kernel: [995105.205208] [<ffffffff812dbc3e>] ? evm_inode_setxattr+0xe/0x10 Oct 7 04:02:10 localhost kernel: [995105.205210] [<ffffffff8119ba2c>] vfs_setxattr+0xbc/0xc0 Oct 7 04:02:10 localhost kernel: [995105.205212] [<ffffffff8119baf6>] setxattr+0xc6/0x120 Oct 7 04:02:10 localhost kernel: [995105.205215] [<ffffffff816599ce>] ? _raw_spin_lock+0xe/0x20 Oct 7 04:02:10 localhost kernel: [995105.205219] [<ffffffff8109efa3>] ? futex_wake+0x113/0x130 Oct 7 04:02:10 localhost kernel: [995105.205236] [<ffffffffa03a95c4>] ? kvm_on_user_return+0x74/0x80 [kvm] Oct 7 04:02:10 localhost kernel: [995105.205238] [<ffffffff8119bf0b>] sys_fsetxattr+0xbb/0xe0 Oct 7 04:02:10 localhost kernel: [995105.205241] [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b Oct 7 04:02:10 localhost kernel: [995105.205242] ---[ end trace d9e36cec5630a987 ]--- Oct 7 04:02:10 localhost kernel: [995105.205245] EXT4-fs error (device sdb1) in ext4_xattr_release_block:496: error 22 Oct 7 04:02:10 localhost kernel: [995105.226757] EXT4-fs error (device sdb1) in ext4_xattr_release_block:504: error 22 Oct 7 04:02:10 localhost kernel: [995105.250132] JBD: <D0><C0><A3>^B$<88><FF><FF><C0><A6>^c: jh->b_next_transaction (1639448617, (null), 0) != transaction (ffff882402a3bd68, 1) Oct 7 04:02:10 localhost kernel: [995105.261521] ------------[ cut here ]------------ Oct 7 04:02:10 localhost kernel: [995105.273394] WARNING: at /build/buildd/linux-3.2.0/fs/jbd2/transaction.c:1175 jbd2_journal_dirty_metadata+0x199/0x230() Oct 7 04:02:10 localhost kernel: [995105.273396] Hardware name: PowerEdge R710 Oct 7 04:02:10 localhost kernel: [995105.273398] Modules linked in: xt_multiport ipmi_si ebt_arp ebt_ip mpt2sas scsi_transport_sas raid_class mptctl mptbase ipmi_devintf ipmi_msghandler ip6table_filter ip6 _tables ebtable_nat ebtables dell_rbu ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tabl es kvm_intel kvm vesafb nbd nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge 8021q garp stp joydev usbhid hid i7core_edac dm_multipath psmouse edac_core dcdbas serio_raw lp parport mac_hid acpi_power_meter aacraid usb_storage ses enclosure bnx2x libcrc32c mdio megaraid_sas bnx2 [last unlo aded: ipmi_si] ---------------------------------------------------------------------------------------------------------------------------------------------------------- If you need more information, let me know what and how I can collect it. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: