Bug 764283 (GLUSTER-2551)

Summary: CPU usage 100% on client and brick...
Product: [Community] GlusterFS Reporter: Amar Tumballi <amarts>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WORKSFORME QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.1.3CC: gluster-bugs, masta, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
glusterdump brick01
none
glusterdump brick02
none
glusterdump client
none
graphs from bricks and client none

Description Aleks3Y 2011-03-17 11:23:45 UTC
Created attachment 454


Here is dump from brick01
CPU usage was around 200-300%

Comment 1 Aleks3Y 2011-03-17 11:28:03 UTC
Created attachment 455


CPU usage was around 100-150%

Comment 2 Aleks3Y 2011-03-17 11:32:02 UTC
Created attachment 456


in client CPU usage 60-120%

Comment 3 Amar Tumballi 2011-03-17 13:55:58 UTC
Users CPU usage is 100% on brick and client machine, while using a software called 'r1soft' for backing up data, on a volume like below:

gluster> volume info

Volume Name: Volume01
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: brick01:/exp1
Brick2: brick02:/exp2
Options Reconfigured:
performance.flush-behind: on
auth.reject: 127.0.0.2
auth.allow: 192.168.254.*,127.0.0.1,172.16.48.74
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: WARNING

same info on: http://pastebin.com/97PweKFw  http://pastebin.com/pzm5G24e http://pastebin.com/7LaBRT4F

log on client:

[2011-03-16 09:41:25.941350] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.941509] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.941630] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.941760] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.941782] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.942150] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.942176] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.942284] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.942362] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.942383] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.942447] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.943252] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.943433] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.943551] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.943610] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.943732] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.943848] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.943922] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.943979] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.944054] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.944114] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.944237] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.944326] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.944347] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.944606] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.944922] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.944940] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.945037] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.945052] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.945256] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.945350] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.945372] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.945502] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.946050] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.946267] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.946412] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.946495] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.946564] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.946579] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.946682] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.946763] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.946860] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.946952] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.947051] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.947152] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.947185] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.947505] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.947661] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.947686] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.947862] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.947885] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.948060] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.948586] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.948654] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.948763] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.948920] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.949070] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949127] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949216] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.949358] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949470] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949547] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949562] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.949862] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949966] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.949989] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2011-03-16 09:41:25.950149] D [afr-transaction.c:979:afr_post_nonblocking_inodelk_cbk] Volume01-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2011-03-16 09:41:25.950273] D [afr-lk-common.c:410:transaction_lk_op] Volume01-replicate-0: lk op is for a transaction
[2011-03-16 09:41:25.950350] D [afr-lk-common.c:987:afr_lock_blocking] Volume01-replicate-0: we're done locking
[2011-03-16 09:41:25.950390] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] Volume01-replicate-0: Blocking inodelks done. Proceeding to FOP

Server log:

[2011-03-16 05:35:24.708136] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.708253] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.708344] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.708821] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.710140] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.710306] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.710544] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.710798] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.711046] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.711271] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.711339] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.711648] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.712345] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.712543] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.713543] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.713745] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.713955] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[2011-03-16 05:35:24.714046] D [inodelk.c:293:__inode_unlock_lock] Volume01-locks:  Matching lock found for unlock
[Repeats another 10000s of time - Removed due to comment size limit]

Comment 4 Aleks3Y 2011-03-17 14:09:36 UTC
Also, client avg speed was 20Mbit/s incoming on public interface and around 70Mbit/s on outgoing going to gluster bricks

I've moved to NFS and i'm getting ~300Mbit/s now incoming and outgoing to 1 brick and then they syncing with cross connect.

CPU usage is still high, but it's due syncing i think.

guys, i can provide you with graphs if you need.

Comment 5 Pranith Kumar K 2011-03-22 07:14:06 UTC
hi,
   According to the glusterdumps provided in the attachments, brick which gave
the glusterdump.1179 does not have any locks for some reason. I tried to look
through the logs but the client and server logs have different timestamps.
Would it be possible to zip and post the log files of the two bricks and the
clients.
   I am a bit confused by the second comment. Could you please clarify why the
data is going only to one brick instead of both the bricks in the replica
pair?.
Do post the graphs as well.

Pranith.

Comment 6 Aleks3Y 2011-03-24 11:27:52 UTC
Created attachment 460

Comment 7 Aleks3Y 2011-03-24 11:30:43 UTC
It's getting worser and worser.
NO errors are showing in logs at all, but cpu usage is around 300%.

About previous comment:
I was able to get only 20Mbit/s with gluster native client and 300Mbit/s with NFS.

Is there any way we can figure it out? I've put it on production and if it will work like that, i'll be asked to remove it :(

Comment 8 Pranith Kumar K 2011-03-24 23:38:46 UTC
(In reply to comment #7)
> It's getting worser and worser.
> NO errors are showing in logs at all, but cpu usage is around 300%.
> 
> About previous comment:
> I was able to get only 20Mbit/s with gluster native client and 300Mbit/s with
> NFS.
> 
> Is there any way we can figure it out? I've put it on production and if it will
> work like that, i'll be asked to remove it :(

Could you please tell me why the traffic/data is going only to one brick instead of both the bricks?.

Pranith

Comment 9 Aleks3Y 2011-03-25 00:10:40 UTC
Because i'm using NFS right now.
As I said, when i'm using native client, CPU usage is 100% on client server and max speed is 20Mbit/s

I'm using backup software on the client...

Comment 10 Pranith Kumar K 2011-03-25 00:32:00 UTC
(In reply to comment #9)
> Because i'm using NFS right now.
> As I said, when i'm using native client, CPU usage is 100% on client server and
> max speed is 20Mbit/s
> 
> I'm using backup software on the client...

I am not sure I understand. When you have the bricks Brick1: brick01:/exp1
Brick2: brick02:/exp2 in a replica pair, if you mount them with fuse or nfs data should go to both brick01:/exp1 and brick02:/exp2, why is it going to only one of these two bricks?.

Pranith

Comment 11 Aleks3Y 2011-03-25 00:43:17 UTC
Yes, with fuse it's going to both servers, in netstat i can see 3 (three) active connections, two to main brick and one to 2nd.
When i'm using nfs, it's only going to one and then bricks are syncing via local connection between each other.

I was sure it's how it's supposed to be.

Comment 12 Pranith Kumar K 2011-03-25 01:03:36 UTC
(In reply to comment #11)
> Yes, with fuse it's going to both servers, in netstat i can see 3 (three)
> active connections, two to main brick and one to 2nd.
> When i'm using nfs, it's only going to one and then bricks are syncing via
> local connection between each other.
> 
> I was sure it's how it's supposed to be.

When you configure the volume as replicate, syncing between the two bricks brick01:/exp1 brick02:/exp2 is taken care. You don't need to sync them again using another local connection explicitly. How are you achieving the sync between these two bricks using local connection? using some other software?. I dont understand how you managed to mount only one brick with nfs. What command did you use to do this?. Could you paste the output of "mount" from the machine where you mounted the volume as fuse and nfs.

Pranith

Comment 13 Pranith Kumar K 2011-11-22 07:19:12 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > Yes, with fuse it's going to both servers, in netstat i can see 3 (three)
> > active connections, two to main brick and one to 2nd.
> > When i'm using nfs, it's only going to one and then bricks are syncing via
> > local connection between each other.
> > 
> > I was sure it's how it's supposed to be.
> 
> When you configure the volume as replicate, syncing between the two bricks
> brick01:/exp1 brick02:/exp2 is taken care. You don't need to sync them again
> using another local connection explicitly. How are you achieving the sync
> between these two bricks using local connection? using some other software?. I
> dont understand how you managed to mount only one brick with nfs. What command
> did you use to do this?. Could you paste the output of "mount" from the machine
> where you mounted the volume as fuse and nfs.
> 
> Pranith

There is no response on this bug from you. Its been a long time and We havent faced similar problems from any other. Are you still hitting this issue?.

Comment 14 Aleks3Y 2011-11-22 21:06:48 UTC
unfortunately we stop using it :(

(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > Yes, with fuse it's going to both servers, in netstat i can see 3 (three)
> > > active connections, two to main brick and one to 2nd.
> > > When i'm using nfs, it's only going to one and then bricks are syncing via
> > > local connection between each other.
> > > 
> > > I was sure it's how it's supposed to be.
> > 
> > When you configure the volume as replicate, syncing between the two bricks
> > brick01:/exp1 brick02:/exp2 is taken care. You don't need to sync them again
> > using another local connection explicitly. How are you achieving the sync
> > between these two bricks using local connection? using some other software?. I
> > dont understand how you managed to mount only one brick with nfs. What command
> > did you use to do this?. Could you paste the output of "mount" from the machine
> > where you mounted the volume as fuse and nfs.
> > 
> > Pranith
> 
> There is no response on this bug from you. Its been a long time and We havent
> faced similar problems from any other. Are you still hitting this issue?.

Comment 15 Pranith Kumar K 2011-12-07 09:09:56 UTC
We don't have enough info to debug this issue and no other reported a similar issue. It is working fine for us. So closing this for now. Feel free to open if you get it again.