[SASAG] Fwd: NFS server stumbling

Stuart Kendrick skendric at fhcrc.org
Fri Nov 4 04:42:18 PDT 2011

I'm working with an NFS server which is stumbling.  And I'm looking for
insights from our collective expertise.

The problem presents itself as slow 'ls' output -- sometimes, 'ls' can
spend seconds or minutes (1, 5, 20! minutes) to produce a listing of a
directory containing tens or hundreds of files.  Happens maybe a dozen
times a day, across a handful of users on a handful of NFS clients (all
OpenSuse 11.3) mounting their volumes from one particular NFS server
(Solaris 10 9/10 s10x_u9wos_14a X86, assembled 2010-08-11) (which in
turn builds its LUNs from any and all of three back-end systems -- 3Par,
Hitachi, and NetApp storage arrays).  The sole emacs user experiences
similar hangs on opening files.  Most of the time, these boxes are
reading/writing large files (multi gigabyte output files from genome
sequencing runs), running multi-hour compute jobs on them; we don't know
if performance is affected on these.

This is an infrequent issue -- the vast majority of the time,
performance of 'ls' and 'emacs' feel reasonable.  And the majority of
the clients (scores) and users (several hundred) don't report a problem...

Have any recommends for books which explain NFS, and ONC-RPC in
particular, in detail?  Which explain how to use 'nfsstat' and others to
trouble-shoot?  I have these on-order:
	NFS Illustrated by Callaghan
	Managing NFS and NIS (O'Reilly)

What tools do you use to shed light on NFS issues?  'nfsstat' and
'wireshark' ... others?

Packet traces show WRITEs and COMMITs which can take .0001 seconds to
complete ... or as long as 35 minutes! to complete.

[BTW:  packet traces between Client and Server show no TCP pathology:
no TCP retransmissions or RSTs.  Which is why I'm focusing on the NFS /
storage portion of the stack.]

Here is a WRITE which takes ~5 minutes to complete:

filter:  rpc.xid == 0x1d207259

No.   Abs Time DeltaT Bytes Src     Dst    Info
55808 10:23:46 14.563 2962  Client  Server V3 WRITE Call (Reply In
62624 10:25:46 14.572 2962  Client  Server [RPC retransmission of
                          #55808] V3 WRITE Call (Reply In 85295)
69434 10:26:46 14.579 2962  Client  Server [RPC retransmission of
                          #55808] V3 WRITE Call (Reply In 85295)
76235 10:28:46 14.577 2962  Client  Server [RPC retransmission of
                          #55808] V3 WRITE Call (Reply In 85295)
85295 10:29:22 0.860  206   Server  Client V3 WRITE Reply (Call In

Looks like some WRITEs needed ~20 minutes to complete, while some
COMMITs needed ~35 minutes.

ONC-RPC Server Response Time statistics for NFSv3:
Procedure     Calls     Min SRT     Max SRT     Avg SRT
WRITE            89     .000180     1188.59     260.127
COMMIT            5     328.581     2106.62     1163.34
FSSTAT           14     .000180     .000441     .000260
ACCESS            3     .000187     .000456     .000282

ONC-RPC Program Statistics:
Program   Version   Calls  Min SRT  Max SRT  Avg SRT
NFS       3         111    .000180  2106.62  260.973

That '2106.62' seconds under Max SRT for a COMMIT is where I pulled the
'~35 minutes' figure.


Server:/home/test-user> nfsstat

Server rpc:
Connection oriented:
calls      badcalls nullrecv badlen xdrcall dupchecks dupreqs
6565848596 0        0        0      0       2161188143 128335

Now, I'm new to interpreting nfsstat output.  Anyone have enough
confidence to speak to this?  Here's my thinking:

'calls' are the total number of RPC calls.  'dupchecks' tells us how
many times the NFS server ran one of those calls through cache.  And
'dupreqs' tells us how many retransmits, or, more precisely, how many
times the NFS server realized that it was /already/ working on that call.

dupreqs / calls = 128334 / 6547537906 =~ .000019600%  Not bad at all

i.e. only a tiny fraction of all the RPC calls are being retransmitted.

But then, I start to doubt my interpretation.  I think I'm
misunderstanding the use of the term 'cache' as it relates to dupchecks
.. are we using the term cache in the sense of "If it's in my cache,
then I'll return the answer to you quicker than if I had to retrieve
blocks from storage".  Or are we using the term cache in the way I would
use the term 'queue' ... meaning, if I'm already working on this request
(in my queue), then I won't bother adding your most recent request to my
queue. If the latter, then:

dupchecks / calls = 2161188143 / 6565848596 = ~.33 =~ 33%

Which would suggest that ~33% of all RPCs are retransmitted ... a
different story entirely.


-We've spent effort analyzing health (load, CPU utilization, network IO,
others) on the clients during these events; so far, we believe that the
clients are underutlized while 'ls' is being slow.

-A wrapper around 'ls' logs information:  precisely what 'ls' command
was run, on what volume, by what user, with what flags, and how long it
took to complete

-We're building a table listing which volumes are involved with slow
'ls' commands, and which LUNs comprise those volumes.  [Looks like some
of the volumes are built from spindles hosted on multiple storage
systems ... using concats, not stripes, meaning some files probably live
on a single spindle, rather than being striped across multiple spindles
... more investigation needed here.]

-We've started to chart IO latency on the relevant volumes


Stuart Kendrick

More information about the Members mailing list