Discussion:
state recovery failed on NFSv4 server with error 2
Thomas Garner
2008-03-17 05:00:48 UTC
Permalink
I have a Nexenta file server (1.0RC2 --- b80), serving my home
directory over nfs4 to several linux machines running Debian. On my
main machine (currently stock debian kernel 2.6.18-6-k7, but also had
the same problem under 2.6.23), after anywhere from 3 minutes to
several days, its load average will start creeping up as the following
messages fill up syslog at 3k per second:

Error: state recovery failed on NFSv4 server 192.168.0.10 with error 2

Searching around the web has shed little light onto this issue. While
repeatable (mostly via a combination of Firefox and JEdit), I have as
of yet to determine a surefire way to get it to start, nor get it to
stop. Tweaking the nfs mount options, I am now able to force umount
the nfs mount with some luck, but it puts only the tiniest bandaid on
the problem, as normally it will quickly revert into the failed
recovery state (but does tend to ward off a hard reboot for a little
longer). It's getting exceptionally frustrating to have to kill X and
every process with an open file on the mount, or worse to have to do
an emergency sync/remount ro/reboot.

Looking forward to debugging and resolving,
Thomas
Thomas Garner
2008-03-24 17:34:41 UTC
Permalink
Can no one help me debug this?
Post by Thomas Garner
I have a Nexenta file server (1.0RC2 --- b80), serving my home
directory over nfs4 to several linux machines running Debian. On my
main machine (currently stock debian kernel 2.6.18-6-k7, but also had
the same problem under 2.6.23), after anywhere from 3 minutes to
several days, its load average will start creeping up as the following
Error: state recovery failed on NFSv4 server 192.168.0.10 with error 2
Searching around the web has shed little light onto this issue. While
repeatable (mostly via a combination of Firefox and JEdit), I have as
of yet to determine a surefire way to get it to start, nor get it to
stop. Tweaking the nfs mount options, I am now able to force umount
the nfs mount with some luck, but it puts only the tiniest bandaid on
the problem, as normally it will quickly revert into the failed
recovery state (but does tend to ward off a hard reboot for a little
longer). It's getting exceptionally frustrating to have to kill X and
every process with an open file on the mount, or worse to have to do
an emergency sync/remount ro/reboot.
Looking forward to debugging and resolving,
Thomas
Benjamin Coddington
2008-03-26 12:17:00 UTC
Permalink
This error is thrown when the client's lease has expired, and the client
is attempting to reclaim it..

You may have a poorly behaved server? Any debug from there? Tried a
network capture while this happens?

Ben
Post by Thomas Garner
Can no one help me debug this?
Post by Thomas Garner
I have a Nexenta file server (1.0RC2 --- b80), serving my home
directory over nfs4 to several linux machines running Debian. On my
main machine (currently stock debian kernel 2.6.18-6-k7, but also had
the same problem under 2.6.23), after anywhere from 3 minutes to
several days, its load average will start creeping up as the following
Error: state recovery failed on NFSv4 server 192.168.0.10 with error 2
Searching around the web has shed little light onto this issue. While
repeatable (mostly via a combination of Firefox and JEdit), I have as
of yet to determine a surefire way to get it to start, nor get it to
stop. Tweaking the nfs mount options, I am now able to force umount
the nfs mount with some luck, but it puts only the tiniest bandaid on
the problem, as normally it will quickly revert into the failed
recovery state (but does tend to ward off a hard reboot for a little
longer). It's getting exceptionally frustrating to have to kill X and
every process with an open file on the mount, or worse to have to do
an emergency sync/remount ro/reboot.
Looking forward to debugging and resolving,
Thomas
_______________________________________________
NFSv4 mailing list
NFSv4 at linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
Thomas Garner
2008-03-27 06:15:20 UTC
Permalink
Thanks, Benjamin, for looking into this.
Post by Benjamin Coddington
You may have a poorly behaved server? Any debug from there? Tried a
network capture while this happens?
Hopefully not a poorly behaved server since it's OpenSolaris and Sun's nfsd
(perhaps it's not a guaranteed assumption that Sun has the best nfs
implementation?). I'm not exactly sure what you're looking for in a packet
dump, but here's a tcpcdump from the client (192.168.0.99) to the server (
192.168.0.10), when Linux is throwing the error. I have the full hex dump
from the client and a snoop dump from the server, so let me know if you'd
prefer more/different dumps.

01:22:35.043981 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 2529206211
win 47784 <nop,nop,timestamp 17427540 21670900>
01:22:35.044072 IP 192.168.0.10.2049 > 192.168.0.99.607590372: reply ok 56
01:22:35.044188 IP 192.168.0.99.691476452 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.044196 IP 192.168.0.99.1954115685 > 192.168.0.10.2049: 1448
proc-1634886504
01:22:35.044203 IP 192.168.0.99.1953703521 > 192.168.0.10.2049: 1448
proc-1046836323
01:22:35.044208 IP 192.168.0.99.543255660 > 192.168.0.10.2049: 1448
proc-1953066862
01:22:35.044214 IP 192.168.0.99.1914846835 > 192.168.0.10.2049: 1448
proc-1042294894
01:22:35.044220 IP 192.168.0.99.1835284020 > 192.168.0.10.2049: 1168
proc-1886353253
01:22:35.044303 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 4345 win
43440 <nop,nop,timestamp 17427540 21670900>
01:22:35.044399 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 6961 win
49232 <nop,nop,timestamp 17427540 21670900>
01:22:35.044500 IP 192.168.0.10.2049 > 192.168.0.99.624367588: reply ok 56
01:22:35.044528 IP 192.168.0.10.2049 > 192.168.0.99.641144804: reply ok 56
01:22:35.044570 IP 192.168.0.99.708253668 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.044576 IP 192.168.0.99.1685415200 > 192.168.0.10.2049: 1448
proc-1701671722
01:22:35.044582 IP 192.168.0.99.975203951 > 192.168.0.10.2049: 1448
proc-1936026725
01:22:35.044588 IP 192.168.0.99.2064255332 > 192.168.0.10.2049: 1448
proc-170093669
01:22:35.044594 IP 192.168.0.99.540024880 > 192.168.0.10.2049: 1448
proc-1886920819
01:22:35.044599 IP 192.168.0.99.1179004976 > 192.168.0.10.2049: 1168
proc-1667591796
01:22:35.044607 IP 192.168.0.99.725030884 > 192.168.0.10.2049: 196 getattr
fh 0,0/22
01:22:35.044692 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 10053 win
46336 <nop,nop,timestamp 17427540 21670900>
01:22:35.044797 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 12949 win
43440 <nop,nop,timestamp 17427540 21670900>
01:22:35.045565 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 15565 win
49232 <nop,nop,timestamp 17427540 21670900>
01:22:35.045620 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 18461 win
46336 <nop,nop,timestamp 17427540 21670900>
01:22:35.045730 IP 192.168.0.10.2049 > 192.168.0.99.657922020: reply ok 56
01:22:35.045747 IP 192.168.0.10.2049 > 192.168.0.99.674699236: reply ok 56
01:22:35.045790 IP 192.168.0.99.741808100 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.045796 IP 192.168.0.99.946544749 > 192.168.0.10.2049: 1448
proc-1885544563
01:22:35.045802 IP 192.168.0.99.842413105 > 192.168.0.10.2049: 1448
proc-1852272425
01:22:35.045808 IP 192.168.0.99.775696225 > 192.168.0.10.2049: 1448
proc-1752462448
01:22:35.045813 IP 192.168.0.99.1485393161 > 192.168.0.10.2049: 1448
proc-2833983756
01:22:35.045827 IP 192.168.0.99.1140865536 > 192.168.0.10.2049: 1168
proc-1811966976
01:22:35.045836 IP 192.168.0.99.758585316 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.045842 IP 192.168.0.99.1718578804 > 192.168.0.10.2049: 1448
proc-174144108
01:22:35.045848 IP 192.168.0.99.1952981091 > 192.168.0.10.2049: 1416
proc-1852272425
01:22:35.045885 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 22773 win
46336 <nop,nop,timestamp 17427540 21670901>
01:22:35.046034 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 25669 win
43440 <nop,nop,timestamp 17427540 21670901>
01:22:35.046209 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 28285 win
49232 <nop,nop,timestamp 17427540 21670901>
01:22:35.046261 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 31181 win
46336 <nop,nop,timestamp 17427540 21670901>
01:22:35.046294 IP 192.168.0.10.2049 > 192.168.0.99.691476452: reply ok 56
getattr ERROR: unk 10011
01:22:35.046347 IP 192.168.0.99.775362532 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.046353 IP 192.168.0.99.1954115685 > 192.168.0.10.2049: 1448
proc-1634886504
01:22:35.046359 IP 192.168.0.99.1953703521 > 192.168.0.10.2049: 1448
proc-1046836323
01:22:35.046365 IP 192.168.0.99.543255660 > 192.168.0.10.2049: 1448
proc-1953066862
01:22:35.046370 IP 192.168.0.99.1914846835 > 192.168.0.10.2049: 1448
proc-1042294894
01:22:35.046377 IP 192.168.0.99.1835284020 > 192.168.0.10.2049: 1168
proc-1886353253
01:22:35.046621 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 34077 win
43440 <nop,nop,timestamp 17427540 21670901>
01:22:35.047209 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 36693 win
49232 <nop,nop,timestamp 17427540 21670901>
01:22:35.047320 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 38337 win
47784 <nop,nop,timestamp 17427540 21670901>
01:22:35.047352 IP 192.168.0.10.2049 > 192.168.0.99.708253668: reply ok 56
getattr ERROR: unk 10011
01:22:35.047375 IP 192.168.0.10.2049 > 192.168.0.99.725030884: reply ok 56
getattr ERROR: unk 10011

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/nfsv4/attachments/20080327/64ae7660/attachment.html
Benjamin Coddington
2008-03-27 12:42:28 UTC
Permalink
Post by Thomas Garner
Thanks, Benjamin, for looking into this.
Well, I barely know what I'm doing, but I'm willing to help.

You also turn up your client debugging, and send that along..

rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
J. Bruce Fields
2008-03-27 13:46:37 UTC
Permalink
Post by Thomas Garner
Thanks, Benjamin, for looking into this.
Post by Benjamin Coddington
You may have a poorly behaved server? Any debug from there? Tried a
network capture while this happens?
Hopefully not a poorly behaved server since it's OpenSolaris and Sun's nfsd
(perhaps it's not a guaranteed assumption that Sun has the best nfs
implementation?). I'm not exactly sure what you're looking for in a packet
dump, but here's a tcpcdump from the client (192.168.0.99) to the server (
192.168.0.10), when Linux is throwing the error. I have the full hex dump
from the client and a snoop dump from the server, so let me know if you'd
prefer more/different dumps.
Yeah, there's not quite enough information here. What I usually ask
people to do is run something like

tcpdump -s0 -wtmp.pcap

(and add whatever filter might help get just the nfsv4 traffic) and then
send me tmp.pcap.

--b.
Post by Thomas Garner
01:22:35.043981 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 2529206211
win 47784 <nop,nop,timestamp 17427540 21670900>
01:22:35.044072 IP 192.168.0.10.2049 > 192.168.0.99.607590372: reply ok 56
01:22:35.044188 IP 192.168.0.99.691476452 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.044196 IP 192.168.0.99.1954115685 > 192.168.0.10.2049: 1448
proc-1634886504
01:22:35.044203 IP 192.168.0.99.1953703521 > 192.168.0.10.2049: 1448
proc-1046836323
01:22:35.044208 IP 192.168.0.99.543255660 > 192.168.0.10.2049: 1448
proc-1953066862
01:22:35.044214 IP 192.168.0.99.1914846835 > 192.168.0.10.2049: 1448
proc-1042294894
01:22:35.044220 IP 192.168.0.99.1835284020 > 192.168.0.10.2049: 1168
proc-1886353253
01:22:35.044303 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 4345 win
43440 <nop,nop,timestamp 17427540 21670900>
01:22:35.044399 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 6961 win
49232 <nop,nop,timestamp 17427540 21670900>
01:22:35.044500 IP 192.168.0.10.2049 > 192.168.0.99.624367588: reply ok 56
01:22:35.044528 IP 192.168.0.10.2049 > 192.168.0.99.641144804: reply ok 56
01:22:35.044570 IP 192.168.0.99.708253668 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.044576 IP 192.168.0.99.1685415200 > 192.168.0.10.2049: 1448
proc-1701671722
01:22:35.044582 IP 192.168.0.99.975203951 > 192.168.0.10.2049: 1448
proc-1936026725
01:22:35.044588 IP 192.168.0.99.2064255332 > 192.168.0.10.2049: 1448
proc-170093669
01:22:35.044594 IP 192.168.0.99.540024880 > 192.168.0.10.2049: 1448
proc-1886920819
01:22:35.044599 IP 192.168.0.99.1179004976 > 192.168.0.10.2049: 1168
proc-1667591796
01:22:35.044607 IP 192.168.0.99.725030884 > 192.168.0.10.2049: 196 getattr
fh 0,0/22
01:22:35.044692 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 10053 win
46336 <nop,nop,timestamp 17427540 21670900>
01:22:35.044797 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 12949 win
43440 <nop,nop,timestamp 17427540 21670900>
01:22:35.045565 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 15565 win
49232 <nop,nop,timestamp 17427540 21670900>
01:22:35.045620 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 18461 win
46336 <nop,nop,timestamp 17427540 21670900>
01:22:35.045730 IP 192.168.0.10.2049 > 192.168.0.99.657922020: reply ok 56
01:22:35.045747 IP 192.168.0.10.2049 > 192.168.0.99.674699236: reply ok 56
01:22:35.045790 IP 192.168.0.99.741808100 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.045796 IP 192.168.0.99.946544749 > 192.168.0.10.2049: 1448
proc-1885544563
01:22:35.045802 IP 192.168.0.99.842413105 > 192.168.0.10.2049: 1448
proc-1852272425
01:22:35.045808 IP 192.168.0.99.775696225 > 192.168.0.10.2049: 1448
proc-1752462448
01:22:35.045813 IP 192.168.0.99.1485393161 > 192.168.0.10.2049: 1448
proc-2833983756
01:22:35.045827 IP 192.168.0.99.1140865536 > 192.168.0.10.2049: 1168
proc-1811966976
01:22:35.045836 IP 192.168.0.99.758585316 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.045842 IP 192.168.0.99.1718578804 > 192.168.0.10.2049: 1448
proc-174144108
01:22:35.045848 IP 192.168.0.99.1952981091 > 192.168.0.10.2049: 1416
proc-1852272425
01:22:35.045885 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 22773 win
46336 <nop,nop,timestamp 17427540 21670901>
01:22:35.046034 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 25669 win
43440 <nop,nop,timestamp 17427540 21670901>
01:22:35.046209 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 28285 win
49232 <nop,nop,timestamp 17427540 21670901>
01:22:35.046261 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 31181 win
46336 <nop,nop,timestamp 17427540 21670901>
01:22:35.046294 IP 192.168.0.10.2049 > 192.168.0.99.691476452: reply ok 56
getattr ERROR: unk 10011
01:22:35.046347 IP 192.168.0.99.775362532 > 192.168.0.10.2049: 1448 getattr
fh 0,0/22
01:22:35.046353 IP 192.168.0.99.1954115685 > 192.168.0.10.2049: 1448
proc-1634886504
01:22:35.046359 IP 192.168.0.99.1953703521 > 192.168.0.10.2049: 1448
proc-1046836323
01:22:35.046365 IP 192.168.0.99.543255660 > 192.168.0.10.2049: 1448
proc-1953066862
01:22:35.046370 IP 192.168.0.99.1914846835 > 192.168.0.10.2049: 1448
proc-1042294894
01:22:35.046377 IP 192.168.0.99.1835284020 > 192.168.0.10.2049: 1168
proc-1886353253
01:22:35.046621 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 34077 win
43440 <nop,nop,timestamp 17427540 21670901>
01:22:35.047209 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 36693 win
49232 <nop,nop,timestamp 17427540 21670901>
01:22:35.047320 IP 192.168.0.10.2049 > 192.168.0.99.667: . ack 38337 win
47784 <nop,nop,timestamp 17427540 21670901>
01:22:35.047352 IP 192.168.0.10.2049 > 192.168.0.99.708253668: reply ok 56
getattr ERROR: unk 10011
01:22:35.047375 IP 192.168.0.10.2049 > 192.168.0.99.725030884: reply ok 56
getattr ERROR: unk 10011
Thomas
_______________________________________________
NFSv4 mailing list
NFSv4 at linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
Thomas Garner
2008-03-27 15:32:35 UTC
Permalink
Since it'll be a while before the error is triggered again, this is
the first 15M of my first 2.6G dump during the error, created using
the following tcpdump options on the client side:

tcpdump -s 1500 -w ~/nfs_dump_argento -x -i eth1

http://s120158928.onlinehome.us/nfs_dump_argento.001.bz2

Thomas
Post by J. Bruce Fields
Yeah, there's not quite enough information here. What I usually ask
people to do is run something like
tcpdump -s0 -wtmp.pcap
(and add whatever filter might help get just the nfsv4 traffic) and then
send me tmp.pcap.
Trond Myklebust
2008-03-27 18:13:16 UTC
Permalink
Post by Thomas Garner
Since it'll be a while before the error is triggered again, this is
the first 15M of my first 2.6G dump during the error, created using
tcpdump -s 1500 -w ~/nfs_dump_argento -x -i eth1
http://s120158928.onlinehome.us/nfs_dump_argento.001.bz2
Thomas
It looks as if the client is spewing new READ and WRITE requests without
handling the NFS4ERR_EXPIRED error returned by the server. AFAICS, that
can only happen if the recovery thread Oopses.

Does 'ps -efww' show a thread with a name that starts with the
IP-address of the server?
Do the syslogs, or does 'dmesg -s 90000' show an Oops involving that
process?

Cheers
Trond
Thomas Garner
2008-03-31 06:00:20 UTC
Permalink
Ok, this finally re-occurred.
Post by Trond Myklebust
Does 'ps -efww' show a thread with a name that starts with the
IP-address of the server?
This time, no, but I have seen a process named like that before.
Post by Trond Myklebust
Do the syslogs, or does 'dmesg -s 90000' show an Oops involving that
process?
No oopses seen in dmesg, /var/log/messages, or /var/log/kern.log.

No lockd running.

Log files, first of a tcpdump from the client with the command line args:

tcpdump -s 0 -w ~/nfs_dump_argento2 -x -i eth1

http://s120158928.onlinehome.us/nfs_dump_argento2.001.bz2

And then a section of /var/log/messages that should have both of these
turned on:

rpcdebug -m rpc -s all
rpcdebug -m nfs -s all

http://s120158928.onlinehome.us/messages.bz2

Again, I really appreciate you guys looking into this.

Thomas
Thomas Garner
2008-04-04 18:16:07 UTC
Permalink
Has anyone had any time to look into this?

Thanks!
Thomas
Post by Thomas Garner
Ok, this finally re-occurred.
Post by Trond Myklebust
Does 'ps -efww' show a thread with a name that starts with the
IP-address of the server?
This time, no, but I have seen a process named like that before.
Post by Trond Myklebust
Do the syslogs, or does 'dmesg -s 90000' show an Oops involving that
process?
No oopses seen in dmesg, /var/log/messages, or /var/log/kern.log.
No lockd running.
tcpdump -s 0 -w ~/nfs_dump_argento2 -x -i eth1
http://s120158928.onlinehome.us/nfs_dump_argento2.001.bz2
And then a section of /var/log/messages that should have both of these
rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
http://s120158928.onlinehome.us/messages.bz2
Again, I really appreciate you guys looking into this.
Thomas
Trond Myklebust
2008-04-05 20:21:16 UTC
Permalink
Post by Thomas Garner
Has anyone had any time to look into this?
Thanks!
Thomas
Post by Thomas Garner
Ok, this finally re-occurred.
Post by Trond Myklebust
Does 'ps -efww' show a thread with a name that starts with the
IP-address of the server?
This time, no, but I have seen a process named like that before.
Post by Trond Myklebust
Do the syslogs, or does 'dmesg -s 90000' show an Oops involving that
process?
No oopses seen in dmesg, /var/log/messages, or /var/log/kern.log.
No lockd running.
tcpdump -s 0 -w ~/nfs_dump_argento2 -x -i eth1
http://s120158928.onlinehome.us/nfs_dump_argento2.001.bz2
And then a section of /var/log/messages that should have both of these
rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
http://s120158928.onlinehome.us/messages.bz2
Again, I really appreciate you guys looking into this.
The only explanation I can find for this is if the calls to
nfs4_get_renew_cred() and nfs4_get_setclientid_cred() are failing. That
again would mean that we've called nfs4_drop_state_owner() in a
situation where we shouldn't...

Could you see if the attached patch helps?

Cheers
Trond



-------------- next part --------------
An embedded message was scrubbed...
From: Trond Myklebust <Trond.Myklebust at netapp.com>
Subject: No Subject
Date: Sat, 5 Apr 2008 15:54:17 -0400
Size: 1296
Url: http://linux-nfs.org/pipermail/nfsv4/attachments/20080405/fc5d3164/attachment.mht
Thomas Garner
2008-04-06 19:10:45 UTC
Permalink
I applied the patch last night and recompiled. We'll see if the
problem resurfaces.

Thomas

On Sat, Apr 5, 2008 at 4:21 PM, Trond Myklebust
Post by Trond Myklebust
Post by Thomas Garner
Has anyone had any time to look into this?
Thanks!
Thomas
Post by Thomas Garner
Ok, this finally re-occurred.
Post by Trond Myklebust
Does 'ps -efww' show a thread with a name that starts with the
IP-address of the server?
This time, no, but I have seen a process named like that before.
Post by Trond Myklebust
Do the syslogs, or does 'dmesg -s 90000' show an Oops involving that
process?
No oopses seen in dmesg, /var/log/messages, or /var/log/kern.log.
No lockd running.
tcpdump -s 0 -w ~/nfs_dump_argento2 -x -i eth1
http://s120158928.onlinehome.us/nfs_dump_argento2.001.bz2
And then a section of /var/log/messages that should have both of these
rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
http://s120158928.onlinehome.us/messages.bz2
Again, I really appreciate you guys looking into this.
The only explanation I can find for this is if the calls to
nfs4_get_renew_cred() and nfs4_get_setclientid_cred() are failing. That
again would mean that we've called nfs4_drop_state_owner() in a
situation where we shouldn't...
Could you see if the attached patch helps?
Cheers
Trond
---------- Forwarded message ----------
From: Trond Myklebust <Trond.Myklebust at netapp.com>
Date: Sat, 5 Apr 2008 15:54:17 -0400
Subject: No Subject
There should be no need to invalidate a perfectly good state owner just
because of a stale filehandle. Doing so can cause the state recovery code
to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely
on finding active state owners.
Signed-off-by: Trond Myklebust <Trond.Myklebust at netapp.com>
---
fs/nfs/nfs4proc.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index f38d057..424aa20 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -982,11 +982,8 @@ static int _nfs4_open_expired(struct nfs_open_context *ctx, struct nfs4_state *s
if (IS_ERR(opendata))
return PTR_ERR(opendata);
ret = nfs4_open_recover(opendata, state);
- if (ret == -ESTALE) {
- /* Invalidate the state owner so we don't ever use it again */
- nfs4_drop_state_owner(state->owner);
+ if (ret == -ESTALE)
d_drop(ctx->path.dentry);
- }
nfs4_opendata_put(opendata);
return ret;
}
Robin Bowes
2008-04-14 14:36:01 UTC
Permalink
Just adding another data point...

I'm seeing the same error with OpenSolaris SXCE b85 as the server, and
CentOS 5.1 i386 as a xen guest on a CentOS 5.1 x86_64 host.

The problem manifests itself when I run a largish rsync process, for
example, syncing the CentOS 5 x86_64 updates repo locally.

Is there anything I can do to provide more information?

Thanks,

R.
Thomas Garner
2008-06-28 01:21:37 UTC
Permalink
I know this was like forever ago, but I've been running this patch
against 2.6.24, and I don't think I've seen this resurface. Though I
do still have lockups, I have no reason to believe that it's nfs
related, as I don't have any debug info for it and am currently
chalking it up to unknown hardware faults.

Thomas
Post by Thomas Garner
I applied the patch last night and recompiled. We'll see if the
problem resurfaces.
Thomas
On Sat, Apr 5, 2008 at 4:21 PM, Trond Myklebust
Post by Trond Myklebust
Post by Thomas Garner
Has anyone had any time to look into this?
Thanks!
Thomas
Post by Thomas Garner
Ok, this finally re-occurred.
Post by Trond Myklebust
Does 'ps -efww' show a thread with a name that starts with the
IP-address of the server?
This time, no, but I have seen a process named like that before.
Post by Trond Myklebust
Do the syslogs, or does 'dmesg -s 90000' show an Oops involving that
process?
No oopses seen in dmesg, /var/log/messages, or /var/log/kern.log.
No lockd running.
tcpdump -s 0 -w ~/nfs_dump_argento2 -x -i eth1
http://s120158928.onlinehome.us/nfs_dump_argento2.001.bz2
And then a section of /var/log/messages that should have both of these
rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
http://s120158928.onlinehome.us/messages.bz2
Again, I really appreciate you guys looking into this.
The only explanation I can find for this is if the calls to
nfs4_get_renew_cred() and nfs4_get_setclientid_cred() are failing. That
again would mean that we've called nfs4_drop_state_owner() in a
situation where we shouldn't...
Could you see if the attached patch helps?
Cheers
Trond
---------- Forwarded message ----------
From: Trond Myklebust <Trond.Myklebust at netapp.com>
Date: Sat, 5 Apr 2008 15:54:17 -0400
Subject: No Subject
There should be no need to invalidate a perfectly good state owner just
because of a stale filehandle. Doing so can cause the state recovery code
to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely
on finding active state owners.
Signed-off-by: Trond Myklebust <Trond.Myklebust at netapp.com>
---
fs/nfs/nfs4proc.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index f38d057..424aa20 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -982,11 +982,8 @@ static int _nfs4_open_expired(struct nfs_open_context *ctx, struct nfs4_state *s
if (IS_ERR(opendata))
return PTR_ERR(opendata);
ret = nfs4_open_recover(opendata, state);
- if (ret == -ESTALE) {
- /* Invalidate the state owner so we don't ever use it again */
- nfs4_drop_state_owner(state->owner);
+ if (ret == -ESTALE)
d_drop(ctx->path.dentry);
- }
nfs4_opendata_put(opendata);
return ret;
}
Loading...