deliverable/linux.git
7 years agoNFSv4: Cap the transport reconnection timer at 1/2 lease period
Trond Myklebust [Fri, 5 Aug 2016 23:03:31 +0000 (19:03 -0400)] 
NFSv4: Cap the transport reconnection timer at 1/2 lease period

We don't want to miss a lease period renewal due to the TCP connection
failing to reconnect in a timely fashion. To ensure this doesn't happen,
cap the reconnection timer so that we retry the connection attempt
at least every 1/2 lease period.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4: Cleanup the setting of the nfs4 lease period
Trond Myklebust [Fri, 5 Aug 2016 23:13:08 +0000 (19:13 -0400)] 
NFSv4: Cleanup the setting of the nfs4 lease period

Make a helper function nfs4_set_lease_period() and have
nfs41_setup_state_renewal() and nfs4_do_fsinfo() use it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
Trond Myklebust [Thu, 4 Aug 2016 04:08:45 +0000 (00:08 -0400)] 
SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout

...and ensure that we propagate it to new transports on the same
client.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: Fix reconnection timeouts
Trond Myklebust [Thu, 4 Aug 2016 04:00:33 +0000 (00:00 -0400)] 
SUNRPC: Fix reconnection timeouts

When the connect attempt fails and backs off, we should start the clock
at the last connection attempt, not time at which we queue up the
reconnect job.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
Trond Myklebust [Fri, 5 Aug 2016 16:16:19 +0000 (12:16 -0400)] 
NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED

We should handle those errors in the same way we handle the other
stateid errors: by invalidating the faulty layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: disable the use of IPv6 temporary addresses.
NeilBrown [Thu, 4 Aug 2016 06:24:28 +0000 (16:24 +1000)] 
SUNRPC: disable the use of IPv6 temporary addresses.

If the net.ipv6.conf.*.use_temp_addr sysctl is set to '2',
then TCP connections over IPv6 will prefer a 'private' source
address.
These eventually expire and become invalid, typically after a week,
but the time is configurable.

When the local address becomes invalid the client will not be able to
receive replies from the server.  Eventually the connection will timeout
or break and a new connection will be established, but this can take
half an hour (typically TCP connection break time).

RFC 4941, which describes private IPv6 addresses, acknowledges that some
applications might not work well with them and that the application may
explicitly a request non-temporary (i.e. "public") address.

I believe this is correct for SUNRPC clients.  Without this change, a
client will occasionally experience a long delay if private addresses
have been enabled.

The privacy offered by private addresses is of little value for an NFS
server which requires client authentication.

For NFSv3 this will often not be a problem because idle connections are
closed after 5 minutes.  For NFSv4 connections never go idle due to the
period RENEW (or equivalent) request.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: allow for upcalls for same uid but different gss service
Olga Kornievskaia [Thu, 4 Aug 2016 00:19:48 +0000 (20:19 -0400)] 
SUNRPC: allow for upcalls for same uid but different gss service

It's possible to have simultaneous upcalls for the same UIDs but
different GSS service. In that case, we need to allow for the
upcall to gssd to proceed so that not the same context is used
by two different GSS services. Some servers lock the use of context
to the GSS service.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: Fix up socket autodisconnect
Trond Myklebust [Tue, 2 Aug 2016 17:47:43 +0000 (13:47 -0400)] 
SUNRPC: Fix up socket autodisconnect

Ensure that we don't forget to set up the disconnection timer for the
case when a connect request is fulfilled after the RPC request that
initiated it has timed out or been interrupted.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: Handle EADDRNOTAVAIL on connection failures
Trond Myklebust [Mon, 1 Aug 2016 17:36:08 +0000 (13:36 -0400)] 
SUNRPC: Handle EADDRNOTAVAIL on connection failures

If the connect attempt immediately fails with an EADDRNOTAVAIL error, then
that means our choice of source port number was bad.
This error is expected when we set the SO_REUSEPORT socket option and we
have 2 sockets sharing the same source and destination address and port
combinations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 402e23b4ed9ed ("SUNRPC: Fix stupid typo in xs_sock_set_reuseport")
Cc: stable@vger.kernel.org # v4.0+
7 years agopNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding
Benjamin Coddington [Thu, 28 Jul 2016 18:41:10 +0000 (14:41 -0400)] 
pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding

A LAYOUTCOMMIT then subsequent GETATTR may both return the same attributes,
and in that case NFS_INO_INVALID_ATTR is never set on the second pass
through nfs_update_inode().  The existing check to skip the clearing of
NFS_INO_INVALID_ATTR if a LAYOUTCOMMIT is outstanding does not help in this
case (see commit 10b7e9ad4488: "pNFS: Don't mark the inode as revalidated
if a LAYOUTCOMMIT is outstanding").  We know that if a LAYOUTCOMMIT is
outstanding then attributes will need upating, so always set
NFS_INO_INVALID_ATTR.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4: Clean up lookup of SECINFO_NO_NAME
Trond Myklebust [Mon, 25 Jul 2016 17:31:14 +0000 (13:31 -0400)] 
NFSv4: Clean up lookup of SECINFO_NO_NAME

Use the minor version ops cached in struct nfs_client instead of looking
them up again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4.2: Fix warning "variable ‘stateids’ set but not used"
Trond Myklebust [Sun, 24 Jul 2016 21:17:16 +0000 (17:17 -0400)] 
NFSv4.2: Fix warning "variable ‘stateids’ set but not used"

Replace it with a test for whether or not the sent a stateid in violation
of what we asked for.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’"
Trond Myklebust [Sun, 24 Jul 2016 21:10:52 +0000 (17:10 -0400)] 
NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’"

Make it static

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoMerge branch 'nfs-rdma'
Trond Myklebust [Sun, 24 Jul 2016 21:09:02 +0000 (17:09 -0400)] 
Merge branch 'nfs-rdma'

7 years agoMerge branch 'pnfs'
Trond Myklebust [Sun, 24 Jul 2016 21:08:59 +0000 (17:08 -0400)] 
Merge branch 'pnfs'

7 years agoMerge branch 'writeback'
Trond Myklebust [Sun, 24 Jul 2016 21:08:31 +0000 (17:08 -0400)] 
Merge branch 'writeback'

7 years agoMerge branch 'sunrpc'
Trond Myklebust [Sun, 24 Jul 2016 21:08:31 +0000 (17:08 -0400)] 
Merge branch 'sunrpc'

7 years agoSUNRPC: Fix a compiler warning in fs/nfs/clnt.c
Trond Myklebust [Sun, 24 Jul 2016 21:06:28 +0000 (17:06 -0400)] 
SUNRPC: Fix a compiler warning in fs/nfs/clnt.c

Fix the report:

net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Remove redundant smp_mb() from pnfs_init_lseg()
Trond Myklebust [Sun, 24 Jul 2016 19:14:44 +0000 (15:14 -0400)] 
pNFS: Remove redundant smp_mb() from pnfs_init_lseg()

It's not visible yet, and won't be until after we grab the inode->i_lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Cleanup - do layout segment initialisation in one place
Trond Myklebust [Sun, 24 Jul 2016 19:10:12 +0000 (15:10 -0400)] 
pNFS: Cleanup - do layout segment initialisation in one place

...instead of splitting the initialisation over init_lseg() and
pnfs_layout_process().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Remove redundant stateid invalidation
Trond Myklebust [Thu, 21 Jul 2016 18:45:19 +0000 (14:45 -0400)] 
pNFS: Remove redundant stateid invalidation

The layout stateid will be invalidated once it holds no more layout
segments anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Remove redundant pnfs_mark_layout_returned_if_empty()
Trond Myklebust [Sun, 24 Jul 2016 16:45:47 +0000 (12:45 -0400)] 
pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()

That's already being taken care of in pnfs_layout_remove_lseg().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Clear the layout metadata if the server changed the layout stateid
Trond Myklebust [Sun, 24 Jul 2016 19:04:07 +0000 (15:04 -0400)] 
pNFS: Clear the layout metadata if the server changed the layout stateid

If the server changed the layout stateid's "other" field, then
we should treat the old layout as being completely gone. In that
case, we want to clear the metadata such as scheduled layoutreturns.

Do this by calling pnfs_mark_layout_stateid_invalid().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()
Trond Myklebust [Fri, 22 Jul 2016 15:25:27 +0000 (11:25 -0400)] 
pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()

Ensure nfs42_layoutstat_done() layoutget don't open code layout stateid
invalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id
Trond Myklebust [Fri, 22 Jul 2016 15:13:22 +0000 (11:13 -0400)] 
NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id

When determining which layout segments to return, we do want
pnfs_mark_matching_lsegs_return to check that they match the layout
sequence id. This ensures that we don't waste time if the server
is replaying a layout recall that has already been satisfied.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Do not set plh_return_seq for non-callback related layoutreturns
Trond Myklebust [Thu, 21 Jul 2016 17:06:18 +0000 (13:06 -0400)] 
pNFS: Do not set plh_return_seq for non-callback related layoutreturns

In cases where we need to send a layoutreturn in order to propagate
an error, we should not tie that to a specific layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Ensure layoutreturn acts as a completion for layout callbacks
Trond Myklebust [Thu, 21 Jul 2016 16:44:15 +0000 (12:44 -0400)] 
pNFS: Ensure layoutreturn acts as a completion for layout callbacks

When we return NFS_OK to the CB_LAYOUTRECALL, we are required to
send a layoutreturn that "completes" that layout recall request, using
the correct stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Fix CB_LAYOUTRECALL stateid verification
Trond Myklebust [Sun, 24 Jul 2016 01:11:43 +0000 (21:11 -0400)] 
pNFS: Fix CB_LAYOUTRECALL stateid verification

We want to evaluate in this order:

If the client holds no layout for this inode, then return
NFS4ERR_NOMATCHING_LAYOUT; it probably forgot the layout.

If the client finds the inode among the list of layouts, but the corresponding
stateid has not yet been initialised, then return NFS4ERR_DELAY to ask the
server to retry once the outstanding LAYOUTGET is complete.

If the current layout stateid's "other" field does not match the recalled
stateid, return NFS4ERR_BAD_STATEID.

If already processing a layout recall with a newer stateid, return
NFS4ERR_OLD_STATEID. This can only happens for servers that are
non-compliant with the NFSv4.1 protocol.

If already processing a layout recall with an older stateid, return
NFS4ERR_DELAY to ask the server to retry once the outstanding
LAYOUTRETURN is complete. Again, this is technically incompliant with
the NFSv4.1 protocol.

If the current layout sequence id is newer than the recalled stateid's
sequence id, return NFS4ERR_OLD_STATEID. This too implies protocol
non-compliance.

If the current layout sequence id is older than the recalled stateid's
sequence id+1, return NFS4ERR_DELAY.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Always update the layout barrier seqid on LAYOUTGET
Trond Myklebust [Sun, 24 Jul 2016 15:46:06 +0000 (11:46 -0400)] 
pNFS: Always update the layout barrier seqid on LAYOUTGET

Currently, pnfs_set_layout_stateid() will update the layout sequence
id barrier only if the stateid itself is newer than the current
layout stateid. However in a situation where multiple LAYOUTGET calls
and a LAYOUTRETURN raced, it is entirely possible for one of the
LAYOUTGET to set the current stateid to something newer than the
LAYOUTRETURN that needs to set the barrier.

The fix is to allow the "update_barrier" flag to force a check as to
whether or not the barrier needs to be updated.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set
Trond Myklebust [Sun, 24 Jul 2016 15:39:03 +0000 (11:39 -0400)] 
pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set

If the layout stateid is invalid, then pnfs_set_layout_stateid() must
always initialise it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Clear the layout return tracking on layout reinitialisation
Trond Myklebust [Thu, 21 Jul 2016 15:53:29 +0000 (11:53 -0400)] 
pNFS: Clear the layout return tracking on layout reinitialisation

Ensure that we don't carry over layoutreturn info from a previous
incarnation of this layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: LAYOUTRETURN should only update the stateid if the layout is valid
Trond Myklebust [Sun, 24 Jul 2016 16:26:34 +0000 (12:26 -0400)] 
pNFS: LAYOUTRETURN should only update the stateid if the layout is valid

If the layout was completely returned, then ignore the returned layout
stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoMerge commit 'e7bdea7750eb'
Trond Myklebust [Sun, 24 Jul 2016 16:51:10 +0000 (12:51 -0400)] 
Merge commit 'e7bdea7750eb'

Needed in order to work on top of pNFS changes in Linus' upstream kernel.

7 years agonfs: don't create zero-length requests
Benjamin Coddington [Mon, 18 Jul 2016 14:41:57 +0000 (10:41 -0400)] 
nfs: don't create zero-length requests

NFS doesn't expect requests with wb_bytes set to zero and may make
unexpected decisions about how to handle that request at the page IO layer.
Skip request creation if we won't have any wb_bytes in the request.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoFix NULL pointer dereference in bl_free_device().
Artem Savkov [Thu, 21 Jul 2016 11:32:04 +0000 (13:32 +0200)] 
Fix NULL pointer dereference in bl_free_device().

When bl_parse_deviceid() fails in bl_alloc_deviceid_node() on
blkdev_get_by_*() step we get an pnfs_block_dev struct that is
uninitialized except for bdev field which is set to whatever error
blkdev_get_by_*() returns.  bl_free_device() then tries to call
blkdev_put() if bdev is not 0 resulting in a wrong pointer dereference.

Fixing this by setting bdev in struct pnfs_block_dev only if we didn't
get an error from blkdev_get_by_*().

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS/files: filelayout_write_done_cb must call nfs_writeback_update_inode()
Trond Myklebust [Thu, 21 Jul 2016 13:43:43 +0000 (09:43 -0400)] 
pNFS/files: filelayout_write_done_cb must call nfs_writeback_update_inode()

All write callbacks are required to call nfs_writeback_update_inode() upon
success to ensure that file size changes are recorded, and the attribute
cache is invalidated.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoMerge tag 'nfs-rdma-4.8-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma
Trond Myklebust [Tue, 19 Jul 2016 21:03:36 +0000 (17:03 -0400)] 
Merge tag 'nfs-rdma-4.8-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Cleanup

Fixes an unnecessary semicolon warning found by the kbuild robot.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: fix semicolon.cocci warnings
kbuild test robot [Fri, 15 Jul 2016 22:02:05 +0000 (06:02 +0800)] 
xprtrdma: fix semicolon.cocci warnings

net/sunrpc/xprtrdma/verbs.c:798:2-3: Unneeded semicolon

 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

CC: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agosunrpc: Prevent resvport min/max inversion via sysfs and module parameter
Frank Sorenson [Fri, 8 Jul 2016 21:35:25 +0000 (16:35 -0500)] 
sunrpc: Prevent resvport min/max inversion via sysfs and module parameter

The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.

Prevent inversion of min/max values when set through sysfs and
module parameter by setting the limits dependent on each other.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agosunrpc: Prevent resvport min/max inversion via sysctl
Frank Sorenson [Fri, 8 Jul 2016 21:35:24 +0000 (16:35 -0500)] 
sunrpc: Prevent resvport min/max inversion via sysctl

The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.

Prevent inversion of min/max values when set through sysctl by
setting the limits dependent on each other.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agosunrpc: Fix reserved port range calculation
Frank Sorenson [Fri, 8 Jul 2016 21:35:23 +0000 (16:35 -0500)] 
sunrpc: Fix reserved port range calculation

The range calculation for choosing the random reserved port will panic
with divide-by-zero when min_resvport == max_resvport, a range of one
port, not zero.

Fix the reserved port range calculation by adding one to the difference.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agosunrpc: Fix bit count when setting hashtable size to power-of-two
Frank Sorenson [Mon, 27 Jun 2016 19:17:19 +0000 (15:17 -0400)] 
sunrpc: Fix bit count when setting hashtable size to power-of-two

Author: Frank Sorenson <sorenson@redhat.com>
Date:   2016-06-27 13:55:48 -0500

    sunrpc: Fix bit count when setting hashtable size to power-of-two

    The hashtable size is incorrectly calculated as the next higher
    power-of-two when being set to a power-of-two.  fls() returns the
    bit number of the most significant set bit, with the least
    significant bit being numbered '1'.  For a power-of-two, fls()
    will return a bit number which is one higher than the number of bits
    required, leading to a hashtable which is twice the requested size.

    In addition, the value of (1 << nbits) will always be at least num,
    so the test will never be true.

    Fix the hash table size calculation to correctly set hashtable
    size, and eliminate the unnecessary check.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs4: flexfiles: respect noresvport when establishing connections to DSes
Tigran Mkrtchyan [Mon, 13 Jun 2016 18:52:00 +0000 (20:52 +0200)] 
nfs4: flexfiles: respect noresvport when establishing connections to DSes

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs4: clnt: respect noresvport when establishing connections to DSes
Tigran Mkrtchyan [Mon, 13 Jun 2016 17:57:35 +0000 (19:57 +0200)] 
nfs4: clnt: respect noresvport when establishing connections to DSes

result:

$ mount -o vers=4.1 dcache-lab007:/ /pnfs
$ cp /etc/profile /pnfs
tcp        0      0 131.169.185.68:1005     131.169.191.141:32049   ESTABLISHED
tcp        0      0 131.169.185.68:751      131.169.191.144:2049    ESTABLISHED
$

$ mount -o vers=4.1,noresvport dcache-lab007:/ /pnfs
$ cp /etc/profile /pnfs
tcp        0      0 131.169.185.68:34894    131.169.191.141:32049   ESTABLISHED
tcp        0      0 131.169.185.68:35722    131.169.191.144:2049    ESTABLISHED
$

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopnfs/blocklayout: put deviceid node after releasing bl_ext_lock
Benjamin Coddington [Fri, 10 Jun 2016 20:37:35 +0000 (16:37 -0400)] 
pnfs/blocklayout: put deviceid node after releasing bl_ext_lock

The last put of deviceid nodes for SCSI layouts may sleep, so we shouldn't
hold any spinlocks.  Make sure we put them outside the bl_ext_lock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agosunrpc: move NO_CRKEY_TIMEOUT to the auth->au_flags
Scott Mayhew [Tue, 7 Jun 2016 19:14:48 +0000 (15:14 -0400)] 
sunrpc: move NO_CRKEY_TIMEOUT to the auth->au_flags

A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag.  A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.

This can be reproduced as follows:

1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server.  Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys

2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m

3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set.  Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

5. Now do some I/O to the sec=sys mount.  This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1

6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client.  Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect.  Grabbing a new kerberos ticket at this
point will have no effect either.

Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags.  Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too.  Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agomount: use sec= that was specified on the command line
Steve Dickson [Wed, 25 May 2016 14:36:50 +0000 (10:36 -0400)] 
mount: use sec= that was specified on the command line

When older servers return RPC_AUTH_NULL, it means the
rpc creds will be ignored. In that case use the sec=
that was specified instead of setting sec=null

Fixes Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1112983
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agopNFS: Fix LAYOUTGET handling of NFS4ERR_BAD_STATEID and NFS4ERR_EXPIRED
Trond Myklebust [Thu, 14 Jul 2016 19:14:02 +0000 (15:14 -0400)] 
pNFS: Fix LAYOUTGET handling of NFS4ERR_BAD_STATEID and NFS4ERR_EXPIRED

We want to recover the open stateid if there is no layout stateid
and/or the stateid argument matches an open stateid.
Otherwise throw out the existing layout and recover from scratch, as
the layout stateid is bad.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
7 years agopNFS: Handle NFS4ERR_RECALLCONFLICT correctly in LAYOUTGET
Trond Myklebust [Thu, 14 Jul 2016 18:28:31 +0000 (14:28 -0400)] 
pNFS: Handle NFS4ERR_RECALLCONFLICT correctly in LAYOUTGET

Instead of giving up altogether and falling back to doing I/O
through the MDS, which may make the situation worse, wait for
2 lease periods for the callback to resolve itself, and then
try destroying the existing layout.

Only if this was an attempt at getting a first layout, do we
give up altogether, as the server is clearly crazy.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
7 years agopNFS: Separate handling of NFS4ERR_LAYOUTTRYLATER and RECALLCONFLICT
Trond Myklebust [Thu, 14 Jul 2016 22:46:24 +0000 (18:46 -0400)] 
pNFS: Separate handling of NFS4ERR_LAYOUTTRYLATER and RECALLCONFLICT

They are not the same error, and need to be handled differently.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
7 years agopNFS: Fix post-layoutget error handling in pnfs_update_layout()
Trond Myklebust [Thu, 14 Jul 2016 22:34:12 +0000 (18:34 -0400)] 
pNFS: Fix post-layoutget error handling in pnfs_update_layout()

The non-retry error path is currently broken and ends up releasing the
reference to the layout twice. It also can end up clearing the
NFS_LAYOUT_FIRST_LAYOUTGET flag twice, causing a race.

In addition, the retry path will fail to decrement the plh_outstanding
counter.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
7 years agopNFS: Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding
Trond Myklebust [Mon, 18 Jul 2016 04:51:01 +0000 (00:51 -0400)] 
pNFS: Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding

We know that the attributes will need updating if there is still a
LAYOUTCOMMIT outstanding.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoSUNRPC: Fix infinite looping in rpc_clnt_iterate_for_each_xprt
Trond Myklebust [Sat, 16 Jul 2016 15:47:00 +0000 (11:47 -0400)] 
SUNRPC: Fix infinite looping in rpc_clnt_iterate_for_each_xprt

If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.

The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.

Reported-by: Oleg Drokin <green@linuxhacker.ru>
Fixes: 80b14d5e61ca ("SUNRPC: Add a structure to track multiple transports")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoMerge tag 'nfs-rdma-4.8-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
Trond Myklebust [Fri, 15 Jul 2016 21:05:52 +0000 (17:05 -0400)] 
Merge tag 'nfs-rdma-4.8-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

New Features:
- Add kerberos support

Bugfixes and cleanups:
- Remove ALLPHYSICAL memory registration mode
- Fix FMR disconnect recovery
- Reduce memory usage

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agonfs/blocklayout: Check max uuids and devices before decoding
Kinglong Mee [Thu, 14 Jul 2016 04:02:01 +0000 (12:02 +0800)] 
nfs/blocklayout: Check max uuids and devices before decoding

Avoid nfs return uuids/devices larger than maximum.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs/blocklayout: Make sure calculate signature length aligned
Kinglong Mee [Thu, 14 Jul 2016 04:01:28 +0000 (12:01 +0800)] 
nfs/blocklayout: Make sure calculate signature length aligned

Avoid a bad nfs server return an unaligned length of signature.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs/blocklayout: support RH/Fedora dm-mpath device nodes
Christoph Hellwig [Fri, 8 Jul 2016 09:41:30 +0000 (18:41 +0900)] 
nfs/blocklayout: support RH/Fedora dm-mpath device nodes

Instead of reusing the wwn-* names for multipath devices nodes RHEL and
Fedora introduce new dm-mpath-uuid-* nodes with a slightly different
naming scheme.  Try these names first to ensure we always get a
multipath-capable device if it exists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs/blocklayout: refactor open-by-wwn
Christoph Hellwig [Fri, 8 Jul 2016 09:41:29 +0000 (18:41 +0900)] 
nfs/blocklayout: refactor open-by-wwn

The current code works with the standard udev/systemd names, but we'll have
to add another method in the next patch.  Refactor it into a separate helper
to make room for the new variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agonfs/blocklayout: use proper fmode for opening block devices
Christoph Hellwig [Fri, 8 Jul 2016 09:41:28 +0000 (18:41 +0900)] 
nfs/blocklayout: use proper fmode for opening block devices

This was fixed for the original block layout code a while ago, but also
needs to be fixed for the SCSI layout path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFSv4: Revert "Truncating file opens should also sync O_DIRECT writes"
Trond Myklebust [Thu, 14 Jul 2016 16:42:40 +0000 (12:42 -0400)] 
NFSv4: Revert "Truncating file opens should also sync O_DIRECT writes"

We're not holding any locks, so both nfs_wb_all() and inode_dio_wait()
are unenforcible and have livelock potential. Just limit ourselves to
flushing out the data.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
7 years agoNFS: Don't drop CB requests with invalid principals
Chuck Lever [Wed, 29 Jun 2016 17:55:22 +0000 (13:55 -0400)] 
NFS: Don't drop CB requests with invalid principals

Before commit 778be232a207 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.

Fixes: 778be232a207 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agosvc: Avoid garbage replies when pc_func() returns rpc_drop_reply
Chuck Lever [Wed, 29 Jun 2016 17:55:14 +0000 (13:55 -0400)] 
svc: Avoid garbage replies when pc_func() returns rpc_drop_reply

If an RPC program does not set vs_dispatch and pc_func() returns
rpc_drop_reply, the server sends a reply anyway containing a single
word containing the value RPC_DROP_REPLY (in network byte-order, of
course). This is a nonsense RPC message.

Fixes: 9e701c610923 ("svcrpc: simpler request dropping")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: No direct data placement with krb5i and krb5p
Chuck Lever [Wed, 29 Jun 2016 17:55:06 +0000 (13:55 -0400)] 
xprtrdma: No direct data placement with krb5i and krb5p

Direct data placement is not allowed when using flavors that
guarantee integrity or privacy. When such security flavors are in
effect, don't allow the use of Read and Write chunks for moving
individual data items. All messages larger than the inline threshold
are sent via Long Call or Long Reply.

On my systems (CX-3 Pro on FDR), for small I/O operations, the use
of Long messages adds only around 5 usecs of latency in each
direction.

Note that when integrity or encryption is used, the host CPU touches
every byte in these messages. Even if it could be used, data
movement offload doesn't buy much in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Clean up fixup_copy_count accounting
Chuck Lever [Wed, 29 Jun 2016 17:54:58 +0000 (13:54 -0400)] 
xprtrdma: Clean up fixup_copy_count accounting

fixup_copy_count should count only the number of bytes copied to the
page list. The head and tail are now always handled without a data
copy.

And the debugging at the end of rpcrdma_inline_fixup() is also no
longer necessary, since copy_len will be non-zero when there is reply
data in the tail (a normal and valid case).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Update only specific fields in private receive buffer
Chuck Lever [Wed, 29 Jun 2016 17:54:49 +0000 (13:54 -0400)] 
xprtrdma: Update only specific fields in private receive buffer

Now that rpcrdma_inline_fixup() updates only two fields in
rq_rcv_buf, a full memcpy of that structure to rq_private_buf is
unwarranted. Updating rq_private_buf fields only where needed also
better documents what is going on.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
Chuck Lever [Wed, 29 Jun 2016 17:54:41 +0000 (13:54 -0400)] 
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()

While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.

The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.

As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:

- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
   if they are not exactly the same

Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.

To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.

While I remember all this, write down the conclusion in documenting
comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: rpcrdma_inline_fixup() overruns the receive page list
Chuck Lever [Wed, 29 Jun 2016 17:54:33 +0000 (13:54 -0400)] 
xprtrdma: rpcrdma_inline_fixup() overruns the receive page list

When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Chunk list encoders no longer share one rl_segments array
Chuck Lever [Wed, 29 Jun 2016 17:54:25 +0000 (13:54 -0400)] 
xprtrdma: Chunk list encoders no longer share one rl_segments array

Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.

However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.

Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.

This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.

This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Place registered MWs on a per-req list
Chuck Lever [Wed, 29 Jun 2016 17:54:16 +0000 (13:54 -0400)] 
xprtrdma: Place registered MWs on a per-req list

Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.

ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.

This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.

As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Release orphaned MRs immediately
Chuck Lever [Wed, 29 Jun 2016 17:54:08 +0000 (13:54 -0400)] 
xprtrdma: Release orphaned MRs immediately

Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Allocate MRs on demand
Chuck Lever [Wed, 29 Jun 2016 17:54:00 +0000 (13:54 -0400)] 
xprtrdma: Allocate MRs on demand

Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.

Commit 94f58c58c0b4 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.

Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.

This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.

FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Chunk list encoders must not return zero
Chuck Lever [Wed, 29 Jun 2016 17:53:52 +0000 (13:53 -0400)] 
xprtrdma: Chunk list encoders must not return zero

Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Honor ->send_request API contract
Chuck Lever [Wed, 29 Jun 2016 17:53:43 +0000 (13:53 -0400)] 
xprtrdma: Honor ->send_request API contract

Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.

Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.

Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Reply buffer exhaustion can be catastrophic
Chuck Lever [Wed, 29 Jun 2016 17:53:35 +0000 (13:53 -0400)] 
xprtrdma: Reply buffer exhaustion can be catastrophic

Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.

Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Clean up device capability detection
Chuck Lever [Wed, 29 Jun 2016 17:53:27 +0000 (13:53 -0400)] 
xprtrdma: Clean up device capability detection

Clean up: Move device capability detection into memreg-specific
source files.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Remove rpcrdma_map_one() and friends
Chuck Lever [Wed, 29 Jun 2016 17:53:19 +0000 (13:53 -0400)] 
xprtrdma: Remove rpcrdma_map_one() and friends

Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.

This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Remove ALLPHYSICAL memory registration mode
Chuck Lever [Wed, 29 Jun 2016 17:53:11 +0000 (13:53 -0400)] 
xprtrdma: Remove ALLPHYSICAL memory registration mode

No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.

ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.

ALLPHYSICAL exposes the client to server bugs, including:
 o base/bounds errors causing data outside the i/o buffer to be
   accessed
 o RDMA access after reply causing data corruption and/or integrity
   fail

ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.

ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Do not leak an MW during a DMA map failure
Chuck Lever [Wed, 29 Jun 2016 17:53:02 +0000 (13:53 -0400)] 
xprtrdma: Do not leak an MW during a DMA map failure

Based on code audit.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Refactor MR recovery work queues
Chuck Lever [Wed, 29 Jun 2016 17:52:54 +0000 (13:52 -0400)] 
xprtrdma: Refactor MR recovery work queues

I found that commit ead3f26e359e ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.

The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.

Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Use scatterlist for DMA mapping and unmapping under FMR
Chuck Lever [Wed, 29 Jun 2016 17:52:45 +0000 (13:52 -0400)] 
xprtrdma: Use scatterlist for DMA mapping and unmapping under FMR

The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01e9
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.

However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.

This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).

The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.

Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <sagi@lightbits.io>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Rename fields in rpcrdma_fmr
Chuck Lever [Wed, 29 Jun 2016 17:52:37 +0000 (13:52 -0400)] 
xprtrdma: Rename fields in rpcrdma_fmr

Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Move init and release helpers
Chuck Lever [Wed, 29 Jun 2016 17:52:29 +0000 (13:52 -0400)] 
xprtrdma: Move init and release helpers

Clean up: Moving these helpers in a separate patch makes later
patches more readable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Create common scatterlist fields in rpcrdma_mw
Chuck Lever [Wed, 29 Jun 2016 17:52:21 +0000 (13:52 -0400)] 
xprtrdma: Create common scatterlist fields in rpcrdma_mw

Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.

One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoxprtrdma: Remove FMRs from the unmap list after unmapping
Chuck Lever [Wed, 29 Jun 2016 17:52:12 +0000 (13:52 -0400)] 
xprtrdma: Remove FMRs from the unmap list after unmapping

ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.

Since commit 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.

I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().

Fixes: 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
7 years agoLinux 4.7-rc7
Linus Torvalds [Mon, 11 Jul 2016 03:24:59 +0000 (20:24 -0700)] 
Linux 4.7-rc7

7 years agotmpfs: fix regression hang in fallocate undo
Hugh Dickins [Sun, 10 Jul 2016 23:46:32 +0000 (16:46 -0700)] 
tmpfs: fix regression hang in fallocate undo

The well-spotted fallocate undo fix is good in most cases, but not when
fallocate failed on the very first page.  index 0 then passes lend -1
to shmem_undo_range(), and that has two bad effects: (a) that it will
undo every fallocation throughout the file, unrestricted by the current
range; but more importantly (b) it can cause the undo to hang, because
lend -1 is treated as truncation, which makes it keep on retrying until
every page has gone, but those already fully instantiated will never go
away.  Big thank you to xfstests generic/269 which demonstrates this.

Fixes: b9b4bb26af01 ("tmpfs: don't undo fallocate past its last page")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoMerge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Linus Torvalds [Sun, 10 Jul 2016 16:13:02 +0000 (09:13 -0700)] 
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus

Pull MIPS fix from Ralf Baechle:
 "Another week with just a single 4.7 fix.

  This fixes a possible 'loss' of the huge page bit from pmd on
  permission change"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
  MIPS: Fix page table corruption on THP permission changes.

7 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 9 Jul 2016 01:59:46 +0000 (18:59 -0700)] 
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Three fixes.  One is the qla24xx MSI regression, one is a theoretical
  problem over blacklist matching, which would bite USB badly if it ever
  triggered and one is a system hang with a particular type of IPR
  device"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  qla2xxx: Fix NULL pointer deref in QLA interrupt
  SCSI: fix new bug in scsi_dev_info_list string matching
  ipr: Clear interrupt on croc/crocodile when running with LSI

7 years agoMerge tag 'ecryptfs-4.7-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 8 Jul 2016 16:48:28 +0000 (09:48 -0700)] 
Merge tag 'ecryptfs-4.7-rc7-fixes' of git://git./linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Provide a more concise fix for CVE-2016-1583:
   - Additionally fixes linux-stable regressions caused by the
     cherry-picking of the original fix

  Some very minor changes that have queued up:
   - Fix typos in code comments
   - Remove unnecessary check for NULL before destroying kmem_cache"

* tag 'ecryptfs-4.7-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: don't allow mmap when the lower fs doesn't support it
  Revert "ecryptfs: forbid opening files without mmap handler"
  ecryptfs: fix spelling mistakes
  eCryptfs: fix typos in comment
  ecryptfs: drop null test before destroy functions

7 years agoMerge tag 'iommu-fixes-v4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 8 Jul 2016 16:35:23 +0000 (09:35 -0700)] 
Merge tag 'iommu-fixes-v4.7-rc6' of git://git./linux/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:
 "Two Fixes:

   - Intel VT-d fix for a suspend/resume issue, introduced with the
     scalability improvements in this cycle.

   - AMD IOMMU fix for systems that have unity mappings defined.  There
     was a race where translation got enabled before the unity mappings
     were in place.  This issue was seen on some HP servers"

* tag 'iommu-fixes-v4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/amd: Fix unity mapping initialization race
  iommu/vt-d: Fix infinite loop in free_all_cpu_cached_iovas

7 years agoMerge tag 'for-linus-4.7b-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 8 Jul 2016 16:12:41 +0000 (09:12 -0700)] 
Merge tag 'for-linus-4.7b-rc6-tag' of git://git./linux/kernel/git/xen/tip

Pull xen bug fixes from David Vrabel:

 - Fix two bugs in the handling of xenbus transactions.

 - Make the xen acpi driver compatible with Xen 4.7.

* tag 'for-linus-4.7b-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/acpi: allow xen-acpi-processor driver to load on Xen 4.7
  xenbus: simplify xenbus_dev_request_and_reply()
  xenbus: don't bail early from xenbus_dev_request_and_reply()
  xenbus: don't BUG() on user mode induced condition

7 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 8 Jul 2016 16:08:27 +0000 (09:08 -0700)] 
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "A couple of late fixes here, but one that we've been sitting on for a
  few weeks while the details were worked out.  Specifically, we now
  enforce USER_DS on taking exceptions whilst in the kernel, which
  avoids leaking kernel data to userspace through things like perf.  The
  other patch is an update to a workaround for a hardware erratum on
  some Cavium SoCs.

  Summary:

   - Enforce USER_DS on exception entry from EL1

   - Apply workaround for Cavium errata #27456 on Thunderx-81xx parts"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Enable workaround for Cavium erratum 27456 on thunderx-81xx
  arm64: kernel: Save and restore UAO and addr_limit on exception entry

7 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 8 Jul 2016 16:06:52 +0000 (09:06 -0700)] 
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Three fixes:

   - A boot crash fix with certain configs
   - a MAINTAINERS entry update
   - Documentation typo fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/Documentation: Fix various typos in Documentation/x86/ files
  x86/amd_nb: Fix boot crash on non-AMD systems
  MAINTAINERS: Update the Calgary IOMMU entry

7 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 8 Jul 2016 16:04:34 +0000 (09:04 -0700)] 
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Two load-balancing fixes for cgroups-intense workloads"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
  sched/fair: Fix effective_load() to consistently use smoothed load

7 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 8 Jul 2016 16:02:16 +0000 (09:02 -0700)] 
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Various fixes:

   - 32-bit callgraph bug fix
   - suboptimal event group scheduling bug fix
   - event constraint fixes for Broadwell/Skylake
   - RAPL module name collision fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix pmu::filter_match for SW-led groups
  x86/perf/intel/rapl: Fix module name collision with powercap intel-rapl
  perf/x86: Fix 32-bit perf user callgraph collection
  perf/x86/intel: Update event constraints when HT is off

7 years agoMerge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 8 Jul 2016 15:59:33 +0000 (08:59 -0700)] 
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Ingo Molnar:
 "Two MIPS-GIC irqchip driver fixes to unbreak certain MIPS boards"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mips-gic: Match IPI IRQ domain by bus token only
  irqchip/mips-gic: Map to VPs using HW VPNum

7 years agoMerge tag 'gpio-v4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux...
Linus Torvalds [Fri, 8 Jul 2016 15:57:03 +0000 (08:57 -0700)] 
Merge tag 'gpio-v4.7-5' of git://git./linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
 "I don't like to toss in last minute patches, but these are all for
  things that are broken, and have bitten people for real.  Two of them
  go into stable.  Maybe all of them if the compile test problem is a
  pain in the ass also for stable folks.

  Final (hopefully) GPIO fixes for v4.7:

   - Fix an oops on the Asus Eee PC 1201

   - Revert a patch trying to split GPIO parsing and GPIO configuration

   - Revert a too liberal compile testing thing"

* tag 'gpio-v4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  Revert "gpio: gpiolib-of: Allow compile testing"
  Revert "gpiolib: Split GPIO flags parsing and GPIO configuration"
  gpio: sch: Fix Oops on module load on Asus Eee PC 1201

7 years agoMerge tag 'drm-fixes-for-v4.7-rc7' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Fri, 8 Jul 2016 15:55:27 +0000 (08:55 -0700)] 
Merge tag 'drm-fixes-for-v4.7-rc7' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "One nouveau fix, and a few AMD Polaris fixes and some Allwinner fixes.

  I've got some vmware fixes that I might send separate over the
  weekend, they fix some black screens, but I'm still debating them"

* tag 'drm-fixes-for-v4.7-rc7' of git://people.freedesktop.org/~airlied/linux:
  drm/amd/powerplay: Update CKS on/ CKS off voltage offset calculation.
  drm/amd/powerplay: fix bug that get wrong polaris evv voltage.
  drm/amd/powerplay: incorrectly use of the function return value
  drm/amd/powerplay: fix incorrect voltage table value for tonga
  drm/amd/powerplay: fix incorrect voltage table value for polaris10
  drm/nouveau/disp/sor/gf119: select correct sor when poking training pattern
  gpu: drm: sun4i_drv: add missing of_node_put after calling of_parse_phandle
  drm/sun4i: Send vblank event when the CRTC is disabled
  drm/sun4i: Report proper vblank

7 years agoecryptfs: don't allow mmap when the lower fs doesn't support it
Jeff Mahoney [Tue, 5 Jul 2016 21:32:30 +0000 (17:32 -0400)] 
ecryptfs: don't allow mmap when the lower fs doesn't support it

There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs.  We shouldn't emulate mmap support on file systems
that don't offer support natively.

CVE-2016-1583

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
[tyhicks: clean up f_op check by using ecryptfs_file_to_lower()]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
7 years agoxen/acpi: allow xen-acpi-processor driver to load on Xen 4.7
Jan Beulich [Fri, 8 Jul 2016 12:15:07 +0000 (06:15 -0600)] 
xen/acpi: allow xen-acpi-processor driver to load on Xen 4.7

As of Xen 4.7 PV CPUID doesn't expose either of CPUID[1].ECX[7] and
CPUID[0x80000007].EDX[7] anymore, causing the driver to fail to load on
both Intel and AMD systems. Doing any kind of hardware capability
checks in the driver as a prerequisite was wrong anyway: With the
hypervisor being in charge, all such checking should be done by it. If
ACPI data gets uploaded despite some missing capability, the hypervisor
is free to ignore part or all of that data.

Ditch the entire check_prereq() function, and do the only valid check
(xen_initial_domain()) in the caller in its place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
This page took 0.053854 seconds and 5 git commands to generate.