deliverable/linux.git
10 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
David S. Miller [Fri, 16 May 2014 21:21:51 +0000 (17:21 -0400)] 
Merge branch 'master' of git://git./linux/kernel/git/jesse/openvswitch

Jesse Gross says:

====================
A set of OVS changes for net-next/3.16.

The major change here is a switch from per-CPU to per-NUMA flow
statistics. This improves scalability by reducing kernel overhead
in flow setup and maintenance.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'dt_fixed_phy'
David S. Miller [Fri, 16 May 2014 21:19:44 +0000 (17:19 -0400)] 
Merge branch 'dt_fixed_phy'

Thomas Petazzoni says:

====================
Add DT support for fixed PHYs

Here is a fourth version of the patch set that adds a Device Tree
binding and the related code to support fixed PHYs. I'm hoping to get
this merged in 3.16.

Changes since v3:

 * Rebased on top of v3.15-rc5

 * In patch "net: phy: decouple PHY id and PHY address in fixed PHY
   driver", changed the PHY ID of fixed PHYs from 0xdeadbeef to 0x0,
   as suggested by Grant Likely.

 * Fixed the !CONFIG_PHY_FIXED case in patch "net: phy: extend fixed
   driver with fixed_phy_register()". Noticed by Florian Fainelli.

 * Added Acked-by from Grant Likely and Florian Fainelli on patch
   "net: phy: extend fixed driver with fixed_phy_register()".

 * Reworked the new fixed-link DT binding to be just a sub-node of the
   Ethernet MAC node, and not a node referenced by the 'phy'
   property. This was requested by Grant Likely.

 * Reworked the code implementing the new DT binding to also make it
   accept the old, single property based, DT binding.

 * Added a patch that actually uses the new fixed link DT binding for
   the Armada XP Matrix board.

Changes since v2:

 * Rebased on top of v3.14-rc1, and re-tested on hardware.

 * Removed the RFC tag, since there seems to be some real interest in
   this feature, and the code has gone through several iterations
   already.

 * The error handling in fixed_phy_register() has been fixed.

Changes since v1:

 * Instead of using a 'fixed-link' property inside the Ethernet device
   DT node, with a fairly cryptic succession of integer values, we now
   use a PHY subnode under the Ethernet device DT node, with explicit
   properties to configure the duplex, speed, pause and other PHY
   properties.

 * The PHY address is automatically allocated by the kernel and no
   longer visible in the Device Tree binding.

 * The PHY device is created directly when the network driver calls
   of_phy_connect_fixed_link(), and associated to the PHY DT node,
   which allows the existing of_phy_connect() function to work,
   without the need to use the deprecated of_phy_connect_fixed_link().

Posts of previous versions:

  RFCv1:   http://www.spinics.net/lists/netdev/msg243253.html
  RFCv2:   http://lists.infradead.org/pipermail/linux-arm-kernel/2013-September/196919.html
  PATCHv3: http://www.spinics.net/lists/netdev/msg273117.html
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoARM: mvebu: use the fixed-link PHY DT binding for the Armada XP Matrix board
Thomas Petazzoni [Fri, 16 May 2014 14:14:07 +0000 (16:14 +0200)] 
ARM: mvebu: use the fixed-link PHY DT binding for the Armada XP Matrix board

The Armada XP Matrix board has an Ethernet PHY that isn't configurable
through the MDIO bus, so we use the newly introduced fixed-link PHY DT
binding to represent the PHY of this platform and get network working.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: mvneta: add support for fixed links
Thomas Petazzoni [Fri, 16 May 2014 14:14:06 +0000 (16:14 +0200)] 
net: mvneta: add support for fixed links

Following the introduction of of_phy_register_fixed_link(), this patch
introduces fixed link support in the mvneta driver, for Marvell Armada
370/XP SOCs.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoof: provide a binding for fixed link PHYs
Thomas Petazzoni [Fri, 16 May 2014 14:14:05 +0000 (16:14 +0200)] 
of: provide a binding for fixed link PHYs

Some Ethernet MACs have a "fixed link", and are not connected to a
normal MDIO-managed PHY device. For those situations, a Device Tree
binding allows to describe a "fixed link" using a special PHY node.

This patch adds:

 * A documentation for the fixed PHY Device Tree binding.

 * An of_phy_is_fixed_link() function that an Ethernet driver can call
   on its PHY phandle to find out whether it's a fixed link PHY or
   not. It should typically be used to know if
   of_phy_register_fixed_link() should be called.

 * An of_phy_register_fixed_link() function that instantiates the
   fixed PHY into the PHY subsystem, so that when the driver calls
   of_phy_connect(), the PHY device associated to the OF node will be
   found.

These two additional functions also support the old fixed-link Device
Tree binding used on PowerPC platforms, so that ultimately, the
network device drivers for those platforms could be converted to use
of_phy_is_fixed_link() and of_phy_register_fixed_link() instead of
of_phy_connect_fixed_link(), while keeping compatibility with their
respective Device Tree bindings.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: phy: extend fixed driver with fixed_phy_register()
Thomas Petazzoni [Fri, 16 May 2014 14:14:04 +0000 (16:14 +0200)] 
net: phy: extend fixed driver with fixed_phy_register()

The existing fixed_phy_add() function has several drawbacks that
prevents it from being used as is for OF-based declaration of fixed
PHYs:

 * The address of the PHY on the fake bus needs to be passed, while a
   dynamic allocation is desired.

 * Since the phy_device instantiation is post-poned until the next
   mdiobus scan, there is no way to associate the fixed PHY with its
   OF node, which later prevents of_phy_connect() from finding this
   fixed PHY from a given OF node.

To solve this, this commit introduces fixed_phy_register(), which will
allocate an available PHY address, add the PHY using fixed_phy_add()
and instantiate the phy_device structure associated with the provided
OF node.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: phy: decouple PHY id and PHY address in fixed PHY driver
Thomas Petazzoni [Fri, 16 May 2014 14:14:03 +0000 (16:14 +0200)] 
net: phy: decouple PHY id and PHY address in fixed PHY driver

Until now, the fixed_phy_add() function was taking as argument
'phy_id', which was used both as the PHY address on the fake fixed
MDIO bus, and as the PHY id, as available in the MII_PHYSID1 and
MII_PHYSID2 registers. However, those two informations are completely
unrelated.

This patch decouples them. The PHY id of fixed PHYs is hardcoded to be
0x0. Ideally, a really reserved value would be nicer, but there
doesn't seem to be an easy of making sure a dummy value can be
assigned to the Linux kernel for such usage.

The PHY address remains passed by the caller of phy_fixed_add().

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'bridge-non-promisc'
David S. Miller [Fri, 16 May 2014 21:06:40 +0000 (17:06 -0400)] 
Merge branch 'bridge-non-promisc'

Vlad Yasevich says:

====================
bridge: Non-promisc bridge ports support

This series adds functionality to the bridge device to enable
operations without setting all ports to promiscuous mode.

The basic concept is this.  The bridge keeps track of the ports
that support learning and flooding packets to unknown destinations.
We call these ports auto-discovery ports since they automatically
discover who is behind them through learning and flooding.

If flooding and learning are disabled via flags, then the port
requires static configuration to tell it which mac addresses
are behind it.  This is accomplished through adding of fdbs.
These fdbs should be static as dynamic fdbs can expire and systems
will become unreachable due to lack of flooding.

If the user marks all ports as needing static configuration then
we can safely make them non-promiscuous since we will know all the
information about them.

If the user leaves only 1 port as automatic, then we can mark
that port as not-promiscuous as well.  One could think of
this a edge relay similar to what's support by embedded switches
in SRIOV devices.  Since we have all the information about the
other ports, we can just program the mac addresses into the
single automatic port to receive all necessary traffic.
More information about this is patch 6.

In other cases, we keep all ports promiscuous as before.

There are some other cases when promiscuous mode has to be turned
back on.  One is when the bridge itself if placed in promiscuous
mode (user sets promisc flag).  The other is if vlan filtering is
turned off.  Since this is the default configuration, the default
bridge operation is not changed.

Changes since v2:
 - White space and spelling fixes from Michael Tsirkin
 - Squash patches 6, 7 and 8 to prevent bisect breakage.

Changes since v1:
 - Address issues rasied by Stephen Heminger
 - Address initializer comments raised by Sergey Shtylyov
 - Rebased recent net-next.

Changes since rfc v2:
 - Better description of in the commit logs
 - Leave port in promiscuous mode if IFF_UNICAST_FLT is disabled on the
   device.
 - Fix issue with flag masking
 - Rework patch ordering a bit.

Changes since rfc v1:
 - Removed private list.  We now traverse the fdb hashtable itself
   to write necessary addresses to the ports (Stephen's concern)
 - Add learning flag to the mask for flags that decides if the port
   is 'auto' or not (suggest by MST and Jamal).
 - Simplified tracking of such ports at the cost of a loop over all
   ports (suggested by MST)

I've played with quite a large number of ports and the current approach
seems to work fairly well.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Automatically manage port promiscuous mode.
Vlad Yasevich [Fri, 16 May 2014 13:59:20 +0000 (09:59 -0400)] 
bridge: Automatically manage port promiscuous mode.

There exist configurations where the administrator or another management
entity has the foreknowledge of all the mac addresses of end systems
that are being bridged together.

In these environments, the administrator can statically configure known
addresses in the bridge FDB and disable flooding and learning on ports.
This makes it possible to turn off promiscuous mode on the interfaces
connected to the bridge.

Here is why disabling flooding and learning allows us to control
promiscuity:
 Consider port X.  All traffic coming into this port from outside the
bridge (ingress) will be either forwarded through other ports of the
bridge (egress) or dropped.  Forwarding (egress) is defined by FDB
entries and by flooding in the event that no FDB entry exists.
In the event that flooding is disabled, only FDB entries define
the egress.  Once learning is disabled, only static FDB entries
provided by a management entity define the egress.  If we provide
information from these static FDBs to the ingress port X, then we'll
be able to accept all traffic that can be successfully forwarded and
drop all the other traffic sooner without spending CPU cycles to
process it.
 Another way to define the above is as following equations:
    ingress = egress + drop
 expanding egress
    ingress = static FDB + learned FDB + flooding + drop
 disabling flooding and learning we a left with
    ingress = static FDB + drop

By adding addresses from the static FDB entries to the MAC address
filter of an ingress port X, we fully define what the bridge can
process without dropping and can thus turn off promiscuous mode,
thus dropping packets sooner.

There have been suggestions that we may want to allow learning
and update the filters with learned addresses as well.  This
would require mac-level authentication similar to 802.1x to
prevent attacks against the hw filters as they are limited
resource.

Additionally, if the user places the bridge device in promiscuous mode,
all ports are placed in promiscuous mode regardless of the changes
to flooding and learning.

Since the above functionality depends on full static configuration,
we have also require that vlan filtering be enabled to take
advantage of this.  The reason is that the bridge has to be
able to receive and process VLAN-tagged frames and the there
are only 2 ways to accomplish this right now: promiscuous mode
or vlan filtering.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Add addresses from static fdbs to non-promisc ports
Vlad Yasevich [Fri, 16 May 2014 13:59:19 +0000 (09:59 -0400)] 
bridge: Add addresses from static fdbs to non-promisc ports

When a static fdb entry is created, add the mac address
from this fdb entry to any ports that are currently running
in non-promiscuous mode.  These ports need this data so that
they can receive traffic destined to these addresses.
By default ports start in promiscuous mode, so this feature
is disabled.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Introduce BR_PROMISC flag
Vlad Yasevich [Fri, 16 May 2014 13:59:18 +0000 (09:59 -0400)] 
bridge: Introduce BR_PROMISC flag

Introduce a BR_PROMISC per-port flag that will help us track if the
current port is supposed to be in promiscuous mode or not.  For now,
always start in promiscuous mode.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Add functionality to sync static fdb entries to hw
Vlad Yasevich [Fri, 16 May 2014 13:59:17 +0000 (09:59 -0400)] 
bridge: Add functionality to sync static fdb entries to hw

Add code that allows static fdb entires to be synced to the
hw list for a specified port.  This will be used later to
program ports that can function in non-promiscuous mode.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Keep track of ports capable of automatic discovery.
Vlad Yasevich [Fri, 16 May 2014 13:59:16 +0000 (09:59 -0400)] 
bridge: Keep track of ports capable of automatic discovery.

By default, ports on the bridge are capable of automatic
discovery of nodes located behind the port.  This is accomplished
via flooding of unknown traffic (BR_FLOOD) and learning the
mac addresses from these packets (BR_LEARNING).
If the above functionality is disabled by turning off these
flags, the port requires static configuration in the form
of static FDB entries to function properly.

This patch adds functionality to keep track of all ports
capable of automatic discovery.  This will later be used
to control promiscuity settings.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Turn flag change macro into a function.
Vlad Yasevich [Fri, 16 May 2014 13:59:15 +0000 (09:59 -0400)] 
bridge: Turn flag change macro into a function.

Turn the flag change macro into a function to allow
easier updates and to reduce space.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: pch_gbe depends on x86_32
Jean Delvare [Fri, 16 May 2014 09:29:04 +0000 (11:29 +0200)] 
net: pch_gbe depends on x86_32

The pch_gbe driver is for a companion chip to the Intel Atom E600
series processors. These are 32-bit x86 processors so the driver is
only needed on X86_32.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoip_tunnel: don't add tunnel twice
Duan Jiong [Thu, 15 May 2014 05:07:02 +0000 (13:07 +0800)] 
ip_tunnel: don't add tunnel twice

When using command "ip tunnel add" to add a tunnel, the tunnel will be added twice,
through ip_tunnel_create() and ip_tunnel_update().

Because the second is unnecessary, so we can just break after adding tunnel
through ip_tunnel_create().

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotools: bpf_jit_disasm: increase image buffer size
Alexei Starovoitov [Thu, 15 May 2014 22:56:39 +0000 (15:56 -0700)] 
tools: bpf_jit_disasm: increase image buffer size

JITed seccomp filters can be quite large if they check a lot of syscalls
Simply increase buffer size

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotools: bpf_jit_disasm: ignore image address for disasm
Alexei Starovoitov [Thu, 15 May 2014 22:56:38 +0000 (15:56 -0700)] 
tools: bpf_jit_disasm: ignore image address for disasm

seccomp filters use kernel JIT image addresses, so bpf_jit_enable=2 prints
[ 20.146438] flen=3 proglen=82 pass=0 image=0000000000000000
[ 20.146442] JIT code: 00000000: 55 48 89 e5 48 81 ec 28 02 00 00 ...

ignore image address, so that seccomp filters can be disassembled

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'systemport-next'
David S. Miller [Fri, 16 May 2014 20:41:53 +0000 (16:41 -0400)] 
Merge branch 'systemport-next'

Florian Fainelli says:

====================
net: systemport: DMA and MAC fixes

This patch series contains a critical fix in how the DMA unmapping of packet
is done, as well as a less critical fix in how we disable the Ethernet MAC
RX/TX functions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: systemport: wait for packet in umac_enable_set()
Florian Fainelli [Thu, 15 May 2014 21:33:53 +0000 (14:33 -0700)] 
net: systemport: wait for packet in umac_enable_set()

When umac_enable_set() is used to disable the UniMAC receiver or
transmitter, we need to make sure that we wait for a full-sized packet
to be processed because the UniMAC hardware stops on a packet boundary,
not immediately.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: systemport: fix dma_unmap_single() len
Florian Fainelli [Thu, 15 May 2014 21:33:52 +0000 (14:33 -0700)] 
net: systemport: fix dma_unmap_single() len

dma_unmap_single() was called with dma_unmap_len(cb, dma_len),
unfortunately we failed to assign this length field in
bcm_sysport_rx_refill() or bcm_sysport_alloc_rx_bufs() using
dma_unmap_len_set().

This causes packet contents corruption because are we not invoking the
cache invalidation routines with the proper length.  Fix this by using
the full RX buffer size (RX_BUF_LENGTH) because the mappings for the RX
bufers are created with that size.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/openvswitch: Use with RCU_INIT_POINTER(x, NULL) in vport-gre.c
Monam Agarwal [Sun, 23 Mar 2014 19:22:43 +0000 (00:52 +0530)] 
net/openvswitch: Use with RCU_INIT_POINTER(x, NULL) in vport-gre.c

This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Use TCP flags in the flow key for stats.
Jarno Rajahalme [Thu, 27 Mar 2014 19:51:49 +0000 (12:51 -0700)] 
openvswitch: Use TCP flags in the flow key for stats.

We already extract the TCP flags for the key, might as well use that
for stats.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Fix output of SCTP mask.
Jarno Rajahalme [Thu, 27 Mar 2014 19:47:11 +0000 (12:47 -0700)] 
openvswitch: Fix output of SCTP mask.

The 'output' argument of the ovs_nla_put_flow() is the one from which
the bits are written to the netlink attributes.  For SCTP we
accidentally used the bits from the 'swkey' instead.  This caused the
mask attributes to include the bits from the actual flow key instead
of the mask.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Per NUMA node flow stats.
Jarno Rajahalme [Thu, 27 Mar 2014 19:42:54 +0000 (12:42 -0700)] 
openvswitch: Per NUMA node flow stats.

Keep kernel flow stats for each NUMA node rather than each (logical)
CPU.  This avoids using the per-CPU allocator and removes most of the
kernel-side OVS locking overhead otherwise on the top of perf reports
and allows OVS to scale better with higher number of threads.

With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
rate doubles on a server with two hyper-threaded physical CPUs (16
logical cores each) compared to the current OVS master.  Tested with
non-trivial flow table with a TCP port match rule forcing all new
connections with unique port numbers to OVS userspace.  The IP
addresses are still wildcarded, so the kernel flows are not considered
as exact match 5-tuple flows.  This type of flows can be expected to
appear in large numbers as the result of more effective wildcarding
made possible by improvements in OVS userspace flow classifier.

Perf results for this test (master):

Events: 305K cycles
+   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
+   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
+   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
+   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
+   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
+   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
+   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
+   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
+   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
+   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
+   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
+   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
+   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
...

And after this patch:

Events: 356K cycles
+   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
+   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
+   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
+   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
+   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
+   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
+   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
+   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
+   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
+   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
+   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
+   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
+   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
+   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
+   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
...

There is a small increase in kernel spinlock overhead due to the same
spinlock being shared between multiple cores of the same physical CPU,
but that is barely visible in the netperf TCP_CRR test performance
(maybe ~1% performance drop, hard to tell exactly due to variance in
the test results), when testing for kernel module throughput (with no
userspace activity, handful of kernel flows).

On flow setup, a single stats instance is allocated (for the NUMA node
0).  As CPUs from multiple NUMA nodes start updating stats, new
NUMA-node specific stats instances are allocated.  This allocation on
the packet processing code path is made to never block or look for
emergency memory pools, minimizing the allocation latency.  If the
allocation fails, the existing preallocated stats instance is used.
Also, if only CPUs from one NUMA-node are updating the preallocated
stats instance, no additional stats instances are allocated.  This
eliminates the need to pre-allocate stats instances that will not be
used, also relieving the stats reader from the burden of reading stats
that are never used.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Remove 5-tuple optimization.
Jarno Rajahalme [Thu, 27 Mar 2014 19:35:23 +0000 (12:35 -0700)] 
openvswitch: Remove 5-tuple optimization.

The 5-tuple optimization becomes unnecessary with a later per-NUMA
node stats patch.  Remove it first to make the changes easier to
grasp.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Use ether_addr_copy
Joe Perches [Tue, 18 Feb 2014 19:15:45 +0000 (11:15 -0800)] 
openvswitch: Use ether_addr_copy

It's slightly smaller/faster for some architectures.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: flow_netlink: Use pr_fmt to OVS_NLERR output
Joe Perches [Tue, 4 Feb 2014 01:18:21 +0000 (17:18 -0800)] 
openvswitch: flow_netlink: Use pr_fmt to OVS_NLERR output

Add "openvswitch: " prefix to OVS_NLERR output
to match the other OVS_NLERR output of datapath.c

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Use net_ratelimit in OVS_NLERR
Joe Perches [Tue, 4 Feb 2014 01:06:46 +0000 (17:06 -0800)] 
openvswitch: Use net_ratelimit in OVS_NLERR

Each use of pr_<level>_once has a per-site flag.

Some of the OVS_NLERR messages look as if seeing them
multiple times could be useful, so use net_ratelimit()
instead of pr_info_once.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: Added (unsigned long long) cast in printf
Daniele Di Proietto [Mon, 3 Feb 2014 22:09:01 +0000 (14:09 -0800)] 
openvswitch: Added (unsigned long long) cast in printf

This is necessary, since u64 is not unsigned long long
in all architectures: u64 could be also uint64_t.

Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: avoid cast-qual warning in vport_priv
Daniele Di Proietto [Mon, 3 Feb 2014 22:08:29 +0000 (14:08 -0800)] 
openvswitch: avoid cast-qual warning in vport_priv

This function must cast a const value to a non const value.
By adding an uintptr_t cast the warning is suppressed.
To avoid the cast (proper solution) several function signatures
must be changed.

Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: avoid warnings in vport_from_priv
Daniele Di Proietto [Mon, 3 Feb 2014 22:07:43 +0000 (14:07 -0800)] 
openvswitch: avoid warnings in vport_from_priv

This change, firstly, avoids declaring the formal parameter const,
since it is treated as non const. (to avoid -Wcast-qual)
Secondly, it cast the pointer from void* to u8*, since it is used
in arithmetic (to avoid -Wpointer-arith)

Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoopenvswitch: use const in some local vars and casts
Daniele Di Proietto [Thu, 23 Jan 2014 18:56:49 +0000 (10:56 -0800)] 
openvswitch: use const in some local vars and casts

In few functions, const formal parameters are assigned or cast to
non-const.
These changes suppress warnings if compiled with -Wcast-qual.

Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
10 years agoMerge branch 'bonding-next'
David S. Miller [Fri, 16 May 2014 20:34:43 +0000 (16:34 -0400)] 
Merge branch 'bonding-next'

Veaceslav Falico says:

====================
bonding: simple macro cleanup

Trivial patchset that converts most of the bonding's macros into inline
functions. It introduces only one macro, BOND_MODE(), which is just
bond->params.mode, better to write/understand/remember.

The only real change is the removal of IFF_UP verification, which always
came in pair with && netif_running(), and is though useless, as it's always
IFF_UP when LINK_STATE_RUNNING.

v2->v3: fix 3/9 to actually invert bond_mode_uses_arp() and add
bond_uses_arp() alongside bond_mode_uses_arp()
v1->v2: use inlined functions instead of macros.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: replace SLAVE_IS_OK() with bond_slave_can_tx()
Veaceslav Falico [Thu, 15 May 2014 19:39:59 +0000 (21:39 +0200)] 
bonding: replace SLAVE_IS_OK() with bond_slave_can_tx()

They're verifying the same thing (except of IFF_UP, which is implied for
netif_running(), which is also a prerequisite).

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: rename {, bond_}slave_can_tx and clean it up
Veaceslav Falico [Thu, 15 May 2014 19:39:58 +0000 (21:39 +0200)] 
bonding: rename {, bond_}slave_can_tx and clean it up

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: convert IS_UP(slave->dev) to inline function
Veaceslav Falico [Thu, 15 May 2014 19:39:57 +0000 (21:39 +0200)] 
bonding: convert IS_UP(slave->dev) to inline function

Also, remove the IFF_UP verification cause we can't be netif_running() with
being also IFF_UP.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: make IS_IP_TARGET_UNUSABLE_ADDRESS an inline function
Veaceslav Falico [Thu, 15 May 2014 19:39:56 +0000 (21:39 +0200)] 
bonding: make IS_IP_TARGET_UNUSABLE_ADDRESS an inline function

Also, use standard IP primitives to check the address.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: create a macro for bond mode and use it
Veaceslav Falico [Thu, 15 May 2014 19:39:55 +0000 (21:39 +0200)] 
bonding: create a macro for bond mode and use it

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: make USES_PRIMARY inline functions
Veaceslav Falico [Thu, 15 May 2014 19:39:54 +0000 (21:39 +0200)] 
bonding: make USES_PRIMARY inline functions

Change the name a bit to better reflect its scope, and update some
comments. Two functions added - one which takes bond as a param and the
other which takes the mode.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: make BOND_NO_USES_ARP an inline function
Veaceslav Falico [Thu, 15 May 2014 19:39:53 +0000 (21:39 +0200)] 
bonding: make BOND_NO_USES_ARP an inline function

Also, change its name to better reflect its scope, and skip the "no"
part.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: make TX_QUEUE_OVERRIDE() macro an inline function
Veaceslav Falico [Thu, 15 May 2014 19:39:52 +0000 (21:39 +0200)] 
bonding: make TX_QUEUE_OVERRIDE() macro an inline function

Also, make it accept bonding as a parameter and change the name a bit to
better reflect its scope.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: remove BOND_MODE_IS_LB macro
Veaceslav Falico [Thu, 15 May 2014 19:39:51 +0000 (21:39 +0200)] 
bonding: remove BOND_MODE_IS_LB macro

It's used only in an inline function and is useless.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: unix: Align send data_len up to PAGE_SIZE
Kirill Tkhai [Thu, 15 May 2014 15:56:28 +0000 (19:56 +0400)] 
net: unix: Align send data_len up to PAGE_SIZE

Using whole of allocated pages reduces requested skb->data size.
This is just a little more thriftily allocation.

netperf does not show difference with the current performance.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomacvlan: simplify the structure port
dingtianhong [Thu, 15 May 2014 11:11:36 +0000 (19:11 +0800)] 
macvlan: simplify the structure port

The port->count was used to count the number of macvlan devs
in the same port, but the list vlans could play the same role
to do that, so free the port if the list vlans is empty and
no need to use the parameter count.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovti6: delete unneeded call to netdev_priv
Julia Lawall [Thu, 15 May 2014 03:43:21 +0000 (05:43 +0200)] 
vti6: delete unneeded call to netdev_priv

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoip_tunnel: delete unneeded call to netdev_priv
Julia Lawall [Thu, 15 May 2014 03:43:20 +0000 (05:43 +0200)] 
ip_tunnel: delete unneeded call to netdev_priv

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/ariadne: delete unneeded call to netdev_priv
Julia Lawall [Thu, 15 May 2014 03:43:19 +0000 (05:43 +0200)] 
net/ariadne: delete unneeded call to netdev_priv

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodrivers/net/wan: delete unneeded call to netdev_priv
Julia Lawall [Thu, 15 May 2014 03:43:18 +0000 (05:43 +0200)] 
drivers/net/wan: delete unneeded call to netdev_priv

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: systemport: pad packets to a minimum of 68 bytes
Florian Fainelli [Thu, 15 May 2014 02:32:14 +0000 (19:32 -0700)] 
net: systemport: pad packets to a minimum of 68 bytes

Packets need to be at least 64 bytes to enter the switch port logic,
including the FCS, otherwise they will be discarded as RUNT packets.

With packets having Broadcom tags, the 4-bytes tag is first stripped
off the packet, and the packet length is then checked, so we need to
make sure that the packet length with FCS is at least 64 bytes.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: systemport: only update UMAC_CMD if something changed
Florian Fainelli [Thu, 15 May 2014 02:32:13 +0000 (19:32 -0700)] 
net: systemport: only update UMAC_CMD if something changed

The link adjustment callback can be called as frequently as desired by
the PHY library, as such, let's avoid doing a Read/Modify/Write sequence
if nothing changed, which is more than likely since we are interfaced
with a switch device.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoti: Remove trailing semicolon from do {...} while (0) macro
Joe Perches [Wed, 14 May 2014 19:15:13 +0000 (12:15 -0700)] 
ti: Remove trailing semicolon from do {...} while (0) macro

These should not have trailing semicolons so remove them.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'filter-next'
David S. Miller [Thu, 15 May 2014 20:32:20 +0000 (16:32 -0400)] 
Merge branch 'filter-next'

Alexei Starovoitov says:

====================
internal BPF jit for x64 and JITed seccomp

Internal BPF JIT compiler for x86_64 replaces classic BPF JIT.
Use it in seccomp and in tracing filters (sent as separate patch)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoseccomp: JIT compile seccomp filter
Alexei Starovoitov [Wed, 14 May 2014 02:50:47 +0000 (19:50 -0700)] 
seccomp: JIT compile seccomp filter

Take advantage of internal BPF JIT

05-sim-long_jumps.c of libseccomp was used as micro-benchmark:

 seccomp_rule_add_exact(ctx,...
 seccomp_rule_add_exact(ctx,...

 rc = seccomp_load(ctx);

 for (i = 0; i < 10000000; i++)
    syscall(...);

$ sudo sysctl net.core.bpf_jit_enable=1
$ time ./bench
real 0m2.769s
user 0m1.136s
sys 0m1.624s

$ sudo sysctl net.core.bpf_jit_enable=0
$ time ./bench
real 0m5.825s
user 0m1.268s
sys 0m4.548s

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: filter: x86: internal BPF JIT
Alexei Starovoitov [Wed, 14 May 2014 02:50:46 +0000 (19:50 -0700)] 
net: filter: x86: internal BPF JIT

Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.

Performance:

1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
  No performance difference is observed for filters that were JIT-able before

Example assembler code for BPF filter "tcpdump port 22"

original BPF -> old JIT:            original BPF -> internal BPF -> new JIT:
   0:   push   %rbp                      0:     push   %rbp
   1:   mov    %rsp,%rbp                 1:     mov    %rsp,%rbp
   4:   sub    $0x60,%rsp                4:     sub    $0x228,%rsp
   8:   mov    %rbx,-0x8(%rbp)           b:     mov    %rbx,-0x228(%rbp) // prologue
                                        12:     mov    %r13,-0x220(%rbp)
                                        19:     mov    %r14,-0x218(%rbp)
                                        20:     mov    %r15,-0x210(%rbp)
                                        27:     xor    %eax,%eax         // clear A
   c:   xor    %ebx,%ebx                29:     xor    %r13,%r13         // clear X
   e:   mov    0x68(%rdi),%r9d          2c:     mov    0x68(%rdi),%r9d
  12:   sub    0x6c(%rdi),%r9d          30:     sub    0x6c(%rdi),%r9d
  16:   mov    0xd8(%rdi),%r8           34:     mov    0xd8(%rdi),%r10
                                        3b:     mov    %rdi,%rbx
  1d:   mov    $0xc,%esi                3e:     mov    $0xc,%esi
  22:   callq  0xffffffffe1021e15       43:     callq  0xffffffffe102bd75
  27:   cmp    $0x86dd,%eax             48:     cmp    $0x86dd,%rax
  2c:   jne    0x0000000000000069       4f:     jne    0x000000000000009a
  2e:   mov    $0x14,%esi               51:     mov    $0x14,%esi
  33:   callq  0xffffffffe1021e31       56:     callq  0xffffffffe102bd91
  38:   cmp    $0x84,%eax               5b:     cmp    $0x84,%rax
  3d:   je     0x0000000000000049       62:     je     0x0000000000000074
  3f:   cmp    $0x6,%eax                64:     cmp    $0x6,%rax
  42:   je     0x0000000000000049       68:     je     0x0000000000000074
  44:   cmp    $0x11,%eax               6a:     cmp    $0x11,%rax
  47:   jne    0x00000000000000c6       6e:     jne    0x0000000000000117
  49:   mov    $0x36,%esi               74:     mov    $0x36,%esi
  4e:   callq  0xffffffffe1021e15       79:     callq  0xffffffffe102bd75
  53:   cmp    $0x16,%eax               7e:     cmp    $0x16,%rax
  56:   je     0x00000000000000bf       82:     je     0x0000000000000110
  58:   mov    $0x38,%esi               88:     mov    $0x38,%esi
  5d:   callq  0xffffffffe1021e15       8d:     callq  0xffffffffe102bd75
  62:   cmp    $0x16,%eax               92:     cmp    $0x16,%rax
  65:   je     0x00000000000000bf       96:     je     0x0000000000000110
  67:   jmp    0x00000000000000c6       98:     jmp    0x0000000000000117
  69:   cmp    $0x800,%eax              9a:     cmp    $0x800,%rax
  6e:   jne    0x00000000000000c6       a1:     jne    0x0000000000000117
  70:   mov    $0x17,%esi               a3:     mov    $0x17,%esi
  75:   callq  0xffffffffe1021e31       a8:     callq  0xffffffffe102bd91
  7a:   cmp    $0x84,%eax               ad:     cmp    $0x84,%rax
  7f:   je     0x000000000000008b       b4:     je     0x00000000000000c2
  81:   cmp    $0x6,%eax                b6:     cmp    $0x6,%rax
  84:   je     0x000000000000008b       ba:     je     0x00000000000000c2
  86:   cmp    $0x11,%eax               bc:     cmp    $0x11,%rax
  89:   jne    0x00000000000000c6       c0:     jne    0x0000000000000117
  8b:   mov    $0x14,%esi               c2:     mov    $0x14,%esi
  90:   callq  0xffffffffe1021e15       c7:     callq  0xffffffffe102bd75
  95:   test   $0x1fff,%ax              cc:     test   $0x1fff,%rax
  99:   jne    0x00000000000000c6       d3:     jne    0x0000000000000117
                                        d5:     mov    %rax,%r14
  9b:   mov    $0xe,%esi                d8:     mov    $0xe,%esi
  a0:   callq  0xffffffffe1021e44       dd:     callq  0xffffffffe102bd91 // MSH
                                        e2:     and    $0xf,%eax
                                        e5:     shl    $0x2,%eax
                                        e8:     mov    %rax,%r13
                                        eb:     mov    %r14,%rax
                                        ee:     mov    %r13,%rsi
  a5:   lea    0xe(%rbx),%esi           f1:     add    $0xe,%esi
  a8:   callq  0xffffffffe1021e0d       f4:     callq  0xffffffffe102bd6d
  ad:   cmp    $0x16,%eax               f9:     cmp    $0x16,%rax
  b0:   je     0x00000000000000bf       fd:     je     0x0000000000000110
                                        ff:     mov    %r13,%rsi
  b2:   lea    0x10(%rbx),%esi         102:     add    $0x10,%esi
  b5:   callq  0xffffffffe1021e0d      105:     callq  0xffffffffe102bd6d
  ba:   cmp    $0x16,%eax              10a:     cmp    $0x16,%rax
  bd:   jne    0x00000000000000c6      10e:     jne    0x0000000000000117
  bf:   mov    $0xffff,%eax            110:     mov    $0xffff,%eax
  c4:   jmp    0x00000000000000c8      115:     jmp    0x000000000000011c
  c6:   xor    %eax,%eax               117:     mov    $0x0,%eax
  c8:   mov    -0x8(%rbp),%rbx         11c:     mov    -0x228(%rbp),%rbx // epilogue
  cc:   leaveq                         123:     mov    -0x220(%rbp),%r13
  cd:   retq                           12a:     mov    -0x218(%rbp),%r14
                                       131:     mov    -0x210(%rbp),%r15
                                       138:     leaveq
                                       139:     retq

On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.

The difference in generated assembler is due to the following:

Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.

New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.

Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.

2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
  extensions. New JIT supports all BPF extensions.
  Performance of such filters improves 2-4 times depending on a filter.
  The longer the filter the higher performance gain.
  Synthetic benchmarks with many ancillary loads see 20x speedup
  which seems to be the maximum gain from JIT

Notes:

. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
  and can be used to see generated assembler

. there are two jit_compile() functions and code flow for classic filters is:
  sk_attach_filter() - load classic BPF
  bpf_jit_compile() - try to JIT from classic BPF
  sk_convert_filter() - convert classic to internal
  bpf_int_jit_compile() - JIT from internal BPF

  seccomp and tracing filters will just call bpf_int_jit_compile()

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: filter: x86: split bpf_jit_compile()
Alexei Starovoitov [Wed, 14 May 2014 02:50:45 +0000 (19:50 -0700)] 
net: filter: x86: split bpf_jit_compile()

Split bpf_jit_compile() into two functions to improve readability
of for(pass++) loop. The change follows similar style of JIT compilers
for arm, powerpc, s390

The body of new do_jit() was not reformatted to reduce noise
in this patch, since the following patch replaces most of it.

Tested with BPF testsuite.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'ieee802154-next'
David S. Miller [Thu, 15 May 2014 19:51:52 +0000 (15:51 -0400)] 
Merge branch 'ieee802154-next'

Phoebe Buckheister says:

====================
802154: some cleanups and fixes

This series adds some definitions for 802.15.4 header fields that were missing,
changes 6lowpan fragmentation to be aware of security headers and fixes
802.15.4 datagram socket sendmsg(), which was entirely incompliant to date.
Also a few minor changes to mac_cb handling, mark a single-use function static,
and correctly check for EMSGSIZE conditions during wpan_header_create.

Changes since v1:
  * rename mac_cb_alloc to mac_cb_init
  * catch all error cases of sendmsg() instead of only !conn && msg_name
  * redo 6lowpan fragmentation to not clone lower layer headers
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomac802154: make mac802154_wpan_open static
Phoebe Buckheister [Wed, 14 May 2014 15:43:11 +0000 (17:43 +0200)] 
mac802154: make mac802154_wpan_open static

This function is only used within the same translation unit, so mark it
static.

Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoieee802154: fix dgram socket sendmsg()
Phoebe Buckheister [Wed, 14 May 2014 15:43:10 +0000 (17:43 +0200)] 
ieee802154: fix dgram socket sendmsg()

802.15.4 datagram sockets do not currently have a compliant sendmsg().
The destination address supplied is always ignored, and in unconnected
mode, packets are broadcast instead of dropped with -EDESTADDRREQ. This
patch fixes 802.15.4 dgram sockets to be compliant, i.e.

 !conn && !msg_name => -EDESTADDRREQ
 !conn &&  msg_name => send to msg_name
  conn && !msg_name => send to connected
  conn &&  msg_name => -EISCONN

Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years ago6lowpan: fix fragmentation
Phoebe Buckheister [Wed, 14 May 2014 15:43:09 +0000 (17:43 +0200)] 
6lowpan: fix fragmentation

Currently, 6lowpan creates one 802.15.4 MAC header for the original
packet the device was given by upper layers and reuses this header for
all fragments, if fragmentation is required. This also reuses frame
sequence numbers, which must not happen. 6lowpan also has issues with
fragmentation in the presence of security headers, since those may imply
the presence of trailing fields that are not accounted for by the
fragmentation code right now.

Fix both of these issues by properly allocating fragment skbs with
headromm and tailroom as specified by the underlying device, create one
header for each skb instead of reusing the original header, let the
underlying device do the rest.

Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoieee802154: change _cb handling slightly
Phoebe Buckheister [Wed, 14 May 2014 15:43:08 +0000 (17:43 +0200)] 
ieee802154: change _cb handling slightly

The current mac_cb handling of ieee802154 is rather awkward and limited.
Decompose the single flags field into multiple fields with the meanings
of each subfield of the flags field to make future extensions (for
example, link-layer security) easier. Also don't set the frame sequence
number in upper layers, since that's a thing the MAC is supposed to set
on frame transmit - we set it on header creation, but assuming that
upper layers do not blindly duplicate our headers, this is fine.

Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomac802154: account for all header parts during wpan header creationg
Phoebe Buckheister [Wed, 14 May 2014 15:43:07 +0000 (17:43 +0200)] 
mac802154: account for all header parts during wpan header creationg

The current WPAN header creation code checks for EMSGSIZE conditions,
but does not account for the MIC field that link layer security may add
at the end of the frame. Now that we can accurately calculate the
maximum payload size of packets, use that to check for EMSGSIZE
conditions.

Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoieee802154: add definitions for link-layer security and header functions
Phoebe Buckheister [Wed, 14 May 2014 15:43:06 +0000 (17:43 +0200)] 
ieee802154: add definitions for link-layer security and header functions

When dealing with 802.15.4, one often has to know the maximum payload
size for a given packet. This depends on many factors, one of which is
whether or not a security header is present in the frame. These
definitions and functions provide an easy way for any upper layer to
calculate the maximum payload size for a packet. The first obvious user
for this is 6lowpan, which duplicates this calculation and gets it
partially wrong because it ignores security headers.

Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodrivers: net: Register Micrel ksz884x network devices in PCI device tree.
Markus Lottmann [Wed, 14 May 2014 12:02:04 +0000 (14:02 +0200)] 
drivers: net: Register Micrel ksz884x network devices in PCI device tree.

This unifies the behaviour with other network device drivers and
allows for a matching of the PCI device path in UDEV rules.

Signed-off-by: Markus Lottmann <markus.lottmann1986@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: Fix CONFIG_SYSCTL ifdef test.
David S. Miller [Thu, 15 May 2014 17:43:14 +0000 (13:43 -0400)] 
net: Fix CONFIG_SYSCTL ifdef test.

> include/net/ip.h:211:5: warning: "CONFIG_SYSCTL" is not defined [-Wundef]
>  #if CONFIG_SYSCTL
>      ^

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'cpsw_cleanups'
David S. Miller [Thu, 15 May 2014 17:42:20 +0000 (13:42 -0400)] 
Merge branch 'cpsw_cleanups'

George Cherian says:

====================
TI CPSW Cleanup

This series does some minimal cleanups.
-Conversion of pr_*() to dev_*()
-Convert kzalloc to devm_kzalloc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodrivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().
George Cherian [Mon, 12 May 2014 04:51:21 +0000 (10:21 +0530)] 
drivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().

Convert kzalloc() to devm_kzalloc().

Signed-off-by: George Cherian <george.cherian@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: davinci_mdio: Convert pr_err() to dev_err() call
George Cherian [Mon, 12 May 2014 04:51:20 +0000 (10:21 +0530)] 
net: davinci_mdio: Convert pr_err() to dev_err() call

Convert the lone pr_err() to dev_err() call.

Signed-off-by: George Cherian <george.cherian@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodriver net: cpsw: Convert pr_*() to dev_*() calls
George Cherian [Mon, 12 May 2014 04:51:19 +0000 (10:21 +0530)] 
driver net: cpsw: Convert pr_*() to dev_*() calls

Convert all pr_*() calls to dev_*() calls.
No functional changes.

Signed-off-by: George Cherian <george.cherian@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodriver/net/ethernet/ec_bhf.c: fix sparse warnings
Darek Marcinkiewicz [Wed, 14 May 2014 06:04:32 +0000 (08:04 +0200)] 
driver/net/ethernet/ec_bhf.c: fix sparse warnings

Sparse was reporting quite a few warnings for the driver.
Those get fixed by this patch.

Signed-off-by: Dariusz Marcinkiewicz <reksio@newterm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: Use a more standard macro for INET_ADDR_COOKIE
Joe Perches [Wed, 14 May 2014 03:30:07 +0000 (20:30 -0700)] 
net: Use a more standard macro for INET_ADDR_COOKIE

Missing a colon on definition use is a bit odd so
change the macro for the 32 bit case to declare an
__attribute__((unused)) and __deprecated variable.

The __deprecated attribute will cause gcc to emit
an error if the variable is actually used.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: systemport: Use devm_ioremap_resource()
Jingoo Han [Wed, 14 May 2014 03:15:42 +0000 (12:15 +0900)] 
net: systemport: Use devm_ioremap_resource()

Use devm_ioremap_resource() because devm_request_and_ioremap() is
obsoleted by devm_ioremap_resource().

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'mlx4-next'
David S. Miller [Wed, 14 May 2014 19:41:05 +0000 (15:41 -0400)] 
Merge branch 'mlx4-next'

Or Gerlitz says:

====================
Mellanox driver update 2014-05-12

This patchset introduce some small bug fixes:

Eyal fixed some compilation and syntactic checkers warnings. Ido fixed a
coruption in user priority mapping when changing number of channels. Shani
fixed some other problems when modifying MAC address. Yuval fixed a problem
when changing IRQ affinity during high traffic - IRQ changes got effective
only after the first pause in traffic.

This patchset was tested and applied over commit 93dccc5: "mdio_bus: fix
devm_mdiobus_alloc_size export"

Changes from V1:
- applied feedback from Dave to use true/false and not 0/1 in patch 1/9
- removed the patch from Noa which adddressed a bug in flow steering table
  when using a bond device, as the fix might need to be in the bonding driver,
  this is now dicussed in the netdev thread "bonding directly changes
  underlying device address"

Changes from V0:
- Patch 1/9 - net/mlx4_core: Enforce irq affinity changes immediatly
  - Moved the new members to a hot cache line as Eric suggested
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_core: Fix inaccurate return value of mlx4_flow_attach()
Eyal Perry [Wed, 14 May 2014 09:15:17 +0000 (12:15 +0300)] 
net/mlx4_core: Fix inaccurate return value of mlx4_flow_attach()

Adopt the "info: why not propagate 'ret' from parse_trans_rule()..."
suggestion made by the smatch semantic checker on:
drivers/net/ethernet/mellanox/mlx4/mcg.c:867 mlx4_flow_attach()

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_en: Using positive error value for unsigned
Eyal Perry [Wed, 14 May 2014 09:15:16 +0000 (12:15 +0300)] 
net/mlx4_en: Using positive error value for unsigned

Using a positive value for error: MLX4_NET_TRANS_RULE_NUM instead
of -EPROTONOSUPPORT, to remove compilation warning.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_en: Protect MAC address modification with the state_lock mutex
Shani Michaelli [Wed, 14 May 2014 09:15:15 +0000 (12:15 +0300)] 
net/mlx4_en: Protect MAC address modification with the state_lock mutex

This Patches solves an issue that could raise when modifying the
device's MAC. It occurs due to a simultaneous access to priv->mac_hash
from two contexts. The buggy scenario described below:
Context 1: copy the new mac address to the dev->dev_addr field.
Context 2: mlx4_en_do_uc_filter removes prev_mac entry from the mac_hash
           db since it is not in dev->uc and not equal to dev->dev_addr.
Context 1: mlx4_en_do_set_mac() calls mlx4_en_replace_mac() to replace
           prev_mac with dev_addr but it fails to update the mac_hash db
           since it no longer contains prev_mac, therefore it returns
           with an error.

The fix is to prevent mlx4_en_do_uc_filter from being executed by both
of the context 1 calls described above, This is done by putting them
both under the mdev->state_lock lock, it will solve this issue since
mlx4_en_do_uc_filter is already protected by the mdev->state_lock.

Reviewed-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_core: Removed unnecessary bit operation condition
Eyal Perry [Wed, 14 May 2014 09:15:14 +0000 (12:15 +0300)] 
net/mlx4_core: Removed unnecessary bit operation condition

Fix the "warn: suspicious bitop condition" made by the smatch semantic
checker on:
drivers/net/ethernet/mellanox/mlx4/main.c:509 mlx4_slave_cap()

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_core: Fix smatch error - possible access to a null variable
Eyal Perry [Wed, 14 May 2014 09:15:13 +0000 (12:15 +0300)] 
net/mlx4_core: Fix smatch error - possible access to a null variable

Fix the "error: we previously assumed 'out_param' could be null" found
by smatch semantic checker on:
drivers/net/ethernet/mellanox/mlx4/cmd.c:506 mlx4_cmd_poll()
drivers/net/ethernet/mellanox/mlx4/cmd.c:578 mlx4_cmd_wait()

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_en: Fix errors in MAC address changing when port is down
Shani Michaelli [Wed, 14 May 2014 09:15:12 +0000 (12:15 +0300)] 
net/mlx4_en: Fix errors in MAC address changing when port is down

This patch fix an issue that happen when changing the MAC address when
the port is down, described as follows:
1. Set the port down.
2. Change the MAC address - mlx4_en_set_mac() will change dev->dev_addr.
3. Set the port up - will result in mlx4_en_do_uc_filter that will
   remove the prev_mac entry from the mac_hash db.
4. Changing the MAC address again will eventually trigger the call to
   mlx4_en_replace_mac() in order to replace prev_mac with dev_addr but
   the prev_mac entry is already not exist in the mac_hash db therefore
   the operation fails.

The fix is to set the prev_mac with the new MAC address so in step 3
above, after setting the port up mlx4_en_get_qp() is updating the
mac_hash with the entry of dev_addr which is equal to prev_mac.
Therefore in step 4, when calling mlx4_en_replace_mac, the entry related
to prev_mac exist in mac_hash and the replace operation succeed.

Reviewed-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_en: User prio mapping gets corrupted when changing number of channels
Ido Shamay [Wed, 14 May 2014 09:15:11 +0000 (12:15 +0300)] 
net/mlx4_en: User prio mapping gets corrupted when changing number of channels

When using ethtool set_channels, mlx4_en_setup_tc is always called, even
when it was not configured. Fixed code to call mlx4_en_setup_tc() only
if needed.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_core: Enforce irq affinity changes immediatly
Yuval Atias [Wed, 14 May 2014 09:15:10 +0000 (12:15 +0300)] 
net/mlx4_core: Enforce irq affinity changes immediatly

During heavy traffic, napi is constatntly polling the complition queue
and no interrupt is fired. Because of that, changes to irq affinity are
ignored until traffic is stopped and resumed.

By registering to the irq notifier mechanism, and forcing interrupt when
affinity is changed, irq affinity changes will be immediatly enforced.

Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomacvlan: Propagate lowerdev MTU changes
dingtianhong [Tue, 13 May 2014 06:39:27 +0000 (14:39 +0800)] 
macvlan: Propagate lowerdev MTU changes

When the physical MTU changes we should ensure that all existing MACVLAN
dev MTU do not exceed the new lowerdev MTU. This patch adds that
propagation.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodccp: make the request_retries minimum is 1
wangweidong [Tue, 13 May 2014 02:57:58 +0000 (10:57 +0800)] 
dccp: make the request_retries minimum is 1

In Documentation/networking/dccp.txt points that request_retries
should be greater than 0. So make the extra1 to be &one instead
of &zero.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosnmp: fix some left over of snmp stats
WANG Cong [Mon, 12 May 2014 23:52:02 +0000 (16:52 -0700)] 
snmp: fix some left over of snmp stats

Fengguang reported the following sparse warning:

>> net/ipv6/proc.c:198:41: sparse: incorrect type in argument 1 (different address spaces)
   net/ipv6/proc.c:198:41:    expected void [noderef] <asn:3>*mib
   net/ipv6/proc.c:198:41:    got void [noderef] <asn:3>**pcpumib

Fixes: commit 698365fa1874aa7635d51667a3 (net: clean up snmp stats code)
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv4: make ip_local_reserved_ports per netns
WANG Cong [Mon, 12 May 2014 23:04:53 +0000 (16:04 -0700)] 
ipv4: make ip_local_reserved_ports per netns

ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.

By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoirda: sh_irda: Enable driver compilation with COMPILE_TEST
Laurent Pinchart [Mon, 12 May 2014 23:04:36 +0000 (01:04 +0200)] 
irda: sh_irda: Enable driver compilation with COMPILE_TEST

This helps increasing build testing coverage.

Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'tipc-next'
David S. Miller [Wed, 14 May 2014 19:20:19 +0000 (15:20 -0400)] 
Merge branch 'tipc-next'

Jon Maloy says:

====================
tipc: bug fixes and improvements

Intensive and extensive testing has revealed some rather infrequent
problems related to flow control, buffer handling and link
establishment. Commits ##1 to 4 deal with these problems.

The remaining four commits are just code improvments, aiming at
making the code more comprehensible and maintainable. There are
no functional enhancements in this series.

v2: Fixed a typo in commit log #2. Otherwise no changes from v1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: merge port message reception into socket reception function
Jon Paul Maloy [Wed, 14 May 2014 09:39:15 +0000 (05:39 -0400)] 
tipc: merge port message reception into socket reception function

In order to reduce complexity and save a call level during message
reception at port/socket level, we remove the function tipc_port_rcv()
and merge its functionality into tipc_sk_rcv().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: clean up neigbor discovery message reception
Jon Paul Maloy [Wed, 14 May 2014 09:39:14 +0000 (05:39 -0400)] 
tipc: clean up neigbor discovery message reception

The function tipc_disc_rcv(), which is handling received neighbor
discovery messages, is perceived as messy, and it is hard to verify
its correctness by code inspection. The fact that the task it is set
to resolve is fairly complex does not make the situation better.

In this commit we try to take a more systematic approach to the
problem. We define a decision machine which takes three state flags
 as input, and produces three action flags as output. We then walk
through all permutations of the state flags, and for each of them we
describe verbally what is going on, plus that we set zero or more of
the action flags. The action flags indicate what should be done once
the decision machine has finished its job, while the last part of the
function deals with performing those actions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: improve and extend media address conversion functions
Jon Paul Maloy [Wed, 14 May 2014 09:39:13 +0000 (05:39 -0400)] 
tipc: improve and extend media address conversion functions

TIPC currently handles two media specific addresses: Ethernet MAC
addresses and InfiniBand addresses. Those are kept in three different
formats:

1) A "raw" format as obtained from the device. This format is known
   only by the media specific adapter code in eth_media.c and
   ib_media.c.
2) A "generic" internal format, in the form of struct tipc_media_addr,
   which can be referenced and passed around by the generic media-
   unaware code.
3) A serialized version of the latter, to be conveyed in neighbor
   discovery messages.

Conversion between the three formats can only be done by the media
specific code, so we have function pointers for this purpose in
struct tipc_media. Here, the media adapters can install their own
conversion functions at startup.

We now introduce a new such function, 'raw2addr()', whose purpose
is to convert from format 1 to format 2 above. We also try to as far
as possible uniform commenting, variable names and usage of these
functions, with the purpose of making them more comprehensible.

We can now also remove the function tipc_l2_media_addr_set(), whose
job is done better by the new function.

Finally, we expand the field for serialized addresses (format 3)
in discovery messages from 20 to 32 bytes. This is permitted
according to the spec, and reduces the risk of problems when we
add new media in the future.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: rename and move message reassembly function
Jon Paul Maloy [Wed, 14 May 2014 09:39:12 +0000 (05:39 -0400)] 
tipc: rename and move message reassembly function

The function tipc_link_frag_rcv() is in reality a re-entrant generic
message reassemby function that has nothing in particular to do with
the link, where it is defined now. This becomes obvious when we see
the need to call the function from other places in the code.

In this commit rename it to tipc_buf_append() and move it to the file
msg.c. We also simplify its signature by moving the tail pointer to
the control block of the head buffer, hence making the head buffer
self-contained.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: mark head of reassembly buffer as non-linear
Jon Paul Maloy [Wed, 14 May 2014 09:39:11 +0000 (05:39 -0400)] 
tipc: mark head of reassembly buffer as non-linear

The message reassembly function does not update the 'len' and 'data_len'
fields of the head skbuff correctly when fragments are chained to it.
This may sometimes lead to obsure errors, such as fragment reordering
when we receive fragments which are cloned buffers.

This commit fixes this, by ensuring that the two fields are updated
correctly.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: don't record link RESET or ACTIVATE messages as traffic
Jon Paul Maloy [Wed, 14 May 2014 09:39:10 +0000 (05:39 -0400)] 
tipc: don't record link RESET or ACTIVATE messages as traffic

In the current code, all incoming LINK_PROTOCOL messages, irrespective
of type, nudge the "last message received" checkpoint, informing the
link state machine that a message was received from the peer since last
supervision timeout event. This inhibits the link from starting probing
the peer unnecessarily.

However, not only STATE messages are recorded as legitimate incoming
traffic this way, but even RESET and ACTIVATE messages, which in
reality are there to inform the link that the peer endpoint has been
reset. At the same time, some RESET messages may be dropped instead
of causing a link reset. This happens when the link endpoint thinks
it is fully up and working, and the session number of the RESET is
lower than or equal to the current link session. In such cases the
RESET is perceived as a delayed remnant from an earlier session, or
the current one, and dropped.

Now, if a TIPC module is removed and then immediately reinserted, e.g.
when using a script, RESET messages may arrive at the peer link endpoint
before this one has had time to discover the failure. The RESET may be
dropped because of the session number, but only after it has been
recorded as a legitimate traffic event. Hence, the receiving link will
not start probing, and not discover that the peer endpoint is down, at
the same time ignoring the periodic RESET messages coming from that
endpoint. We have ended up in a stale state where a failed link cannot
be re-established.

In this commit, we remedy this by nudging the checkpoint only for
received STATE messages, not for RESET or ACTIVATE messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: compensate for double accounting in socket rcv buffer
Jon Paul Maloy [Wed, 14 May 2014 09:39:09 +0000 (05:39 -0400)] 
tipc: compensate for double accounting in socket rcv buffer

The function net/core/sock.c::__release_sock() runs a tight loop
to move buffers from the socket backlog queue to the receive queue.

As a security measure, sk_backlog.len of the receiving socket
is not set to zero until after the loop is finished, i.e., until
the whole backlog queue has been transferred to the receive queue.
During this transfer, the data that has already been moved is counted
both in the backlog queue and the receive queue, hence giving an
incorrect picture of the available queue space for new arriving buffers.

This leads to unnecessary rejection of buffers by sk_add_backlog(),
which in TIPC leads to unnecessarily broken connections.

In this commit, we compensate for this double accounting by adding
a counter that keeps track of it. The function socket.c::backlog_rcv()
receives buffers one by one from __release_sock(), and adds them to the
socket receive queue. If the transfer is successful, it increases a new
atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
transferred buffer. If a new buffer arrives during this transfer and
finds the socket busy (owned), we attempt to add it to the backlog.
However, when sk_add_backlog() is called, we adjust the 'limit'
parameter with the value of the new counter, so that the risk of
inadvertent rejection is eliminated.

It should be noted that this change does not invalidate the original
purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
doesn't respect the send window) keeps pumping in buffers to
sk_add_backlog(), he will eventually reach an upper limit,
(2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
to the backlog, and the connection will be broken. Ordinary, well-
behaved senders will never reach this buffer limit at all.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: decrease connection flow control window
Jon Paul Maloy [Wed, 14 May 2014 09:39:08 +0000 (05:39 -0400)] 
tipc: decrease connection flow control window

Memory overhead when allocating big buffers for data transfer may
be quite significant. E.g., truesize of a 64 KB buffer turns out
to be 132 KB, 2 x the requested size.

This invalidates the "worst case" calculation we have been
using to determine the default socket receive buffer limit,
which is based on the assumption that 1024x64KB = 67MB buffers
may be queued up on a socket.

Since TIPC connections cannot survive hitting the buffer limit,
we have to compensate for this overhead.

We do that in this commit by dividing the fix connection flow
control window from 1024 (2*512) messages to 512 (2*256). Since
older version nodes send out acks at 512 message intervals,
compatibility with such nodes is guaranteed, although performance
may be non-optimal in such cases.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: alloc the structure ad_info dynamically in per slave
dingtianhong [Mon, 12 May 2014 07:08:43 +0000 (15:08 +0800)] 
bonding: alloc the structure ad_info dynamically in per slave

The struct ad_slave_info is very huge, and only be used for 802.3ad mode,
so alloc the structure dynamically could save 356 Bits for every slave in
non 802.3ad mode.

Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosh_eth: replace devm_kzalloc() with devm_kmalloc_array()
Sergei Shtylyov [Mon, 12 May 2014 22:30:14 +0000 (02:30 +0400)] 
sh_eth: replace devm_kzalloc() with devm_kmalloc_array()

When I was converting the driver to the managed device API, only devm_kzalloc()
was available for memory allocation, so I had to use it, despite zeroing out the
PHY IRQ array right before initializing all  its entries to PHY_POLL was quite
stupid.   Now that devm_kmalloc_array() has become available, we can avoid the
needless zeroing out...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'tg3-next'
David S. Miller [Tue, 13 May 2014 22:39:01 +0000 (18:39 -0400)] 
Merge branch 'tg3-next'

Michael Chan says:

====================
tg3: TSO related enhancements to prevent memory allocation failure

Michael Chan (3):
  tg3: Don't modify ip header fields when doing GSO
  tg3: Prevent page allocation failure during TSO workaround
  tg3: Update copyright and version to 3.137
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotg3: Update copyright and version to 3.137
Michael Chan [Mon, 12 May 2014 03:22:55 +0000 (20:22 -0700)] 
tg3: Update copyright and version to 3.137

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotg3: Prevent page allocation failure during TSO workaround
Michael Chan [Mon, 12 May 2014 03:22:54 +0000 (20:22 -0700)] 
tg3: Prevent page allocation failure during TSO workaround

If any TSO fragment hits hardware bug conditions (e.g. 4G boundary), the
driver will workaround by calling skb_copy() to copy to a linear SKB.  Users
have reported page allocation failures as the TSO packet can be up to 64K.
Copying such a large packet is also very inefficient.  We fix this by using
existing tg3_tso_bug() to transmit the packet using GSO.

Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This page took 0.052467 seconds and 5 git commands to generate.