OpenBSD CVS

CVS log for src/sys/net/ifq.c


[BACK] Up to [local] / src / sys / net

Request diff between arbitrary revisions


Default branch: MAIN


Revision 1.53 / (download) - annotate - [select for diffs], Fri Nov 10 15:51:24 2023 UTC (6 months, 3 weeks ago) by bluhm
Branch: MAIN
CVS Tags: OPENBSD_7_5_BASE, OPENBSD_7_5, HEAD
Changes since 1.52: +9 -2 lines
Diff to previous 1.52 (colored)

Make ifq and ifiq interface MP safe.

Rename ifq_set_maxlen() to ifq_init_maxlen().  This function neither
uses WRITE_ONCE() nor a mutex and is called before the ifq mutex
is initialized.  The new name expresses that it should be used only
during interface attach when there is no concurrency.

Protect ifq_len(), ifq_empty(), ifiq_len(), and ifiq_empty() with
READ_ONCE().  They can be used without lock as they only read a
single integer.

OK dlg@

Revision 1.52 / (download) - annotate - [select for diffs], Sun Oct 8 07:44:52 2023 UTC (7 months, 3 weeks ago) by claudio
Branch: MAIN
Changes since 1.51: +1 -9 lines
Diff to previous 1.51 (colored)

Revert commitid: KtmyJEoS0WWxmlZ5
---
Protect interface queues with read once and mutex.

Reading atomic values need at least read once and writing values
should have a mutex.  This is what mbuf queues already do.  Add
READ_ONCE() to ifq and ifiq macros for len and empty.  Convert
ifq_set_maxlen() to a function that grabs ifq_mtx.

OK mvs@
---

ifq_set_maxlen() is called before the ifq_mtx is initalized and this at
least crashes WITNESS kernels on boot.

Reported-by: syzbot+7b218ef53432b5d56d7d@syzkaller.appspotmail.com

Revision 1.51 / (download) - annotate - [select for diffs], Thu Oct 5 11:08:56 2023 UTC (7 months, 4 weeks ago) by bluhm
Branch: MAIN
Changes since 1.50: +9 -1 lines
Diff to previous 1.50 (colored)

Protect interface queues with read once and mutex.

Reading atomic values need at least read once and writing values
should have a mutex.  This is what mbuf queues already do.  Add
READ_ONCE() to ifq and ifiq macros for len and empty.  Convert
ifq_set_maxlen() to a function that grabs ifq_mtx.

OK mvs@

Revision 1.50 / (download) - annotate - [select for diffs], Sun Jul 30 05:39:52 2023 UTC (10 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_7_4_BASE, OPENBSD_7_4
Changes since 1.49: +20 -3 lines
Diff to previous 1.49 (colored)

count the number of times a ring was marked as oactive.

this is interesting as an indicator of how busy or overloaded a
transmit queue is before the next indicator which is the number of
qdrops.

Revision 1.49 / (download) - annotate - [select for diffs], Mon Jan 9 03:39:14 2023 UTC (16 months, 3 weeks ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_7_3_BASE, OPENBSD_7_3
Changes since 1.48: +28 -1 lines
Diff to previous 1.48 (colored)

flesh out ifiq_enqueue

Revision 1.48 / (download) - annotate - [select for diffs], Mon Jan 9 03:37:44 2023 UTC (16 months, 3 weeks ago) by dlg
Branch: MAIN
Changes since 1.47: +12 -3 lines
Diff to previous 1.47 (colored)

count the number times a packet was dropped by bpf as fdrops.

Revision 1.47 / (download) - annotate - [select for diffs], Tue Nov 22 03:40:53 2022 UTC (18 months, 1 week ago) by dlg
Branch: MAIN
Changes since 1.46: +16 -2 lines
Diff to previous 1.46 (colored)

count how many times ifiqs enqueue and dequeue packets.

network cards try to enqueue a list of packets on an ifiq once per
interrupt and ifiqs already count how many packets they're handling.
this let's us see how well interrupt mitigation is working on a
ring or interface. ifiqs are supposed to provide backpressure
signalling to a driver if it enqueues a lot more work than it's
able to process in softnet, so recording dequeues let's us see this
ratio.

Revision 1.46 / (download) - annotate - [select for diffs], Sat Apr 30 21:13:57 2022 UTC (2 years, 1 month ago) by bluhm
Branch: MAIN
CVS Tags: OPENBSD_7_2_BASE, OPENBSD_7_2
Changes since 1.45: +3 -3 lines
Diff to previous 1.45 (colored)

Run IP input and forwarding with shared netlock.  Also distribute
packets from the interface receive rings into multiple net task
queues.
Note that we still have only one softnet task.  So there will be
no concurrency yet, but we can notice wrong exclusive lock assertions.
Soon the final step will be to increase the NET_TASKQ define.
lots of testing Hrvoje Popovski; OK sashan@

Revision 1.45 / (download) - annotate - [select for diffs], Tue Jan 18 10:54:05 2022 UTC (2 years, 4 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_7_1_BASE, OPENBSD_7_1
Changes since 1.44: +2 -2 lines
Diff to previous 1.44 (colored)

return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies.

this is consistent with other drivers when they report their
underlying device being detached.

Revision 1.44 / (download) - annotate - [select for diffs], Fri Jul 9 01:22:05 2021 UTC (2 years, 10 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_7_0_BASE, OPENBSD_7_0
Changes since 1.43: +4 -1 lines
Diff to previous 1.43 (colored)

ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.

Revision 1.43 / (download) - annotate - [select for diffs], Sat Feb 20 04:37:26 2021 UTC (3 years, 3 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_6_9_BASE, OPENBSD_6_9
Changes since 1.42: +2 -2 lines
Diff to previous 1.42 (colored)

default interfaces to bpf_mtap_ether for their if_bpf_mtap handler.

call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input
and if_vinput.

Revision 1.42 / (download) - annotate - [select for diffs], Sat Feb 20 01:11:44 2021 UTC (3 years, 3 months ago) by dlg
Branch: MAIN
Changes since 1.41: +7 -5 lines
Diff to previous 1.41 (colored)

add a MONITOR flag to ifaces to say they're only used for watching packets.

an example use of this is when you have a span port on a switch and
you want to be able to see the packets coming out of it with tcpdump,
but do not want these packets to enter the network stack for
processing. this is particularly important if the span port is
pushing a copy of any packets related to the machine doing the
monitoring as it will confuse pf states and the stack.

ok benno@

Revision 1.41 / (download) - annotate - [select for diffs], Tue Jul 7 00:00:03 2020 UTC (3 years, 10 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_6_8_BASE, OPENBSD_6_8
Changes since 1.40: +123 -1 lines
Diff to previous 1.40 (colored)

add kstats for rx queues (ifiqs) and transmit queues (ifqs).

this means you can observe what the network stack is trying to do
when it's working with a nic driver that supports multiple rings.
a nic with only one set of rings still gets queues though, and this
still exports their stats.

here is a small example of what kstat(8) currently outputs for these
stats:

em0:0:rxq:0
         packets: 2292 packets
           bytes: 229846 bytes
          qdrops: 0 packets
          errors: 0 packets
            qlen: 0 packets
em0:0:txq:0
         packets: 1297 packets
           bytes: 193413 bytes
          qdrops: 0 packets
          errors: 0 packets
            qlen: 0 packets
         maxqlen: 511 packets
         oactive: false

Revision 1.40 / (download) - annotate - [select for diffs], Wed Jun 17 06:45:22 2020 UTC (3 years, 11 months ago) by dlg
Branch: MAIN
Changes since 1.39: +3 -3 lines
Diff to previous 1.39 (colored)

make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags.

i've been wanting to do this for a while, and now that we've got
stoeplitz and it gives us 16 bits, it seems like the right time.

Revision 1.39 / (download) - annotate - [select for diffs], Thu May 21 00:06:16 2020 UTC (4 years ago) by dlg
Branch: MAIN
Changes since 1.38: +2 -6 lines
Diff to previous 1.38 (colored)

back out 1.38. some bits of the stack aren't ready for it yet.

mark patruck found significant packet drops with trunk(4), and there's
some reports that pppx or pipex relies on some implicit locking
that it shouldn't.

i can fix those without this diff being in the tree.

Revision 1.38 / (download) - annotate - [select for diffs], Wed May 20 01:28:59 2020 UTC (4 years ago) by dlg
Branch: MAIN
Changes since 1.37: +7 -3 lines
Diff to previous 1.37 (colored)

defer calling !IFXF_MPSAFE driver start routines to the systq

this reuses the tx mitigation machinery, but instead of deferring
some start calls to the nettq, it defers all calls to the systq.
this is to avoid taking the KERNEL_LOCK while processing packets
in the stack.

i've been running this in production for 6 or so months, and the
start of a release is a good time to get more people trying it too.

ok jmatthew@

Revision 1.37 / (download) - annotate - [select for diffs], Tue Mar 10 08:45:28 2020 UTC (4 years, 2 months ago) by tobhe
Branch: MAIN
CVS Tags: OPENBSD_6_7_BASE, OPENBSD_6_7
Changes since 1.36: +2 -2 lines
Diff to previous 1.36 (colored)

Make sure return value 'error' is initialized to '0'.

ok dlg@ deraadt@

Revision 1.36 / (download) - annotate - [select for diffs], Sat Jan 25 06:31:32 2020 UTC (4 years, 4 months ago) by dlg
Branch: MAIN
Changes since 1.35: +40 -1 lines
Diff to previous 1.35 (colored)

tweaks sleeping for an mbuf so it's more mpsafe.

the stack puts an mbuf on the tun ifq, and ifqs protect themselves
with a mutex. rather than invent another lock that tun can wrap
these ifq ops with and also coordinate it's conditionals (reading
and dying) with, try and reuse the ifq mtx for the tun stuff too.

because ifqs are more special than tun, this adds a special
ifq_deq_sleep to ifq code that tun can call. tun just passes the
reading and dying variables to ifq to check, but the tricky stuff
about ifqs are kept in the right place.

with this, tun_dev_read should be callable without the kernel lock.

Revision 1.35 / (download) - annotate - [select for diffs], Tue Oct 8 04:18:00 2019 UTC (4 years, 7 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_6_6_BASE, OPENBSD_6_6
Changes since 1.34: +11 -9 lines
Diff to previous 1.34 (colored)

back out the use of ifiq pressure, and go back to using a packet count.

the pressure thresholds were too low in a lot of situations, and
still produced hard to understand interactions at high thresholds.
until we understand the numbers better, and for release, we're going
back counting the length of the per interface input queues.

this was originally based on a report of bad tcp performance with
em(4) by mlarkin, but is very convincingly demonstrated by a bunch
of work procter@ has been doing. deraadt@ is keen on the pressure
backout so he can cut a release.

Revision 1.34 / (download) - annotate - [select for diffs], Fri Aug 16 04:09:02 2019 UTC (4 years, 9 months ago) by dlg
Branch: MAIN
Changes since 1.33: +2 -2 lines
Diff to previous 1.33 (colored)

ifq_hdatalen should keep the mbuf it's looking at, not leak it.

ie, use ifq_deq_rollback after looking at the head mbuf instead of
ifq_deq_commit.

this is used in tun/tap, where it had the effect that you'd get the
datalen for the packet, and then when you try to read that many
bytes it had gone. cool and normal.

this was found by a student who was trying to do just that. i've
always just read the packet into a large buffer.

Revision 1.33 / (download) - annotate - [select for diffs], Wed Jul 3 10:19:45 2019 UTC (4 years, 11 months ago) by dlg
Branch: MAIN
Changes since 1.32: +39 -1 lines
Diff to previous 1.32 (colored)

add the kernel side of net.link.ifrxq.pressure_return and pressure_drop

these values are used as the backpressure thresholds in the interface
rx q processing code. theyre being exposed as tunables to userland
while we are figuring out what the best values for them are.

ok visa@ deraadt@

Revision 1.32 / (download) - annotate - [select for diffs], Mon Jul 1 00:44:29 2019 UTC (4 years, 11 months ago) by dlg
Branch: MAIN
Changes since 1.31: +8 -7 lines
Diff to previous 1.31 (colored)

reintroduce ifiq_input counting backpressure

instead of counting the number of packets on an ifiq, count the
number of times a nic has tried to queue packets before the stack
processes them.

this new semantic interacted badly with virtual interfaces like
vlan and trunk, but these interfaces have been tweaked to call
if_vinput instead of if_input so their packets are processed directly
because theyre already running inside the stack.

im putting this in so we can see what the effect is. if it goes
badly i'll back it out again.

ok cheloha@ proctor@ visa@

Revision 1.31 / (download) - annotate - [select for diffs], Tue Apr 16 04:04:19 2019 UTC (5 years, 1 month ago) by dlg
Branch: MAIN
Changes since 1.30: +34 -1 lines
Diff to previous 1.30 (colored)

have another go at tx mitigation

the idea is to call the hardware transmit routine less since in a
lot of cases posting a producer ring update to the chip is (very)
expensive. it's better to do it for several packets instead of each
packet, hence calling this tx mitigation.

this diff defers the call to the transmit routine to a network
taskq, or until a backlog of packets has built up. dragonflybsd
uses 16 as the size of it's backlog, so i'm copying them for now.

i've tried this before, but previous versions caused deadlocks. i
discovered that the deadlocks in the previous version was from
ifq_barrier calling taskq_barrier against the nettq. interfaces
generally hold NET_LOCK while calling ifq_barrier, but the tq might
already be waiting for the lock we hold.

this version just doesnt have ifq_barrier call taskq_barrier. it
instead relies on the IFF_RUNNING flag and normal ifq serialiser
barrier to guarantee the start routine wont be called when an
interface is going down. the taskq_barrier is only used during
interface destruction to make sure the task struct wont get used
in the future, which is already done without the NET_LOCK being
held.

tx mitigation provides a nice performanace bump in some setups. up
to 25% in some cases.

tested by tb@ and hrvoje popovski (who's running this in production).
ok visa@

Revision 1.30 / (download) - annotate - [select for diffs], Fri Mar 29 04:21:55 2019 UTC (5 years, 2 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_6_5_BASE, OPENBSD_6_5
Changes since 1.29: +2 -3 lines
Diff to previous 1.29 (colored)

while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED

nothing uses them, and they can generate false positives if the
serialiser is running at a lower IPL on the same cpu as a call to
ifq_is_serialiazed.

Revision 1.29 / (download) - annotate - [select for diffs], Fri Mar 29 04:12:55 2019 UTC (5 years, 2 months ago) by dlg
Branch: MAIN
Changes since 1.28: +1 -8 lines
Diff to previous 1.28 (colored)

deprecate ifiq_barrier.

drivers don't need to call it because the stack runs work in ifiqs.
again, only the stack has to care about waiting for pending work
when shutting down, not drivers. ifiq_destroy already does a task_del
and task_barrier dance, so we don't need ifiq_barrier.

Revision 1.28 / (download) - annotate - [select for diffs], Mon Mar 4 21:57:16 2019 UTC (5 years, 3 months ago) by dlg
Branch: MAIN
Changes since 1.27: +7 -8 lines
Diff to previous 1.27 (colored)

move back to ifiq_input counting packets instead of queue operations.

the backpressure seems to have kicked in too early, introducing a
lot of packet loss where there wasn't any before. secondly, counting
operations interacted extremely badly with pseudo-interfaces. for
example, if you have a physical interface that rxes 100 vlan
encapsulated packets, it will call ifiq_input once for all 100
packets. when the network stack is running vlan_input against thes
packets, vlan_input will take the packet and call ifiq_input against
each of them. because the stack is running packets on the parent
interface, it can't run the packets on the vlan interface, so you
end up with ifiq_input being called 100 times, and we dropped packets
after 16 calls to ifiq_input without a matching run of the stack.

chris cappuccio hit some weird stuff too.

discussed with claudio@

Revision 1.27 / (download) - annotate - [select for diffs], Mon Mar 4 21:34:08 2019 UTC (5 years, 3 months ago) by dlg
Branch: MAIN
Changes since 1.26: +1 -3 lines
Diff to previous 1.26 (colored)

don't need to initialise qdrops twice when setting up ifqs and ifiqs.

Revision 1.26 / (download) - annotate - [select for diffs], Fri Mar 1 04:47:33 2019 UTC (5 years, 3 months ago) by dlg
Branch: MAIN
Changes since 1.25: +12 -8 lines
Diff to previous 1.25 (colored)

rework how ifiq_input decides the stack is busy and whether it should drop

previously ifiq_input uses the traditional backpressure or defense
mechanism and counts packets to decide when to shed load by dropping.
currently it ends up waiting for 10240 packets to get queued on the
stack before it would decide to drop packets. this may be ok for
some machines, but for a lot this was too much.

this diff reworks how ifiqs measure how busy the stack is by
introducing an ifiq_pressure counter that is incremented when
ifiq_input is called, and cleared when ifiq_process calls the network
stack to process the queue. if ifiq_input is called multiple times
before ifiq_process in a net taskq runs, ifiq_pressure goes up, and
ifiq_input uses a high value to decide the stack is busy and it
should drop.

i was hoping there would be no performance impact from this change,
but hrvoje popovski notes a slight bump in forwarding performance.
my own testing shows that the ifiq input list length grows to a
fraction of the 10240 it used to get to, which means the maximum
burst of packets through the stack is smoothed out a bit. instead
of big lists of packets followed by big periods of drops, we get
relatively small bursts of packets with smaller gaps where we drop.

the follow-on from this is to make drivers implementing rx ring
moderation to use the return value of ifiq_input to scale the ring
allocation down, allowing the hardware to drop packets so software
doesnt have to.

Revision 1.25 / (download) - annotate - [select for diffs], Sun Dec 16 03:36:02 2018 UTC (5 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.24: +1 -3 lines
Diff to previous 1.24 (colored)

add task_pending

jsg@ wants this for drm, and i've had a version of it in diffs sine
2016, but obviously havent needed to use it just yet.

task_pending is modelled on timeout_pending, and tells you if the
task is on a list waiting to execute.

ok jsg@

Revision 1.24 / (download) - annotate - [select for diffs], Tue Dec 11 01:36:42 2018 UTC (5 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.23: +0 -0 lines
Diff to previous 1.23 (colored)

provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.

Revision 1.23 / (download) - annotate - [select for diffs], Tue Dec 11 01:33:05 2018 UTC (5 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.22: +16 -1 lines
Diff to previous 1.22 (colored)

add ifq_hdatalen for getting the size of the packet at the head of an ifq

this gets the locks right, and returns 0 if there's no packet available.

ok stsp@

Revision 1.22 / (download) - annotate - [select for diffs], Thu Jan 25 14:04:36 2018 UTC (6 years, 4 months ago) by mpi
Branch: MAIN
CVS Tags: OPENBSD_6_4_BASE, OPENBSD_6_4, OPENBSD_6_3_BASE, OPENBSD_6_3
Changes since 1.21: +2 -9 lines
Diff to previous 1.21 (colored)

Assert that ifiq_destroy() is not called with the NET_LOCK() held.

Calling taskq_barrier() on a softnet thread while holding the lock
is clearly a deadlock.

ok visa@, dlg@, bluhm@

Revision 1.21 / (download) - annotate - [select for diffs], Thu Jan 4 11:02:57 2018 UTC (6 years, 4 months ago) by tb
Branch: MAIN
Changes since 1.20: +0 -41 lines
Diff to previous 1.20 (colored)

Back out tx mitigation again because it breaks suspend and resume at
least on x230 and x240. Problem noted by claudio on icb.

ok dlg

Revision 1.20 / (download) - annotate - [select for diffs], Tue Jan 2 07:08:10 2018 UTC (6 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.19: +42 -1 lines
Diff to previous 1.19 (colored)

reintroduce tx mitigation

to quote the previous commit:

> this replaces ifq_start with code that waits until at least 4 packets
> have been queued on the ifq before calling the drivers start routine.
> if less than 4 packets get queued, the start routine is called from
> a task in a softnet tq.
>
> 4 packets was chosen this time based on testing sephe did in dfly
> which showed no real improvement when bundling more packets.  hrvoje
> popovski tested this on several nics and found an improvement of
> 10 to 20 percent when forwarding across the board.
>
> because some of the ifq's work could be sitting on a softnet tq,
> ifq_barrier now calls taskq_barrier to guarantee any work that was
> pending there has finished.
>
> ok mpi@ visa@

this was backed out because of a race in the net80211 stack that
anton@ hit. mpi@ committed a workaround for it in revision 1.30 of
src/sys/net80211/ieee80211_pae_output.c.

im putting this in again so we can see what breaks next.

Revision 1.19 / (download) - annotate - [select for diffs], Fri Dec 15 01:40:39 2017 UTC (6 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.18: +1 -4 lines
Diff to previous 1.18 (colored)

ifq_barrier should be callable by any nic, not just MPSAFE ones.

if (when) tx mitigation goes in again, all nics will have deferred
work that will need a barrier to ensure isn't running anymore.

found by bluhm@ when tx mit was in.

Revision 1.18 / (download) - annotate - [select for diffs], Fri Dec 15 01:37:30 2017 UTC (6 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.17: +165 -1 lines
Diff to previous 1.17 (colored)

add ifiqueues for mp safety and nics with multiple rx rings.

currently there is a single mbuf_queue per interface, which all
rings on a nic shove packets onto. while the list inside this queue
is protected by a mutex, the counters around it (ie, ipackets,
ibytes, idrops) are not. this means updates can be lost, and reading
the statistics is also inconsistent. having a single queue means
that busy rx rings can dominate and then starve the others.

ifiqueue structs are like ifqueue structs. they provide per ring
queues, and independent counters for each ring. when ifdata is read
for userland, these counters are aggregated. having a queue per
ring now allows for per ring backpressure to be applied. MCLGETI
will have it's day again.

right now we assume every interface wants an input queue and
unconditionally provide one. individual interfaces can opt into
more.

im not completely happy about the shape of this atm, but shuffling
it around more makes the diff bigger.

ok visa@

Revision 1.17 / (download) - annotate - [select for diffs], Thu Dec 14 02:40:51 2017 UTC (6 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.16: +3 -4 lines
Diff to previous 1.16 (colored)

i forgot to convert ifq_barrier_task to cond_signal.

Revision 1.16 / (download) - annotate - [select for diffs], Thu Dec 14 00:45:16 2017 UTC (6 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.15: +4 -8 lines
Diff to previous 1.15 (colored)

replace the bare sleep state handling in barriers with wait cond code

Revision 1.15 / (download) - annotate - [select for diffs], Tue Nov 14 08:44:11 2017 UTC (6 years, 6 months ago) by dlg
Branch: MAIN
Changes since 1.14: +1 -42 lines
Diff to previous 1.14 (colored)

anton@ reports that ifq tx mitigation breaks iwm somehow.

back it out until i can figure the problem out.

Revision 1.14 / (download) - annotate - [select for diffs], Tue Nov 14 04:08:11 2017 UTC (6 years, 6 months ago) by dlg
Branch: MAIN
Changes since 1.13: +13 -1 lines
Diff to previous 1.13 (colored)

move the adding of an ifqs counters in if_getdata to ifq.c

this keeps the knowledge of ifq locking in ifq.c

ok visa@

Revision 1.13 / (download) - annotate - [select for diffs], Tue Nov 14 00:00:35 2017 UTC (6 years, 6 months ago) by dlg
Branch: MAIN
Changes since 1.12: +42 -1 lines
Diff to previous 1.12 (colored)

reintroduce tx mitigation, like dfly does and like we used to do.

this replaces ifq_start with code that waits until at least 4 packets
have been queued on the ifq before calling the drivers start routine.
if less than 4 packets get queued, the start routine is called from
a task in a softnet tq.

4 packets was chosen this time based on testing sephe did in dfly
which showed no real improvement when bundling more packets.  hrvoje
popovski tested this on several nics and found an improvement of
10 to 20 percent when forwarding across the board.

because some of the ifq's work could be sitting on a softnet tq,
ifq_barrier now calls taskq_barrier to guarantee any work that was
pending there has finished.

ok mpi@ visa@

Revision 1.12 / (download) - annotate - [select for diffs], Fri Jun 2 00:07:12 2017 UTC (7 years ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_6_2_BASE, OPENBSD_6_2
Changes since 1.11: +27 -19 lines
Diff to previous 1.11 (colored)

be less tricky about when ifq_free is handled.

instead of assuming start routines only run inside the ifq serialiser,
only rely on the serialisation provided by the ifq mtx which is
explicitly used during ifq_deq ops.

ie, free the mbufs in ifq_free at the end of ifq_deq ops instead
of in the ifq_serialiser loop. ifq deq ops arent necessarily called
within the serialiser.

this should fix panics caused by fq codel on top of bce (which calls
bce_start from it's tx completion path instead of ifq_restart).

ok mikeb@

Revision 1.11 / (download) - annotate - [select for diffs], Wed May 3 20:55:29 2017 UTC (7 years, 1 month ago) by mikeb
Branch: MAIN
Changes since 1.10: +12 -1 lines
Diff to previous 1.10 (colored)

Provide a function to dispose of a list of mbufs on dequeue

ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list
as an argument.  This also lets these functions subtract the number
of packets to be disposed of from the ifq length.

OK dlg

Revision 1.10 / (download) - annotate - [select for diffs], Wed May 3 03:14:32 2017 UTC (7 years, 1 month ago) by dlg
Branch: MAIN
Changes since 1.9: +23 -1 lines
Diff to previous 1.9 (colored)

add ifq_mfreem() so ifq backends can free packets during dequeue.

a goal of the ifq api is to avoid freeing an mbuf while holding a
lock. to acheive this it allowed the backend enqueue operation to
return a single mbuf to be freed. however, mikeb@ is working on a
backend that wants to free packets during dequeue. to support this,
ifq_mfreem queues a packet during dequeue for freeing at the end
of the ifq serialiser.

there's some doco in ifq.h about it.

requested by mikeb@

Revision 1.9 / (download) - annotate - [select for diffs], Tue Mar 7 15:42:02 2017 UTC (7 years, 2 months ago) by mikeb
Branch: MAIN
CVS Tags: OPENBSD_6_1_BASE, OPENBSD_6_1
Changes since 1.8: +22 -6 lines
Diff to previous 1.8 (colored)

Change priq enqueue policy to drop lower priority packets

The new priority queueing enqueue policy is such that when the
aggregate queue depth of an outgoing queue is exceeded we attempt
to find a non-empty queue of packets with lower priority than the
priority of a packet we're trying to enqueue and if there's such
queue, we drop the first packet from it.

This ensures that high priority traffic will almost always find
the place on the queue and low priority bulk traffic gets a better
chance at regulating its throughput.  There's no change in the
behavior if altered priorities are not used (e.g. via "set prio"
Pf directive, VLAN priorities and so on).

With a correction from dlg@, additional tests by dhill@
OK bluhm, mpi

Revision 1.8 / (download) - annotate - [select for diffs], Tue Mar 7 15:16:01 2017 UTC (7 years, 2 months ago) by mikeb
Branch: MAIN
Changes since 1.7: +18 -33 lines
Diff to previous 1.7 (colored)

Convert priority queue lists to mbuf_lists

This simplifies the code quite a bit making it easier to reason about.
dlg@ has begrudgingly submitted to populism, OK bluhm, mpi

Revision 1.7 / (download) - annotate - [select for diffs], Tue Mar 7 01:29:53 2017 UTC (7 years, 2 months ago) by dlg
Branch: MAIN
Changes since 1.6: +19 -25 lines
Diff to previous 1.6 (colored)

deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs.

mikeb@ wants priq to be able to drop lower priority packets if the
current one is high. because ifq avoids freeing an mbuf while an
ifq mutex is held, he needs a way for a backend to return an arbitrary
mbuf to drop rather than signal that the current one needs to be
dropped.

this lets the backends return the mbuf to be dropped, which may or
may not be the current one.

to support this ifq_enqueue_try has to be dropped because it can
only signal about the current mbuf. nothing uses it (except
ifq_enqueue), so we can get rid of it. it wasnt even documented.

this diff includes some tweaks by mikeb@ around the statistics
gathered in ifq_enqueue when an mbuf is dropped.

Revision 1.6 / (download) - annotate - [select for diffs], Tue Jan 24 03:57:35 2017 UTC (7 years, 4 months ago) by dlg
Branch: MAIN
Changes since 1.5: +37 -14 lines
Diff to previous 1.5 (colored)

add support for multiple transmit ifqueues per network interface.

an ifq to transmit a packet is picked by the current traffic
conditioner (ie, priq or hfsc) by providing an index into an array
of ifqs. by default interfaces get a single ifq but can ask for
more using if_attach_queues().

the vast majority of our drivers still think there's a 1:1 mapping
between interfaces and transmit queues, so their if_start routines
take an ifnet pointer instead of a pointer to the ifqueue struct.
instead of changing all the drivers in the tree, drivers can opt
into using an if_qstart routine and setting the IFXF_MPSAFE flag.
the stack provides a compatability wrapper from the new if_qstart
handler to the previous if_start handlers if IFXF_MPSAFE isnt set.

enabling hfsc on an interface configures it to transmit everything
through the first ifq. any other ifqs are left configured as priq,
but unused, when hfsc is enabled.

getting this in now so everyone can kick the tyres.

ok mpi@ visa@ (who provided some tweaks for cnmac).

Revision 1.5 / (download) - annotate - [select for diffs], Fri Jan 20 03:48:03 2017 UTC (7 years, 4 months ago) by dlg
Branch: MAIN
Changes since 1.4: +12 -7 lines
Diff to previous 1.4 (colored)

keep output packet counters on the ifq structure.

these copy what is counted on the output path on the ifnet struct,
except ifq counts both packets and bytes when a packet is queued
instead of just the bytes.

all the counters are protected by the ifq mutex except for ifq_errors,
which can be updated safely from inside a start routine because the
ifq machinery serialises them.

ok mpi@

Revision 1.4 / (download) - annotate - [select for diffs], Tue Dec 29 12:35:43 2015 UTC (8 years, 5 months ago) by dlg
Branch: MAIN
CVS Tags: OPENBSD_6_0_BASE, OPENBSD_6_0, OPENBSD_5_9_BASE, OPENBSD_5_9
Changes since 1.3: +12 -6 lines
Diff to previous 1.3 (colored)

store curcpu() in ifq_serializer so we can check it.

this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED()
macro.

ok mpi@

Revision 1.3 / (download) - annotate - [select for diffs], Wed Dec 9 12:07:42 2015 UTC (8 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.2: +16 -43 lines
Diff to previous 1.2 (colored)

rework ifq_serialise to avoid some atomic ops.

now both the list of work and the flag saying if something is
running the list are protected by a single mutex. it cuts the
number of interlocked ops for an uncontended run of the queue from
5 down to 2.

jmatthew likes it.

Revision 1.2 / (download) - annotate - [select for diffs], Wed Dec 9 03:22:39 2015 UTC (8 years, 5 months ago) by dlg
Branch: MAIN
Changes since 1.1: +92 -27 lines
Diff to previous 1.1 (colored)

rework the if_start mpsafe serialisation so it can serialise arbitrary work

work is represented by struct task.

the start routine is now wrapped by a task which is serialised by the
infrastructure. if_start_barrier has been renamed to ifq_barrier and
is now implemented as a task that gets serialised with the start
routine.

this also adds an ifq_restart() function. it serialises a call to
ifq_clr_oactive and calls the start routine again. it exists to
avoid a race that kettenis@ identified in between when a start
routine discovers theres no space left on a ring, and when it calls
ifq_set_oactive. if the txeof side of the driver empties the ring
and calls ifq_clr_oactive in between the above calls in start, the
queue will be marked oactive and the stack will never call the start
routine again.

by serialising the ifq_set_oactive call in the start routine and
ifq_clr_oactive calls we avoid that race.

tested on various nics
ok mpi@

Revision 1.1 / (download) - annotate - [select for diffs], Tue Dec 8 10:06:12 2015 UTC (8 years, 5 months ago) by dlg
Branch: MAIN

split the interface send queue (struct ifqueue) implementation out.

the intention is to make it more clear what belongs to a transmit
queue and what belongs to an interface.

suggested by and ok mpi@

This form allows you to request diff's between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.