Networking applications use the Berkeley socket model to interface with a networking stack that resides in the operating system kernel. This model requires costly context switching between applications and the kernel, as well as memory copies on both the sending and receiving path. Context switches require the TLB and caches and can severely degrade instructions per cycle (IPC) for tens of thousands of cycles. This model imposes a limitation on performance which becomes even more apparent with the doubling of bandwidth of network bandwidth every 17-18 months, compared with CPU and DRAM performance doubling only every 26-27 months. For example The Memcached application spends over 80% of CPU time in the kernel networking stack, using less than 5% of the available networking bandwidth.
Applications using this model also suffer from lack of connection locality, as the kernel can process packets on different cores to the application. Multicore scalability is limited due to the lack of connection locality and synchronisation overhead from sharing networking state across multiple cores. To achieve multicore scalability different parallelisation techniques can be utilised such as a run-to-completion model where packets are processed on the same core, or a streaming model where application and network cores are separate and communicate using message passing. The streaming model has the ability to achieve parallelisation within a request, whereas the run-to-completion model attempts to improve temporal locality by processing packets as early as possible.
This research aims to evaluate the impact of using kernel bypass technologies listed above, to accelerate network bound applications. Some research questions include: