A Scheduling Method for Avoiding Kernel Lock Thrashing on Multi-cores.

However, for the guest operating system on a virtual machine (VM), such assumption cannot be guaranteed, since virtual CPUs of VMs share limited physical cores. ... Such a virtual time discontinuity problem leads to significant inefficiency for lock and interrupt handling, which rely on the immediate availability of CPUs whenever the operating system requires computation ... The dual insertion policies were originally proposed by Qureshi et. al for a single core cache [16] , and later extended for multi-core shared caches [8] . ...

doi:10.1109/micro.2014.49 dblp:conf/micro/AhnPH14 fatcat:tncat37lxbayploevptpnnaejq

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the ... By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system ... In this case, determining optimal number of tasks becomes a new trade-off between fine-grained task scheduling and long kernel scheduling overhead. ...

arXiv:2012.06959v1 fatcat:am7guw7i5fchxafrkp34plwvky

Open Access

We opportunistically leverage the existence of such locks by modifying the lock admission policy so as to intentionally limit the number of threads circulating over the lock in a given period. ... We borrow the concept of swapping from the field of memory management and intentionally impose concurrency restriction (CR) if a lock is oversubscribed. ... Bishop for useful discussions. ...

arXiv:1511.06035v7 fatcat:led5wcfcfnaoznmdxjrbhrxrwa

Multiple Versions

For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization overhead. ... We propose one such policy that logically partitions the CMP cores between guest VMs. ... We also thank Saisanthosh Balakrishnan and the anonymous reviewers for their comments on this paper. ...

doi:10.1145/1152154.1152176 dblp:conf/IEEEpact/WellsCS06 fatcat:32z2wvc7kzcubpzikzwqf2iptu

Recent research on multi-core database architectures has made the argument that, when possible, database systems should abandon the use of latches in favor of latch-free algorithms. ... Our findings indicate that the argument for latch-free algorithms' superior scalability is far more nuanced than the current state-of-the-art in multi-core database architectures suggests. ... We thank Joseph Hellerstein, Hideaki Kimura, Justin Levandoski, Ippokratis Pandis, Julian Shun, and the anonymous CIDR 2017 reviewers for their insightful comments on earlier versions of this paper. ...

dblp:conf/cidr/FaleiroA17 fatcat:ph7hhvys65fbjo3pp5ft3h7y6u

Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. ... The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi-and manycore architectures as power and cooling constraints limit increases in microprocessor ... on a multi-chip module. ...

doi:10.1016/j.parco.2011.02.001 fatcat:dc6yotxbj5htteiv2z43o4ooje

to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. ... Thanks to Collin Lee for giving feedback on design, and to Yilong Li for help with debugging the RAMCloud networking stack. ... This work was supported by C-FAR (one of six centers of STARnet, a Semiconductor Research Corporation program, sponsored by MARCO and DARPA) and by the industrial affiliates of the Stanford Platform Lab ...

dblp:conf/osdi/QinLSKO18 fatcat:x32cf2z37vaxrcdhirnihdcily

To bridge this gap, we have significantly upgraded the LTE-Sim framework by implementing a concurrent scheduling algorithm, namely the Multi-Master Scheduler, aimed at efficiently handling events in a ... In this context, multi-core/multi-processor simulation tools can accelerate their activities by drastically reducing the time required to simulate complex scenarios. ... Each processor has 8 CPU-cores (for a total of 32 CPU-cores) that share a 10MB L3 cache (5118KB per each 4-cores set), and each core has a 512KB private L2 cache. ...

doi:10.1109/waina.2013.202 dblp:conf/aina/PellegriniP13 fatcat:7jzewxnat5clpkgr3btb42ccpi

Since Moore's law is now delivering multiple cores instead of faster processors, future systems must either bear a relatively higher cost for abstractions or use some cores to help tolerate abstraction ... It introduces Cache-friendly Asymmetric Buffering (CAB), a lock-free ring-buffer that implements efficient communication between application and analysis threads. ... It uses a sentinel value (NULL) to avoid concurrent access of the queue head and tail indices, and forces a delay between the consumer and producer to avoid cache line thrashing. ...

doi:10.1145/1640089.1640101 dblp:conf/oopsla/HaABM09 fatcat:lzkkrc3w4ne4bjznhiwaie36jm

Since Moore's law is now delivering multiple cores instead of faster processors, future systems must either bear a relatively higher cost for abstractions or use some cores to help tolerate abstraction ... It introduces Cache-friendly Asymmetric Buffering (CAB), a lock-free ring-buffer that implements efficient communication between application and analysis threads. ... It uses a sentinel value (NULL) to avoid concurrent access of the queue head and tail indices, and forces a delay between the consumer and producer to avoid cache line thrashing. ...

doi:10.1145/1639949.1640101 fatcat:i7dgkfveqnex3ms5bvefsqn3u4

We will create a multiuser environment on a uniprocessor system; this can be achieved, when there is a separate kernel for each user in the same operating system. ... In the present time, a personal computer has a very high processing power than required for a single user system. ... On a multiprocessor (including multi-core system), the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task. ...

doi:10.9790/0661-0744652 fatcat:klqyxqbvyncpbcm4gh6rui4uvq

Powered by our newly designed kernel-level monitoring module and page migration engine, memos can dynamically optimize the data placement at the memory hierarchy in terms of the on-line memory patterns ... In this paper, we introduce memos, which can schedule memory resources over the entire memory hierarchy including cache, channels, main memory comprising DRAM and NVM simultaneously. ... Furthermore, we deploy a practical emulation platform for hybrid DRAM-NVM on a real multi-core machine by using the channel-partitioning approach [35] . ...

arXiv:1703.07725v1 fatcat:wzdseqs4pbc7nozfejrllqszl4

We present a comprehensive survey on real-time issues in virtualization for embedded systems, covering popular virtualization systems including KVM, Xen, L4 and others. ... A State-of-the-Art Survey on Real-Time Issues in Embedded Systems Virtualization 278 that are not specific to any virtualization approach, but we believe they are of sufficient importance to dedicate separate ... [45] proposed a method for improving I/O performance of Xen on a multicore processor. ...

doi:10.4236/jsea.2012.54033 fatcat:iiqszwe3brhjxl4ycyf6ynbp2u

Open Access

The next decade will afford us computer chips with 100's to 1,000's of cores on a single piece of silicon. ... Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. ... We thank Robert Morris and Frans Kaashoek for feedback on this work. We also thank Charles Gruenwald for help on fos. ...

doi:10.1145/1531793.1531805 fatcat:vdak4y4dt5cavlcqj7s7q4p3bu

Finally, an efficient parallel run-time system is built by employing lock-free design principles from top to bottom to support multi-threaded execution on multi-core processors. ... Multi-core architectures provide a viable solution for building high-performance parsers for application protocols. ... on a multi-core server. ...

doi:10.1109/icpads.2011.37 dblp:conf/icpads/ZhangWHT11 fatcat:t2vg3k5i5rbuvodtuqd4voipm4

Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems

Preserved Fulltext

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures [article]

Preserved Fulltext

Malthusian Locks [article]

Preserved Fulltext

Other Versions

Hardware support for spin management in overcommitted virtual machines

Preserved Fulltext

Latch-free Synchronization in Database Systems: Silver Bullet or Fool's Gold?

Preserved Fulltext

Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Preserved Fulltext

Arachne: Core-Aware Thread Management

Preserved Fulltext

Multi-threaded Simulation of 4G Cellular Systems within the LTE-Sim Framework

Preserved Fulltext

A concurrent dynamic analysis framework for multicore hardware

Preserved Fulltext

A concurrent dynamic analysis framework for multicore hardware

Preserved Fulltext

Performance analysis of N-computing device under various load conditions

Preserved Fulltext

Memos: Revisiting Hybrid Memory Management in Modern Operating System [article]

Preserved Fulltext

A State-of-the-Art Survey on Real-Time Issues in Embedded Systems Virtualization

Preserved Fulltext

Factored operating systems (fos)

Preserved Fulltext

Building High-Performance Application Protocol Parsers on Multi-core Architectures

Preserved Fulltext