Compiler and Runtime Support for Big Data Systems
The availability of an enormous amount of data has led to the proliferation of large-scale, data-intensive platforms. Big Data analytics quickly becomes a key player in transforming science and society. Having a scalable system that can efficiently analyze data of massive scale is highly desired.
The mainstream approach to scalability is to enable distributed processing using a large number of machines in clusters or in the cloud. An input dataset is split among machines so that many processors can work simultaneously on a computing task. Often, these Big Data systems are developed in managed languages, such as Java, C#, or Scala. This is primarily because these languages 1) enable fast development cycles due to simple usage and automatic memory management and 2) provide abundant library suites and community support.
However, a managed runtime comes at a cost: memory management in Big Data systems is often prohibitively expensive. These systems commonly suffer from severe memory problems, causing low performance and scalability. Allocating and deallocating a sea of data objects puts a severe strain on the runtime system, leading to high memory management overhead, prolonged execution time, and inability to process datasets of even moderate sizes.
While there exists a large body of techniques that can improve Big Data performance, they focus on horizontal scaling, i.e., scaling data processing to a large cluster of machines with an assumption that the processing task on each machine yields satisfactory performance. However, in many cases, the data-processing program running on a worker machine suffers from extensive memory pressure – the execution pushes the heap’s limit soon after it starts and the system struggles with finding allocation space for new objects throughout the execution. Consequently, data-intensive applications crash due to out-of-memory errors in a surprisingly early stage. Even if a program runs successfully to the end, its execution is usually dominated by garbage collection (GC), which can take 40-60% of a job’s end-to-end execution time.
Unlike the mainstream approach which tackles the scalability problem by spending more resources (e.g., machines or memory), we approach this problem from a different perspective. My research focuses on systematically improving the performance of each data-processing program, i.e., enabling vertical scaling. By optimizing the program running on each worker node, large performance gains can be expected in a data center. With the reduction of data processing time on each node, the data center is expected to have increased throughput and reduced energy consumption, leading to economic benefits.
Data Plane for Resource-Disaggregated Data Centers
Advances in hardware and networking technologies have paved roads for the rise of a new datacenter architecture — resource disaggregation. As remote access achieves microseconds latency, it is now practical to segregate resources (e.g., CPU, memory, storage, etc.) of the same type into dedicated pools. Under a resource-disaggregated cluster, any device can access any resources regardless of whether the device itself carries those resources. In comparison to the traditional monolithic-server architecture, the dissemination of hardware resources offers various benefits such as higher resource utilization, finer-grained fault tolerance, and improved hardware elasticity. However, this paradigm shift poses a new challenge to achieve satisfactory performance. For instance, cloud applications running in a memory disaggregated environment suffer significant slowdowns (e.g., higher than 10×).
The slowdown is due to the current data plane, which piles up modules that are independently designed, does not account for application-semantics in terms of objects and data structures, application versus garbage collection, etc. Therefore, it causes an excessive number of data movements between the compute nodes and memory pools, magnifying the data transfer cost. The key insight is that, to significantly reduce the cost of data round-trips, data plane modules must coordinate with each other — low-level kernel modules such as the paging and swap system must explicitly account for application semantics while high-level applications and runtimes must be aware of the underlying disaggregation architecture. Hence, redesigning the data plane in a holistic manner is critical (e.g., with the primary goal of reducing the number of unnecessary remote accesses, as apposed to reducing the latency of each access as seen in mainstream approaches in hardware and networking technologies). This research spreads across layers of the computing stack, including but not limited to, applications, runtime systems, and kernel modules for memory-disaggregation to embrace the new architecture and move beyond the status quo.
High Performance Memory Allocator
Modern workloads with large heap sizes put significant pressure on the CPU to perform virtual-to-physical memory translation, especially when using traditional hardware pages (i.e., 4KB). Modern OSes now transparently support huge pages (e.g., 2MB or 1GB), thereby improving system performance due to higher TLB hit rates with larger address space coverage. To fully reap the benefits of huge pages, user-level memory allocators must be huge page-aware.
However, this shift leads to a dilemma for memory allocators. On the one hand, memory pages allocated from the OS at the granularity of a huge page result in higher performance (i.e., more TLB hits and less page table walks) but at risk of high fragmentation (i.e., wasted memory on a huge page). On the other hand, limiting fragmentation by breaking a huge page into multiple 4KB pages reduces huge page coverage, leading to lower performance. Finding a sweet spot to satisfy high performance while limiting fragmentation is a challenge faced by memory allocator designers.