vlambda博客
学习文章列表

ASPLOS 2022 编译器与编程语言技术洞察

ASPLOS (Architectural Support for Programming Languages and Operating Systems) is the premier forum for interdisciplinary systems research, intersecting computer architecture, hardware and emerging technologies, programming languages and compilers, operating systems, and networking. The 27th edition of the ASPLOS conference will be in Lausanne, Switzerland.

The report prepared by technical experts(Rayson Ho, Tomasz Czajkowski, Maziar Goudarzi, Reza Azimi) from Toronto Heterogenous Compiler Lab summarizes key technology insights of trends and development of compiler and programming language technologies from the ASPLOS 2022.

DOTA: detect and omit weak attentions for scalable transformer acceleration [1]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507738

论文作者单位:

加州大学圣巴巴拉分校(UC Santa Barbara)

摘要:(向上滑动阅览)

Transformer 神经网络在语言理解、图像处理和生成建模等许多应用程序中表现出领先的性能。尽管性能令人印象深刻,但由于self-attention的二次计算复杂性和内存消耗,长序列 Transformer 处理成本很高。在本文中,我们介绍了 DOTA,这是一种算法架构协同设计,可有效解决可扩展 Transformer 推理的挑战。基于注意力图中并非所有连接都同等重要的见解,我们建议与 Transformer 模型联合优化轻量级检测器(Detector),以在运行时准确检测和忽略弱连接。此外,我们使用提出的注意力检测机制为端到端 Transformer 加速设计了一个专门的系统架构。对各种Benchmark的实验证明了 DOTA 优于其他解决方案的性能。总之,DOTA 分别基于 GPU 和定制硬件实现了 152.6 倍和 4.5 倍的性能加速和能效数量级提升。


Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-attention. In this paper, we present DOTA, an algorithm architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-to-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6× and 4.5× performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively.

ASPLOS 2022 编译器与编程语言技术洞察

BiSon-e: a lightweight and high-performance accelerator for narrow integer linear algebra computing on the edge [2]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507746

论文作者单位:

巴塞罗那超级计算中心(Barcelona Supercomputing Center)

摘要:(向上滑动阅览)

基于字节和子字节整数数据格式的线性代数计算内核是各类应用程序(从深度学习到模式匹配)的基础。将这些应用程序的计算从云端移植到边缘和移动设备将在安全性 (security)、安全性 (safety) 和能源效率方面有显着提高。然而,尽管它们的内存和能源需求较低,但它们本质上的高计算强度使得这些工作负载在资源高度受限的设备上执行具有挑战性。在本文中,我们介绍了 BiSon-e,这是一种基于 RISC-V 的新型架构,它通过在现成的标量功能单元(FU)上执行单指令多数据 (SIMD) 操作,基于边缘处理器上的窄整数计算来加速线性代数内核。我们的新颖架构建立在二进制分割 (binary segmentation) 技术之上,它可以显着减少内存占用和需要窄数据大小的线性代数内核的算术强度。我们将 BiSon-e 集成到基于 RISC-V 的完整片上系统 (SoC) 中,采用 65nm 和 22nm 技术进行合成和布局布线,相对于基线架构引入了可忽略不计的 0.07% 的面积开销。我们的实验评估表明,在计算 AlexNet 和 VGG-16 卷积神经网络 (CNNs) 的卷积层和全连接层(8 位、4 位和 2 位)时,与单个 RISC-V 内核的标量实现相比,我们的解决方案增益高达 5.6 倍、13.9 倍和 24 倍,与基于 RISC-V 的矢量处理单元 (VPU) 相比,将字符串匹配任务的能效提高了 5 倍。


Linear algebra computational kernels based on byte and sub-byte integer data formats are at the base of many classes of applications, ranging from Deep Learning to Pattern Matching. Porting the computation of these applications from cloud to edge and mobile devices would enable significant improvements in terms of security, safety, and energy efficiency. However, despite their low memory and energy demands, their intrinsically high computational intensity makes the execution of these workloads challenging on highly resource-constrained devices. In this paper, we present BiSon-e, a novel RISC-V based architecture that accelerates linear algebra kernels based on narrow integer computations on edge processors by performing Single Instruction Multiple Data (SIMD) operations on off-the-shelf scalar Functional Units (FUs). Our novel architecture is built upon the binary segmentation technique, which allows to significantly reduce the memory footprint and the arithmetic intensity of linear algebra kernels requiring narrow data sizes. We integrate BiSon-e into a complete System-on-Chip (SoC) based on RISC-V, synthesized and Place&Routed in 65nm and 22nm technologies, introducing a negligible 0.07% area overhead with respect to the baseline architecture. Our experimental evaluation shows that, when computing the Convolution and Fully-Connected layers of the AlexNet and VGG-16 Convolutional Neural Networks (CNNs) with 8-, 4-, and 2-bit, our solution gains up to 5.6×, 13.9× and 24× in execution time compared to the scalar implementation of a single RISC-V core, and improves the energy efficiency of string matching tasks by 5× when compared to a RISC-V-based Vector Processing Unit (VPU).


ASPLOS 2022 编译器与编程语言技术洞察

Critical Slice Prefetching [3]

论文PDF链接:

https://people.ucsc.edu/~hlitz/papers/crisp.pdf

摘要:(向上滑动阅览)

DRAM 的高访问延迟仍然是当代微处理器系统的性能挑战。预取是解决这个问题的成熟技术,然而,现有的实现设计在存在不规则内存访问模式的情况下无法提供任何性能优势。可以预测不规则内存访问(例如超前执行)的现有技术的硬件复杂性已被证明无法在实际硬件中实现。我们提出了一种轻量级机制,通过利用基于关键性的调度来隐藏不规则内存访问模式的高延迟。特别是,我们的技术尽可能早地执行拖欠负载 (delinquent load) 及其负载切片,从而隐藏了很大一部分延迟。此外,我们观察到由分支错误预测和其他高延迟指令引起的延迟可以用类似的方法隐藏。我们的提议只需要通过执行内存访问分类、加载和分支切片提取以及仅在软件中进行优先级分析来进行最少的硬件修改。因此,我们的技术是可行的,只需引入一个简单的新指令前缀,同时需要对指令调度程序进行最少的修改。我们的技术将受内存延迟限制的应用程序的 IPC 提高了 38%,平均提高了 8.4%。


The high access latency of DRAM continues to be a performance challenge for contemporary microprocessor systems. Prefetching is a well-established technique to address this problem, however, existing implemented designs fail to provide any performance benefits in the presence of irregular memory access patterns. The hardware complexity of prior techniques that can predict irregular memory accesses such as runahead execution has proven untenable for implementation in real hardware. We propose a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling. In particular, our technique executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach. Our proposal only requires minimal hardware modifications by performing memory access classification, load and branch slice extraction, as well as priority analysis exclusively in software. As a result, our technique is feasible to implement, introducing only a simple new instruction prefix while requiring minimal modifications of the instruction scheduler. Our technique increases the IPC of memory-latency-bound applications by up to 38% and by 8.4% on average.

ASPLOS 2022 编译器与编程语言技术洞察

SparseCore: Stream ISA and Processor Specialization for Sparse Computation [4]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507705

论文作者单位:

南加州大学(University of Southern California)

摘要:(向上滑动阅览)

对于许多应用程序来说,稀疏数据的计算变得越来越重要。最近的稀疏计算加速器是为特定的算法/应用程序设计的,使得它们在软件优化方面不灵活。本文提出了 SparseCore,这是第一个用于稀疏计算的通用处理器扩展,可以灵活地加速复杂的代码模式和快速演进的算法。我们扩展指令集架构 (ISA) 以使流或稀疏向量成为一等公民,并开发高效的架构组件来支持流 ISA。新颖的 ISA 扩展本质上对流进行操作,实现了高效的数据移动和计算。仿真结果表明,SparseCore 在稀疏张量计算和图形模式计算方面实现了显着的加速。


Computation on sparse data is becoming increasingly important for many applications. Recent sparse computation accelerators are designed for specific algorithm/application, making them inflexible with software optimizations. This paper proposes SparseCore, the first general-purpose processor extension for sparse computation that can flexibly accelerate complex code patterns and fast-evolving algorithms. We extend the instruction set architecture (ISA) to make stream or sparse vector first-class citizens, and develop efficient architectural components to support the stream ISA. The novel ISA extension intrinsically operates on streams, realizing both efficient data movement and computation. The simulation results show that SparseCore achieves significant speedups for sparse tensor computation and graph pattern computation.


ASPLOS 2022 编译器与编程语言技术洞察

AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures [5]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507723

论文作者单位:

阿里巴巴,清华大学,悉尼大学(University of Sydney)

摘要:(向上滑动阅览)

这项工作表明,在最近的机器学习模型中,内存密集型计算是一个不断上升的性能关键因素。由于一系列独特的新挑战,现有的 ML 优化编译器无法在复杂的两级依赖关系和即时需求下执行高效融合。他们面临的困境是要么由于繁重的冗余计算而执行昂贵的融合,要么跳过导致大量内核的融合。此外,由于缺乏对具有不规则张量形状的实际生产工作负载的支持,它们经常遭受低并行度的困扰。为了应对这些日益严峻的挑战,我们提出了 AStitch,这是一种机器学习优化编译器,它为内存密集型 ML 计算开辟了一个新的多维优化空间。它在考虑多维优化目标的同时系统地抽象了四种算子拼接方案,通过新颖的分层数据重用解决复杂的计算图依赖关系,并通过自适应线程映射有效地处理各种张量形状。最后,AStitch 提供了即时支持,并结合了我们针对 ML 训练和推理提出的优化。尽管 AStitch 作为一个独立的编译器引擎可以移植到任何版本的 TensorFlow,但它的基本思想可以普遍应用于其他 ML 框架和优化编译器。实验结果表明,与最先进的谷歌 XLA 解决方案相比,AStitch 在五个生产工作负载中平均可以实现 1.84 倍的加速(最高 2.73 倍)。我们还将 AStitch 部署到具有数千个 GPU 的 ML 工作负载的生产集群上。该系统已运行 10 多个月,每周为 70,000 个任务节省约 20,000 个 GPU 小时。


This work reveals that memory-intensive computation is a rising performance-critical factor in recent machine learning models. Due to a unique set of new challenges, existing ML optimizing compilers cannot perform efficient fusion under complex two-level dependencies combined with just-in-time demand. They face the dilemma of either performing costly fusion due to heavy redundant computation, or skipping fusion which results in massive number of kernels. Furthermore, they often suffer from low parallelism due to the lack of support for real-world production workloads with irregular tensor shapes. To address these rising challenges, we propose AStitch, a machine learning optimizing compiler that opens a new multi-dimensional optimization space for memory-intensive ML computations. It systematically abstracts four operator-stitching schemes while considering multi-dimensional optimization objectives, tackles complex computation graph dependencies with novel hierarchical data reuse, and efficiently processes various tensor shapes via adaptive thread mapping. Finally, AStitch provides justin-time support incorporating our proposed optimizations for both ML training and inference. Although AStitch serves as a stand-alone compiler engine that is portable to any version of TensorFlow, its basic ideas can be generally applied to other ML frameworks and optimization compilers. Experimental results show that AStitch can achieve an average of 1.84× speedup (up to 2.73×) over the stateof-the-art Google’s XLA solution across five production workloads. We also deploy AStitch onto a production cluster for ML workloads with thousands of GPUs. The system has been in operation for more than 10 months and saves about 20,000 GPU hours for 70,000 tasks per week.

ASPLOS 2022 编译器与编程语言技术洞察

TaskStream: Accelerating Task-Parallel Workloads by Recovering Program Structure [6]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507706

论文作者单位:

加州大学(University of California)

摘要:(向上滑动阅览)

可重新配置的加速器,如 CGRA 和数据流架构,在解决数据处理问题方面已经变得突出。但是,它们在很大程度上仅限于具有常规并行性的工作负载,从而排除了它们对流行的任务并行工作负载的适用性。可重构架构和任务并行性似乎是矛盾的,因为前者需要重复和简单的程序结构,而后者则打破程序结构来创建小的、单独调度的程序单元。我们的见解是,如果任务及其通信结构的潜力是硬件中的一流原语,则可以以极低的开销恢复程序结构。我们提出了一个用于加速器的任务执行模型,称为 TaskStream,它使用足以恢复任务间结构的信息来注释任务依赖关系。TaskStream 支持工作感知负载平衡、流水线任务间依赖关系的恢复以及通过多播恢复任务间读取共享。我们将 TaskStream 应用于可重新配置的数据流架构,为任务并行工作负载创建无缝的分层数据流模型。我们将我们的加速器 Delta 与等效的静态并行设计进行比较。总体而言,我们发现我们的执行模型可以将性能提高 2.2 倍,而面积开销仅为 3.6%,同时减轻了管理任务分配的编程负担。


Reconfigurable accelerators, like CGRAs and dataflow architectures, have come to prominence for addressing data-processing problems. However, they are largely limited to workloads with regular parallelism, precluding their applicability to prevalent task-parallel workloads. Reconfigurable architectures and task parallelism seem to be at odds, as the former requires repetitive and simple program structure, and the latter breaks program structure to create small, individually scheduled program units.Our insight is that if tasks and their potential for communication structure are first-class primitives in the hardware, it is possible to recover program structure with extremely low overhead. We propose a task execution model for accelerators called TaskStream, which annotates task dependences with information sufficient to recover inter-task structure. TaskStream enables work-aware load balancing, recovery of pipelined inter-task dependences, and recovery of inter-task read sharing through multicasting.We apply TaskStream to a reconfigurable dataflow architecture, creating a seamless hierarchical dataflow model for task-parallel workloads. We compare our accelerator, Delta, with an equivalent static-parallel design. Overall, we find that our execution model can improve performance by 2.2× with only 3.6% area overhead, while alleviating the programming burden of managing task distribution.


ASPLOS 2022 编译器与编程语言技术洞察

Software-Defined Address Mapping: A Case on 3D Memory [7]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507774

论文作者单位:

宾夕法尼亚大学(University of Pennsylvania),威斯康星大学麦迪逊分校(University of Wisconsin-Madison)

摘要:(向上滑动阅览)


3D-stacking memory such as High-Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) provides orders of magnitude more bandwidth and significantly increased channel-level parallelism (CLP) due to its new parallel memory architecture. However, it is challenging to fully exploit the abundant CLP for performance as the bandwidth utilization is highly dependent on address mapping in the memory controller. Unfortunately, CLP is very sensitive to a program’s data access pattern, which is not made available to OS/hardware by existing mechanisms.In this work, we address these challenges with software-defined address mapping (SDAM) that, for the first time, enables user program to obtain a direct control of the low-level memory hardware in a more intelligent and fine-grained manner. In particular, we develop new mechanisms that can effectively communicate a program’s data access properties to the OS and hardware and to use it to control data placement in hardware. To guarantee correctness and reduce overhead in storage and performance, we extend Linux kernel and C-language memory allocators to support multiple address mappings. For advanced system optimization, we develop machine learning methods that can automatically identify access patterns of major variables in a program and cluster these with similar access patterns to reduce the overhead for SDAM. We demonstrate the benefits of our design on real system prototype, comprising (1) a RISC-V processor, near memory accelerators and HBM modules using Xilinx FPGA platform, and (2) modified Linux and glibc. Our evaluation on standard CPU benchmarks and data-intensive benchmarks (for both CPU and accelerators) demonstrates 1.41×, 1.84× speedup on CPU and 2.58× on near memory accelerators in our system with SDAM compared to a baseline system that uses a fixed address mapping.


ASPLOS 2022 编译器与编程语言技术洞察

A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators [8]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507767

论文作者单位:

Google Brain

摘要:(向上滑动阅览)

快速变化的深度学习环境为构建针对特定数据中心规模工作负载优化的推理加速器提供了独特的机会。我们提出了全栈加速器搜索技术 (FAST),这是一种硬件加速器搜索框架,它定义了一个涵盖关键设计决策的广泛优化环境硬件-软件堆栈,包括硬件数据路径、软件调度和编译器通道,如操作融合和张量填充。在本文中,我们分析了最先进的视觉和自然语言处理 (NLP) 模型中的瓶颈,包括 EfficientNet [91]和 BERT [19],并使用 FAST 设计能够解决这些瓶颈的加速器。与 TPU-v3 相比,针对单一工作负载优化的 FAST 生成的加速器在所有基准测试中平均将 Perf/TDP 提高了 3.7 倍。与 TPU-v3 相比,为服务一组工作负载而优化的 FAST 生成加速器将 Perf/TDP 平均提高了 2.4 倍。我们的投资回报分析表明,FAST 生成的加速器可能适用于中等规模的数据中心部署。


The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads.We propose Full-stackAccelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding.In this paper,we analyzebottlenecksin stateof-the-art vision and natural language processing (NLP) models, including EfficientNet [91] and BERT [19], and use FAST to design accelerators capable of addressing these bottlenecks. FAST-generated accelerators optimized for single workloads improve Perf/TDP by 3.7×on average across all benchmarks compared to TPU-v3. A FASTgenerated accelerator optimized for serving a suite of workloads improves Perf/TDP by 2.4×on average compared to TPU-v3. Our return on investment analysis shows that FAST-generated accelerators can potentially be practical for moderate-sized datacenter deployments.


ASPLOS 2022 编译器与编程语言技术洞察

Understanding and Exploiting Optimal Function Inlining [9]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507744

摘要:(向上滑动阅览)

内联是优化编译器的核心转换。它用被调用函数的主体(被调用者)替换函数调用(调用站点)。它有助于减少函数调用开销和二进制大小,更重要的是,可以实现其他优化。内联的问题已被广泛研究,但远未解决;由于与编译器管道的其余部分交互,预测哪些内联决策是有益的并非易事。以前的工作主要集中在设计启发式算法以获得更好的内联决策,并没有研究最优内联,即穷举地找到最优内联决策。最佳内联对于识别和利用错失的机会以及评估最先进的技术水平是必要的。本文通过使用 SPEC2017 基准套件对最佳内联进行广泛的实证分析来填补这一空白。我们的新公式大大减少了内联搜索空间的大小(从 2349 减少到 225),并允许我们详尽地评估 1,135 个 SPEC2017 文件的所有内联选择。我们展示了 LLVM 中最先进的策略与优化二进制大小时的最佳内联之间存在显着差距,二进制大小是一个独立于工作负载的重要的确定性指标(与性能相反,另一个重要指标)。受我们分析的启发,我们引入了一种简单、有效的内联自动调整策略,在 SPEC2017 上平均优于现有技术 7%(最高 28%),在 LLVM 本身的源代码上优于 15% 和 10%关于 SQLite 的源代码。这项工作强调了通过提供新的、可操作的洞察力和有效的自动调整策略来探索最佳内联的重要性。 


Inlining is a core transformation in optimizing compilers. It replaces a function call (call site) with the body of the called function (callee). It helps reduce function call overhead and binary size, and more importantly, enables other optimizations. The problem of inlining has been extensively studied, but it is far from being solved; predicting which inlining decisions are beneficial is nontrivial due to interactions with the rest of the compiler pipeline. Previous work has mainly focused on designing heuristics for better inlining decisions and has not investigated optimal inlining, i.e., exhaustively finding the optimal inlining decisions. Optimal inlining is necessary for identifying and exploiting missed opportunities and evaluating the state of the art. This paper fills this gap through an extensive empirical analysis of optimal inlining using the SPEC2017 benchmark suite. Our novel formulation drastically reduces the inlining search space size (from 2349 down to 225) and allows us to exhaustively evaluate all inlining choices on 1,135 SPEC2017 files. We show a significant gap between the state-of-the-art strategy in LLVM and optimal inlining when optimizing for binary size, an important, deterministic metric independent of workload (in contrast to performance, another important metric). Inspired by our analysis, we introduce a simple, effective autotuning strategy for inlining that outperforms the state of the art by 7% on average (and up to 28%) on SPEC2017, 15% on the source code of LLVM itself, and 10% on the source code of SQLite. This work highlights the importance of exploring optimal inlining by providing new, actionable insight and an effective autotuning strategy that is of practical utility.


ASPLOS 2022 编译器与编程语言技术洞察
ASPLOS 2022 编译器与编程语言技术洞察
ASPLOS 2022 编译器与编程语言技术洞察
ASPLOS 2022 编译器与编程语言技术洞察

左右滑动查看更多

CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation [10]

论文PDF链接:

https://dl.acm.org/doi/pdf/10.1145/3503222.3507771

论文作者单位:

美国西北大学(Northwestern University, USA)

摘要:(向上滑动阅览)


Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative softwareonly design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.


ASPLOS 2022 编译器与编程语言技术洞察

左右滑动查看更多

Reference

[1] Qu Z, Liu L, Tu F, et al. DOTA: detect and omit weak attentions for scalable transformer acceleration[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 14-26.

[2] Reggiani E, Lazo C R, Bagué R F, et al. BiSon-e: a lightweight and high-performance accelerator for narrow integer linear algebra computing on the edge[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 56-69.

[3] Litz H, Ayers G, Ranganathan P. CRISP: Critical Slice Prefetching[J]. 2022.

[4] Rao G, Chen J, Yik J, et al. SparseCore: stream ISA and processor specialization for sparse computation[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 186-199.

[5] Zheng Z, Yang X, Zhao P, et al. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 359-373.

[6] Dadu V, Nowatzki T. TaskStream: accelerating task-parallel workloads by recovering program structure[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 1-13.

[7] Zhang J, Swift M, Li J. Software-defined address mapping: a case on 3D memory[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 70-83.

[8] Zhang D, Huda S, Songhori E, et al. A full-stack search technique for domain optimized deep learning accelerators[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 27-42.

[9] Theodoridis T, Grosser T, Su Z. Understanding and exploiting optimal function inlining[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 977-989.

[10] Suchy B, Ghosh S, Kersnar D, et al. CARAT CAKE: Replacing paging via compiler/kernel cooperation[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022: 98-114.


点击阅读原文跳转至ASPLOS 2022 会议论文页面