Workshops and Tutorials | 2015 International Symposium on Code Generation and Optimization

Workshops:

2/7/2015 Saturday AM	2/7/2015 Saturday PM	2/8/2015 Sunday AM	2/8/2015 Sunday PM
AutoTune: International Workshop on Code Auto-Tuning		COSMIC: Code optimization for multi and many cores	COSMIC: Code optimization for multi and many cores
AMAS-BT: Workshop on Architectural and Microarchitectural Support for Binary Translation

Tutorials:

2/7/2015 Saturday AM	2/7/2015 Saturday PM	2/8/2015 Sunday AM	2/8/2015 Sunday PM
LLVM: An Intro to LLVM: IR, optimizations, backends and more.	LLVM: An Intro to LLVM: IR, optimizations, backends and more.	AlteraOpenCL: Compiling OpenCL to a streaming dataflow architecture on FPGAs	DynamoRIO: Building Dynamic Tools with DynamoRIO on x86 and ARM
HPDSLs: Scala, LMS and Delite for High-Performance DSLs and Program Generators	Periscope: Code Auto-Tuning with the Periscope Tuning Framework	OpenTuner: Autotuning programs with OpenTuner	Graal: A research platform for dynamic compilation and managed languages
	Halide: Code generation for image processing and stencil computation in Halide	Pin++: Using Pin++ To Author Highly Configurable Pintools for the Pin	SnuCL: A Unified OpenCL Framework
	SYCL @ CGO: A hands-on introduction to Khronos SYCL for OpenCL

Workshops:

Saturday AM

Title: International Workshop on Code Auto-Tuning (http://www.autotune-project.eu/CGO_2015_workshop)

Organizers: Renato Miceli (SENAI CIMATEC), Michael Gerndt (Technische Universität München), and Siegfried Benkner (Universität Wien)

In this workshop the attendees will have the opportunity to delve into the topic of application auto-tuning, presented by developers and performance engineers from the AutoTune project. This workshop will present the theory behind auto-tuning, focusing on the conceptual basis and discussing the latest advancements in the field within the AutoTune project, whereas the related tutorial in the afternoon (Code Auto-Tuning with the Periscope Tuning Framework) will provide a practical perspective to auto-tuning, exemplifying with use cases how to best harness and tailor performance analysers to tune real applications.

Title: 7th Workshop on Architectural and Microarchitectural Support for Binary Translation (AMAS-BT)

Organizers: Mauricio Breternitz (AMD), Vijay Janapa Reddi(University of Texas/Austin), and Youfeng Wu (Intel)

The main goal of this half-day workshop is to bring together researchers and practitioners with the aim of stimulating the exchange of ideas and experiences on the potential and limits of Architectural and MicroArchitectural Support for Binary Translation (hence the acronym AMAS-BT). The key focus is on challenges and opportunities for such assistance and opening new avenues of research. A secondary goal is to enable dissemination of hitherto unpublished Important Dates

Sunday AM & PM

Title: Code optimisation for multi and many cores (COSMIC)

Organizers: Pavlos Petoumenos (University of Edinburgh)
and Zheng Wang (University of Lancaster)

Many-core architectures such as mobile SOCs or GPGPUs are quickly becoming the norm in computing devices and consumer electronics. The community sees this development as essential in sustaining the exponential growth of performance in an energy efficient way, but at present there is no consensus on how software can make best use of it. Developing parallel applications often starts with an existing sequential implementation. A key problem is how to discover the parallelism potentially available and then convert it into a form that can be exploited. Once we have a parallel implementation, its performance and energy efficiency largely depend on how it is mapped to the available hardware. Given that hardware is increasingly diverse and heterogeneous and that in the era of dark silicon energy efficiency affects the availability of hardware, how can this re-mapping be best achieved. Solutions to these two problems form the core topic of the workshop. With novel research papers and expert invited speakers from both industry and academia, this workshop aims at examining different solutions to these problems and includes (but is not limited to):

programming languages and models
compilers and tools
runtime systems
operating systems
binary translation
combinations of the above

for homogeneous, heterogeneous multi-core and many-core based systems.

Tutorials:

Saturday AM & PM

Title: An Intro to LLVM: IR, optimizations, backends and more (llvm.org)

Organizers: Chandler Carruth (Google) and Tanya Lattner (Apple)

Topic Overview

High-level overview of LLVM & Clang
- Will include how to get started coding on LLVM & Clang
- Overview of core design elements, data structures, APIs, and patterns used in the codebase
- High-level testing strategy for LLVM & Clang using tools like Clang’s ‘-verify’, opt, llc, FileCheck, and GoogleTest
- Process of submitting a patch, code review, and community interactions
How to add an optimization pass to LLVM
- Tutorial on the LLVM IR both in the abstract and at the level of internal APIs
- Basic APIs and data structures needed to implement, test, and wire a new pass into the compiler.
- Overview of the relationship between transform and analysis passes.
- Overview of the different kinds of transformation passes, how they interact, and what they can and can’t do
- Actually add a transformation pass and an analysis pass to the compiler that depend on each other and exercise this machinery.
  - Includes authoring relevant tests for each component
High-level overview of the architecture of an LLVM backend, with an emphasis on modifying or enhancing existing backends rather than adding a new one
- Detailed review of where things are: from SelectionDAG to FastISel to the register allocator
- Detailed review of exactly how a backend’s tablegen works, and how to make changes there and debug things
Add a target-independent SelectionDAG combine to the code generator
- Include detailed walk through of the relevant DAG combine interfaces.
Add a target-specific DAG combine with special consideration of legalization
Add support for a new instruction pattern to a backend
Every bit of performance matters, and how the LLVM coding standard helps here

Saturday AM

Title: Scala, LMS and Delite for High-Performance DSLs and Program Generators

Organizers: Tiark Rompf (Purdue University), Kunle Olukotun (Stanford University), and Markus Püschel (ETH Zürich)

This tutorial is targeted at researchers and practitioners interested in building efficient domain specific languages (DSLs) and program generators. Lightweight Modular Staging (LMS) is a pragmatic approach to runtime code generation in Scala, and Delite is a compiler framework for embedded DSLs that simplifies the process of implementing DSLs for parallel computation and heterogeneous targets. This tutorial provides an overview of the technology stack, demonstrates use-cases where it has been successfully applied and guides the attendees step-by-step through creation of simple generators and DSLs.

Saturday PM

Title: Code Auto-Tuning with the Periscope Tuning Framework(Periscope)

Organizers:Renato Miceli (SENAI CIMATEC), Michael Gerndt (Technische Universität München), and Siegfried Benkner (Universität Wien)

In this tutorial, the attendees will have the opportunity to delve into the topic of application auto-tuning, presented by developers and performance engineers from the Auto-Tune project. This tutorial will provide a practical perspective to auto-tuning, exemplifying with use cases how to best harness and tailor performance analysers to tune real applications.

Title: Code generation for image processing and stencil computation in Halide (halide-lang.org)

Organizers:Jonathan Ragan-Kelley (Stanford) and Saman Amarasinghe (MIT)

This workshop will cover design and implementation of Halide, a domain-specific language and compiler for image processing and stencil computation, for people interested in using and building on it as a highly configurable code generator. As a language now in widespread production use, Halide is an interesting and high-impact platform for research on program transformation and code generation; as a language with explicit algebraic control over a wide range of loop synthesis and code generation strategies, it is a powerful backend for other languages and systems, especially those including stencil computation.

Topics:

The Halide programming model
Halide’s model of scheduling for loop synthesis
Examples of program transformation and synthesis via Halide schedules
Code generation in Halide
Mapping to the GPU and heterogeneous parallel execution via Halide schedules
Hands-on session with Halide, focussed on scheduling and code generation

Sunday AM

Title: Compiling OpenCL to a streaming dataflow architecture on FPGAs

Organizers: Deshanand Singh (Altera) and Doris Chen (Altera)

In recent years, Field-Programmable Gate Arrays have become extremely powerful computational platforms that can efficiently solve many complex problems. Modern FPGAs comprise effectively millions of programmable elements, signal processing elements and high-speed interfaces, all of which are necessary to deliver a complete solution. The power of FPGAs is unlocked via low-level programming languages such as VHDL and Verilog, which allow designers to explicitly specify the behavior of each programmable element. While these languages provide a means to create highly efficient logic circuits, they are akin to “assembly language” programming for modern processors. This is a serious limiting factor for both productivity and the adoption of FPGAs on a wider scale.

In this tutorial, we use the OpenCL language to explore techniques that allow us to program FPGAs at a level of abstraction closer to traditional software-centric approaches. OpenCL is an industry standard parallel language based on ‘C’ that offers numerous advantages that enable designers to take full advantage of the capabilities offered by FPGAs, while providing a high-level design entry language that is familiar to a wide range of programmers.

The challenge of mapping a ‘C’ based language to FPGAs is that these languages all have implicit assumptions that the underlying architecture executing these programs is a processor based architecture. Processors are characterized by a sequence of instructions that control a datapath that manipulates data values stored in a memory. Conversely, FPGA architectures are more suited to implementing spatial computing circuits where data flows in a pipelined fashion from one functional unit to the next until computations are complete. Data can be transferred efficiently by wires, registers or FIFOs without always resorting to external storage. This tutorial will explore compiler optimizations and code generation techniques that can transform sequential programs into efficient streaming dataflow circuits for FPGAs. We will examine specific case studies of DSP filters, image processing and mathematical computations to demonstrate how these techniques can be applied to real world examples.

Title: Autotuning programs with OpenTuner (http://opentuner.org/tutorial/cgo2015/)

Organizers: Jason Ansel (MIT), Saman Amarasinghe (MIT), and Jonathan Ragan-Kelley (Stanford)

This tutorial will cover the usage of OpenTuner, a open source framework for building domain-specific multi-objective program autotuners. OpenTuner supports fully customizable configuration representations, an extensible technique representation to allow for domain-specific techniques, and an easy to use interface for communicating with the tuned program. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneously. Techniques which perform well will receive larger testing budgets and techniques which perform poorly will be disabled. OpenTuner has been used by a number of different projects to build domain specific autotuners.

The topics covered in the workshop will be:

Overview of autotuning: including a history of past autotuning
projects and how autotuning is used today
Machine learning primer: empirical search, model based techniques,
and which technique is right for you
OpenTuner framework: how is it designed and how you should use it
Examples of using opentuner: presentations by current users of opentuner
What makes a good search space representation: the secret sauce of autotuning
How to go about autotuning your system with OpenTuner
Hands-on session with OpenTuner

Title: Using Pin++ To Author Highly Configurable Pintools for the Pin (https://github.com/SEDS/PinPP)

Organizers: James H. Hill (Indiana University-Purdue University Indianapolis)

This tutorial will discuss an open-source framework for creating Pintools, which are analysis tools for the dynamic binary instrumentation tool named Pin, named Pin++. Pin++ is an object-oriented framework that uses template meta-programming to implement Pintools. The goal of Pin++ is to simplify programming a Pintool and promote reuse of its components across different Pintools. Our results show that Pintools implemented using Pin++ can have a 54% reduction in complexity, increase in its modularity, and up to 60% reduction in instrumentation overhead.
This tutorial will focus on the following key concepts in Pin++:

It will discuss the challenges of implement a Pintool using the traditional approach.
It will discuss how Pin++ addresses existing challenges when authoring Pintools.
Using hands-on examples, it will discuss how to implement basic Pintools using Pin++ so the audience can begin exploring how to apply Pin++ to their existing problems.

Sunday PM

Title: Building Dynamic Tools with DynamoRIO on x86 and ARM (http://dynamorio.org/tutorial.html)

Organizers: Derek Bruening (Google) and Qin Zhao (Google)

This tutorial will present the DynamoRIO tool platform and describe how to use its API to build custom tools that utilize dynamic code manipulation for instrumentation, profiling, analysis, optimization, introspection, security, and more. The DynamoRIO tool platform was first released to the public in June 2002 and has since been used by many researchers to develop systems ranging from taint tracking to prefetch optimization. DynamoRIO is publicly available in open source form and targets Windows, Linux, and Mac on x86 and Linux on ARM.

The tutorial will cover the following topics:

DynamoRIO API: an overview of the full range of DynamoRIO’s powerful API, which abstracts away the details of the underlying infrastructure and allows the tool builder to concentrate on analyzing or modifying the application’s runtime code stream. It includes both high-level features for quick prototyping and low-level features for full control over instrumentation.
DynamoRIO system overview: a brief description of how DynamoRIO works under the covers.
Description of tools provided with the DynamoRIO package, including the Dr. Memory memory debugging tool, the DrCov code coverage tool, and the DrStrace Windows system call tracing tool.
Sample tool starting points for building new tools
Advanced topics when building sophisticated tools

Title: Graal: A research platform for dynamic compilation and managed languages https://wiki.openjdk.java.net/display/Graal/Main

Organizers: Christian Wimmer (Oracle Labs)

The tutorial will cover the following topics:

Graal: a new high-performance dynamic compiler for Java written in Java
Introduction to the Graal intermediate representation, and how it simplifies speculative optimizations
Graal API: Separation of the compiler from the VM
Snippets: expressing high-level semantics in low-level Java code
Integration of the compiler with an application/library – and how that can help your research project.
Using Graal for static analysis
Graal as a compiler for dynamic programming languages
Project Sumatra: Compiling for the GPU

Title: SnuCL: A Unified OpenCL Framework (http://snucl.snu.ac.kr/tutorials.html)

Organizers: Jaejin Lee (Seoul National University)

OpenCL is a programming model for heterogeneous parallel computing systems. OpenCL provides a common abstraction layer across different multicore architectures, such as CPUs, GPUs, FPGAs, and Xeon Phi processors. The OpenCL ICD (installable client driver) enables different OpenCL platforms from different vendors to coexist in the same system (a single operating system instance). However, current OpenCL has two major limitations. First, to use different processors from different vendors in the same application, programmers need to explicitly specify a vendor specific OpenCL platform for each processor in the application, and OpenCL objects (buffers, events, etc.) cannot be shared across different vendor platforms without explicit copying. Second, OpenCL is restricted to a heterogeneous system running a single operating system instance. To target a heterogenous cluster running multiple operating system instances, programmers must use an OpenCL framework together with a communication library, such as MPI. This tutorial introduces a unified OpenCL framework, called SnuCL, which overcomes the limitations of current OpenCL. SnuCL naturally extends the original OpenCL semantics to the heterogeneous cluster environment. In addition, programmers do not need to explicitly specify different vendor platforms to use different processors. OpenCL objects can be shared across different platforms in the same application without explicit copying. SnuCL is a freely available, open-source software developed at Seoul National University. SnuCL provides the programmer with an illusion of a single OpenCL platform image.
The first part of this tutorial consists of an introduction to OpenCL programming and accelerator architectures, such as GPUs and Intel Xeon Phi. The second part of the tutorial covers the SnuCL framework.