RE-gem5: Building Sustainable Research Infrastructure
Note: this post originally appeared on Computer Architecture Today.
RE-gem5 is a directed effort to rejuvenate the underlying infrastructure of gem5. RE-gem5 is not a new simulator or a new project; it is a project to enhance and support the current gem5 infrastructure.
The community-developed gem5 infrastructure is one of the most popular and widely known cycle-level computer architecture simulation systems. It is used in academia and industry for research and used for teaching at universities all over the world. The gem5 paper has over (3100 citations)[https://scholar.google.com/scholar?q=gem5] in the past 8 years. However, much of the software was written 15 years ago, and it is currently maintained by students and researchers. Without a concerted effort to improve the underlying infrastructure, gem5’s software will become increasingly fragile and its users’ research output will gradually diminish due to time spent on ad hoc maintenance instead of research.
Thus, we are embarking on the RE-gem5 project to reinvigorate the gem5 infrastructure to enable the next 15 years of computer architecture research. This project will enhance and rejuvenate the aging gem5 community infrastructure through improvements to the underlying gem5 components, community outreach and improved user services, and developing new models for emerging devices required for evaluating important applications. This project will create a sustainable path for the future of the gem5 infrastructure by improving the code testing coverage and increasing the robustness of hardware device models. Additionally, community outreach efforts will broaden the community, increasing the impact of gem5 and providing more contributors to continue development of this infrastructure after this project completes.
Some of the highlights of our plan include a full-time programmer to help with gem5 community development, stable, tagged, rigorously tested and documented releases multiple times per year, community outreach, known-good configurations, and supporting full machine learning stacks. Many more details can be found in the roadmap where we are currently soliciting feedback.
One of our main goals is to make gem5 easier to use and accessible for a broader range of users, both to increase the impact of gem5 and provide more contributors to continue development of the infrastructure. To this end, we will work on expanding gem5’s documentation and creating and running two gem5 summer school sessions over the next three years. These summer schools will closely follow the content in the Learning gem5 book, with new modules to cover additional topics such as CPU-GPU programs, debugging complex coherence protocols, and best practices when using gem5.
Most computer architecture research, and computer systems research more broadly, asks the question “How does my new idea compare to the current state-of-the-art systems?” Models which accurately represent the current state of the art are necessary to investigate these questions. As part of this project, we will provide the community with a set of publicly validated models of current architectural system components including CPU cores, GPU compute units (CUs), caches, main memory systems, and devices. We will strive for accuracy compared to real systems; however, since most systems are proprietary and complex, perfect accuracy for all workloads will be difficult. Thus, we will broadly advertise the relative performance, power, and other metrics when providing these models so users can make an informed decision when choosing their baseline configurations.
Known-good configurations will provide three main benefits:
- They will allow researchers to concentrate on only one part of the system and spend more time on their innovations and less time fighting to configure a baseline.
- They will enable systems researchers outside of computer architecture to use gem5 more easily.
- They will improve the reproducibility of research by making it easier to write good methodology sections in papers (e.g., “Our baseline is the desktop proxy from gem5’s known-good configuration library.”).
We will work with the computer architecture and broader computer systems communities to choose and prioritize the specific state-of-the-art to use as baselines and solicit feedback on which workloads and systems are most important. We will include validated configurations for both real hardware systems (e.g., a proxy for an Intel Core i7 processor) and high impact research proposals.
Supporting machine learning stacks
To properly analyze how ML workloads perform on future architectures and the benefits of co-designed hardware-software solutions, the community needs completely open-source simulation platforms. High-level frameworks like PyTorch, Caffe, and TensorFlow, which are widely used, build on top of optimized ML libraries like MIOpen. We will integrate these frameworks into the gem5 ecosystem by improving gem5’s full system support and providing gem5 users with a set of Docker containers to quickly get started with these complex frameworks. We will explore multiple methods of reducing simulation time, including (SimPoints)[http://cseweb.ucsd.edu/~calder/simpoint/], statistical simulation, checkpointing the state of the system to be reloaded, and virtualized fast forwarding and make these frameworks public and open source. We will also integrate accelerator models into the mainline gem5 code so that a rich set of realistic ML workloads can also be run on and optimized for accelerators, including the TPU.
As the initial step in this project, we are looking for community feedback on our near- and medium-term roadmap. A draft of the roadmap is available at the following link that anyone can comment on and make suggestions.
The purpose of this roadmap is twofold:
We want to get feedback to see what other features are important to the community and whether our current priorities are aligned with the community. We want to broadcast to the community that changes and improvements are coming. gem5 is widely used across the computer architecture community. We would love to hear your feedback on the roadmap! We truly believe gem5 is a community project, and that requires the community (everyone reading this blog post) to contribute! Giving feedback on gem5’s direction is an easy way to have a big impact on gem5.
RE-gem5 is currently primarily sponsored by an NSF CCRI grant. We are actively looking for additional sponsors to expand the impact of this project and enable sustainable gem5 development for many years to come. If you or your company is interested in sponsoring gem5 development to further enhance its research potential please reach out to Jason Lowe-Power (firstname.lastname@example.org).
gem5 is an open source community project and many of the ideas in this blogpost and the roadmap have come from the community at large. Details about gem5’s governance can be found on our website. If you would like to join the discussion, please subscribe to the gem5 mailing lists.