Writing and running software is now as much a part of science as telescopes and test tubes, but most researchers are never taught how to do either well. As a result, it takes them longer to accomplish simple tasks than it should, and it is harder for them to share their work with others than it needs to be.

This book introduces the concepts, tools, and skills that researchers need to get more done in less time and with less pain. Based on the practical experiences of its authors, who collectively have spent several decades teaching software skills to scientists, it covers everything graduate-level researchers need to automate their workflows, collaborate with colleagues, ensure that their results are trustworthy, and publish what they have built so that others can build on it. The book assumes only a basic knowledge of Python as a starting point, and shows readers how it, the Unix shell, Git, Make, and related tools can give them more time to focus on the research they actually want to do.

Research Software Engineering with Python can be used as the main text in a one-semester course or for self-guided study. A running example shows how to organize a small research project step by step; over a hundred exercises give readers a chance to practice these skills themselves, while a glossary defining over two hundred terms will help readers find their way through the terminology. All of the material can be re-used under a Creative Commons license, and all royalties from sales of the book will be donated to The Carpentries, an organization that teaches foundational coding and data science skills to researchers worldwide.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Ciencia de la computación

Subtopic

Ciencias computacionales general

1 Getting Started

Everything starts somewhere, though many physicists disagree.

— Terry Pratchett

As with many research projects, the first step in our Zipf’s Law analysis is to download the research data and install the required software. Before doing that, it’s worth taking a moment to think about how we are going to organize everything. We will soon have a number of books from Project Gutenberg¹ in the form of a series of text files, plots we’ve produced showing the word frequency distribution in each book, as well as the code we’ve written to produce those plots and to document and release our software package. If we aren’t organized from the start, things could get messy later on.

1.1 Project Structure

Project organization is like a diet: everyone has one, it’s just a question of whether it’s healthy or not. In the case of a project, “healthy” means that people can find what they need and do what they want without becoming frustrated. This depends on how well organized the project is and how familiar people are with that style of organization.

As with good coding style, small pieces in predictable places with readable names are easier to find and use than large chunks that vary from project to project and have names like “stuff.” We can be messy while we are working and then tidy up later, but experience teaches that we will be more productive if we make tidiness a habit.

In building the Zipf’s Law project, we’ll follow a widely used template for organizing small and medium-sized data analysis projects (Noble 2009). The project will live in a directory called zipf, which will also be a Git repository stored on GitHub (Chapter 6). The following is an abbreviated version of the project directory tree as it appears toward the end of the book:

¹https://www.gutenberg.org/

The full, final directory tree is documented in Appendix D.

1.1.1 Standard information

Our project will contain a few standard files that should be present in every research software project, open source or otherwise:

README includes basic information on our project. We’ll create it in Chapter 7, and extend it in Chapter 14.
LICENSE is the project’s license. We’ll add it in Section 8.4.
CONTRIBUTING explains how to contribute to the project. We’ll add it in Section 8.11.
CONDUCT is the project’s Code of Conduct. We’ll add it in Section 8.3.
CITATION explains how to cite the software. We’ll add it in Section 14.7.

Some projects also include a CONTRIBUTORS or AUTHORS file that lists everyone who has contributed to the project, while others include that information in the README (we do this in Chapter 7) or make it a section in CITATION. These files are often called boilerplate, meaning they are copied without change from one use to the next.

1.1.2 Organizing project content

Following Noble (2009), the directories in the repository’s root are organized according to purpose:

Runnable programs go in bin/ (an old Unix abbreviation for “binary,” meaning “not text”). This will include both shell scripts, e.g., book_summary.sh developed in Chapter 4, and Python programs, e.g., countwords.py, developed in Chapter 5.
Raw data goes in data/ and is never modified after being stored. You’ll set up this directory and its contents in Section 1.2.
Results are put in results/. This includes cleaned-up data, figures, and everything else created using what’s in bin and data. In this project, we’ll describe exactly how bin and data are used with Makefile created in Chapter 9.
Finally, documentation and manuscripts go in docs/. In this project, docs will contain automatically generated documentation for the Python package, created in Section 14.6.2.

This structure works well for many computational research projects and we encourage its use beyond just this book. We will add some more folders and files not directly addressed by Noble (2009) when we talk about testing (Chapter 11), provenance (Chapter 13), and packaging (Chapter 14).

1.2 Downloading the Data

The data files used in the book are archived at an online repository called Figshare (which we discuss in detail in Section 13.1.2) and can be accessed at:

We can download a zip file containing the data files by clicking “download all” at this URL and then unzipping the contents into a new zipf/data directory (also called a folder) that follows the project structure described above. Here’s how things look once we’re done:

1.3 Installing the Software

In order to conduct our analysis, we need to install the following software:

A Bash shell
Git version control
A text editor
Python 3² (v...

Cover
Title Page
Copyright Page
Dedication
Contents
Welcome
1 Getting Started
2 The Basics of the Unix Shell
3 Building Tools with the Unix Shell
4 Going Further with the Unix Shell
5 Building Command-Line Tools with Python
6 Using Git at the Command Line
7 Going Further with Git
8 Working in Teams
9 Automating Analyses with Make
10 Configuring Programs
11 Testing Software
12 Handling Errors
13 Tracking Provenance
14 Creating Packages with Python
15 Finale
Appendix
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Research Software Engineering with Python by Damien Irving,Kate Hertweck,Luke Johnston,Joel Ostblom,Charlotte Wickham,Greg Wilson in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Ciencias computacionales general. We have over one million books available in our catalogue for you to explore.

About this book