Sharing Data and Models in Software Engineering
eBook - ePub

Sharing Data and Models in Software Engineering

  1. 406 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Sharing Data and Models in Software Engineering

About this book

Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software engineering, this edited volume proceeds to identify critical questions of contemporary software engineering related to data and models. Learn how to adapt data from other organizations to local problems, mine privatized data, prune spurious information, simplify complex results, how to update models for new platforms, and more. Chapters share largely applicable experimental results discussed with the blend of practitioner focused domain expertise, with commentary that highlights the methods that are most useful, and applicable to the widest range of projects. Each chapter is written by a prominent expert and offers a state-of-the-art solution to an identified problem facing data scientists in software engineering. Throughout, the editors share best practices collected from their experience training software engineering students and practitioners to master data science, and highlight the methods that are most useful, and applicable to the widest range of projects. - Shares the specific experience of leading researchers and techniques developed to handle data problems in the realm of software engineering - Explains how to start a project of data science for software engineering as well as how to identify and avoid likely pitfalls - Provides a wide range of useful qualitative and quantitative principles ranging from very simple to cutting edge research - Addresses current challenges with software engineering data such as lack of local data, access issues due to data privacy, increasing data quality via cleaning of spurious chunks in data

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Sharing Data and Models in Software Engineering by Tim Menzies,Ekrem Kocaguneli,Burak Turhan,Leandro Minku,Fayola Peters in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over one million books available in our catalogue for you to explore.
Chapter 1

Introduction

Before we begin: for the very impatient (or very busy) reader, we offer an executive summary in Section 1.3 and statement on next directions in Chapter 25.

1.1 Why read this book?

NASA used to run a Metrics Data Program (MDP) to analyze data from software projects. In 2003, the research lead, Kenneth McGill, asked: “What can you learn from all that data?” McGill's challenge (and funding support) resulted in much work. The MDP is no more but its data was the seed for the PROMISE repository (Figure 1.1). At the time of this writing (2014), that repository is the focal point for many researchers exploring data science and software engineering. The authors of this book are long-time members of the PROMISE community.
f01-01-9780124172951
Figure 1.1 The PROMISE repository of SE data: http://openscience.us/repo.
When a team has been working at something for a decade, it is fitting to ask, “What do you know now that you did not know before?” In short, we think that sharing needs to be studied much more, so this book is about sharing ideas and how data mining can help that sharing. As we shall see:
Sharing can be very useful and insightful.
But sharing ideas is not a simple matter.
The bad news is that, usually, ideas are shared very badly. The good news is that, based on much recent research, it is now possible to offer much guidance on how to use data miners to share.
This book offers that guidance. Because it is drawn from our experiences (and we are all software engineers), its case studies all come from that field (e.g., data mining for software defect prediction or software effort estimation). That said, the methods of this book are very general and should be applicable to many other domains.

1.2 What do we mean by “sharing”?

To understand “sharing,” we start with a story. Suppose two managers of different projects meet for lunch. They discuss books, movies, the weather, and the latest political/sporting results. After all that, their conversation turns to a shared problem: how to better manage their projects.
Why are our managers talking? They might be friends and this is just a casual meeting. On the other hand, they might be meeting in order to gain the benefit of the other's experience. If so, then their discussions will try to share their experience. But what might they share?

1.2.1 Sharing insights

Perhaps they wish to share their insights about management. For example, our diners might have just read Fred Brooks's book on The Mythical Man Month [59]. This book documents many aspects of software project management including the famous Brooks' law which says “adding staff to a late software project makes it later.”
To share such insights about management, our managers might share war stories on (e.g.) how upper management tried to save late projects by throwing more staff at them. Shaking their heads ruefully, they remind each other that often the real problems are the early lifecycle decisions that crippled the original concept.

1.2.2 Sharing models

Perhaps they are reading the software engineering literature and want to share models about software development. Now “models” can be mean different things to different people. For example, to some object-oriented design people, a “model” is some elaborate class diagram. But models can be smaller, much more focused statements. For example, our lunch buddies might have read Barry Boehm's Software Economics book. That book documents a power law of software that states that larger software projects take exponentially longer to complete than smaller projects [34].
Accordingly, they might discuss if development effort for larger projects can be tamed with some well-designed information hiding.1
(Just as an aside, by model we mean any succinct description of a domain that someone wants to pass to someone else. For this book, our models are mostly quantitative equations or decision trees. Other models may more qualitative such as the rules of thumb that one manager might want to offer to another—but in the terminology of this chapter, we would call that more insight than model.)

1.2.3 Sharing data

Perhaps our managers know that general models often need tuning with local data. Hence, they might offer to share specific project data with each other. This data sharing is particularly useful if one team is using a technology that is new to them, but has long been used by the other. Also, such data sharing is become fashionable amongst data-driven decision makers such as Nate Silver [399], or the evidence-based software engineering community [217].

1.2.4 Sharing analysis methods

Finally, if our managers are very experienced, they know that it is not enough just to share data in order to share ideas. This data has to be summarized into actionable statements, which is the task of the data scientist. When two such scientists meet for lunch, they might spend some time discussing the tricks they use for different kinds of data mining problems. That is, they might share analysis methods for turning data into models.

1.2.5 Types of sharing

In summary, when two smart people talk, there are four things they can share. They might want to:
share models;
share data;
share insight;
share analysis methods for turning data into models.
This book is about sharing data and sharing models. We do not discuss sharing insight because, to date, it is not clear what can be said on that point. As to sharing analysis methods, that is a very active area of current research; so much so that it would premature to write a book on that topic. However, for some state-of-the-art results in sharing analysis methods, the reader is referred to two recent articles by Tom Zimmermann and his colleagues at Microsoft Research. They discuss the very wide range of questions that are asked of data scientists [27, 64] (and many of those queries are about exploring data before any conclusions are made).

1.2.6 Challenges with sharing

It turns out that sharing data and models is not a simple matter. To illustrate that point, we review the limitations of the models learned from the first generation of analytics in software engineering.
As soon as people started programming, it became apparent that programming was an inherently buggy process. As recalled by Maurice Wilkes [443] speaking of his programming experiences from the early 1950s:
It was on one of my journeys between the EDSAC room and the punching equipment that hesitating at the angles of stairs the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.
It took several decades to find the experience required to build a size/defect relationship. In 1971, Fumio Akiyama described the first known “size” law, saying the number of defects D was a ...

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Why this book?
  6. Foreword
  7. List of Figures
  8. Chapter 1: Introduction
  9. Part I: Data Mining for Managers
  10. Part II: Data Mining: A Technical Tutorial
  11. Part III: Sharing Data
  12. Part IV: Sharing Models
  13. Bibliography