PART 1
Data Science and Research Libraries ā Perspectives
CHAPTER 1
Sustainability and Success Models for Informal Data Science Training within Libraries
Elizabeth Wickes
Introduction
Data science, in its stubborn refusal to be defined or constrained around a coherent conceptual node, can be seen as a collection of skills, approaches and methods that have become relevant across a variety of research domains. While the departmental home for data science within the academic context continues to be debated (Donoho, 2017), the need for data science-related skills within the broader scientific research community has grown across domains and those researchers are actively searching for help (Osborne et al., 2014).
Data science is also a quickly growing academic degree and certificate area. Undergraduate degree programs and academic units hosting these programs are being strongly urged to develop faculty specializing in data science research and education (National Academies of Sciences, Engineering, and Medicine, 2018).
As much as the field of statistics may lament not becoming data scienceās de facto home, the reality is that this new domain has grown beyond the boundaries of any single department. This ownership debate will likely continue unabated as many academic disciplines recognize missed strategic opportunities and attempt to assert political control on their campuses. Looking beyond organizational chart intrigue, these new students and scholars will exist no matter how contentious their position is. Waiting for the debates to settle before acting to provide them with library and information services puts them at strong risk of being unserved or underserved for technical, data and other information services.
This entirely new subject domain and service population represent an exciting engagement opportunity for librarianship and information services. Not only are there patrons working directly in this new subject area, like undergraduates majoring in data science, but there are also indirect members sitting as affiliated faculty and other researchers seeking out data science-aligned training. They have unique research needs around data discovery, technical services, research data management and scientific reproducibility. Libraries resistant to engaging with these new scholars and students face a similar missed strategic opportunity.
Given the complicated and bespoke nature of where these data science research, educational and service units live within academic campuses, hosting campus-level data science training opportunities and consultations within a universityās library or a research unit within a library can be one of the most efficient methods of distributing that service. While the work of data science may seem out of scope for the work of the library, looking deeper at the advisory and technical service needs of these scholars can help us see that the work of data scientists is strongly aligned with many previous and future service areas of librarianship.
Many libraries and librarians are equipped to serve some data science training needs and many have the organizational frameworks in place to expand related training services to meet the needs of this growing research and education sector. Creating something completely new is not always necessary. This chapter will attempt to unpack the various activities within the field of data science and contextualize where librarianship is already building capacity to provide relevant training services. It will cover the variety of blockers facing scholars attempting to acquire computational research skills and the challenges for providing training for such skills. It will also offer a variety of solutions intended to fit across many levels of existing and planned support for research data management services within a campus. No single solution fits all. Each library will need to assess their own available skills, capacity and strategic growth plans before building or supporting these services. However, it is my hope that every library unit can see how providing elements of data science training can be both possible and beneficial.
Existing relevant data services within libraries
Research data management services
Research data management (RDM) services remain new for many institutions. Born from a variety of mandates that research data must be preserved and made publicly available, these units represent a large growth area within many libraries, data centers and research institutions. Cox et al. (2017) found that RDM units across institutions have each taken a unique growth and partnership model, largely driven by which campus units were providing institutional support.
Many stewardship services offered within an RDM unit are driven by the skills available within the existing staff and partner network and are often used synonymously with data or digital curation. This means that many services are sensitive to staff turnover or student graduation, making their existence and service offerings precarious to include in formal outreach.
Staffing RDM units
Crafting sustainable and robust menus of data stewardship activities within RDM units is a growing research area for these units. RDM units have a few options for how to configure service staffing, but most options require either an extensive network of people with single specific skills or finding and retaining people with hybrid skills across several categories. Required skills generally fall into three categories: data curation librarianship expertise; technical skills for research data processing; and domain expertise around the research being conducted. These three represent the scholarly expertise, technical experience and domain knowledge necessary to communicate across the many stakeholders and understand the unique challenges of data-intensive research.
The common configurations of data science and other data-oriented research teams run parallel to how many successful research data management units are organized. As early data science roles were created, many organizations attempted to hire āunicornā staff members who held expertise in all facets of the roles, but have found this strategy to be time consuming and expensive (Press, 2015). Data journalism teams (Hermida and Young, 2017) and data science teams (BaÅ”karada and Koronios, 2017) have opted to develop team-based strategies that blend technologists and domain specialists to compensate for the lack of data science āunicornsā being readily available. Instead, they focus on crafting teams composed of those strongly skilled in individual knowledge areas. This is effective in commercial enterprises that have the budgets and administrative flexibility for hiring into many key roles, but it may not translate well to libraries.
Academic units often face difficulties in getting multiple positions created as well as finding individuals with strong skills in some technical and domain areas, with many experiencing difficulties in filling and retaining staff in these RDM roles (Fearon et al., 2013). Roles that are already filled face the additional problem of becoming active within the support network of researchers. Lyon (2012) describes a variety of ways that current subject specialists could be embedded more within research projects. However, Lyon observed that informatics skills relevant to computational research are necessary but commonly missing from library and information science (LIS) training programs and thus from many of the current and incoming information professionals.
In a large systematic literature review of the roles of professionals within LIS service organizations, Vassilakaki and Moniarou-Papaconstantinou (2015) found that these LIS professionals are being positioned as educators both within the library and the classroom, technology specialists, information consultants, knowledge managers and subject specialists. This indicates that libraries already utilize this team-based approach to internal organization, but some units and specialists may be so siloed or purposefully size limited that these approaches may not be immediately translatable.
Partnerships and collaboration
Failure to make use of the collaborative traditions within librarianship that embed information professionals within the research and education systems of a university, combined with the difficulty of identifying skilled professionals, could spell a collapse of awareness and thus usage of library services. Frank et al. (2001) warned of a stronger need for librarians to become more involved within the teaching and research process by serving as information consultants within university systems. Vassilakaki and Moniarou-Papaconstantinou (2015) also concluded that many outreach efforts to establish and promote use of new services, particularly information literacy programs, were strongly dependent on establishing this collaborative relationship between librarians and faculty members. Yet despite the positive awareness-boosting benefits of these intensive collaborations, using the word ācollaboratorā seemed to devalue the perception of impact and value of these librarians as educators. This makes team approaches even more problematic within the library context. However, the growing needs around research data management may mean a new door is opening for academic library patrons to re-enter our service domain and refresh their perceptions about the role of librarians.
Data science training presents an opportunity for information specialists to be presented as scholars and consultants for the data needs within modern computational research. Yet the problematic lack of commonly available expertise and experience in computational research among LIS professionals must be resolved via creative partnership solutions and increased training. Davis and Cross (2015) observed that partnerships required to develop robust RDM services can be an extremely valuable method of on-the-job training to skill up our current information professional population. This means that building out broad partnership networks as workarounds to problems with hiring and retention may serve as a solution to sustainable training. These partnerships may take the form of cross-institutional networks to share domain expertise or cross-campus networks creating a robust network of service providers and consultants.
These networks represent the creativity of LIS professionals responding to the time-sensitive pressure of the mandate and bounded by the accessibility of local resources. RDM has always been defined by partnership building. At first it was to ensure the survival of the service and then later to expand the quality, depth and type of services offered. Partnerships have also been valuable even after initial assessment, roll out and capacity-building phases have been completed. Some institutions, such as Purdue University, have found that the inter-library and cross-campus partnerships formed during the creation of their Purdue University Research Repository (PURR) and associated services have continued to the benefit of other projects (Dearborn, 2018). Many other units have also reported on the unexpected learning opportunities that these increased, cross-skilled collaborations have created. Examples of the positive benefits of showcasing expertise within RDM services operating at the level 1 and level 2 stages, where data management plan advising and data management workshops happen, point to the longer-term value of this exposure of library staff to the research process of more faculty across an institution.
Many librarians develop deep collaborative instructional and scholarly partnerships with faculty members to provide information literacy content for their classes or for the research process where they participate in gathering datasets, background research, etc. However, situating librarians within the active data management and collection process shifts this relationship away from a purely service role of finding and gathering to a production role of researcher and scholar. Davis and Cross (2015) found that a longer-term effect of embedding their RDM committee members within the data management plan (DMP) review changed how these librarians were viewed and even generated a faculty member writing a librarian into a grant for assistance with data. They also found that their ability to build out capacity for data analysis and other technical advising within the DMP reviews was greatly reduced by the limited number of staff members with related experience and skills in those areas. This means that representing librarians as expert scholars will be important to grow partnerships, but these relationships are sensitive to failure if the staff is weak in the necessary skills.
As the maturity model progresses for RDM units and the mandated deposit stage becomes sufficiently compliant with government and funder requirements for data preservation and access, more staff time is freed for advisory services and professional development. This moves the RDM service coverage into the previous data lifecycle stage where active analysis is wrapping up and publication materials are being created. The Cox et al. (2017) maturity model describes this as level 2, featuring capacity building and outreach for the advisory services.
Data science training is out of bounds for early stage RDM units, but the capacity and partnership-building features of later maturity stages open up opportunities. Research data management units are able to consider growth areas and begin to build out staff time and skills that expand their technical and advisory services. Technical services that go a step beyond a consultation or workshop may include things like metadata editing or broadly working with data during the active research process. This stage, while mature for our current perception of curation services, can actually be seen as recreating previous models of research data curation. Dataset integration for large scale analysis is not new, but many of the more modern programmatic methods are more readily available because of more specialized programming languages and the availability of skilled labor. Research data curators have been working with active research datasets since long before mandates appeared requiring the deposit of this final research data.
Challenges for scholars
Expanding our perception of what data science is and can be, we can begin to see why the excitement bubble has spread so far across academic and research organizations. Machine learning methodologies like data and text mining are becoming more accessible to those outside of computer science and are creating new opportunities for data classification, prediction modeling and even topic modelling of large text corpuses. Activities like feature extraction are commonly used to prepare these machine learning models and to generate data by-products that can provide valuable insights into unexplored aspects of datasets. Some projects integrate additional data observations to make large samples for analysis a norm, expanding the possibilities for data aggregation. Growing computer bandwidth has eliminated some needs for reducing the size of data before analysis. Powerful and freely available open source toolkits expand the commonly available statistical approaches that had previously only been possible via custom scripts within a proprietary analysis platform.
Open source communities
As these methods grow out of open source communities, many authors of new tools contribute their codebases back to the community, sometimes creating an entirely new community around it. This makes the tools available to other researchers and further expands their use and reuse value. This represents a new publishing model for scholars working through the tenure track faculty pathway. Some universities and academic units have begun recognizing contributions to open source projects as valuable scholarly work and consider it during tenure and promotion decisions.
Skill sets
While this growth has begun reaping value for those with programming skills and access to computing power, the ability to consider or begin using these methods remains unequal. Those without pro...