|Scooped by Pedro Fernandes|
Next generation sequencing (NGS) technologies for DNA have resulted in a yet bigger deluge of data. Researchers are learning that analyzing the data efficiently requires the creation of sophisticated pipelines, typically using commandline tools in a Linux or other Opensource Unix variant compute environment. Many researchers have created these pipelines to successfully analyze their data. Now they are faced with the challenge of making these pipelines available to their colleagues. The issue of reproducibility has emerged as a major issue (TODO REF), as researchers, peer reviewers, and even pharmaceutical companies discover that the software and data used to produce a particular research finding are either not available, poorly documented, or targetted to specific compute infrastructures that are not available to the wider research community. To remedy this, funding agencies and journals are creating policies to promote software reproducibility. In this brief workshop we will establish several best practices of reproducibility in the (comparative) analysis of data obtained by NGS. In doing so we will encounter the commonly used technologies that enable these best practices by working through use casesthat illustrate the underlying principles. Building on the basis of an existing pipeline of commandline utilities, we will illustrate how the entire compute environment used to run the pipeline can be packaged into a unit that can be shared with other researchers such that they can make full use of the environment on their own machines, or on standard cloud compute environments such as amazon or google.Best practicesCommandline scripting of analysis stepsProvisioning systems to standardize software environment requirementsPackaging of compute environment into static, portable unitsSharing of compute environment packagesTechnologiesNext generation sequencing platformsCommand-line executables, command line scripting and batchingProvisioning Systems: Puppet, DockerfileVirtualization with Virtualbox and VagrantContainerization with DockerTarget audience
This course is aimed at researchers who've developed pipelines to analyze NGS data and now, faced with new reproducibility requirements, would like to learn how to package their analysis pipeline into in a reproducible (and shareable) way. This course will start with a very basic NGS pipeline that runs in a Linux commandline environment, and develop this pipeline into two packages that can be shared with, and used by other researchers. The ideal attendee is a scientist who is already comfortable developing scripted pipelines on the commandline, or who is not afraid to get his/her hands dirty to acquire the computer-literacy skills for dealing with the informatics side of data analysis.Pre-requisites
The course assumes that attendees are not intimidated by the prospect of gaining experience working on UNIX-like operating systems (including the shell, and shell scripting). Attendees should understand some of the science behind high-throughput DNA sequencing and sequence analysis, as we will not go deeply into underlying theory (or the mechanics of given algorithms, for example) as such. What will be taught are technical solutions for automating and sharing such analyses in shareable, reusable compute environments, which will include (but is not limited to) beginner-level programming, and basic Linux provisioning. General computer literacy, (e.g. editing plain text data files, navigating the command line) will be assumed.