[BBC] COURSE - Bioinformatics using LINUX

Thu Apr 28 18:45:48 CEST 2016

Course:  Introduction to bioinformatics using LINUX

Instructor: Dr. Martin Jones

This course will run from 15th – 19th August 2016 at SCENE (the 
Scottish Centre for Ecology and the Natural Environment), Loch Lomond 
National Park, Glasgow.

Course overview: Most high-throughput bioinformatics work these days 
takes place on the Linux command line. The programs which do the 
majority of the computational heavy lifting — genome assemblers, read 
mappers, and annotation tools — are designed to work best when used 
with a command-line interface. Because the command line can be an 
intimidating environment, many biologists learn the bare minimum needed 
to get their analysis tools working. This means that they miss out on 
the power of Linux to customize their environment and automate many 
parts of the bioinformatics workflow. This course will introduce the 
Linux command line environment from scratch and teach students how to 
make the most of its tools to achieve a high level of productivity when 
working with biological data.

Availability: 15 places total.

Course programme

Monday 15th – Classes from 09:00 to 17:00 (approximately)
● Session 1 - The design of Linux
In the first session we briefly cover the design of Linux: how is it 
different from Windows/OSX and how is it best used? We'll then jump 
straight onto the command line and learn about the layout of the Linux 
filesystem and how to navigate it. We'll describe Linux's file 
permission system (which often trips up beginners), how paths work, and 
how we actually run programs on the command line. We'll learn a few 
tricks for using the command line more efficiently, and how to deal with 
programs that are misbehaving. We'll finish this session by looking at 
the built in help system and how to read and interpret manual pages.
● Session 2 - System management
We'll first look at a few command line tools for monitoring the status 
of the system and keeping track of what's happening to processor power, 
memory, and disk space. We'll go over the process of installing new 
software from the built in repositories (which is easy) and from source 
code downloads (which is trickier). We'll also introduce some tools for 
benchmarking software (measuring the time/memory requirements of 
processing large datasets).

Tuesday 16th - Classes from 09:00 to 17:00 (approximately)
● Session 3 - Manipulating tabular data
Many data types we want to work with in bioinformatics are stored as 
tabular plain text files, and here we learn all about manipulating 
tabular data on the command line. We'll start with simple things like 
extracting columns, filtering and sorting, searching for text before 
moving on to more complex tasks like searching for duplicated values, 
summarizing large files, and combining simple tools into long commands.
● Session 4 - Constructing pipelines
In this session we will look at the various tools Linux has for 
constructing pipelines out of individual commands. Aliases, shell 
redirection, pipes, and shell scripting will all be introduced here. 
We'll also look at a couple of specific tools to help with running tools 
on multiple processors, and for monitoring the progress of long running 
tasks.

Wednesday 17th - Classes from 09:00 to 17:00 (approximately)
● Session 5 – EMBOSS
EMBOSS is a suite of bioinformatics command-line tools explicitly 
designed to work in the Linux paradigm. We'll get an overview of the 
different sequence data formats that we might expect to work with, and 
put what we learned about shell scripting to biological use by building 
a pipeline to compare codon usage across two collections of DNA 
sequences.
● Session 6 – Using a Linux server
Often in bioinformatics we'll be working on a Linux server rather than 
our own computer— typically because we need access to more computing 
power, or to specialized tools and datasets. In this session we'll learn 
how to connect to a Linux server and how to manage sessions. We'll also 
consider the various ways of moving data to and from a server from your 
own computer, and finish with a discussion of the considerations we have 
to make when working on a shared computer.

Thursday 18th - Classes from 09:00 to 17:00 (approximately)
● Session 7 – Combining methods
In the next two sessions — i.e. one full day — we'll put everything 
we have learned together and implement a workflow for next-gen sequence 
analysis. In this first session we'll carry out quality control on some 
paired-end Illumina data and map these reads to a reference genome. 
We'll then look at various approaches to automating this pipeline, 
allowing us to quickly do the same for a second dataset.
● Session 8 – Combining methods
The second part of the next-gen workflow is to call variants to identify 
SNPs between our two samples and the reference genome. We'll look at the 
VCF file format and figure out how to filter SNPs for read coverage and 
quality. By counting the number of SNPs between each sample and the 
reference we will try to figure out something about the biology of the 
two samples. We'll attempt to automate this analysis in various ways so 
that we could easily repeat the pipeline for additional samples.

Friday 19th - Classes from 09:00 to 16:00 (approximately)
● Session 9 – Customization
Part of the Linux design is that everything can be customised. This can 
be intimidating at first but, given that bioinformatics work is often 
fairly repetitive, can be used to good effect. Here we'll learn about 
environment variables, custom prompts, soft links, and ssh configuration 
—  a collection of tools with modest capabilities, but which together 
can make life on the command line much more pleasant. In this last 
session there will also be time to continue working on the next-gen 
sequencing pipeline.

The afternoon of Friday 19th is reserved for finishing off the next-gen 
workflow exercise, working on your own datasets, or leaving early for 
travel.

The cost is £530 (+VAT) including lunches and course materials. An 
all-inclusive option is also available at £660 (+VAT); this includes 
breakfast, lunch, dinner, refreshments, accommodation and course 
materials. Participants will need a laptop with a recent version of 
LINUX installed.

Please send inquiries to oliverhooker at prstatistics.com or visit the 
website www.prstatistics.com

Please feel free to distribute this information anywhere you think 
suitable.

Other related courses - email for details oliverhooker at prstatistics.com
INTRODUCTION TO PYTHON FOR BIOLOGISTS (May)
ADVANCES IN DNA TAXONOMY USING R (August)
GENETIC DATA ANALYSIS USING R (August)
PHYLOGENETIC DATA ANALYSIS USING R (October)
LANDSCAPE (POPULATION) GENETIC DATA ANALYSIS USING R (October)

Upcoming courses - email for details oliverhooker at prstatistics.com
ADVANCING IN STATISTICAL MODELLING USING R (May)
TIMES SERIES DATA ANALYSIS FOR ECOLOGISTS AND CLIMATOLOGISTS (May)
ADVANCES IN SPATIAL ANALYSIS OF MULTIVARIATE ECOLOGICAL DATA: THEORY AND 
PRACTICE (July)
INTRODUCTION TO BAYESIAN HIERARCHICAL MODELLING (August)
MODEL BASED MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA USING R (October)
APPLIED BAYESIAN MODELLING FOR ECOLOGISTS AND EPIDEMIOLOGISTS (October)
SPATIAL ANALYSIS OF ECOLOGIC AL DATA USING R (November)

Dates still to be confirmed - email for details 
oliverhooker at prstatistics.com
STABLE ISOTOPE MIXING MODELS USING SIAR, SIBER AND MIXSIAR USING R
INTRODUCTION TO R AND STATISTICS FOR BIOLOGISTS
BIOINFORMATICS FOR GENETICISTS AND BIOLOGISTS

Oliver Hooker

PR~Statistics
3/1
128 Brunswick Street
Glasgow
G1 1TF

+44 (0) 7966500340

www.prstatistics.com
www.prstatistics.com/organiser/oliver-hooker/