00:00:00
Hello everybody! My name is Mehrbod Estaki.
I’m a postdoctoral researcher in the Knight Lab
00:00:06
at the University of California San Diego. In this
video I’m going to give an overview of the process
00:00:12
of importing data into QIIME 2. We'll be mainly
focusing on importing raw sequence data and also
00:00:20
demultiplexing of these raw reads. In this video I
assume that you're already familiar with the core
00:00:26
concepts of QIIME 2 that was covered in a previous
video, especially with regards to the semantic
00:00:32
types and the file formats of QIIME 2 artifacts,
as these become pretty important components of
00:00:40
importing files into QIIME 2. Okay. This is a
basic overview diagram of a simple workflow in
00:00:47
QIIME 2. This workflow starts at the top left with
raw sequences and goes through the demultiplexing
00:00:55
of sequences, denoising of our reads to form a
feature table and representative sequence files.
00:01:02
Then it creates a phylogenetic tree, as well
as a taxonomy file, and finally it runs some
00:01:08
diversity analysis to produce some nice results
and visualizations. What is nice about QIIME 2 is
00:01:16
that you can import your data at virtually any
of these steps in the pipeline and continue to
00:01:21
use the various available plugins downstream
without the need for the upstream files.
00:01:28
So, for example, if a collaborator gives you a
biome table that they produced elsewhere -say,
00:01:34
in R- using a new in-house pipeline, then you
can simply import that biome table without the
00:01:40
need to access the original raw sequence files,
and you can just work your way downstream through
00:01:47
the available plugins and analyze the data
using the new biome table you've imported.
00:01:54
However, before you can do that, you need to
import your data into QIIME 2 in the correct
00:01:59
format. That is to say, you need to
know the type of file that you have,
00:02:04
as well as the format (if
needed) in which it was made.
00:02:12
In this series of videos, we're covering the
entire process starting from raw sequences,
00:02:17
so in this video I’m going to mainly focus on
describing the importing process from raw sequence
00:02:23
levels - but I just wanted to re-emphasize that
importing can happen at any of these steps.
00:02:32
So before we start I just wanted
to give a disclaimer here -and
00:02:36
this is purely my opinion, based on my
experience working with a variety of data types,
00:02:43
previous QIIME 2 workshops, and answering
questions on the QIIME 2 forum for the last
00:02:49
few years - and I found out that the importing of
raw step is often the most confusing part of the
00:02:55
QIIME 2 pipeline for new users. I personally
found this to be true with any bioinformatics
00:03:01
software that I've used, so this is definitely
not exclusive just to QIIME 2 by any means. The
00:03:07
main reason why this step can be confusing is that
there are many many ways your data can exist, in
00:03:14
tens to hundreds of variations that exist to date,
and only one of those is the correct one for any
00:03:21
given data. So you really need to know what type
of data you have, and what format it's in. And so
00:03:29
this is often this difficult concept for new users
who may not be familiar with this type of data.
00:03:39
And just so you don't think I’m exaggerating when
I say there can be hundreds of different data
00:03:43
formats, this is a list of 46 importable types and
75 importable formats which my QIIME2 environment
00:03:54
currently recognizes - and of course, there can be
even more depending on if you have any additional
00:04:00
third party QIIME 2 plugins installed. I show this
not to intimidate you, but rather to reassure you
00:04:10
that if you do find yourself struggling
with this portion of the pipeline
00:04:14
when you're analyzing your own data for the
first time, just know that you're not alone.
00:04:19
The good news is that once you have imported
data into QIIME 2, basically the hardest
00:04:24
part is over and everything downstream is
much simpler and easier to find help with.
00:04:33
Now while there's no automatic way of detecting
your data type and format, there is at least
00:04:40
one resource that I know of that may be useful in
helping you make sense of your data. This is an
00:04:47
example of an excellent quick reference flow
chart made by Nick Bokulich. This can be found
00:04:54
on the QIIME 2 forum in the link provided
here. This can help identify which type
00:05:02
and data format your input files
may be in the majority of cases.
00:05:08
Perhaps not all of them, but it will do
a pretty good job for most of the cases.
00:05:18
Okay, now let's take a closer look here at the
process involved in importing your raw sequences
00:05:26
into QIIME 2. So far up to this point you have
completed your carefully designed experiment,
00:05:33
you have collected your samples and extracted
DNA from them, you've amplified your target
00:05:39
gene of interest (for example, in this case the
v4 region of the 16S rRNA), and you've added
00:05:46
unique barcodes to the reads from each sample -
and of course, ever so carefully recorded those
00:05:54
per sample barcodes in your metadata file.
00:05:59
Next you'll pull all of these different
samples together and run it through your
00:06:05
sequencing machine. The sequencing machine
then performs its magic and gives you
00:06:10
some outputs in the form of FASTQ files. I’ll
describe FASTQ files in more detail a bit later,
00:06:18
but essentially these are the files which contain
the actual sequence information of your reads.
00:06:24
You'll have one FASTQ file that holds
information about the actual sequences,
00:06:29
and another FASTQ file that is
specific to the barcode sequences.
00:06:34
For simplicity's sake, this example is just
demonstrating sequences of the forward reads only,
00:06:41
but if you have paired end data -that is say
if you sequence the reverse reads as well-
00:06:46
then you will receive an additional FASTQ file
that holds information on the reverse read.
00:06:53
So what is a FASTQ file? Well you can think
of FASTQ files as essentially a text file that
00:07:01
holds various information about your sequences
in a somewhat standardized format. These files
00:07:08
are built on the older FASTA format that has been
around for many years (most popularly perhaps used
00:07:15
with the 454 pyrosequencing platform), and the
major difference between them is that FASTQ files
00:07:22
hold additional information about the quality of
each base call. This is in fact where we get the
00:07:29
"Q" in the FASTQ name, which tells us that this is
a FASTA file with quality scores. In a FASTQ file,
00:07:39
each read is described in exactly four lines.
The first line, that starts with the @ symbol,
00:07:48
is a sequence identifier. This line is not
really something that is well standardized,
00:07:55
so what is written here can vary
from different sequencing facilities,
00:07:59
but it generally holds some information about the
run, the equipment ID, their run ID, lane number,
00:08:08
perhaps the date, and so on. The second line
is your actual sequences. In this example,
00:08:16
this is our DNA sequences from our amplicons.
But of course if this was our barcode FASTQ file,
00:08:23
for example, then this would simply correspond to
the seven or eight nucleotide long unique barcode.
00:08:34
The third line denoted by the + sign here, is
a plus placeholder line, which can technically
00:08:41
hold a variety of information that you may
want to include. But these days you mainly see
00:08:49
just a plus sign indicating that this is a
placeholder line. Finally, the fourth line
00:08:57
holds quality scores corresponding to the
sequences. These quality scores are coded
00:09:04
using a series of ASCII characters, which are then
translated into numerical values downstream. These
00:09:13
quality scores, also known as phred scores or q
scores, are calculated by the sequencing machine
00:09:21
and they tell us about the quality of our
nucleotides in terms of error probabilities.
00:09:28
So for example, the question mark character
(corresponding to the ASCII code 63 here)
00:09:35
translates to a quality score of 30. A quality
score of 30 indicates that the probability
00:09:44
of the corresponding nucleotide being incorrect is
one in one thousand, or is 99.9 percent accurate.
00:09:55
In other words, if we saw a G for example
nucleotide in our reads, the likelihood of that
00:10:03
G having been called G by error -and in
fact it was meant to be let's say a C-
00:10:10
is one in one thousand. These quality scores are
very important and become a crucial component of
00:10:19
our quality control steps and filtering step
that you'll learn more about in later videos.
00:10:26
Okay. Now that we have a better understanding
of our FASTQ files, let's go back to our example
00:10:33
data. So again, here we have on the left
our sample metadata file, which contains
00:10:39
information about each of our samples, including
their unique barcode. In the middle we have a
00:10:46
FASTQ file for our barcodes. And on the right
hand, we have another FASTQ file for our actual
00:10:55
sequences. What is important to emphasize here is
that at this point our data is still multiplexed,
00:11:03
meaning that all of our sequencing data is
contained in one location and in one file,
00:11:09
and they are not linked to their original sample
source yet. And what we ultimately want is to
00:11:17
group all of our sequences from the multiplexed
FASTQ file and we want to demultiplex them so that
00:11:26
each sequence is paired with the sample it
originally came from. In other words, we want
00:11:32
all of the orange sequences to be paired with
our orange sample, the blue the blue sequences
00:11:38
with the blue sample, and so on. The way we can
achieve this is by simply mapping the sequences
00:11:47
back to their sample of origin using those
unique barcodes that we added at the beginning.
00:11:54
So this is a very simple overview of how the
demultiplexing process actually works. We
00:12:01
start from the right with our sequence FASTQ
files. We take the first read we see there,
00:12:08
and now we move to the barcode FASTQ file
and match it with the first read we see
00:12:14
in that file. It's worth pointing out that the
order of sequences between these two FASTQ files
00:12:22
is paired, and they are always matched when they
are produced by the sequencing machine. So what I
00:12:29
mean is that read number one in our sequence file
will always correspond to read number one in the
00:12:37
barcode file. Same thing with read number 2 in
the sequence file, it will always correspond to
00:12:44
read number 2 in the barcode file, and so on.
Now in our barcode file, we read the unique
00:12:52
barcode identifier that is associated with
that read, and finally we can map that barcode
00:12:59
using our metadata file and identify exactly
which sample that barcode corresponds
00:13:06
to - meaning what sample our original read came
from. In this case, the orange sample was our
00:13:18
original sample source. So we repeat
this demultiplexing process for each read
00:13:24
until all of our reads have been
assigned to one of our samples.
00:13:30
When the demultiplexing process is complete,
00:13:34
instead of having one FASTQ for all of our
samples, we'll now have one FASTQ file per sample.
00:13:43
We no longer need our barcode file because we
have already extracted the information that we
00:13:49
used from them. So this is what we refer to
as a demultiplexed file. Of course, when you
00:13:56
are working within the QIIME 2 environment, you
will only actually see a single QIIME 2 artifact.
00:14:04
However, we now know that the
underlying structure of that artifact
00:14:09
is a series of individual FASTQ files. In fact,
if you were to export this artifact at this point,
00:14:17
you'll see that all of these FASTQ files
exist as a separate file within it.
00:14:25
And of course again if you have sequenced
paired end data then you will have two FASTQ
00:14:31
files per sample: one corresponding to the
forward reads, and the other to your reverse
00:14:36
reads. It is also worth mentioning here that,
at this point, if you have paired end reads,
00:14:44
your forward and reverse reads are still not
joined together. That is something that happens
00:14:50
at a later step. At this point they are simply
held paired together but as separate files.
00:15:02
Now depending on the sequencing facility where
your data came from, you may receive your data
00:15:08
in either multiplexed form such as in our
example here, or demultiplexed. So if you
00:15:16
receive the multiplexed file, then of course
you will need to de-multiplex your reads,
00:15:21
but if you receive them already demultiplexed then
you can simply skip this demultiplexing process
00:15:28
in QIIME 2 and you can just move forward
to the next step, which will be denoising
00:15:34
and clustering. The easiest way to know if you
have received multiplexed or demultiplexed data
00:15:41
from your facility is by simply looking
to see if you have received one FASTQ file
00:15:48
that contains all of your samples, or if you
have separate FASTQ files for each sample.
00:15:56
So this concludes the lecture tutorial on
importing and demultiplexing groups into QIIME 2.
00:16:03
In the next section, you will get to actually
get a hands-on experience on how to import
00:16:08
your raw FASTQ files in QIIME 2. We'll see you
again at the next video in this tutorial series
00:16:16
which will be about denoising or clustering
your data. Thank you very much for joining!