QIIME 2 er et open source bioinformatik værktøj, der bruges til analyse af mikrobielle samfundsdata.

Hvad betyder demultiplexing i QIIME 2?

Demultiplexing refererer til processen, hvor man adskiller sekvensdata til deres oprindelige prøvekilde baseret på unikke stregkoder.

Hvad er en FASTQ-fil?

En FASTQ-fil er en tekstfil, der indeholder sekvensdata samt kvalitetsscore for hver base i sekvensen.

Hvilke trin er involveret i QIIME 2 workflow?

Workflowet starter fra rå sekvenser, går gennem demultiplexing, denoising, danner fylogenetiske træer og udfører diversitetsanalyse.

Hvad skal man overveje ved import af data til QIIME 2?

Det er vigtigt at kende filtypen og formatet på dataene, da der findes mange variationer.

Hvordan kan man vide, om data er multiplexet eller demultiplexet?

Hvis alle prøver er i én FASTQ-fil, er dataene multiplexet; hvis de er i separate filer pr. prøve, er de demultiplexet.

Hvad er kvalitetsscorer i FASTQ-filer?

Kvalitetsscorer angiver nøjagtigheden af basekald i sekvenserne og er vigtige for kvalitetssikring.

Hvad sker der efter import og demultiplexing i QIIME 2?

Efter disse trin går man videre med denoising og clustering af dataene.

Importing and demultiplexing

00:16:22

https://www.youtube.com/watch?v=QMqKd7HGBbQ

Résumé

TLDRVideoen præsenterer en oversigt over processen med at importere sekvensdata til QIIME 2, med fokus på rådata og demultiplexing. Da data kan komme i forskellige filtyper og formater, er det vigtigt at kende disse for at importere korrekt. Importprocessen beskrives som en af de mest udfordrende for nye brugere, da der ikke er nogen automatisk detektionsmetode for filtype og format. Efter import kan man udføre downstream-analyser uden de originale filer. Videoen forklarer, hvordan man identificerer og arbejder med FASTQ-filer, herunder struktur og kvalitetsscorer, samt hvordan demultiplexing-processen parrer sekvenser med deres oprindelige prøver ved hjælp af stregkoder. Når dataene er korrekt importeret, er det svære arbejde overstået, og de næste trin som denoising er mere ligetil.

A retenir

📊 QIIME 2 bruges til mikrobiel dataanalyse.
🔀 Demultiplexing adskiller sekvenser til deres oprindelige prøver.
📁 FASTQ-filer indeholder sekvens- og kvalitetsdata.
🧬 Workflow starter med rå sekvenser og går til diversitetsanalyse.
🎛️ Korrekt datatyp og format er essentiel for import.
🔍 Demultiplexede data betyder separate filer pr. prøve.
⚠️ Import er den mest komplekse del for nye brugere.
🗂️ Efter demultiplexing er downstream-analyse lettere.
🔢 Kvalitetsscorer i FASTQ er essentielle for nøjagtighed.
📖 Videogennemgangen hjælper med praktisk erfaring i QIIME 2.

Chronologie

00:00:00 - 00:05:00
Mehrbod Estaki introducerer processen med at importere data til QIIME 2 med fokus på rå sekvensdata og deres demultiplexing. Han understreger vigtigheden af at forstå QIIME 2's semantiske typer og filformater og forklarer, hvordan workflowet i QIIME 2 kan begynde fra enhver del af processen med tilgængelige plugins. Han nævner, at import af rå data ofte er den mest forvirrende del for nye brugere på grund af de mange formater, data kan eksistere i.
00:05:00 - 00:10:00
For bedre at forstå FASTQ-filerne forklarer han, at de indeholder sekvensernes identifikatorer og kvalitetsscore, som er afgørende for kvalitetskontrol og filtrering af data. Disse scorer hjælper med at vurdere nøjagtigheden af hvert nukleotid. Han uddyber, hvordan disse filer består af sekvensdata og tilhørende stregkodesevene, og hvordan de parres med den oprindelige prøve via unikke stregkoder.
00:10:00 - 00:16:22
Demultiplexeringsprocessen forklares som processen med at matche sekvenser til deres oprindelige prøve ved hjælp af stregkoder. Når dataene er demultiplexerede, har hver prøve sin egen FASTQ-fil. Han påpeger vigtigheden af, om dataene modtages som multiplexerede eller demultiplexerede og hvordan man fortsætter i QIIME 2-processen. Import og demultiplexing i QIIME 2 forbereder dataene til yderligere behandling som denoise og clustering.

Carte mentale

Vidéo Q&R

Hvad er QIIME 2?
QIIME 2 er et open source bioinformatik værktøj, der bruges til analyse af mikrobielle samfundsdata.
Hvad betyder demultiplexing i QIIME 2?
Demultiplexing refererer til processen, hvor man adskiller sekvensdata til deres oprindelige prøvekilde baseret på unikke stregkoder.
Hvad er en FASTQ-fil?
En FASTQ-fil er en tekstfil, der indeholder sekvensdata samt kvalitetsscore for hver base i sekvensen.
Hvilke trin er involveret i QIIME 2 workflow?
Workflowet starter fra rå sekvenser, går gennem demultiplexing, denoising, danner fylogenetiske træer og udfører diversitetsanalyse.
Hvad skal man overveje ved import af data til QIIME 2?
Det er vigtigt at kende filtypen og formatet på dataene, da der findes mange variationer.
Hvordan kan man vide, om data er multiplexet eller demultiplexet?
Hvis alle prøver er i én FASTQ-fil, er dataene multiplexet; hvis de er i separate filer pr. prøve, er de demultiplexet.
Hvad er kvalitetsscorer i FASTQ-filer?
Kvalitetsscorer angiver nøjagtigheden af basekald i sekvenserne og er vigtige for kvalitetssikring.
Hvad sker der efter import og demultiplexing i QIIME 2?
Efter disse trin går man videre med denoising og clustering af dataene.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !

Sous-titres

Défilement automatique:

00:00:00
Hello everybody! My name is Mehrbod Estaki. I’m a postdoctoral researcher in the Knight Lab
00:00:06
at the University of California San Diego. In this video I’m going to give an overview of the process
00:00:12
of importing data into QIIME 2. We'll be mainly focusing on importing raw sequence data and also
00:00:20
demultiplexing of these raw reads. In this video I assume that you're already familiar with the core
00:00:26
concepts of QIIME 2 that was covered in a previous video, especially with regards to the semantic
00:00:32
types and the file formats of QIIME 2 artifacts, as these become pretty important components of
00:00:40
importing files into QIIME 2. Okay. This is a basic overview diagram of a simple workflow in
00:00:47
QIIME 2. This workflow starts at the top left with raw sequences and goes through the demultiplexing
00:00:55
of sequences, denoising of our reads to form a feature table and representative sequence files.
00:01:02
Then it creates a phylogenetic tree, as well as a taxonomy file, and finally it runs some
00:01:08
diversity analysis to produce some nice results and visualizations. What is nice about QIIME 2 is
00:01:16
that you can import your data at virtually any of these steps in the pipeline and continue to
00:01:21
use the various available plugins downstream without the need for the upstream files.
00:01:28
So, for example, if a collaborator gives you a biome table that they produced elsewhere -say,
00:01:34
in R- using a new in-house pipeline, then you can simply import that biome table without the
00:01:40
need to access the original raw sequence files, and you can just work your way downstream through
00:01:47
the available plugins and analyze the data using the new biome table you've imported.
00:01:54
However, before you can do that, you need to import your data into QIIME 2 in the correct
00:01:59
format. That is to say, you need to know the type of file that you have,
00:02:04
as well as the format (if needed) in which it was made.
00:02:12
In this series of videos, we're covering the entire process starting from raw sequences,
00:02:17
so in this video I’m going to mainly focus on describing the importing process from raw sequence
00:02:23
levels - but I just wanted to re-emphasize that importing can happen at any of these steps.
00:02:32
So before we start I just wanted to give a disclaimer here -and
00:02:36
this is purely my opinion, based on my experience working with a variety of data types,
00:02:43
previous QIIME 2 workshops, and answering questions on the QIIME 2 forum for the last
00:02:49
few years - and I found out that the importing of raw step is often the most confusing part of the
00:02:55
QIIME 2 pipeline for new users. I personally found this to be true with any bioinformatics
00:03:01
software that I've used, so this is definitely not exclusive just to QIIME 2 by any means. The
00:03:07
main reason why this step can be confusing is that there are many many ways your data can exist, in
00:03:14
tens to hundreds of variations that exist to date, and only one of those is the correct one for any
00:03:21
given data. So you really need to know what type of data you have, and what format it's in. And so
00:03:29
this is often this difficult concept for new users who may not be familiar with this type of data.
00:03:39
And just so you don't think I’m exaggerating when I say there can be hundreds of different data
00:03:43
formats, this is a list of 46 importable types and 75 importable formats which my QIIME2 environment
00:03:54
currently recognizes - and of course, there can be even more depending on if you have any additional
00:04:00
third party QIIME 2 plugins installed. I show this not to intimidate you, but rather to reassure you
00:04:10
that if you do find yourself struggling with this portion of the pipeline
00:04:14
when you're analyzing your own data for the first time, just know that you're not alone.
00:04:19
The good news is that once you have imported data into QIIME 2, basically the hardest
00:04:24
part is over and everything downstream is much simpler and easier to find help with.
00:04:33
Now while there's no automatic way of detecting your data type and format, there is at least
00:04:40
one resource that I know of that may be useful in helping you make sense of your data. This is an
00:04:47
example of an excellent quick reference flow chart made by Nick Bokulich. This can be found
00:04:54
on the QIIME 2 forum in the link provided here. This can help identify which type
00:05:02
and data format your input files may be in the majority of cases.
00:05:08
Perhaps not all of them, but it will do a pretty good job for most of the cases.
00:05:18
Okay, now let's take a closer look here at the process involved in importing your raw sequences
00:05:26
into QIIME 2. So far up to this point you have completed your carefully designed experiment,
00:05:33
you have collected your samples and extracted DNA from them, you've amplified your target
00:05:39
gene of interest (for example, in this case the v4 region of the 16S rRNA), and you've added
00:05:46
unique barcodes to the reads from each sample - and of course, ever so carefully recorded those
00:05:54
per sample barcodes in your metadata file.
00:05:59
Next you'll pull all of these different samples together and run it through your
00:06:05
sequencing machine. The sequencing machine then performs its magic and gives you
00:06:10
some outputs in the form of FASTQ files. I’ll describe FASTQ files in more detail a bit later,
00:06:18
but essentially these are the files which contain the actual sequence information of your reads.
00:06:24
You'll have one FASTQ file that holds information about the actual sequences,
00:06:29
and another FASTQ file that is specific to the barcode sequences.
00:06:34
For simplicity's sake, this example is just demonstrating sequences of the forward reads only,
00:06:41
but if you have paired end data -that is say if you sequence the reverse reads as well-
00:06:46
then you will receive an additional FASTQ file that holds information on the reverse read.
00:06:53
So what is a FASTQ file? Well you can think of FASTQ files as essentially a text file that
00:07:01
holds various information about your sequences in a somewhat standardized format. These files
00:07:08
are built on the older FASTA format that has been around for many years (most popularly perhaps used
00:07:15
with the 454 pyrosequencing platform), and the major difference between them is that FASTQ files
00:07:22
hold additional information about the quality of each base call. This is in fact where we get the
00:07:29
"Q" in the FASTQ name, which tells us that this is a FASTA file with quality scores. In a FASTQ file,
00:07:39
each read is described in exactly four lines. The first line, that starts with the @ symbol,
00:07:48
is a sequence identifier. This line is not really something that is well standardized,
00:07:55
so what is written here can vary from different sequencing facilities,
00:07:59
but it generally holds some information about the run, the equipment ID, their run ID, lane number,
00:08:08
perhaps the date, and so on. The second line is your actual sequences. In this example,
00:08:16
this is our DNA sequences from our amplicons. But of course if this was our barcode FASTQ file,
00:08:23
for example, then this would simply correspond to the seven or eight nucleotide long unique barcode.
00:08:34
The third line denoted by the + sign here, is a plus placeholder line, which can technically
00:08:41
hold a variety of information that you may want to include. But these days you mainly see
00:08:49
just a plus sign indicating that this is a placeholder line. Finally, the fourth line
00:08:57
holds quality scores corresponding to the sequences. These quality scores are coded
00:09:04
using a series of ASCII characters, which are then translated into numerical values downstream. These
00:09:13
quality scores, also known as phred scores or q scores, are calculated by the sequencing machine
00:09:21
and they tell us about the quality of our nucleotides in terms of error probabilities.
00:09:28
So for example, the question mark character (corresponding to the ASCII code 63 here)
00:09:35
translates to a quality score of 30. A quality score of 30 indicates that the probability
00:09:44
of the corresponding nucleotide being incorrect is one in one thousand, or is 99.9 percent accurate.
00:09:55
In other words, if we saw a G for example nucleotide in our reads, the likelihood of that
00:10:03
G having been called G by error -and in fact it was meant to be let's say a C-
00:10:10
is one in one thousand. These quality scores are very important and become a crucial component of
00:10:19
our quality control steps and filtering step that you'll learn more about in later videos.
00:10:26
Okay. Now that we have a better understanding of our FASTQ files, let's go back to our example
00:10:33
data. So again, here we have on the left our sample metadata file, which contains
00:10:39
information about each of our samples, including their unique barcode. In the middle we have a
00:10:46
FASTQ file for our barcodes. And on the right hand, we have another FASTQ file for our actual
00:10:55
sequences. What is important to emphasize here is that at this point our data is still multiplexed,
00:11:03
meaning that all of our sequencing data is contained in one location and in one file,
00:11:09
and they are not linked to their original sample source yet. And what we ultimately want is to
00:11:17
group all of our sequences from the multiplexed FASTQ file and we want to demultiplex them so that
00:11:26
each sequence is paired with the sample it originally came from. In other words, we want
00:11:32
all of the orange sequences to be paired with our orange sample, the blue the blue sequences
00:11:38
with the blue sample, and so on. The way we can achieve this is by simply mapping the sequences
00:11:47
back to their sample of origin using those unique barcodes that we added at the beginning.
00:11:54
So this is a very simple overview of how the demultiplexing process actually works. We
00:12:01
start from the right with our sequence FASTQ files. We take the first read we see there,
00:12:08
and now we move to the barcode FASTQ file and match it with the first read we see
00:12:14
in that file. It's worth pointing out that the order of sequences between these two FASTQ files
00:12:22
is paired, and they are always matched when they are produced by the sequencing machine. So what I
00:12:29
mean is that read number one in our sequence file will always correspond to read number one in the
00:12:37
barcode file. Same thing with read number 2 in the sequence file, it will always correspond to
00:12:44
read number 2 in the barcode file, and so on. Now in our barcode file, we read the unique
00:12:52
barcode identifier that is associated with that read, and finally we can map that barcode
00:12:59
using our metadata file and identify exactly which sample that barcode corresponds
00:13:06
to - meaning what sample our original read came from. In this case, the orange sample was our
00:13:18
original sample source. So we repeat this demultiplexing process for each read
00:13:24
until all of our reads have been assigned to one of our samples.
00:13:30
When the demultiplexing process is complete,
00:13:34
instead of having one FASTQ for all of our samples, we'll now have one FASTQ file per sample.
00:13:43
We no longer need our barcode file because we have already extracted the information that we
00:13:49
used from them. So this is what we refer to as a demultiplexed file. Of course, when you
00:13:56
are working within the QIIME 2 environment, you will only actually see a single QIIME 2 artifact.
00:14:04
However, we now know that the underlying structure of that artifact
00:14:09
is a series of individual FASTQ files. In fact, if you were to export this artifact at this point,
00:14:17
you'll see that all of these FASTQ files exist as a separate file within it.
00:14:25
And of course again if you have sequenced paired end data then you will have two FASTQ
00:14:31
files per sample: one corresponding to the forward reads, and the other to your reverse
00:14:36
reads. It is also worth mentioning here that, at this point, if you have paired end reads,
00:14:44
your forward and reverse reads are still not joined together. That is something that happens
00:14:50
at a later step. At this point they are simply held paired together but as separate files.
00:15:02
Now depending on the sequencing facility where your data came from, you may receive your data
00:15:08
in either multiplexed form such as in our example here, or demultiplexed. So if you
00:15:16
receive the multiplexed file, then of course you will need to de-multiplex your reads,
00:15:21
but if you receive them already demultiplexed then you can simply skip this demultiplexing process
00:15:28
in QIIME 2 and you can just move forward to the next step, which will be denoising
00:15:34
and clustering. The easiest way to know if you have received multiplexed or demultiplexed data
00:15:41
from your facility is by simply looking to see if you have received one FASTQ file
00:15:48
that contains all of your samples, or if you have separate FASTQ files for each sample.
00:15:56
So this concludes the lecture tutorial on importing and demultiplexing groups into QIIME 2.
00:16:03
In the next section, you will get to actually get a hands-on experience on how to import
00:16:08
your raw FASTQ files in QIIME 2. We'll see you again at the next video in this tutorial series
00:16:16
which will be about denoising or clustering your data. Thank you very much for joining!