Importing and demultiplexing

00:16:22
https://www.youtube.com/watch?v=QMqKd7HGBbQ

Summary

TLDRVideoen præsenterer en oversigt over processen med at importere sekvensdata til QIIME 2, med fokus på rådata og demultiplexing. Da data kan komme i forskellige filtyper og formater, er det vigtigt at kende disse for at importere korrekt. Importprocessen beskrives som en af de mest udfordrende for nye brugere, da der ikke er nogen automatisk detektionsmetode for filtype og format. Efter import kan man udføre downstream-analyser uden de originale filer. Videoen forklarer, hvordan man identificerer og arbejder med FASTQ-filer, herunder struktur og kvalitetsscorer, samt hvordan demultiplexing-processen parrer sekvenser med deres oprindelige prøver ved hjælp af stregkoder. Når dataene er korrekt importeret, er det svære arbejde overstået, og de næste trin som denoising er mere ligetil.

Takeaways

  • 📊 QIIME 2 bruges til mikrobiel dataanalyse.
  • 🔀 Demultiplexing adskiller sekvenser til deres oprindelige prøver.
  • 📁 FASTQ-filer indeholder sekvens- og kvalitetsdata.
  • 🧬 Workflow starter med rå sekvenser og går til diversitetsanalyse.
  • 🎛️ Korrekt datatyp og format er essentiel for import.
  • 🔍 Demultiplexede data betyder separate filer pr. prøve.
  • ⚠️ Import er den mest komplekse del for nye brugere.
  • 🗂️ Efter demultiplexing er downstream-analyse lettere.
  • 🔢 Kvalitetsscorer i FASTQ er essentielle for nøjagtighed.
  • 📖 Videogennemgangen hjælper med praktisk erfaring i QIIME 2.

Timeline

  • 00:00:00 - 00:05:00

    Mehrbod Estaki introducerer processen med at importere data til QIIME 2 med fokus på rå sekvensdata og deres demultiplexing. Han understreger vigtigheden af at forstå QIIME 2's semantiske typer og filformater og forklarer, hvordan workflowet i QIIME 2 kan begynde fra enhver del af processen med tilgængelige plugins. Han nævner, at import af rå data ofte er den mest forvirrende del for nye brugere på grund af de mange formater, data kan eksistere i.

  • 00:05:00 - 00:10:00

    For bedre at forstå FASTQ-filerne forklarer han, at de indeholder sekvensernes identifikatorer og kvalitetsscore, som er afgørende for kvalitetskontrol og filtrering af data. Disse scorer hjælper med at vurdere nøjagtigheden af hvert nukleotid. Han uddyber, hvordan disse filer består af sekvensdata og tilhørende stregkodesevene, og hvordan de parres med den oprindelige prøve via unikke stregkoder.

  • 00:10:00 - 00:16:22

    Demultiplexeringsprocessen forklares som processen med at matche sekvenser til deres oprindelige prøve ved hjælp af stregkoder. Når dataene er demultiplexerede, har hver prøve sin egen FASTQ-fil. Han påpeger vigtigheden af, om dataene modtages som multiplexerede eller demultiplexerede og hvordan man fortsætter i QIIME 2-processen. Import og demultiplexing i QIIME 2 forbereder dataene til yderligere behandling som denoise og clustering.

Mind Map

Mind Map

Frequently Asked Question

  • Hvad er QIIME 2?

    QIIME 2 er et open source bioinformatik værktøj, der bruges til analyse af mikrobielle samfundsdata.

  • Hvad betyder demultiplexing i QIIME 2?

    Demultiplexing refererer til processen, hvor man adskiller sekvensdata til deres oprindelige prøvekilde baseret på unikke stregkoder.

  • Hvad er en FASTQ-fil?

    En FASTQ-fil er en tekstfil, der indeholder sekvensdata samt kvalitetsscore for hver base i sekvensen.

  • Hvilke trin er involveret i QIIME 2 workflow?

    Workflowet starter fra rå sekvenser, går gennem demultiplexing, denoising, danner fylogenetiske træer og udfører diversitetsanalyse.

  • Hvad skal man overveje ved import af data til QIIME 2?

    Det er vigtigt at kende filtypen og formatet på dataene, da der findes mange variationer.

  • Hvordan kan man vide, om data er multiplexet eller demultiplexet?

    Hvis alle prøver er i én FASTQ-fil, er dataene multiplexet; hvis de er i separate filer pr. prøve, er de demultiplexet.

  • Hvad er kvalitetsscorer i FASTQ-filer?

    Kvalitetsscorer angiver nøjagtigheden af basekald i sekvenserne og er vigtige for kvalitetssikring.

  • Hvad sker der efter import og demultiplexing i QIIME 2?

    Efter disse trin går man videre med denoising og clustering af dataene.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    Hello everybody! My name is Mehrbod Estaki.  I’m a postdoctoral researcher in the Knight Lab
  • 00:00:06
    at the University of California San Diego. In this  video I’m going to give an overview of the process
  • 00:00:12
    of importing data into QIIME 2. We'll be mainly  focusing on importing raw sequence data and also
  • 00:00:20
    demultiplexing of these raw reads. In this video I  assume that you're already familiar with the core
  • 00:00:26
    concepts of QIIME 2 that was covered in a previous  video, especially with regards to the semantic
  • 00:00:32
    types and the file formats of QIIME 2 artifacts,  as these become pretty important components of
  • 00:00:40
    importing files into QIIME 2. Okay. This is a  basic overview diagram of a simple workflow in
  • 00:00:47
    QIIME 2. This workflow starts at the top left with  raw sequences and goes through the demultiplexing
  • 00:00:55
    of sequences, denoising of our reads to form a  feature table and representative sequence files.
  • 00:01:02
    Then it creates a phylogenetic tree, as well  as a taxonomy file, and finally it runs some
  • 00:01:08
    diversity analysis to produce some nice results  and visualizations. What is nice about QIIME 2 is
  • 00:01:16
    that you can import your data at virtually any  of these steps in the pipeline and continue to
  • 00:01:21
    use the various available plugins downstream  without the need for the upstream files.
  • 00:01:28
    So, for example, if a collaborator gives you a  biome table that they produced elsewhere -say,
  • 00:01:34
    in R- using a new in-house pipeline, then you  can simply import that biome table without the
  • 00:01:40
    need to access the original raw sequence files,  and you can just work your way downstream through
  • 00:01:47
    the available plugins and analyze the data  using the new biome table you've imported.
  • 00:01:54
    However, before you can do that, you need to  import your data into QIIME 2 in the correct
  • 00:01:59
    format. That is to say, you need to  know the type of file that you have,
  • 00:02:04
    as well as the format (if  needed) in which it was made.
  • 00:02:12
    In this series of videos, we're covering the  entire process starting from raw sequences,
  • 00:02:17
    so in this video I’m going to mainly focus on  describing the importing process from raw sequence
  • 00:02:23
    levels - but I just wanted to re-emphasize that  importing can happen at any of these steps.
  • 00:02:32
    So before we start I just wanted  to give a disclaimer here -and
  • 00:02:36
    this is purely my opinion, based on my  experience working with a variety of data types,
  • 00:02:43
    previous QIIME 2 workshops, and answering  questions on the QIIME 2 forum for the last
  • 00:02:49
    few years - and I found out that the importing of  raw step is often the most confusing part of the
  • 00:02:55
    QIIME 2 pipeline for new users. I personally  found this to be true with any bioinformatics
  • 00:03:01
    software that I've used, so this is definitely  not exclusive just to QIIME 2 by any means. The
  • 00:03:07
    main reason why this step can be confusing is that  there are many many ways your data can exist, in
  • 00:03:14
    tens to hundreds of variations that exist to date,  and only one of those is the correct one for any
  • 00:03:21
    given data. So you really need to know what type  of data you have, and what format it's in. And so
  • 00:03:29
    this is often this difficult concept for new users  who may not be familiar with this type of data.
  • 00:03:39
    And just so you don't think I’m exaggerating when  I say there can be hundreds of different data
  • 00:03:43
    formats, this is a list of 46 importable types and  75 importable formats which my QIIME2 environment
  • 00:03:54
    currently recognizes - and of course, there can be  even more depending on if you have any additional
  • 00:04:00
    third party QIIME 2 plugins installed. I show this  not to intimidate you, but rather to reassure you
  • 00:04:10
    that if you do find yourself struggling  with this portion of the pipeline
  • 00:04:14
    when you're analyzing your own data for the  first time, just know that you're not alone.
  • 00:04:19
    The good news is that once you have imported  data into QIIME 2, basically the hardest
  • 00:04:24
    part is over and everything downstream is  much simpler and easier to find help with.
  • 00:04:33
    Now while there's no automatic way of detecting  your data type and format, there is at least
  • 00:04:40
    one resource that I know of that may be useful in  helping you make sense of your data. This is an
  • 00:04:47
    example of an excellent quick reference flow  chart made by Nick Bokulich. This can be found
  • 00:04:54
    on the QIIME 2 forum in the link provided  here. This can help identify which type
  • 00:05:02
    and data format your input files  may be in the majority of cases.
  • 00:05:08
    Perhaps not all of them, but it will do  a pretty good job for most of the cases.
  • 00:05:18
    Okay, now let's take a closer look here at the  process involved in importing your raw sequences
  • 00:05:26
    into QIIME 2. So far up to this point you have  completed your carefully designed experiment,
  • 00:05:33
    you have collected your samples and extracted  DNA from them, you've amplified your target
  • 00:05:39
    gene of interest (for example, in this case the  v4 region of the 16S rRNA), and you've added
  • 00:05:46
    unique barcodes to the reads from each sample -  and of course, ever so carefully recorded those
  • 00:05:54
    per sample barcodes in your metadata file.
  • 00:05:59
    Next you'll pull all of these different  samples together and run it through your
  • 00:06:05
    sequencing machine. The sequencing machine  then performs its magic and gives you
  • 00:06:10
    some outputs in the form of FASTQ files. I’ll  describe FASTQ files in more detail a bit later,
  • 00:06:18
    but essentially these are the files which contain  the actual sequence information of your reads.
  • 00:06:24
    You'll have one FASTQ file that holds  information about the actual sequences,
  • 00:06:29
    and another FASTQ file that is  specific to the barcode sequences.
  • 00:06:34
    For simplicity's sake, this example is just  demonstrating sequences of the forward reads only,
  • 00:06:41
    but if you have paired end data -that is say  if you sequence the reverse reads as well-
  • 00:06:46
    then you will receive an additional FASTQ file  that holds information on the reverse read.
  • 00:06:53
    So what is a FASTQ file? Well you can think  of FASTQ files as essentially a text file that
  • 00:07:01
    holds various information about your sequences  in a somewhat standardized format. These files
  • 00:07:08
    are built on the older FASTA format that has been  around for many years (most popularly perhaps used
  • 00:07:15
    with the 454 pyrosequencing platform), and the  major difference between them is that FASTQ files
  • 00:07:22
    hold additional information about the quality of  each base call. This is in fact where we get the
  • 00:07:29
    "Q" in the FASTQ name, which tells us that this is  a FASTA file with quality scores. In a FASTQ file,
  • 00:07:39
    each read is described in exactly four lines.  The first line, that starts with the @ symbol,
  • 00:07:48
    is a sequence identifier. This line is not  really something that is well standardized,
  • 00:07:55
    so what is written here can vary  from different sequencing facilities,
  • 00:07:59
    but it generally holds some information about the  run, the equipment ID, their run ID, lane number,
  • 00:08:08
    perhaps the date, and so on. The second line  is your actual sequences. In this example,
  • 00:08:16
    this is our DNA sequences from our amplicons.  But of course if this was our barcode FASTQ file,
  • 00:08:23
    for example, then this would simply correspond to  the seven or eight nucleotide long unique barcode.
  • 00:08:34
    The third line denoted by the + sign here, is  a plus placeholder line, which can technically
  • 00:08:41
    hold a variety of information that you may  want to include. But these days you mainly see
  • 00:08:49
    just a plus sign indicating that this is a  placeholder line. Finally, the fourth line
  • 00:08:57
    holds quality scores corresponding to the  sequences. These quality scores are coded
  • 00:09:04
    using a series of ASCII characters, which are then  translated into numerical values downstream. These
  • 00:09:13
    quality scores, also known as phred scores or q  scores, are calculated by the sequencing machine
  • 00:09:21
    and they tell us about the quality of our  nucleotides in terms of error probabilities.
  • 00:09:28
    So for example, the question mark character  (corresponding to the ASCII code 63 here)
  • 00:09:35
    translates to a quality score of 30. A quality  score of 30 indicates that the probability
  • 00:09:44
    of the corresponding nucleotide being incorrect is  one in one thousand, or is 99.9 percent accurate.
  • 00:09:55
    In other words, if we saw a G for example  nucleotide in our reads, the likelihood of that
  • 00:10:03
    G having been called G by error -and in  fact it was meant to be let's say a C-
  • 00:10:10
    is one in one thousand. These quality scores are  very important and become a crucial component of
  • 00:10:19
    our quality control steps and filtering step  that you'll learn more about in later videos.
  • 00:10:26
    Okay. Now that we have a better understanding  of our FASTQ files, let's go back to our example
  • 00:10:33
    data. So again, here we have on the left  our sample metadata file, which contains
  • 00:10:39
    information about each of our samples, including  their unique barcode. In the middle we have a
  • 00:10:46
    FASTQ file for our barcodes. And on the right  hand, we have another FASTQ file for our actual
  • 00:10:55
    sequences. What is important to emphasize here is  that at this point our data is still multiplexed,
  • 00:11:03
    meaning that all of our sequencing data is  contained in one location and in one file,
  • 00:11:09
    and they are not linked to their original sample  source yet. And what we ultimately want is to
  • 00:11:17
    group all of our sequences from the multiplexed  FASTQ file and we want to demultiplex them so that
  • 00:11:26
    each sequence is paired with the sample it  originally came from. In other words, we want
  • 00:11:32
    all of the orange sequences to be paired with  our orange sample, the blue the blue sequences
  • 00:11:38
    with the blue sample, and so on. The way we can  achieve this is by simply mapping the sequences
  • 00:11:47
    back to their sample of origin using those  unique barcodes that we added at the beginning.
  • 00:11:54
    So this is a very simple overview of how the  demultiplexing process actually works. We
  • 00:12:01
    start from the right with our sequence FASTQ  files. We take the first read we see there,
  • 00:12:08
    and now we move to the barcode FASTQ file  and match it with the first read we see
  • 00:12:14
    in that file. It's worth pointing out that the  order of sequences between these two FASTQ files
  • 00:12:22
    is paired, and they are always matched when they  are produced by the sequencing machine. So what I
  • 00:12:29
    mean is that read number one in our sequence file  will always correspond to read number one in the
  • 00:12:37
    barcode file. Same thing with read number 2 in  the sequence file, it will always correspond to
  • 00:12:44
    read number 2 in the barcode file, and so on.  Now in our barcode file, we read the unique
  • 00:12:52
    barcode identifier that is associated with  that read, and finally we can map that barcode
  • 00:12:59
    using our metadata file and identify exactly  which sample that barcode corresponds
  • 00:13:06
    to - meaning what sample our original read came  from. In this case, the orange sample was our
  • 00:13:18
    original sample source. So we repeat  this demultiplexing process for each read
  • 00:13:24
    until all of our reads have been  assigned to one of our samples.
  • 00:13:30
    When the demultiplexing process is complete,
  • 00:13:34
    instead of having one FASTQ for all of our  samples, we'll now have one FASTQ file per sample.
  • 00:13:43
    We no longer need our barcode file because we  have already extracted the information that we
  • 00:13:49
    used from them. So this is what we refer to  as a demultiplexed file. Of course, when you
  • 00:13:56
    are working within the QIIME 2 environment, you  will only actually see a single QIIME 2 artifact.
  • 00:14:04
    However, we now know that the  underlying structure of that artifact
  • 00:14:09
    is a series of individual FASTQ files. In fact,  if you were to export this artifact at this point,
  • 00:14:17
    you'll see that all of these FASTQ files  exist as a separate file within it.
  • 00:14:25
    And of course again if you have sequenced  paired end data then you will have two FASTQ
  • 00:14:31
    files per sample: one corresponding to the  forward reads, and the other to your reverse
  • 00:14:36
    reads. It is also worth mentioning here that,  at this point, if you have paired end reads,
  • 00:14:44
    your forward and reverse reads are still not  joined together. That is something that happens
  • 00:14:50
    at a later step. At this point they are simply  held paired together but as separate files.
  • 00:15:02
    Now depending on the sequencing facility where  your data came from, you may receive your data
  • 00:15:08
    in either multiplexed form such as in our  example here, or demultiplexed. So if you
  • 00:15:16
    receive the multiplexed file, then of course  you will need to de-multiplex your reads,
  • 00:15:21
    but if you receive them already demultiplexed then  you can simply skip this demultiplexing process
  • 00:15:28
    in QIIME 2 and you can just move forward  to the next step, which will be denoising
  • 00:15:34
    and clustering. The easiest way to know if you  have received multiplexed or demultiplexed data
  • 00:15:41
    from your facility is by simply looking  to see if you have received one FASTQ file
  • 00:15:48
    that contains all of your samples, or if you  have separate FASTQ files for each sample.
  • 00:15:56
    So this concludes the lecture tutorial on  importing and demultiplexing groups into QIIME 2.
  • 00:16:03
    In the next section, you will get to actually  get a hands-on experience on how to import
  • 00:16:08
    your raw FASTQ files in QIIME 2. We'll see you  again at the next video in this tutorial series
  • 00:16:16
    which will be about denoising or clustering  your data. Thank you very much for joining!
Tags
  • QIIME 2
  • dataimport
  • demultiplexing
  • FASTQ-fil
  • sekvensanalyse
  • bioinformatik