202 - Two ways to read HAM10000 dataset into python for skin cancer lesion classification

00:26:43
https://www.youtube.com/watch?v=qB6h5CohLbs

摘要

TLDRThe video by Shrini on the channel 'Digital's Really On YouTube' focuses on handling the HAM10000 dataset for image classification. This dataset consists of 10,015 dermatoscopic images with its metadata provided by Harvard Database. The primary goal is to guide viewers on effectively loading these images using either Keras or PyTorch to set up the dataset for classification into one of seven classes. Shrini separates dataset loading from the actual classification video to help users with other datasets sharing similar structures. The dataset's images are stored in two folders, part1 and part2, with class labels given for each image. He discusses the importance of managing class imbalance, as the NV class contains the majority of images. Shrini suggests two methods for loading data: one utilizes a pandas DataFrame and a lot of RAM to store all image data initially, while the other organizes the images into class-specific folders using ImageDataGenerator in Keras or equivalent approaches in PyTorch. This organizational step is crucial for more efficient classification model building.

心得

  • 📁 The HAM10000 dataset comprises 10,015 skin images divided into seven classes.
  • 🔗 These images are stored in two folders, providing metadata about each image.
  • 🗂️ The dataset can be handled using both Keras and PyTorch, with specific loading techniques for each.
  • 💼 Two methods are described: loading data into a pandas dataframe or using data generators.
  • 🧠 Managing class imbalance is crucial, as the NV class represents the majority.
  • 📝 Practical steps include organizing images into labeled folders or loading them into RAM efficiently.
  • 🖼️ Images are loaded as low-resolution, suitable for training machine learning models.
  • 🔧 The video explains the utilities of lambda functions, ImageDataGenerator, and PyTorch's image loader.
  • 🔍 Metadata includes details like origin, classification method, and subject demographics.
  • 📊 Shrini provides a script demo for each method, detailing steps for efficient data setup.

时间轴

  • 00:00:00 - 00:05:00

    The video introduces the topic of classifying the HAM10000 dataset into one of seven classes, focusing on loading datasets with similar structures using Keras or PyTorch. It explains what HAM10000 is and its structure, noting it consists of 10,015 dermatoscopic images and metadata, and they are divided into various classes confirmed through different methods.

  • 00:05:00 - 00:10:00

    The speaker explains the file structure of HAM10000, including two folders containing images and a metadata file. He emphasizes the importance of referencing images by their name for loading and discusses the numerical imbalance across the seven classes in the dataset. A preview of images shows how each is tagged into one of the classes.

  • 00:10:00 - 00:15:00

    The process of organizing the dataset into a Pandas DataFrame is detailed, explaining how to add image paths and metadata into the dataframe while highlighting the challenges with memory usage due to the large number of images. The next step involves class balance visualization and initial impressions of the metadata.

  • 00:15:00 - 00:20:00

    An alternative loading method using Keras’ ImageDataGenerator is discussed for systems with limited RAM. The speaker describes reorganizing images into subfolders for better handling and applying image augmentation during loading. He introduces image data generator features such as rotating and resizing images.

  • 00:20:00 - 00:26:43

    The video concludes with a comparison between using Keras and PyTorch for handling image data, recommending reorganizing datasets into subdirectories to optimize loading. It suggests balancing the data with augmentation and outlines that the next video will focus on the classification aspect.

显示更多

思维导图

视频问答

  • How does this video help in image classification projects?

    The video helps by teaching how to properly load and organize datasets before classification, which is crucial for accurate model training.

  • What is the HAM10000 dataset?

    HAM10000 consists of 10,015 dermatoscopic images in seven classes, used to train machine learning models in dermatology.

  • How does the video propose loading images for classification?

    Images are loaded from folders where they are categorized either all together for one method, or into subfolders for another organized approach.

  • What is the first method for handling dataset images in the video?

    The first method loads path and images into a single pandas DataFrame, keeping all data in one place.

  • Can the methods shown in the video be applied to other datasets?

    Yes, it can be applied to any dataset with a similar folder and metadata structure.

  • What is the second method discussed for handling images?

    The second method organizes images into class-only subfolders, using data generator tools from Keras or PyTorch to handle them.

  • How does the video address class imbalance?

    Imbalance is handled by recognizing which classes have more images and potentially applying techniques to balance the dataset, like adding class weights.

  • Why is RAM management important when handling datasets?

    Using more RAM allows for handling more data directly, especially with large datasets like HAM10000, enhancing performance but requiring efficient coding, such as avoiding too much data in RAM at once.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要!
字幕
en
自动滚动:
  • 00:00:00
    hey guys i'm shrini and welcome to my
  • 00:00:02
    channel digital's really on youtube
  • 00:00:05
    in this video i'm going to cover a topic
  • 00:00:07
    that's heavily heavily requested by
  • 00:00:09
    our viewers of this channel and this is
  • 00:00:13
    basically
  • 00:00:13
    classification of ham 10000 dataset into
  • 00:00:17
    one of the seven classes
  • 00:00:19
    and this video is dedicated to loading
  • 00:00:22
    these images whether you would like to
  • 00:00:24
    work with keras or pytorch
  • 00:00:26
    you obviously have to focus on loading
  • 00:00:28
    the data set first
  • 00:00:29
    so i'm separating this from the actual
  • 00:00:32
    classification video because this video
  • 00:00:34
    by itself could be useful for you
  • 00:00:36
    in loading other data sets that have
  • 00:00:38
    very similar structure
  • 00:00:39
    okay now first let's understand what
  • 00:00:41
    this data set is all about
  • 00:00:43
    first of all ham 10 000 stands for
  • 00:00:47
    human against machine with 10 000
  • 00:00:48
    training images
  • 00:00:50
    so as you can imagine this data set has
  • 00:00:52
    about 10 000 training images
  • 00:00:54
    and these are publicly available
  • 00:00:56
    courtesy of harvard database
  • 00:00:58
    data set i'll leave a link to this in
  • 00:01:01
    the description
  • 00:01:02
    and i'll also display the link in a
  • 00:01:04
    minute on the screen
  • 00:01:05
    it contains 10 015
  • 00:01:09
    dermatoscopic images basically the skin
  • 00:01:12
    images showing the lesion and it also
  • 00:01:15
    contains a metadata file right i mean
  • 00:01:18
    you have all these images dumped into
  • 00:01:19
    two folders we'll look at the folder
  • 00:01:21
    structure in a minute
  • 00:01:22
    and for each of this image there it
  • 00:01:24
    actually tells you which class it
  • 00:01:26
    belongs to now how did they know what
  • 00:01:28
    class it belongs to i'll get to that in
  • 00:01:30
    a second
  • 00:01:30
    uh and more than 50 percent of these are
  • 00:01:34
    confirmed through uh histopathology
  • 00:01:36
    okay and the remaining 50 it's either
  • 00:01:40
    follow-up examination or it's either
  • 00:01:42
    expert consensus or
  • 00:01:43
    confirmation by inve vocal focal
  • 00:01:45
    microscopy okay so they're basically
  • 00:01:47
    four different methods these are all
  • 00:01:49
    confirmed and the
  • 00:01:50
    each image is tagged with the
  • 00:01:53
    appropriate metadata
  • 00:01:54
    which is contained in the csv file now
  • 00:01:57
    even if you're working with any other
  • 00:01:59
    type of data sets you'll probably have
  • 00:02:00
    very similar structure all these images
  • 00:02:02
    are probably dumped into a couple of
  • 00:02:04
    folders or every single folder
  • 00:02:05
    and you have a json file or csv file
  • 00:02:08
    that tells exactly
  • 00:02:09
    uh what each image name is
  • 00:02:12
    and then what other metadata that goes
  • 00:02:14
    along with this name
  • 00:02:16
    so the technique here is we're going to
  • 00:02:18
    use that name
  • 00:02:19
    find that file in a folder load it okay
  • 00:02:22
    that's basically what we're trying to do
  • 00:02:24
    okay now if you open these uh images
  • 00:02:27
    they kind of look like these these are
  • 00:02:28
    all low resolution images but
  • 00:02:30
    it kind of gives you an idea of how
  • 00:02:32
    these images look like and each of this
  • 00:02:35
    image or you know it it's tagged into
  • 00:02:37
    one of these seven classes
  • 00:02:40
    okay and uh we'll later on see that
  • 00:02:43
    these classes are
  • 00:02:44
    all not perfectly balanced and i'll also
  • 00:02:47
    try to include how to work with
  • 00:02:49
    imbalance data in the next video so
  • 00:02:51
    please stay tuned for the next video
  • 00:02:52
    where we talk about classification of
  • 00:02:54
    these
  • 00:02:55
    uh but these are the seven classes we'd
  • 00:02:57
    like to classify each of this
  • 00:02:59
    image into and it's not an easy task uh
  • 00:03:02
    just by looking at this if we get
  • 00:03:04
    anything above 50 60
  • 00:03:05
    of accuracy we are good but uh uh it's
  • 00:03:08
    up to you how you put together
  • 00:03:10
    a nice uh uh you know a nice model that
  • 00:03:13
    gives you
  • 00:03:14
    even higher accuracies like 70 80 90
  • 00:03:18
    okay uh let's let's stay focused on the
  • 00:03:20
    data set itself
  • 00:03:21
    finally where can you get more
  • 00:03:23
    information about this first of all to
  • 00:03:24
    download this here is a link i know
  • 00:03:26
    watching this video that's that's a
  • 00:03:28
    difficult task so i'll include the link
  • 00:03:30
    please look at the description
  • 00:03:32
    down below and this is how the web page
  • 00:03:34
    looks like you can access the data set
  • 00:03:35
    from here you can download it it's an
  • 00:03:37
    archive it's a zip file
  • 00:03:38
    that contains uh you know all these
  • 00:03:41
    images again we'll look at that in a
  • 00:03:43
    minute
  • 00:03:43
    now here is a link that also provides
  • 00:03:46
    you more information about this data set
  • 00:03:48
    if you really want to know more about
  • 00:03:50
    these images and
  • 00:03:51
    uh you know this is basically the
  • 00:03:53
    information about this data
  • 00:03:55
    okay now let's jump onto the code and
  • 00:03:59
    quickly have a look at how what are
  • 00:04:01
    these two ways okay to
  • 00:04:02
    load this data set now before again as
  • 00:04:06
    usual jumping into the code
  • 00:04:08
    let me show you the folder structure i
  • 00:04:10
    actually downloaded the zip file when
  • 00:04:12
    you unzip it it actually in this case
  • 00:04:15
    i placed everything into this folder
  • 00:04:16
    called ham 10000
  • 00:04:18
    okay so when i go into this folder
  • 00:04:20
    you'll see that it has two sub
  • 00:04:21
    directories okay
  • 00:04:23
    and uh first one has 5000 images so
  • 00:04:26
    instead of putting all 1015 images in
  • 00:04:28
    one folder or one directory
  • 00:04:30
    they actually placed the first 5000 in
  • 00:04:32
    this folder
  • 00:04:33
    and the second one in part two folder
  • 00:04:35
    okay you can combine all of this into
  • 00:04:36
    one if you want
  • 00:04:38
    or you can write code such a way that it
  • 00:04:39
    just looks at these two folders for the
  • 00:04:41
    file name
  • 00:04:42
    okay uh so it's these two and then the
  • 00:04:45
    metadata
  • 00:04:47
    here so if i open this metadata file
  • 00:04:49
    again i'm using notepad plus plus
  • 00:04:51
    because i do not have microsoft
  • 00:04:52
    office or anything on the system uh you
  • 00:04:55
    can see this is how
  • 00:04:56
    the metadata looks like okay if i go all
  • 00:04:58
    the way down it should be there should
  • 00:04:59
    be 1015 of these you see 1016 first one
  • 00:05:02
    being the header okay so total 1015
  • 00:05:05
    images and the first one stands for
  • 00:05:08
    legend id okay this we're not going to
  • 00:05:11
    use it
  • 00:05:12
    the second one is image id this is
  • 00:05:14
    basically if i if you look at this
  • 00:05:16
    and let's go back to this folder so two
  • 00:05:19
    three four
  • 00:05:20
    two four three zero six okay is the
  • 00:05:22
    image id in this case two seven four one
  • 00:05:24
    nine if i actually go down there should
  • 00:05:26
    be two seven four
  • 00:05:27
    one nine somewhere okay that's it's
  • 00:05:30
    important let's go ahead and look at it
  • 00:05:31
    so two seven four one nine two seven
  • 00:05:34
    four one three
  • 00:05:35
    we're almost there there two seven four
  • 00:05:37
    one nine that
  • 00:05:38
    image okay has a legend id of this
  • 00:05:42
    and it's labeled as bkl what's bkl again
  • 00:05:46
    or bkl stands for benign keratosis like
  • 00:05:49
    legends okay so this one is classified
  • 00:05:52
    as bkl
  • 00:05:54
    and using histopathology and uh
  • 00:05:57
    it came from someone with an age of 80
  • 00:05:59
    years male
  • 00:06:01
    and from the scalp location so there is
  • 00:06:03
    a lot of metadata here so if you just
  • 00:06:04
    want
  • 00:06:05
    to leverage all of this other metadata
  • 00:06:08
    uh you know in in classifying
  • 00:06:10
    great but what we are going to do is
  • 00:06:12
    load these images
  • 00:06:14
    okay and then classify them into one of
  • 00:06:16
    these uh
  • 00:06:17
    one of these classes one of these seven
  • 00:06:19
    classes in the next video okay now let's
  • 00:06:21
    jump into
  • 00:06:22
    uh the actual part of this video where
  • 00:06:24
    we see how to load this okay
  • 00:06:26
    now there are many ways i'm just going
  • 00:06:28
    to show you two ways
  • 00:06:30
    again part of this is inspired by me
  • 00:06:32
    going through a bunch of
  • 00:06:34
    uh code that's out there on kaggle and
  • 00:06:36
    everywhere else and then i'm digesting
  • 00:06:38
    it and regurgitating it for you guys
  • 00:06:40
    okay
  • 00:06:41
    uh so first of all again at a high level
  • 00:06:45
    what am i trying to do here
  • 00:06:46
    like i said take the image file name
  • 00:06:49
    image name
  • 00:06:50
    read it and place it into a pandas data
  • 00:06:52
    frame that's step
  • 00:06:53
    that's one way of doing it this requires
  • 00:06:56
    a lot of ram because now we are going to
  • 00:06:57
    load all 10 000
  • 00:06:59
    uh you know into our pandas data frame
  • 00:07:01
    the other method
  • 00:07:02
    is using image data generator if you're
  • 00:07:04
    a keras person and if you are a
  • 00:07:06
    pie charm person you uh probably should
  • 00:07:08
    know pycharm
  • 00:07:10
    if you are a uh if you are a i'm going
  • 00:07:13
    blank in my mind if you are a pie torch
  • 00:07:15
    person then you can actually use
  • 00:07:17
    the pie torch's image folder which is
  • 00:07:21
    very similar to keras's flow from
  • 00:07:22
    directory right i mean
  • 00:07:24
    image data generator which is uh
  • 00:07:27
    it just looks at your folder structure
  • 00:07:29
    and assigns the class label as the
  • 00:07:31
    folder name
  • 00:07:32
    so if that's the case then you need to
  • 00:07:34
    organize your folder such a way that
  • 00:07:37
    that subdirectories have these folder
  • 00:07:40
    names
  • 00:07:40
    that doesn't make any sense let's get to
  • 00:07:42
    the point here okay
  • 00:07:44
    step number one method number one first
  • 00:07:45
    of all let's go ahead and import the
  • 00:07:47
    libraries that we need uh nothing to
  • 00:07:49
    explain here
  • 00:07:51
    matplotlib for plotting numpy pandas uh
  • 00:07:54
    os glob and pill to read images so you
  • 00:07:57
    should be familiar with all of these by
  • 00:07:59
    now
  • 00:08:01
    then or uh our our metadata file i'm
  • 00:08:05
    calling this skin
  • 00:08:06
    data frame okay is this ham 10000
  • 00:08:09
    metadata.csv correct let me make sure
  • 00:08:12
    that's the right uh
  • 00:08:13
    way yeah that one which is in ham 10000
  • 00:08:16
    data set
  • 00:08:16
    so we are going to use pandas to load
  • 00:08:18
    that and i've already done that
  • 00:08:21
    so my skin df when i open it
  • 00:08:24
    you should see uh yeah you should see
  • 00:08:27
    that okay i have legend id
  • 00:08:28
    image id this is dx is uh what is it
  • 00:08:32
    classified as or label if you want you
  • 00:08:34
    can rename this as label
  • 00:08:36
    and dx type is this from histo or is
  • 00:08:39
    this from consensus where is it if i go
  • 00:08:41
    down yeah there is consensus
  • 00:08:43
    i keep scrolling through it there is
  • 00:08:45
    consensus right here what process was
  • 00:08:46
    used
  • 00:08:47
    okay and age sex and localization right
  • 00:08:49
    so this is the stuff that we already
  • 00:08:51
    we already uh know so as soon as you
  • 00:08:54
    load your data
  • 00:08:55
    that's what you're going to see now to
  • 00:08:57
    this data frame let's go ahead and add
  • 00:08:59
    our images
  • 00:09:00
    this is the plan how do you do that
  • 00:09:02
    first of all
  • 00:09:03
    let's go ahead and there's a reason why
  • 00:09:06
    we imported os
  • 00:09:07
    so now let's go ahead and split our text
  • 00:09:10
    right here
  • 00:09:11
    and all we are trying to do here is
  • 00:09:13
    basically
  • 00:09:14
    look at this sub directory and look at
  • 00:09:16
    all the jpeg files so this entire thing
  • 00:09:19
    you can actually get rid of that if you
  • 00:09:21
    copy
  • 00:09:23
    these images into one single folder
  • 00:09:26
    and then just look for all jpeg files
  • 00:09:28
    okay the reason why we have this line
  • 00:09:30
    is to basically hey go to this directory
  • 00:09:34
    first of all get to the directory
  • 00:09:36
    subdirectory names and then go to the
  • 00:09:38
    first one and then go to the second one
  • 00:09:39
    and look at all the jpeg images
  • 00:09:41
    okay so that's what my image path is so
  • 00:09:44
    if i go to image path
  • 00:09:46
    right here so my image path is this data
  • 00:09:49
    slash ham 10 000
  • 00:09:50
    right ham 10 thousand slash
  • 00:09:54
    now it's looking for where is it is it
  • 00:09:56
    in part 1 or part 2
  • 00:09:58
    and what is the file name that's exactly
  • 00:10:00
    what this part of the
  • 00:10:01
    code is doing so once we have the path
  • 00:10:03
    we can just go ahead and read it
  • 00:10:05
    right so that's exactly what we are
  • 00:10:06
    going to do here so once we have this
  • 00:10:09
    path
  • 00:10:10
    okay first of all this line is adding
  • 00:10:12
    that path
  • 00:10:13
    to the data frame so if i open the data
  • 00:10:15
    frame again
  • 00:10:16
    okay so the next thing is
  • 00:10:20
    we are going to add this path to this
  • 00:10:23
    data frame
  • 00:10:24
    okay you don't need to do that but let's
  • 00:10:27
    go ahead and add it
  • 00:10:29
    because it makes it easy for us to go do
  • 00:10:31
    the next step which is
  • 00:10:32
    once you have that okay use that path
  • 00:10:36
    okay to read the image and how do you do
  • 00:10:39
    that i mean
  • 00:10:40
    again watch my video on lambda functions
  • 00:10:43
    so now i'm using lambda function
  • 00:10:45
    which is uh again instead of writing a
  • 00:10:47
    for loop you can just do this in one
  • 00:10:48
    line right i mean that's what a lambda
  • 00:10:50
    function is here
  • 00:10:51
    so for every x go ahead and open this
  • 00:10:53
    image
  • 00:10:54
    right open this image resize it to 32 by
  • 00:10:56
    32
  • 00:10:57
    i can do this because i'm using my my
  • 00:11:00
    pillow
  • 00:11:00
    library right here to open this image
  • 00:11:03
    you can do exactly the same
  • 00:11:05
    using opencv or scikit image if you want
  • 00:11:07
    to load it that way
  • 00:11:08
    but in this case because resizing is
  • 00:11:10
    easy with this
  • 00:11:12
    pillow in a single line this is exactly
  • 00:11:14
    what i'm trying to do image dot
  • 00:11:16
    open so you're opening the image and
  • 00:11:17
    then resize it to 32 by 32
  • 00:11:19
    and then convert that into a numpy array
  • 00:11:22
    that's it right that's exactly what we
  • 00:11:25
    are doing
  • 00:11:25
    for each x in this path
  • 00:11:29
    so for each of this image path it opens
  • 00:11:31
    this image and converts that into numpy
  • 00:11:33
    array and add that as a
  • 00:11:35
    separate column in my skin df
  • 00:11:38
    uh pandas data frame and the column name
  • 00:11:42
    is image that's exactly what this line
  • 00:11:44
    is
  • 00:11:44
    okay so if i open my skin df again this
  • 00:11:47
    part
  • 00:11:48
    if once you run this it it takes almost
  • 00:11:51
    on my system it took almost five minutes
  • 00:11:53
    to load the entire thing and you need
  • 00:11:55
    at least 32 gb of ram i didn't try this
  • 00:11:57
    on 16 gb maybe try it out
  • 00:12:00
    but you need a pretty high amount of ram
  • 00:12:03
    if you don't have that if you're just
  • 00:12:05
    working on 8gb
  • 00:12:06
    or much smaller ram probably the next
  • 00:12:09
    method that i'm going to show you
  • 00:12:10
    is the best one for you okay so let me
  • 00:12:12
    scroll all the way to the right
  • 00:12:14
    in fact let's make this a bit bigger so
  • 00:12:17
    you can see my image right there
  • 00:12:19
    and each of this image it's you know
  • 00:12:22
    it's not showing
  • 00:12:22
    what's in there but
  • 00:12:26
    there you go now you can see all the
  • 00:12:28
    numbers in there okay so this is my
  • 00:12:30
    image it's already a numpy array
  • 00:12:32
    and it's going to be of the shape let's
  • 00:12:35
    not edit anything
  • 00:12:36
    it's going to be of the shape 32 by 32
  • 00:12:40
    okay so that is uh in fact let's go
  • 00:12:43
    ahead and confirm that so once you have
  • 00:12:45
    this
  • 00:12:46
    uh now let's go ahead and see how many
  • 00:12:49
    different types of images we have or how
  • 00:12:50
    many different types of dx
  • 00:12:52
    do we have or ids do we have right so
  • 00:12:54
    let's go ahead and plot it
  • 00:12:55
    so uh label nv
  • 00:12:58
    whatever that nv stands for uh
  • 00:13:01
    let's see melanocynic nevi okay so
  • 00:13:05
    six thousand seven hundred of these and
  • 00:13:07
    then one thousand
  • 00:13:08
    one thousand of each of these melanoma i
  • 00:13:11
    think
  • 00:13:11
    mel and then this bkl the rest of these
  • 00:13:14
    are 500 300 142 and 115 this is heavily
  • 00:13:17
    imbalanced
  • 00:13:18
    data set so if you don't do much of
  • 00:13:21
    anything there's a good chance that you
  • 00:13:23
    may get high
  • 00:13:24
    accuracy but most of the images that are
  • 00:13:26
    being classified as high accurate are
  • 00:13:28
    uh you'll end up with nv label nv right
  • 00:13:32
    so you have to do something to balance
  • 00:13:33
    this whether
  • 00:13:35
    you add weights during training or you
  • 00:13:38
    kind of do the balancing beforehand
  • 00:13:39
    by by down scaling the number of images
  • 00:13:43
    of nv or upscaling the number of
  • 00:13:45
    df images okay okay so
  • 00:13:48
    far so good now i hope again if you want
  • 00:13:51
    you can go ahead and print them
  • 00:13:52
    or show these images on the screen
  • 00:13:55
    so let's pick five by five so here you
  • 00:13:58
    have these images
  • 00:13:59
    uh up here and if you zoom in there is
  • 00:14:01
    like a label up here
  • 00:14:02
    i should change the the figure size so
  • 00:14:05
    you can read them but
  • 00:14:06
    hopefully you can read them right there
  • 00:14:08
    okay
  • 00:14:09
    okay so uh after about 13 to 14 minutes
  • 00:14:13
    what did we learn
  • 00:14:14
    well there are seven different classes
  • 00:14:16
    and the images are in two different
  • 00:14:18
    folders and there is a metadata file
  • 00:14:20
    and we are basically extracting the path
  • 00:14:23
    of each image
  • 00:14:24
    by walking through these folders and
  • 00:14:26
    each image and capturing that into a
  • 00:14:28
    data frame
  • 00:14:29
    and using that path and applying a
  • 00:14:32
    lambda function
  • 00:14:33
    to each of those entries and loading the
  • 00:14:37
    image and adding that as a separate
  • 00:14:39
    column to your data frame i like this
  • 00:14:41
    approach because everything is at one
  • 00:14:43
    place now i have
  • 00:14:44
    all the images everything loaded into
  • 00:14:45
    one data frame the reason i dislike this
  • 00:14:48
    approach is now you're loading tens of
  • 00:14:50
    thousands of images
  • 00:14:51
    into your ram it may not leave much room
  • 00:14:54
    to do other stuff
  • 00:14:55
    if you are limited by ram okay if that's
  • 00:14:57
    the case let's go to this next step
  • 00:14:59
    what do you do okay this is the second
  • 00:15:00
    method now okay
  • 00:15:02
    so let's clear the screen uh the second
  • 00:15:05
    one is uh it's it's it's a great trick
  • 00:15:08
    to actually
  • 00:15:09
    do uh even if you have uh if you're
  • 00:15:12
    working with other types of data sets
  • 00:15:13
    first i like i'd like to sort images
  • 00:15:16
    into respective subfolders what if you
  • 00:15:18
    can go to each of these like bkls and
  • 00:15:20
    nvs and melanoma and all of these seven
  • 00:15:22
    classes and create
  • 00:15:24
    seven subfolders and extract the images
  • 00:15:26
    that belong to each of this subfolder
  • 00:15:28
    and dump them there
  • 00:15:29
    okay that's exactly what we're trying to
  • 00:15:31
    do here
  • 00:15:32
    so in fact the first step here this one
  • 00:15:35
    is sorting these images into these
  • 00:15:38
    subfolders so what i did for that
  • 00:15:40
    is i actually dumped all these images
  • 00:15:42
    into single folder
  • 00:15:43
    first of all okay called all images so
  • 00:15:46
    if i go to all images you should see
  • 00:15:47
    1015 of these images in this folder
  • 00:15:50
    now i mean i could have written similar
  • 00:15:52
    lines as before but
  • 00:15:54
    it's easier this way so now i have all
  • 00:15:56
    images in one folder
  • 00:15:58
    which is my data directory and then my
  • 00:16:00
    destination directory is
  • 00:16:02
    uh data slash reorganized so i went back
  • 00:16:05
    yeah right there
  • 00:16:06
    reorganized this is my destination
  • 00:16:07
    directory so take images from all images
  • 00:16:10
    put them in destination directory
  • 00:16:12
    into respective subfolders based on the
  • 00:16:14
    class names
  • 00:16:15
    so how do we do that again so first of
  • 00:16:17
    all we are reading our metadata again
  • 00:16:19
    right
  • 00:16:20
    we are reading the metadata and then and
  • 00:16:23
    then uh
  • 00:16:24
    basically uh let's get to the meat of it
  • 00:16:27
    down here what we are trying to do is
  • 00:16:29
    first of all make a directory right for
  • 00:16:31
    i
  • 00:16:31
    in label what is my label let's run a
  • 00:16:34
    couple of lines so it makes it easy for
  • 00:16:36
    you
  • 00:16:37
    okay so let's do these
  • 00:16:40
    lines i really do not want to run this
  • 00:16:42
    entire thing because it takes a lot of
  • 00:16:44
    time
  • 00:16:44
    so let's do the relevant parts okay
  • 00:16:46
    there you go
  • 00:16:48
    and now let's go ahead and read this and
  • 00:16:50
    print
  • 00:16:51
    our counts we saw this last time right
  • 00:16:54
    i mean to print this out you don't need
  • 00:16:55
    to load images
  • 00:16:57
    we just need to load metadata to see the
  • 00:16:59
    distribution of data
  • 00:17:01
    and then now go ahead and look at these
  • 00:17:05
    unique values and save it to a list all
  • 00:17:07
    i'm trying to do is extract the label
  • 00:17:08
    names
  • 00:17:09
    nvmel and bkl into a separate list so if
  • 00:17:12
    you look at label images
  • 00:17:14
    label label there you go
  • 00:17:17
    so this is the list i just created now
  • 00:17:20
    in that uh label right the label that we
  • 00:17:23
    just uh
  • 00:17:24
    created uh the list of labels go through
  • 00:17:27
    each one of those
  • 00:17:28
    okay and make a directory with that name
  • 00:17:31
    make a subdirectory with that name
  • 00:17:32
    that's exactly what it did make a
  • 00:17:34
    subdirectory with the name of that label
  • 00:17:37
    next while you're there go ahead and
  • 00:17:40
    look at the image id okay and if that
  • 00:17:44
    if that label is equal to for example
  • 00:17:47
    bkl
  • 00:17:48
    or nv if that label is equal to nv okay
  • 00:17:51
    go ahead and find that image and move it
  • 00:17:54
    this is my
  • 00:17:54
    input directory and this is the
  • 00:17:56
    destination directory and that's that's
  • 00:17:58
    pretty much it
  • 00:17:58
    so i did that and it sorted all my
  • 00:18:01
    images into
  • 00:18:02
    for example a k e i c i have 327 items
  • 00:18:06
    right so let's move this
  • 00:18:07
    so ak iec 327 okay
  • 00:18:10
    and if i go to a different folder where
  • 00:18:13
    is it bcc
  • 00:18:14
    i have 514 bcc 514 so
  • 00:18:18
    this is an easy way of sorting images
  • 00:18:20
    and placing them into right bins
  • 00:18:22
    now it's very easy to use
  • 00:18:26
    for example keras data gen let's run
  • 00:18:28
    this let's actually run this
  • 00:18:30
    uh let's delete everything and start
  • 00:18:32
    with a clean slate sorry for making this
  • 00:18:34
    video long but i want to make sure you
  • 00:18:36
    understand this i'll share the code
  • 00:18:38
    anyway
  • 00:18:38
    okay let's run these libraries i mean
  • 00:18:41
    these are pretty
  • 00:18:42
    uh straightforward right so we are going
  • 00:18:44
    to use image data generator
  • 00:18:46
    to to uh typically we use this to
  • 00:18:49
    augment data
  • 00:18:50
    okay which is which we are still doing
  • 00:18:53
    it
  • 00:18:54
    but image data generator we can actually
  • 00:18:56
    supply
  • 00:18:57
    any parameters within the datagen right
  • 00:18:59
    here again watch my video on data
  • 00:19:01
    augmentation using keras
  • 00:19:02
    but here you can say okay rotate the
  • 00:19:04
    image
  • 00:19:06
    by a random rotation between 0 to 30
  • 00:19:09
    degrees
  • 00:19:10
    uh random zoom random you know something
  • 00:19:13
    so you can perform different operations
  • 00:19:15
    to your images
  • 00:19:16
    right now i'm not giving any anything
  • 00:19:18
    right there okay right now i'm just
  • 00:19:20
    i'm just uh leaving it blank okay
  • 00:19:24
    so let's go ahead and define our datagen
  • 00:19:26
    object
  • 00:19:27
    right there so it created this object
  • 00:19:30
    now what do we do with that object right
  • 00:19:31
    so first of all where is my training
  • 00:19:33
    directory
  • 00:19:34
    i added this line because it tells it it
  • 00:19:36
    looks at my current working directory
  • 00:19:38
    and then adds this
  • 00:19:39
    to that current working directory okay
  • 00:19:41
    so let's go ahead and
  • 00:19:42
    do that my training directory okay so if
  • 00:19:45
    you just do
  • 00:19:47
    oh sorry very right hype i'll remove
  • 00:19:50
    that in a second
  • 00:19:51
    train underscore id oh sorry
  • 00:19:54
    dir so this is my training directory
  • 00:19:57
    okay
  • 00:19:58
    reorganized and what do we do with our
  • 00:20:01
    training directory
  • 00:20:02
    okay the next step is train uh i'm
  • 00:20:05
    creating
  • 00:20:06
    uh flow from directory again uh once you
  • 00:20:09
    create your datagen this is how you
  • 00:20:10
    apply
  • 00:20:11
    uh data augmentation right you create an
  • 00:20:14
    object
  • 00:20:15
    for datagen and you apply that object to
  • 00:20:18
    your
  • 00:20:18
    data and the data can come from dot flow
  • 00:20:21
    or flow from directory in this example
  • 00:20:23
    since it's coming from a directory we
  • 00:20:25
    have to tell it where the directory is
  • 00:20:27
    so datagen.flow from directory the
  • 00:20:29
    directory is my training directory which
  • 00:20:31
    we just created
  • 00:20:32
    again i apologize if this is too basic
  • 00:20:34
    but
  • 00:20:35
    uh if you have never done this you this
  • 00:20:37
    can be very helpful
  • 00:20:38
    okay the class mode is categorical
  • 00:20:40
    because we have multi-class
  • 00:20:42
    okay if it's only binary the class mode
  • 00:20:44
    would be binary so we have
  • 00:20:46
    categorical and bad size is 16 meaning
  • 00:20:48
    each time while the training is
  • 00:20:49
    happening each batch it's actually
  • 00:20:51
    loading 16 images
  • 00:20:52
    okay and what exactly does it do it
  • 00:20:54
    looks at your uh training directory
  • 00:20:57
    loads uh some of these uh like in this
  • 00:20:59
    case 16 images
  • 00:21:01
    and it performs any augmentation that
  • 00:21:03
    you would do here
  • 00:21:04
    right image data generator it tells okay
  • 00:21:07
    rotate this image randomly
  • 00:21:09
    flip this image or in a vertical or a
  • 00:21:11
    horizontal way and all that stuff
  • 00:21:13
    right now we're not doing any of that
  • 00:21:14
    but that's exactly what it does
  • 00:21:17
    so it loads 16 of this and then target
  • 00:21:19
    size is 32 by 32. if the images
  • 00:21:21
    are all of different sizes this is very
  • 00:21:23
    useful also if you cannot handle large
  • 00:21:25
    images
  • 00:21:26
    based on your ram just give what the
  • 00:21:28
    target size is okay
  • 00:21:29
    remember previously i uh changed the
  • 00:21:32
    size while reading the file
  • 00:21:34
    okay again let's go back i'm resizing it
  • 00:21:38
    while i was reading this and then
  • 00:21:40
    capturing it as numpy array this time
  • 00:21:42
    i'm basically
  • 00:21:43
    saying hey image data generator
  • 00:21:46
    okay go ahead and resize them to 32 by
  • 00:21:48
    32 that's exactly what's going on here
  • 00:21:51
    okay so let us go ahead and run this
  • 00:21:54
    line
  • 00:21:57
    okay so it says it found ten thousand
  • 00:22:00
    fifteen images belonging to seven
  • 00:22:01
    classes
  • 00:22:02
    that's excellent because that's we know
  • 00:22:04
    that's exactly what
  • 00:22:05
    it's supposed to be okay uh you're done
  • 00:22:08
    now you can go ahead and work in keras
  • 00:22:11
    for example
  • 00:22:12
    now if you want to just look at okay
  • 00:22:14
    let's load
  • 00:22:15
    a first random batch of 16 so let's go
  • 00:22:18
    ahead and do
  • 00:22:20
    x y next string data keras now if you
  • 00:22:22
    look at
  • 00:22:23
    up here my x is 16 images each image 32
  • 00:22:26
    by 32 by 3
  • 00:22:28
    and my y is 16 images uh with seven
  • 00:22:32
    different classes
  • 00:22:33
    this is this is excellent right and you
  • 00:22:34
    see we are we aren't even converting
  • 00:22:36
    this into
  • 00:22:37
    two categorical and everything because
  • 00:22:39
    we already said hey this is class mode
  • 00:22:41
    is categorical so it should if i'm right
  • 00:22:44
    convert this into one hot encoded
  • 00:22:47
    uh array okay so as you can see the
  • 00:22:50
    first one belongs to class number five
  • 00:22:52
    the second one belongs to class number
  • 00:22:54
    five and then class number two and so on
  • 00:22:55
    so
  • 00:22:56
    it saves you a lot of time in not doing
  • 00:22:58
    these steps that we typically do
  • 00:23:00
    when we load the data into you know a
  • 00:23:02
    pandas data frame and all that stuff
  • 00:23:05
    okay so let's go ahead and
  • 00:23:08
    plot them since we reloaded them so
  • 00:23:12
    what am i doing here oh yeah i'm
  • 00:23:14
    plotting each image
  • 00:23:15
    there you go one two three four they all
  • 00:23:18
    look
  • 00:23:19
    different as you can see so it's just
  • 00:23:20
    loading these 16 in this batch
  • 00:23:23
    okay so this is with
  • 00:23:26
    keras i'll just quickly show you pytorch
  • 00:23:28
    if you're pytorch people
  • 00:23:30
    exactly the same thing we are getting
  • 00:23:31
    the directory in
  • 00:23:33
    pytarch again instead of image data
  • 00:23:35
    generator with a bunch of
  • 00:23:36
    operations you have equivalent
  • 00:23:38
    transforms dot compose okay
  • 00:23:40
    in this example i'm resizing them to 256
  • 00:23:43
    well you can change this to 32 right so
  • 00:23:45
    that's exactly what we did before
  • 00:23:47
    random horizontal flips you can do i
  • 00:23:50
    mean
  • 00:23:50
    convert to a tensor so you can actually
  • 00:23:52
    bring it into your
  • 00:23:53
    pytorch model training later on
  • 00:23:56
    and uh if you want you can normalize
  • 00:23:58
    this if these images are not normalized
  • 00:24:00
    that's another thing you can also
  • 00:24:01
    normalize these
  • 00:24:02
    here right in image data generator you
  • 00:24:04
    can say 1 over 255
  • 00:24:07
    so you can scale them to between values
  • 00:24:10
    between 0 to 1.
  • 00:24:11
    here you can normalize them with a mean
  • 00:24:13
    around here and standard deviation
  • 00:24:14
    around here for these three
  • 00:24:17
    and then you can go ahead and train your
  • 00:24:19
    uh
  • 00:24:20
    you know use your image folder very
  • 00:24:22
    similar to what we have done here
  • 00:24:24
    like datagen.flow from directory you're
  • 00:24:26
    doing exactly the same thing image
  • 00:24:28
    folder and you provide your folder path
  • 00:24:30
    okay in fact if you are providing your
  • 00:24:32
    transform image you
  • 00:24:34
    obviously do comma your transform i
  • 00:24:36
    think i included that here yeah the
  • 00:24:38
    transform operation
  • 00:24:39
    uh i think that's pretty much it so if
  • 00:24:42
    you really would like to
  • 00:24:43
    check this how this thing works let's go
  • 00:24:45
    ahead and run it
  • 00:24:47
    again a sanity check this is already a
  • 00:24:49
    long video but i hope you find this to
  • 00:24:51
    be
  • 00:24:51
    very useful okay so it it loaded
  • 00:24:55
    and let's go ahead and print the number
  • 00:24:56
    of train samples there number of train
  • 00:24:58
    samples 1015 we know that
  • 00:25:00
    right and uh the detected classes are
  • 00:25:03
    these are the classes
  • 00:25:05
    bcc is given a value of 1 bkl is given a
  • 00:25:08
    value of 2 and so on
  • 00:25:09
    that's pretty much it and and anyway
  • 00:25:12
    let's go ahead and do some of these
  • 00:25:13
    print operations so you get a
  • 00:25:15
    uh oh we are not
  • 00:25:18
    let's not worry about this sorry i keep
  • 00:25:21
    doing these
  • 00:25:22
    i said let's not worry about it but it
  • 00:25:24
    bugs me if i leave it there
  • 00:25:26
    okay so as you can see we have about six
  • 00:25:28
    thousand seven hundred and five uh
  • 00:25:30
    images that belong to class five and we
  • 00:25:32
    know what that is right class five
  • 00:25:34
    nv five of these so we know that the
  • 00:25:36
    data
  • 00:25:37
    has been loaded successfully okay that's
  • 00:25:40
    a
  • 00:25:40
    uh long video i apologize but i hope you
  • 00:25:43
    find this to be
  • 00:25:45
    useful like i already mentioned so
  • 00:25:49
    let me summarize this by not extending
  • 00:25:51
    it too much two primary ways one
  • 00:25:53
    i'll go ahead and load them directly to
  • 00:25:55
    ram into your pandas data frame
  • 00:25:58
    and two organize them into
  • 00:26:00
    subdirectories first
  • 00:26:02
    and then use uh keras or pytarch to
  • 00:26:06
    to to use your image data generate user
  • 00:26:09
    datagen
  • 00:26:10
    in keras and in pytarch use your
  • 00:26:13
    uh you know datasets.image folder to
  • 00:26:16
    kind of read the folder structure and
  • 00:26:18
    automatically sort the data for you
  • 00:26:21
    so stay tuned and again
  • 00:26:24
    wait for the next video where we'll
  • 00:26:26
    continue this discussion
  • 00:26:27
    and just do the classification part
  • 00:26:30
    that's the easy part
  • 00:26:31
    handling the data is the tough part by
  • 00:26:32
    the way okay
  • 00:26:34
    i know you enjoyed this because after 25
  • 00:26:36
    minutes you're still here so please go
  • 00:26:38
    ahead and subscribe to this channel and
  • 00:26:39
    you'll find more such content in future
  • 00:26:41
    thank you very much
标签
  • HAM10000
  • Image classification
  • Dataset loading
  • Keras
  • PyTorch
  • Metadata
  • Data organization
  • DEEP learning
  • Machine learning
  • Skin images