AWS re:Invent 2023 - Train and tune state-of-the-art ML models on Amazon SageMaker (AIM335)
Resumo
TLDRThe presentation highlights the evolution and challenges of training state-of-the-art machine learning models using Amazon SageMaker. Gal Oshri introduces the session and discusses the growing interest in deep learning applications, along with the benefits of using SageMaker to handle large-scale model training challenges, including infrastructure orchestration, data management, and cost efficiency. Emily Webber elaborates on fine-tuning and pre-training large language models with practical demonstrations, emphasizing the ease of use and scalability of SageMaker. Tom Kollar shares insights from the Toyota Research Institute, detailing their application of SageMaker for various machine learning tasks, particularly in robotics, and highlights the importance of robust training infrastructure. The session underscores advancements like smart data sifting and distributed training, showcasing SageMaker as a versatile tool for model development.
Conclusões
- 👤 Gal Oshri introduces SageMaker and its capabilities.
- 📊 Emily discusses fine-tuning large language models with a live demo.
- 🚗 Tom shares insights from Toyota Research on using SageMaker in robotics.
- ⚙️ The challenges of training large models include hardware utilization and cost efficiency.
- 📦 Amazon SageMaker streamlines the training process with its estimator API.
- 📈 Smart sifting can reduce training time by filtering uninformative samples.
- 💾 SageMaker allows seamless integration of various data sources for training.
- 🔧 The importance of cluster repair features for uninterrupted training.
- 🌐 Advances in transformer architecture improve model performance significantly.
- 🔍 Customers can securely customize third-party models without exposing their data.
Linha do tempo
- 00:00:00 - 00:05:00
Gal Oshri introduces himself and colleagues, presenting on training and tuning state-of-the-art ML models on AWS SageMaker. The audience is involved with a count of how many are currently training ML models, especially on larger scales. The talk outlines the agenda, including challenges in large-scale model training, demonstrations, and research use cases.
- 00:05:00 - 00:10:00
Machine learning is showcased as a versatile tool across various applications like recommendations and autonomous driving, highlighting recent advances in image generation from ML models. Gal discusses how algorithmic improvements, particularly the transformer architecture, and increased availability of data and compute power have contributed to the enhancement of model outputs over recent years.
- 00:10:00 - 00:15:00
Challenges in training large-scale ML models are detailed, including the need for efficient hardware, proper orchestration, dataset management, scaling infrastructure, and cost management. Gal emphasizes the importance of optimizing both financial costs and team resources, advocating for effective use of SageMaker in addressing these challenges.
- 00:15:00 - 00:20:00
Amazon SageMaker is presented as a solution to the aforementioned challenges. A high-level overview of SageMaker's operation is provided, detailing how it streamlines the training process through a structured API that manages compute resources, health checks, and data integration, ensuring efficient model training and cost-effectiveness.
- 00:20:00 - 00:25:00
SageMaker’s capabilities for loading data, using built-in algorithms, and providing distributed training options were discussed. Features like logging, checkpoint synchronization, and instance management further highlight how SageMaker can facilitate efficient training processes while ensuring resiliency against failures during training jobs.
- 00:25:00 - 00:30:00
Training model efficiency and performance tracking using SageMaker tools like the profiler were discussed. The profiler helps optimize GPU usage and performance issues, which is critical for minimizing the overall training costs and times through better hardware utilization.
- 00:30:00 - 00:35:00
Emily introduces the concept of smart sifting—a new feature in SageMaker aimed at improving training efficiency by refining data during training, thus reducing overall training time and costs up to 35% without negatively impacting model accuracy. She presents how to implement this feature simply using SageMaker's framework.
- 00:35:00 - 00:40:00
The introduction of Amazon SageMaker HyperPod is announced for managing large-scale training instances while allowing easier access and control over the training environment, providing managed benefits without sacrificing performance with features such as resilience and reduced setup time.
- 00:40:00 - 00:45:00
The discussion shifts to fine-tuning models, emphasizing how it can be executed on proprietary models without exposing sensitive data. New security enhancements in SageMaker facilitate this process while maintaining the model’s confidentiality, showcasing an efficient end-to-end workflow.
- 00:45:00 - 00:54:18
Emily transitions to her segment focused on fine-tuning and pre-training large language models (LLMs) on SageMaker. She explains various customization techniques for LLMs, from simpler adjustments like prompt engineering to more complex methods like retrieval-augmented generation and pre-training new foundation models.
Mapa mental
Vídeo de perguntas e respostas
What is Amazon SageMaker?
Amazon SageMaker is a fully managed service that provides tools to build, train, and deploy machine learning models quickly and efficiently.
What are the benefits of using SageMaker for training models?
SageMaker helps automate the orchestration of training jobs, monitors hardware health, offers distributed training libraries, and provides tools for monitoring and managing costs.
How does SageMaker assist with large model training?
It offers tools for easy scaling of infrastructure, enables distributed training, and ensures fault tolerance through cluster repair features.
What is smart sifting in the context of SageMaker?
Smart sifting is a technique that filters out less informative training data, potentially reducing training time and costs without impacting model accuracy.
How does Toyota Research Institute utilize SageMaker?
TRI uses SageMaker for various applications including training large language models, robotics projects, and serving models in production.
What advancements have impacted deep learning in recent years?
Significant algorithmic improvements, particularly the introduction of transformer architecture, along with increased datasets, model sizes, and computational resources.
How can one get started with SageMaker?
Users can easily initiate a training job by using the estimator API, which requires minimal setup and configuration.
What types of customization can be done on large language models?
Models can be customized through techniques like prompt engineering, retrieval augmented generation, fine-tuning, and pre-training.
What role does model parallel training play in SageMaker?
Model parallel training allows the distribution of a neural network over multiple GPUs or accelerators to optimize performance.
Can existing models be fine-tuned on SageMaker?
Yes, SageMaker allows users to fine-tune third-party models on their data securely without exposing their data to the model provider.
Ver mais resumos de vídeos
SPIK V 1987: Allah Tritunggal (1) - Pdt. Dr. (H.C.) Stephen Tong
How to Get Rich in the New Era of A.I. (2025)
Cation Exchange
They Are Already Preparing the World Government of the Beast! See the Biblical Proof!
This Scalping Strategy Will Make Millionaires in 2025….
How To Program Apps That Make INFINITE MONEY
- 00:00:00- Good afternoon everyone.
- 00:00:01My name is Gal Oshri, I'm a product manager
- 00:00:03at AWS working on SageMaker.
- 00:00:06I'm here with Emily Webber and Thomas Kollar,
- 00:00:08to talk to you about training and tuning state-of-the-art
- 00:00:11machine learning models on Amazon SageMaker.
- 00:00:14Before we start, how many of you're already training
- 00:00:16machine learning models today?
- 00:00:20Awesome.
- 00:00:20How many of you're training models
- 00:00:21with more than 10 GPUs or accelerators?
- 00:00:25All right, anyone with more than a hundred?
- 00:00:29Alright, a thousand?
- 00:00:31No.
- 00:00:32Alright, cool, well, today we'll learn a bit about that.
- 00:00:37So we'll talk about the challenges,
- 00:00:39for training large scale machine learning models.
- 00:00:42And then we'll talk about how SageMaker
- 00:00:43can help you train those models.
- 00:00:45Emily will then talk to you about fine tuning
- 00:00:47and pre-training large language models,
- 00:00:50and show you a demo, training Llama 7 B on SageMaker.
- 00:00:54And then we'll hear from Tom,
- 00:00:55about Toyota Research Institute
- 00:00:57and their machine learning use cases.
- 00:01:03So machine learning has already proven itself useful
- 00:01:05across a wide range of applications.
- 00:01:07From recommendations, to credit risk prediction,
- 00:01:10and autonomous driving, to document analysis.
- 00:01:14But recently there's been an explosion in interest
- 00:01:16in deep learning models for computer vision
- 00:01:19and natural language processing.
- 00:01:22Just a few years ago, this is the type of image you would
- 00:01:25get if you tried to generate a very clean living room
- 00:01:28with a machine learning model.
- 00:01:30You can tell that it's fake, it's not really coherent,
- 00:01:32you can't tell that it's a living room.
- 00:01:36And just a few years later, we can now generate images
- 00:01:39like this, where you'd have to look really closely
- 00:01:42to tell that it's not a real image.
- 00:01:44And I showed something similar a year ago at re:Invent,
- 00:01:47and at that time this was kind of shocking, right,
- 00:01:49seeing this type of image get generated from a model.
- 00:01:52But a year later, and I think many people in the audience
- 00:01:54have already seen these types of images get generated,
- 00:01:57and the quality that you can get
- 00:01:58with machine learning models.
- 00:02:01So how did this happen?
- 00:02:03Well first, there were notable algorithmic improvements
- 00:02:05over the last few years.
- 00:02:07Specifically the transformer architecture,
- 00:02:09which is used in many of the large scale models
- 00:02:12that you hear about today.
- 00:02:14However, there's also an increase in the data sets,
- 00:02:17the model sizes, and the amount of compute that is used to
- 00:02:20train these models.
- 00:02:22And a lot of the research shows that we can continue
- 00:02:24increasing these dimensions,
- 00:02:26to get better and better results.
- 00:02:29So to be competitive, you really have to think about
- 00:02:31how do you leverage the these advancements
- 00:02:34to provide the best experiences
- 00:02:35for your customers with machine learning.
- 00:02:39Okay, so training large scale models is awesome.
- 00:02:41Let's just train the biggest one immediately
- 00:02:43and be done with it, right?
- 00:02:45But, it's a bit more complicated than that.
- 00:02:47There's some challenges.
- 00:02:49The first, is that you want to use the latest hardware.
- 00:02:52Every few years there are innovations in hardware
- 00:02:54that lead to two to nine x improvements in training
- 00:02:57efficiency.
- 00:02:59But it's not enough to get access to the latest hardware,
- 00:03:02you have to think about how well it works.
- 00:03:04Is it, like fault resistant enough, to be able to let you
- 00:03:08to continue your training with minimal interruptions
- 00:03:10to the machine learning team?
- 00:03:13You have to think about orchestration,
- 00:03:15and how to most effectively use the
- 00:03:16resources you have available.
- 00:03:18Especially if you have a large team of data scientists
- 00:03:21who want to train many models in parallel.
- 00:03:24We talked about how you want to have larger data sets,
- 00:03:27and being able to store, load, and process
- 00:03:29these large data sets, can require a lot of work.
- 00:03:33And there are a lot of pitfalls in doing that.
- 00:03:36You wanna think about scaling up.
- 00:03:38Both the infrastructure, to get more compute for training
- 00:03:41the model, as well as the algorithms that you use.
- 00:03:44The models that we train today, for these use cases,
- 00:03:47do not fit on a single accelerator.
- 00:03:49So you have to think about the algorithms
- 00:03:50that you need to use to scale up.
- 00:03:53And finally, we have to think about cost.
- 00:03:56Training these models can cost hundreds of thousands,
- 00:03:58or millions of dollars.
- 00:04:00So you need to think about efficiency when you're training
- 00:04:02those models.
- 00:04:03Especially at the beginning, when you're doing
- 00:04:04sporadic experimentation, you're trying out different ideas,
- 00:04:08and you don't use the hardware all the time.
- 00:04:11Right, so you want to think about how you use that
- 00:04:12efficiently.
- 00:04:14And it's not just the financial cost,
- 00:04:15but the team's time, right?
- 00:04:17A lot of customers tell us, that making sure that their
- 00:04:20ML engineers are not spending time dealing
- 00:04:22with infrastructure, is one of their top priorities.
- 00:04:27But not all hope is lost.
- 00:04:29Amazon SageMaker can help with many of these challenges.
- 00:04:34I'll give a high level overview of how SageMaker works,
- 00:04:37but you'll see it in a lot more detail during Emily's demo.
- 00:04:41We start, by calling the create training job API.
- 00:04:45This Sage API captures information about your dataset,
- 00:04:48your compute resource configuration,
- 00:04:50as well as the training algorithm that you want to use.
- 00:04:54SageMaker will then set up the cluster
- 00:04:56for training the model,
- 00:04:58with the right VPC and networking configurations,
- 00:05:01by default, to save you a lot of time,
- 00:05:03but you can configure all of it yourself as well
- 00:05:06and add the flexibility that you need.
- 00:05:09As part of spinning up the cluster,
- 00:05:11SageMaker will also run health checks on the hardware,
- 00:05:14to make sure that everything is working effectively.
- 00:05:16before the job even begins, and before the billing starts.
- 00:05:20So this saves you time and money,
- 00:05:22to make sure that the training can continue efficiently.
- 00:05:28SageMaker will then load data from S3, EFS,
- 00:05:31or FSX for Luster, and you have options to either copy
- 00:05:35or stream the data.
- 00:05:36Depending on your dataset size, one of those might be
- 00:05:38more applicable.
- 00:05:40But again, and you'll hear this theme again, again,
- 00:05:42while SageMaker provides great options for
- 00:05:45getting started and moving quickly,
- 00:05:47you also have the flexibility to do what you want,
- 00:05:50and load data from other sources.
- 00:05:54You can then download the training image from ECR,
- 00:05:57and you have options of built-in algorithms in SageMaker.
- 00:06:01You can use one of the SageMaker deep learning containers
- 00:06:03to quickly use PyTorch, TensorFlow, or Hugging Face,
- 00:06:06or you can bring your own training image,
- 00:06:09with your own algorithms completely.
- 00:06:12SageMaker also offers distributed training libraries
- 00:06:14that can help accelerate both data, and model parallel
- 00:06:17training, and you'll hear more about that later.
- 00:06:22SageMaker, then starts a training, and streams the logs
- 00:06:24to CloudWatch throughout training.
- 00:06:27It stores the metadata and hyper parameters,
- 00:06:29so you can view it later.
- 00:06:30And you have, again, options for using TensorBoard,
- 00:06:32and other tools, to visualize your experiments.
- 00:06:37It will synchronize your checkpoints throughout training,
- 00:06:39to your storage, which is critical, if you are, you know,
- 00:06:43you want to be fault resistant in case something fails
- 00:06:46during training, you don't want to lose your progress
- 00:06:48until that point.
- 00:06:51At the end of the training, SageMaker will save the model
- 00:06:54and other output data so you can revisit it later.
- 00:06:58At the end of the training, SageMaker spins down all the
- 00:07:01compute, so that if the job fails at 3:00 AM,
- 00:07:04no one has to wake up to turn anything off
- 00:07:06and make sure that you're not paying for all that hardware,
- 00:07:09all those instances running,
- 00:07:10without being used to train a model.
- 00:07:13And with the same paradigm,
- 00:07:15we can actually scale up our training
- 00:07:17to many more instances really easily,
- 00:07:19to get those large scale models.
- 00:07:23One really awesome feature that we launched this year,
- 00:07:26is a cluster repair feature.
- 00:07:28So if during the training, one of the instances fails,
- 00:07:31we look at what happened to that instance
- 00:07:34and decide whether we need to reboot it
- 00:07:36or replace it with a different instance,
- 00:07:38and then restart the training within a few minutes.
- 00:07:41So there are all these resiliency capabilities
- 00:07:43to ensure the training continues as quickly as possible
- 00:07:46and without manual intervention.
- 00:07:52In case any of that sounds intimidating,
- 00:07:54the good news, it's actually really easy to get started.
- 00:07:57The most important code for converting model training
- 00:07:59to a SageMaker training job, is the estimator API.
- 00:08:04You'll see more in the demo later,
- 00:08:05but at a high level, the API takes a a Python file
- 00:08:09or an entry point, in this case cifar10.pi,
- 00:08:13which is very similar to how I would do the model training
- 00:08:16on my laptop.
- 00:08:18I also provide the instance type I want to use,
- 00:08:21how many of them, and hyper parameters,
- 00:08:24which I can easily change later
- 00:08:25to try additional training jobs.
- 00:08:28I also add metric definitions, so that I can view
- 00:08:31those metrics in CloudWatch during the training.
- 00:08:35Finally, I provide a path to my data and call estimator.fit.
- 00:08:42Now the even better news, is that we recently made it
- 00:08:44even easier to get started.
- 00:08:47Now, you can take your existing Python code
- 00:08:50and add the remote python decorator to it,
- 00:08:53to immediately serialize the runtime,
- 00:08:56the packages, functions, and everything else,
- 00:08:58so that it runs as a SageMaker training job
- 00:09:01without even having to learn about the estimator API.
- 00:09:07Once the training job begins,
- 00:09:08you can easily view the metadata and reproduce it later,
- 00:09:12or clone that training job.
- 00:09:14And you know, like tracking experiments
- 00:09:15is extremely important.
- 00:09:17You want to learn from what you've done before.
- 00:09:19And we often see people who keep the training results
- 00:09:22in a spreadsheet, or even in a document,
- 00:09:24and pass it around the team,
- 00:09:25but that makes it more difficult to collaborate
- 00:09:28and learn from past experiments.
- 00:09:30So by automatically keeping all of this in one place,
- 00:09:33it becomes much easier to learn from your mistakes
- 00:09:36and build better models.
- 00:09:42Now let's move on from tracking the training,
- 00:09:44to improving the performance,
- 00:09:46specifically the training speed, which impacts how much time
- 00:09:49you end up requiring to like use the instances
- 00:09:52to train the model, and the overall project completion time,
- 00:09:56as well as the cost.
- 00:09:58Now, the SageMaker profiler, is an ML observability tool
- 00:10:01that enables you to understand hardware utilization
- 00:10:04and root cause performance issues
- 00:10:06to maximize the efficiency of your model training.
- 00:10:09On the dashboard shown here, we can see some overall metrics
- 00:10:13around the GPU usage, and you want that to be as high
- 00:10:16as possible, as well as the GPU usage throughout
- 00:10:19the training job, across each individual node
- 00:10:22within your cluster.
- 00:10:24So you can see that even if your utilization overall might
- 00:10:27be high, within some intervals, there might be
- 00:10:29low utilization that you want to check out a bit more.
- 00:10:34Lower down on the dashboard,
- 00:10:35there are other metrics.
- 00:10:36For example, the total time spent on each GPU kernel.
- 00:10:40So that gives you additional hints about what you want
- 00:10:42to optimize next, to further improve your training.
- 00:10:48There's another page in the profiler,
- 00:10:49showing a more detailed timeline view,
- 00:10:52that allows you to get data from your host and devices
- 00:10:56at all the different levels,
- 00:10:57so you can dig deeper to understand
- 00:10:59what is happening at each point.
- 00:11:04Now I'm excited to announce a preview of a new capability
- 00:11:07in SageMaker, smart sifting of data.
- 00:11:10Smart sifting is an online data refinement technique
- 00:11:12that can reduce your deep learning training time and cost
- 00:11:15by up to 35%.
- 00:11:18Now when you train your models,
- 00:11:20and we talked about wanting to use larger data sets,
- 00:11:23it also matters about the quality of the data sets.
- 00:11:26And, some samples in your data might be less informative
- 00:11:30to your model training,
- 00:11:31or you might have seen those samples already.
- 00:11:33There might be duplicate data or similar data.
- 00:11:36And it's often difficult to pre-process that data
- 00:11:39and remove the data that you don't want
- 00:11:41in the training anymore.
- 00:11:42So smart sifting helps, because it analyzes your data
- 00:11:45during the training job, and filters out
- 00:11:47the low loss samples, which are less informative
- 00:11:50to the model.
- 00:11:52By training on a subset of your data,
- 00:11:53you can reduce the time and cost of the training
- 00:11:56by up to 35%.
- 00:11:58And because it only filters out the low loss samples,
- 00:12:02it has minimal or no impact, to the final training accuracy.
- 00:12:08And it's easy to get started,
- 00:12:09because it does not require you to make changes
- 00:12:11to your data or training pipeline.
- 00:12:16Here's a simple example that uses smart sifting.
- 00:12:20We use the SageMaker deep learning container,
- 00:12:22we load the sifting data loader,
- 00:12:25and then we wrap whatever existing data loader we use
- 00:12:28with the sifting data loader,
- 00:12:29and provide a bit more configuration to start using it.
- 00:12:33I don't need to change the rest of my model
- 00:12:35or data pipeline.
- 00:12:38And we already have customers who are seeing great results
- 00:12:40with this capability.
- 00:12:42For example, LG AI research,
- 00:12:44use it to get meaningful increase in training performance
- 00:12:47without any changes to the model accuracy.
- 00:12:54At re:Invent, we also announced Amazon SageMaker HyperPod.
- 00:12:58This enables customers who are training large scale models
- 00:13:00to get all the managed benefits of SageMaker
- 00:13:03that we've been discussing today,
- 00:13:05with a UX that they might be more familiar with,
- 00:13:07of being able to access the instances directly,
- 00:13:10using Slerm and so on.
- 00:13:13It has similar resilient capabilities to what we discussed
- 00:13:15for training, in terms of replacing faulty instances
- 00:13:20and enabling the training to begin a bit more quickly,
- 00:13:24saving up to 20% in time.
- 00:13:27It also benefits from the optimized distributed training
- 00:13:30libraries, that also improve performance,
- 00:13:32for both model and data parallel training.
- 00:13:37And I mentioned it provides more granular control
- 00:13:40over the cluster in what you're doing,
- 00:13:42being able to access the instances directly,
- 00:13:45install additional software, and make any changes
- 00:13:47that you want to the cluster, to be able to fine tune
- 00:13:50your training a bit more.
- 00:13:54Now, we've talked about training really large scale models,
- 00:13:57but sometimes you don't need to do that.
- 00:13:58Sometimes you just want to fine tune an existing model.
- 00:14:01And that's beneficial if you have an existing model,
- 00:14:04a foundation model, and you want to bring in your own data
- 00:14:08to fine tune that model to a particular use case.
- 00:14:12So, by bringing in your own data,
- 00:14:14you're making the model better than if you were just using
- 00:14:16an off the shelf foundation model.
- 00:14:18But, it saves you a lot of time and money,
- 00:14:20because you don't have to train
- 00:14:21that whole model from scratch.
- 00:14:24Now the challenge is, that some models are not open sourced.
- 00:14:27Right, you can't download the model weights
- 00:14:29and fine tune them yourselves in an existing
- 00:14:31SageMaker training job.
- 00:14:33But, this has changed with enhancements we've made
- 00:14:35to SageMaker algorithms and model packages.
- 00:14:39You can now easily and securely customize third party models
- 00:14:42by fine tuning them on your private data.
- 00:14:45This provides end-to-end security.
- 00:14:47The model provider can provide their model without revealing
- 00:14:50their model weights, and you as a customer,
- 00:14:53can fine tune on that model by bringing in your own data,
- 00:14:56without exposing that data to the model provider.
- 00:14:59And the final model weights, after your fine tuning,
- 00:15:02are also only available to you.
- 00:15:06Now this can be done with a variety of models
- 00:15:07and algorithms, for example, Cohere models.
- 00:15:12And all this is easy to use, and done through
- 00:15:14the Python SageMaker SDK, that we were discussing earlier,
- 00:15:18and integrates with other SageMaker capabilities,
- 00:15:20like SageMaker experiments and pipelines.
- 00:15:25And of course, with SageMaker inference,
- 00:15:27you can deploy the models at the end,
- 00:15:29to use them for inference in production scenarios,
- 00:15:33in a secure way.
- 00:15:36I'll now hand it over to Emily, to talk about fine tuning
- 00:15:38and pre-training LLMs on SageMaker.
- 00:15:41- Alright, thanks Gal.
- 00:15:43Great, so I hope you're as excited as I am about a lot
- 00:15:46of these new launches, new features.
- 00:15:48I should introduce myself.
- 00:15:50My name's Emily Webber, I lead our generative AI
- 00:15:53foundation's technical field community here at AWS.
- 00:15:57And in particular, some of those launches,
- 00:16:00many of them came directly from conversations with you.
- 00:16:03Actually, we were listening with customers
- 00:16:05and chatting with you to understand key capabilities
- 00:16:08that you wanted to see in our training stack,
- 00:16:11and this led to a number of the features
- 00:16:12that you've just learned about.
- 00:16:16In any case, there are many ways to customize
- 00:16:20a large language model.
- 00:16:22Here I'm presenting them on two axis, right?
- 00:16:25So at the bottom you have roughly complexity and then cost.
- 00:16:29Obviously you wanna be closer to the left.
- 00:16:32You want your LLM customization techniques
- 00:16:35to be roughly easy, because then of course that's faster
- 00:16:39for you to get started, and then that's less expensive.
- 00:16:42However, there is a progression of these techniques.
- 00:16:46Most customers will start with prompt engineering,
- 00:16:51which is a nice way to easily improve
- 00:16:54and then customize your large language model.
- 00:16:56However, it's not as accurate as some of these
- 00:16:59extra techniques that you can use.
- 00:17:02Most customers will move from prompt engineering into
- 00:17:06what we call a retrieval augmented generation stack,
- 00:17:10where you have some set of data, you're converting
- 00:17:13that data into embeddings, or that dense representation,
- 00:17:17and then retrieving those documents
- 00:17:19to interact with your consumers.
- 00:17:22This then can transform, if you will,
- 00:17:25into a fine tuning stack.
- 00:17:27Actually, there's a bit of an overlap there,
- 00:17:30but in any case, you can take, as Gal mentioned,
- 00:17:33custom data, fine tune your model to add
- 00:17:37that extra knowledge.
- 00:17:39All of these techniques, however, pale in comparison
- 00:17:43to the holy grail, which is pre-training,
- 00:17:45which is creating a net new foundation model.
- 00:17:49And so all of these techniques are available on SageMaker
- 00:17:53and well supported by the stack.
- 00:17:55So we're gonna learn how to move from
- 00:17:59fine tuning into pre-training,
- 00:18:01during our session here today.
- 00:18:04Now, fine tuning small models is really impactful.
- 00:18:09Here are a couple reasons why you would consider
- 00:18:11fine tuning a small model.
- 00:18:13The first is, of course, it's less expensive.
- 00:18:16You're going to use a smaller dataset,
- 00:18:18possibly a smaller model, and then still improve accuracy
- 00:18:24because you're fine tuning this model,
- 00:18:25but you're keeping your costs down.
- 00:18:28When you're working with a smaller model,
- 00:18:30such as something in the 7 billion parameters,
- 00:18:33this is inherently faster, because the model itself
- 00:18:36is just physically smaller than some of those larger ones,
- 00:18:40and so the training time is faster,
- 00:18:43the inferencing time is faster,
- 00:18:45which means you can train more models
- 00:18:48and you can do more inferencing,
- 00:18:50again, with that smaller object.
- 00:18:53Because the object is smaller,
- 00:18:55it's easier for you to manage.
- 00:18:57And so, again, the storage requirements are smaller,
- 00:19:01so it's easier for you to copy the model.
- 00:19:03It's easier for you to put the model into your applications,
- 00:19:07and your packages, and your CICD pipelines,
- 00:19:10and your repositories.
- 00:19:12Many customers inherently prefer the ownership that comes
- 00:19:16with creating new models, particular through fine tuning,
- 00:19:20and then again, pre-training.
- 00:19:21This allows you to increase the IP of your firm.
- 00:19:25And then of course you have more deployment options when
- 00:19:28you're fine tuning, again, that small model.
- 00:19:32The more deployment options include serverless,
- 00:19:34actually I have customers who create and then fine tune
- 00:19:39these small 7 billion parameter models, compile them,
- 00:19:43and then host them on Lambda, (chuckling)
- 00:19:45and run them on serverless inferencing.
- 00:19:47And so, absolutely, when you're working with these
- 00:19:50tiny models, that are knowledgeable in small domains,
- 00:19:55you have a lot of flexibility.
- 00:19:58Pre-training is really best for extremely large data sets.
- 00:20:03So when you have hundreds of GBs, or multiple terabytes
- 00:20:07of custom language data that just really is not online,
- 00:20:11if it, the language data that you have,
- 00:20:15if it's not in Wikipedia, if it's not on Reddit,
- 00:20:20if it's the core language that you're using,
- 00:20:23if when you, you know, take a sentence and try and put
- 00:20:26that sentence into Wikipedia, for example,
- 00:20:29if Wikipedia doesn't understand what you're trying to say,
- 00:20:31you may wanna consider seriously customizing a language
- 00:20:35model, and then possibly creating a new one from scratch.
- 00:20:39Now, why is this the case?
- 00:20:41Why is pre-training so powerful?
- 00:20:44Part of this is because the pre-training loss function
- 00:20:48is more generalizable.
- 00:20:49So when you're creating that new foundation model
- 00:20:53from scratch, the learning is slightly different.
- 00:20:56It's more general, and it's deeper
- 00:20:58in the neural network, actually.
- 00:21:01Also, when you're creating a new foundation model,
- 00:21:04you can do this without supervised data.
- 00:21:07So you don't need to go label, you know, millions of records
- 00:21:11in pre-training, you can just capture and tokenize
- 00:21:14a terabyte of your own language data
- 00:21:18and then throw that into the network.
- 00:21:20There's no need to add additional supervision on top of
- 00:21:22that, which makes it very attractive.
- 00:21:26Also, I love to see the efficiency gains
- 00:21:29of pre-training, actually.
- 00:21:30We all have small teams, we all have have few resources
- 00:21:34for data science and modeling, and so,
- 00:21:36when we take our small teams and focus them on one project,
- 00:21:40and create this one massive, you know,
- 00:21:42powerful foundation model,
- 00:21:44and then use the foundation model
- 00:21:46in many, many applications,
- 00:21:48it actually, I find, is more efficient than optimizing
- 00:21:52and then maintaining our tiny ML ops workloads,
- 00:21:56which is what many of us were doing,
- 00:21:58prior to transformers.
- 00:22:01So, what does it take, to pre-train a new foundation model?
- 00:22:07It sounds scary, it sounds like only, you know,
- 00:22:09the the best can do this, but in fact,
- 00:22:13in large part, due to, you know, a very sophisticated
- 00:22:17and very mature training infrastructure that you're here
- 00:22:19to learn about, it's actually pretty accessible.
- 00:22:22So how are we gonna do this?
- 00:22:24So here are three example models that were pre-trained
- 00:22:28and created from scratch on Amazon SageMaker,
- 00:22:31specially on our training infrastructure.
- 00:22:33Stable Diffusion, clocking in at 5 billion images
- 00:22:38and 240 terabytes of image data.
- 00:22:41And so of course, that's a lot.
- 00:22:43And so, image models tend to take a lot of data,
- 00:22:47but the models themselves are a bit smaller.
- 00:22:50And so you can use smaller cluster sizes.
- 00:22:54The Falcon model of course,
- 00:22:56from Technology Innovation Institute,
- 00:22:58is a very large language model, the largest open source
- 00:23:01language model.
- 00:23:031 trillion tokens, just under three terabytes
- 00:23:06of language data, 40 billion parameters,
- 00:23:09and then 48 p4d instances.
- 00:23:13So sizable cluster, and that is two months
- 00:23:16to train this model.
- 00:23:19And then we have another Financial large language model
- 00:23:22trained on SageMaker with just under two terabytes
- 00:23:25of language data.
- 00:23:27And so all of these requirements,
- 00:23:30are surprisingly accessible.
- 00:23:32Actually, I think there are quite a few companies
- 00:23:34with that volume of language data.
- 00:23:38And then the capabilities that we provide on SageMaker,
- 00:23:41make the training experience, again, very accessible
- 00:23:45to a wide variety of companies.
- 00:23:49So how do we do this?
- 00:23:51If we know we have, we meet the requirements,
- 00:23:54how are we gonna go about creating
- 00:23:56and pre-training these foundation models on AWS?
- 00:23:59So the first step is just gathering and accessing that data.
- 00:24:03And again, we want at least, I'd say one terabyte
- 00:24:07of your own language data.
- 00:24:09So this is documents, digitized PDFs, conversations,
- 00:24:15you know, language streams like rich,
- 00:24:18rich, robust language data.
- 00:24:20So you wanna gather about one terabyte
- 00:24:22of this language data.
- 00:24:23Many firms will then pair that with open source data
- 00:24:27actually, so that your model understands both
- 00:24:30the nuances of your company's acronyms, and history,
- 00:24:35and phrasing, and domain expertise,
- 00:24:37but also knows what time the sunrises in Honolulu.
- 00:24:41Because of course we want that mix of the general,
- 00:24:44sort of open source knowledge,
- 00:24:46but also what's specific to your company.
- 00:24:49And so, that's gathering and storing the information.
- 00:24:52After that, you'll pre-process your data.
- 00:24:56SageMaker also has a really nice capability
- 00:24:59for pre-processing datasets.
- 00:25:01Actually one of our builders, Jenny over here,
- 00:25:04helped me run many pre-processing
- 00:25:06and data transformation jobs on SageMaker.
- 00:25:10And so you can use our training job API,
- 00:25:14including that remote function that we just learned about,
- 00:25:17to run jobs in parallel, which are then tokenizing
- 00:25:21and pre-processing.
- 00:25:22So this core, sort of training job construct,
- 00:25:26is applicable both for creating new models from scratch,
- 00:25:29and also for for general data transformation
- 00:25:32and general processing.
- 00:25:33So you'll pre-process your data sets,
- 00:25:36and then you'll optimize those data sets using
- 00:25:38your preferred data storage.
- 00:25:40We see a lot of customers using FSX for Luster.
- 00:25:44This is because you can store your data in one place
- 00:25:46and then easily attach this volume to training job runs.
- 00:25:51So as you're iterating through different model sizes,
- 00:25:54and different infrastructure, and experimental choices,
- 00:25:58you can use and store your data in the same place.
- 00:26:03After this, customers will then need to develop and iterate
- 00:26:06over their training scripts.
- 00:26:08And the elasticity that you get with the infrastructure
- 00:26:11on SageMaker is beautiful.
- 00:26:13You can use and run tiny instances.
- 00:26:16So the T3 medium and the T2, that Werner shared with us
- 00:26:20this morning.
- 00:26:21So the T3 medium is a great choice for notebook instances,
- 00:26:25very cost effective, very small machine.
- 00:26:28And then you can scale that up with a click
- 00:26:30of a couple buttons, to a small GPU, for example,
- 00:26:34the G4 or the G5 series,
- 00:26:38which your teams can then develop on,
- 00:26:40and get the nuances working in their training loop.
- 00:26:44And then ultimately scale out in the same platform,
- 00:26:47in the same service, to hundreds and thousands of GPUs.
- 00:26:53And so that's that step from, that move from step four
- 00:26:56to step five, where you're developing and testing
- 00:26:59on increasingly larger instances,
- 00:27:02and then ultimately scaling up and using the massive
- 00:27:06training infrastructure that SageMaker provides.
- 00:27:10And then of course you'll evaluate the model artifact
- 00:27:13step by step, and the way that SageMaker holds onto
- 00:27:17the metadata, holds onto your scripts,
- 00:27:20holds onto your hyper parameters, stores all of your
- 00:27:23artifacts in S3, makes it so easy to just look up
- 00:27:27your previous work.
- 00:27:29So, I know if you're trying to capture an experiment
- 00:27:32that you ran six months ago,
- 00:27:34or even three years ago, as long as it was in AWS,
- 00:27:38then you can easily go look up the results of that job,
- 00:27:42capture some of the artifacts,
- 00:27:44and then run a new experiment.
- 00:27:46And so, at a high level, that's how you can pre-train
- 00:27:49foundation models on AWS.
- 00:27:53And again, all of this is possible because of the
- 00:27:56distributed training libraries that we provide
- 00:27:59on Amazon SageMaker.
- 00:28:00So these are capabilities that we've been building
- 00:28:03for many years.
- 00:28:04Including data parallel and model parallel
- 00:28:07distributed training libraries that give you efficiency
- 00:28:10and enhancements.
- 00:28:11So model parallel, is a way to distribute a neural network
- 00:28:16over multiple accelerators and GPUs,
- 00:28:19providing optimized performance.
- 00:28:21And then our data parallel package,
- 00:28:23will let you actually make copies of your model
- 00:28:26across a large cluster.
- 00:28:28And then we're delivering custom communication collectives
- 00:28:31actually, that are optimized for the AWS network topology,
- 00:28:35to save you up to 40% in the overall training time.
- 00:28:39And so this is after many years
- 00:28:41of innovation at this layer in the stack.
- 00:28:44And again, all of this is available through SageMaker.
- 00:28:48And customers agree with us.
- 00:28:50So as you heard from Swami's keynote yesterday,
- 00:28:54Aravind Srinivas, CEO of Perplexity AI,
- 00:28:58is happily using SageMaker, and in particular
- 00:29:02the data and model parallel training libraries,
- 00:29:06again, to get that optimized performance
- 00:29:08in particular in the Hyperpod mode.
- 00:29:13Another feature of SageMaker that I find really handy,
- 00:29:17is Warm Pools.
- 00:29:18And so, the training job API, again,
- 00:29:22is creating infrastructure when you train a model.
- 00:29:26So when you call model.fits, or when you run
- 00:29:29that Python training script,
- 00:29:31we actually turn on our instances at the same time.
- 00:29:34And so that call to create the cluster,
- 00:29:37and to execute the scripts, are coupled,
- 00:29:40they happen together.
- 00:29:41And now again, this is really useful for cost efficiency,
- 00:29:46so that when the job fails, because I forgot to point
- 00:29:50to the right Luster volume,
- 00:29:52that instance isn't sitting up there charging me money,
- 00:29:54right, it turns off.
- 00:29:56So it's extremely compute efficient.
- 00:29:58However, as a dove, that can be challenging,
- 00:30:01because I don't wanna wait eight minutes
- 00:30:03just to ship a new line of code.
- 00:30:06And so we launched, last year, our Warm Pools feature,
- 00:30:10that lets you run new jobs, using the same image,
- 00:30:14in seconds.
- 00:30:15And so as a developer it's extremely handy,
- 00:30:19because you can make just one, two, three line edits in your
- 00:30:22training script, and then just run the job in seconds.
- 00:30:25And so the Warm Pool feature is incredibly useful
- 00:30:28for developing with the SageMaker training API.
- 00:30:33Another core feature of SageMaker,
- 00:30:36is the ability to use many different types of instances
- 00:30:39and have a lot of flexibility with the underlying
- 00:30:42infrastructure, where you're trying to run your scripts.
- 00:30:45One of these, of course, is custom accelerators from AWS.
- 00:30:49And so the Trainium and Inferentia capabilities
- 00:30:52are both available on SageMaker.
- 00:30:55And you're seeing a lot of cost performance,
- 00:30:59relative to comparable Amazon EC2 instances,
- 00:31:02up to 46% with Trainium one, relative to Llama two,
- 00:31:07and so you'll see even better performance with Trainium two,
- 00:31:10which was just recently announced.
- 00:31:12And so, in the demo today, actually, we're gonna take a look
- 00:31:15at Trainium on SageMaker.
- 00:31:19So what is this demo?
- 00:31:20So we're gonna be pre-training Llama two,
- 00:31:23of course I got a little visual for you.
- 00:31:25So this is a cartoon Llama with sunglasses
- 00:31:28on the Las Vegas strip from Bedrock.
- 00:31:30And so we're gonna pre-train a 7 billion parameter
- 00:31:33Llama on SageMaker.
- 00:31:36Now, why are we going to do this?
- 00:31:38Why is this a useful exercise?
- 00:31:40Again, this is assuming I have at least a few hundred
- 00:31:45gigabytes of custom data.
- 00:31:47So really a sizable data set
- 00:31:49of my own language data.
- 00:31:52And then again, this is knowledge
- 00:31:54that's not generally available online.
- 00:31:57And so it's my own proprietary data set.
- 00:32:00So it's knowledge that wouldn't generally be found
- 00:32:02in say, a Wikipedia archive.
- 00:32:05This then drives in-domain accuracy.
- 00:32:09And so this small model will be very surprisingly accurate
- 00:32:13within that domain.
- 00:32:15Again, it won't know everything under the sun,
- 00:32:17but it will have a surprising amount of accuracy,
- 00:32:21again, in that dataset, and in that domain
- 00:32:24where you're training it.
- 00:32:25This of course then drives, as I mentioned earlier,
- 00:32:28ownership, it drives flexibility, and then that lets you,
- 00:32:32again, use serverless hosting,
- 00:32:34and then ultimately cost reduction opportunities.
- 00:32:37And again, how are we gonna do this?
- 00:32:39So I have some example notebooks, I'm gonna walk
- 00:32:41through different instances, again, that T3 medium,
- 00:32:45the Train one, optimize large scale data
- 00:32:48stored on FSX for Luster, and then some Warm Pools,
- 00:32:51and then, again, that distributed training infrastructure.
- 00:32:54So let's check this out.
- 00:33:03All right, so here we are.
- 00:33:06I'm starting with the example notebook.
- 00:33:09So this is a publicly available GitHub repository
- 00:33:13that is using Neuron X,
- 00:33:15which is from the Amazon Annapurna ML team,
- 00:33:18and then Nemo Megatron actually, which is from Nvidia.
- 00:33:21So this is the Nvidia distributed training framework,
- 00:33:25Nemo Megatron.
- 00:33:26And then we're gonna be running that
- 00:33:27on Tranium accelerators.
- 00:33:31This also uses Pytorch and then the core
- 00:33:34torchrun framework.
- 00:33:39All right, so first I am running this on a notebook instance
- 00:33:44actually, so this is a SageMaker notebook instance.
- 00:33:47This is my sturdy M four instance.
- 00:33:51And it's handy because again, you can create a docker image.
- 00:33:56So what I'm gonna do here on my notebook instance,
- 00:33:58is I'm pointing to a deep learning container.
- 00:34:03So that's a fully managed deep learning container
- 00:34:06that we build at AWS, and track, and manage, and update,
- 00:34:10as the software frameworks change.
- 00:34:12We manage these deep learning containers,
- 00:34:14and then you can inherit our container in your docker files
- 00:34:19and build on top of it.
- 00:34:20And so it's an easy way for you to make sure
- 00:34:22that your scripts will work in the AWS framework.
- 00:34:25So, first I'm gonna build that docker image.
- 00:34:29Then I'm setting up FSX for Luster for my optimized runs.
- 00:34:33I'm gonna prepare my data sets using tokenization
- 00:34:37on the SageMaker training API,
- 00:34:40converting the Hugging Face weights to the Neo framework.
- 00:34:43Then I'm gonna train Llama two.
- 00:34:46So again, just a really simple docker image,
- 00:34:49I'm just importing SageMaker,
- 00:34:54pointing to the deep learning container accounts,
- 00:34:57grabbing that image,
- 00:34:59and then this is a nice sh script that just builds
- 00:35:02a docker image.
- 00:35:03So I've got my docker image locally,
- 00:35:05and then this is pushed to,
- 00:35:07what's called the Elastic Container Registry,
- 00:35:09so another AWS service that just hosts your docker images.
- 00:35:13And so I'm pushing my docker image,
- 00:35:16which is on my notebook instance locally,
- 00:35:18up to ECR in AWS.
- 00:35:21And so that's this process.
- 00:35:23Great.
- 00:35:25So I have my docker image hosted.
- 00:35:28My second step is setting up FSX for Luster.
- 00:35:32So this is using a cloud formation template
- 00:35:35to deploy Luster in my account.
- 00:35:38You can also do this through the console.
- 00:35:41Luster sets up a two-way data repository
- 00:35:44with your S3 bucket.
- 00:35:46So as long as you have data sitting in your S3 bucket,
- 00:35:49you can enable a two-way data repository.
- 00:35:52So data will be copied through the service.
- 00:35:56The metadata will be copied through the service,
- 00:35:58from the bucket, to Luster,
- 00:36:00and then as you write files to Luster,
- 00:36:04those are then copied back to S3.
- 00:36:06So you can download it from S3.
- 00:36:08You can also mount the Luster volume
- 00:36:10to your notebook instance, as long as the networking
- 00:36:14is aligned.
- 00:36:14And then you can view the contents of that volume directly
- 00:36:19from your notebook instance.
- 00:36:20So I'll set this up here.
- 00:36:22Again, just downloading the cloud formation template,
- 00:36:26and then creating the stack in the relevant AZ.
- 00:36:31Once the stack has been created,
- 00:36:33then I'm gonna prep my dataset.
- 00:36:35And so my dataset, again, is for LLM training,
- 00:36:38using that docker image, using the SageMaker SDK,
- 00:36:42and creating a lot of hyper parameters.
- 00:36:45Some hyper parameters for the dataset pointer,
- 00:36:49the model pointer, some of my keys,
- 00:36:52I'm pointing to FSX for Luster right here,
- 00:36:56and then setting up the file system input.
- 00:36:59And then as Gal mentioned, here is the Pytorch API.
- 00:37:03Now this is, again, a very complex example.
- 00:37:07If you are new to SageMaker, I would give you
- 00:37:09a really easy one. (chuckling)
- 00:37:10And so there are a lot of very accessible examples,
- 00:37:15where you can train a model in like two lines of code,
- 00:37:18basically, using Hugging Face, actually.
- 00:37:21But certainly the Python training scripts,
- 00:37:24you can bring your own packages and requirements.txt,
- 00:37:27and be on your way very, very quickly.
- 00:37:30This is a complex example,
- 00:37:31but we have very simple ones that you can use
- 00:37:33to get started.
- 00:37:34So in any case, I'm importing a pointer to that container.
- 00:37:39Actually, this is, this Python object is pointing
- 00:37:43to the deep learning containers for PyTorch.
- 00:37:46And then this API is letting me define my own
- 00:37:50pre-processing function right here, in the script.
- 00:37:54I'm pointing to a source directory,
- 00:37:57which can be a local file, can also be a Git repository.
- 00:38:01And then I'm defining my infrastructure right here.
- 00:38:05So this is one, Trainium actually, dot 32 xl.
- 00:38:10And so I'm running this job here.
- 00:38:13And again, that's my pre-processing dataset.
- 00:38:15And I'm gonna jump to Llama here,
- 00:38:18'cause we're a little short on time.
- 00:38:19And so, once I've pre-processed the data
- 00:38:22and stored it on Luster,
- 00:38:25which again, then will replicate back in S3,
- 00:38:27after that, I'm going to train my model
- 00:38:31on that same Luster volume.
- 00:38:32And Luster is useful because it mounts in seconds.
- 00:38:36So when you're working with large scale data sets,
- 00:38:38of course it can be very computationally and time intensive
- 00:38:41to copy them.
- 00:38:43And streaming can be a little bit challenging to set up.
- 00:38:46So Luster is a nice way to store your data
- 00:38:49in a high performance file system,
- 00:38:52which you can then mount in seconds using as many jobs
- 00:38:56as you want to, because the bandwidth scales,
- 00:38:59actually, is a function of mounts.
- 00:39:01So Luster is a great high performance data store.
- 00:39:03So in any case, we're gonna set this up.
- 00:39:07And then same as last time, pointing to all
- 00:39:10of my hyper parameters, specifying that I'm using,
- 00:39:14again, four instances.
- 00:39:16Each of these instances has 32 accelerators.
- 00:39:20Again, the Trainium accelerators.
- 00:39:22All of the hyper parameters for the 7 billion parameters
- 00:39:27that we're gonna train here, my FSX for Luster pointer,
- 00:39:32and then I'm gonna launch my training job.
- 00:39:34So, setting in the rest of my hyper parameters,
- 00:39:37again pointing to that PyTorch API, loading in my scripts,
- 00:39:42and then I call model.fits.
- 00:39:44And as promised, all of the content for this job,
- 00:39:49is loaded directly in the SageMaker control plane.
- 00:39:52So I see exactly when I started this, what the status was,
- 00:39:56where the artifacts are, when I ran this.
- 00:39:59I can view the outputs, I can step through the logs
- 00:40:03and see every piece of information that I need for my model,
- 00:40:07which then of course, I can download it and build
- 00:40:09an entire app on top of this.
- 00:40:12And so with that, I'm gonna hand over the stage
- 00:40:15to Tom, and he's gonna share some information
- 00:40:18with you about Toyota.
- 00:40:26- Great, thank you Emily.
- 00:40:29So today I am going to really tell you about
- 00:40:31how we're using SageMaker,
- 00:40:33to accelerate machine learning at TRI.
- 00:40:36And first, maybe I should tell you a little bit
- 00:40:38about what TRI actually is.
- 00:40:41I'll give you a couple of example of projects
- 00:40:42that we have, ongoing right now.
- 00:40:46The first, is a project around autonomous drift driving,
- 00:40:56with a Toyota Supra.
- 00:40:57And I'll just let the video play here.
- 00:40:59(electronic music)
- 00:41:03(tires screeching)
- 00:41:07(indistinct chatter)
- 00:41:09(electronic music continues)
- 00:41:22(engine revving)
- 00:41:25(tires screeching)
- 00:41:29So that's one example.
- 00:41:31And of course AI, here, helps lay the foundation,
- 00:41:34for all of this work.
- 00:41:35The second is, we work on a lot of challenge problems.
- 00:41:38And so, there's a big robotics group
- 00:41:40that focuses on challenge problems,
- 00:41:44that we start in the lab, but also go out
- 00:41:46to the real world environment,
- 00:41:48and evaluate our systems in.
- 00:41:50And so, this is an example of where we have a robotic system
- 00:41:55that we built in-house, from the ground up,
- 00:41:57and we're able to retrieve and stock grocery store shelves.
- 00:42:03This has evolved more into the factory setting as well,
- 00:42:05more recently.
- 00:42:09And, we're 250 people across a few different locations.
- 00:42:12So there's a team in Los Altos
- 00:42:15and a team in Cambridge, Mass.
- 00:42:17And there's teams also in human-centered AI,
- 00:42:19and also material science as well.
- 00:42:23And most recently, one of the things about generative AI
- 00:42:25that we found anyway, is, in the context of robotics,
- 00:42:28is that it can be applied, it can now be applied
- 00:42:31to robotics, to be able to do a wide variety of tasks
- 00:42:36that we never thought were possible.
- 00:42:38And this is a technique called diffusion policy,
- 00:42:41that is now able to learn from a few examples of a human,
- 00:42:45from a human, how to perform very complicated tasks.
- 00:42:50And so, building on this, the machine learning team at TRI
- 00:42:54tries to build a foundation across language, vision,
- 00:42:57and action.
- 00:42:58Language in the sense, that both common sense knowledge,
- 00:43:06and also a wider variety of applications.
- 00:43:09So like, language has applications across Toyota
- 00:43:12more generally, in the context of enterprise applications,
- 00:43:15but also in terms of code generation as well.
- 00:43:19Vision, feeds into language to give robots eyes,
- 00:43:22for example, and then action, to perform a wide variety
- 00:43:26of tasks across a number of different platforms.
- 00:43:30But, this talk is more about SageMaker. (chuckling)
- 00:43:33And so I wanna tell you about how we're using SageMaker,
- 00:43:36at TRI, to really accelerate our progress.
- 00:43:40And the first is sort of general experimentation.
- 00:43:43Where we use sort of, one to eight instances,
- 00:43:46to scale up our training jobs.
- 00:43:49And, the second, is how we can take some of these ideas
- 00:43:53and really scale this up very, very quickly.
- 00:43:56To not just a few GPUs, but to hundreds of GPUs at a time.
- 00:44:01And finally, we're also looking, we're also able
- 00:44:04to use SageMaker for even more broad applications,
- 00:44:06such as, like, just serving models.
- 00:44:09As these are hard to serve sort of locally,
- 00:44:11on a device.
- 00:44:15So lemme tell you a little bit about the experimentation
- 00:44:17that we do on SageMaker at TRI.
- 00:44:22First, you know, here's I guess,
- 00:44:24the high level, is that we have a wide variety
- 00:44:26of applications that we're, models that we're training
- 00:44:29on SageMaker.
- 00:44:30The first, is large language models.
- 00:44:34Second, is a mono depth model.
- 00:44:35So taking RGB images and inferring depth, for example.
- 00:44:39A third, is sort of stable diffusion.
- 00:44:42Language to image generation,
- 00:44:44for better feature representations.
- 00:44:46And a third, a fourth, is 3D representations.
- 00:44:50Such as, language, to 3D sort of structures, as well.
- 00:44:53That's useful for robotics and a number
- 00:44:54of other applications.
- 00:44:57But, and across all of these we found,
- 00:45:00SageMaker to be very useful for a number of reasons.
- 00:45:03Some of the challenges that come up for us,
- 00:45:07include a few things.
- 00:45:09First, we wanna be able to reuse
- 00:45:11existing training infrastructure and clusters
- 00:45:13that we create.
- 00:45:15And so, you know, Warm Pools, that you heard about earlier,
- 00:45:19are one way in which to do that.
- 00:45:21And we take advantage of that on a daily basis,
- 00:45:24to pull back those resources
- 00:45:25and continue iterating on our training jobs.
- 00:45:29The second is for scaling.
- 00:45:31We need to be able to go from one to many instances
- 00:45:33very quickly, and also to change instance types
- 00:45:36very quickly as well.
- 00:45:40We also need high performance systems.
- 00:45:41So, you know, and SageMaker is very well optimized
- 00:45:46in the backend.
- 00:45:48And finally, we need a sort of flexibility,
- 00:45:50we need to run a number of different jobs
- 00:45:54across all of the, sort of, science group.
- 00:45:59And so, you know, I'll echo like,
- 00:46:02this is the code you saw earlier.
- 00:46:06I'll echo how easy it is to sort of like,
- 00:46:08scale these things up.
- 00:46:10You can start with one instance,
- 00:46:11and you can iterate with your training job,
- 00:46:14with that instance for example.
- 00:46:16And, if you need to scale, it's a very simple change
- 00:46:21to enable that.
- 00:46:23So this can, in this case you can change the instance count
- 00:46:26to like from one to eight for example,
- 00:46:28and then start just scaling your runs very quickly.
- 00:46:32The second, is to, as new hardware comes out,
- 00:46:34as Gal mentioned, we're able to,
- 00:46:37quickly change the hardware types, as well.
- 00:46:40And so in this case the, we can change,
- 00:46:43say, from a p4 instance to a p5,
- 00:46:46which will give twice the throughput for our training jobs,
- 00:46:49and reduce, sort of, the training times for us.
- 00:46:52And just to give some evidence of how this looks,
- 00:46:55how performant these systems are, if you look at sort of,
- 00:46:59scaling across a number of instances,
- 00:47:01it's almost linear, in terms of the scalability here.
- 00:47:04So SageMaker has been very performant for us,
- 00:47:07in terms of scaling up our training jobs as well.
- 00:47:13As Emily mentioned earlier, you know, using,
- 00:47:16these data sets are huge too,
- 00:47:18and as we started, we started using
- 00:47:21data sets of a few terabytes.
- 00:47:25And, it's nice to be able to quickly start up
- 00:47:28with FSX for Luster.
- 00:47:30However, as we scaled our training jobs as well,
- 00:47:32and the amount of data that's been, that we need,
- 00:47:34for these, from, you know, a few terabytes,
- 00:47:36to a half a petabyte or more,
- 00:47:39we, the flexibility in SageMaker, to pull in other resources
- 00:47:44like web dataset, has been really, really great,
- 00:47:47and has really accelerated, sort of,
- 00:47:49the training runs that we have as well.
- 00:47:56And so, just to reiterate, I mean there's,
- 00:47:59the group is running jobs from, you know,
- 00:48:02one instance, eight instances,
- 00:48:04and these are a few of the applications.
- 00:48:07But, going beyond this, we're also able to scale
- 00:48:11our training runs,
- 00:48:13to much larger scale as well, here.
- 00:48:16And so, you know, just to highlight one
- 00:48:18of the ways in which we're doing that at TRI,
- 00:48:20we're building sort of state of the art,
- 00:48:23the question, you know, the question is:
- 00:48:24How do you build state, can you build state-of-the-art
- 00:48:26LLMs with SageMaker?
- 00:48:28And,
- 00:48:29we at TRI have been doing this.
- 00:48:31We've been reproducing some of the Llama two
- 00:48:35models initially,
- 00:48:36to validate all of our systems.
- 00:48:41And, for this, SageMaker, we need sort of
- 00:48:45scalability and performance across all of these instances.
- 00:48:48And what SageMaker has really provided for us,
- 00:48:51is that scalability.
- 00:48:52So this is the newest hardware, the H100s.
- 00:48:55And, sort of linear scaling, as the sort of
- 00:48:59number of nodes increases.
- 00:49:01And this is, you know, so this ends up being like 256,
- 00:49:06like, H100s,
- 00:49:09and if you run this out, a training job like this,
- 00:49:12for pre-training a Llama two model, you know,
- 00:49:15can take about a week,
- 00:49:17when you start scaling out to 30 instances.
- 00:49:20And this is just to say, you know, with a trillion,
- 00:49:22with more than a trillion tokens,
- 00:49:25we can reproduce, sort of like,
- 00:49:28the state-of-the-art models here.
- 00:49:29And we're scaling, not only from the 7 billion parameter
- 00:49:32models, up to 13, 34, 70 as well, on SageMaker right now.
- 00:49:40One of the key features, or one of the nice features,
- 00:49:43of SageMaker, so that you don't lose any time,
- 00:49:46has also been some of the repair work.
- 00:49:47I think Gal may have mentioned this earlier.
- 00:49:50So your job may actually, as you scale these jobs,
- 00:49:53it's often the case that like, hardware will fail.
- 00:49:56And when hardware fails, you have downtime.
- 00:49:59And if you have downtime,
- 00:50:01it costs you money.
- 00:50:03You're not training your models.
- 00:50:06One of the great parts of SageMaker is that,
- 00:50:09it has this option for cluster repair.
- 00:50:11And so, for us, this happened in about 10 minutes,
- 00:50:14like, one of the feed machines failed,
- 00:50:15and the cluster came right back up and we were able to
- 00:50:18continue our training run very quickly.
- 00:50:23So that's pre-training.
- 00:50:25And, you know, the other thing is that,
- 00:50:29is sort of, on more on the side of up training.
- 00:50:31Where you have a large data set,
- 00:50:33and you want to, but not quite the size you'd need
- 00:50:35for pre-training, and you want to
- 00:50:38look at a particular domain.
- 00:50:39So we, at TRI we're, you know,
- 00:50:41because we're a Toyota-centric entity,
- 00:50:45Japanese was sort of like, one of the areas that we were
- 00:50:48very interested in.
- 00:50:50And so, you know, you can take,
- 00:50:52some of the state-of-the-art models,
- 00:50:53which aren't actually trained for,
- 00:50:56they do have a little bit of, say for example,
- 00:50:57Japanese training data,
- 00:50:58but they don't have that much.
- 00:51:00If you go out there and acquire all of,
- 00:51:02say, the open source data available,
- 00:51:04you get to a hundred billion,
- 00:51:0610 to a hundred billion tokens here,
- 00:51:08which is enough to up train a model,
- 00:51:11such as Japanese.
- 00:51:14And what we found, is that, you know, taking Llama two,
- 00:51:17with 13 billion parameters, and up training,
- 00:51:22you gain some performance.
- 00:51:23This is a win rate metric, against some of the best model,
- 00:51:27closed source models.
- 00:51:29But, the next step, where you actually instruction
- 00:51:32fine tune the model, so this is how you get
- 00:51:34large language models to follow instructions,
- 00:51:38to be sort of chatty, is to fine tune them using
- 00:51:40instruction fine tuning,
- 00:51:42with data of this type, where the instruction
- 00:51:44in the first part would be,
- 00:51:45would be in the first part, and the second part would be,
- 00:51:48the sort of, response you would expect.
- 00:51:51And if you do that, with the additional pre-training,
- 00:51:54with the additional instruction fine tuning in Japanese,
- 00:51:57on some of the more performant models out there,
- 00:51:59you can get state-of-the-art performance,
- 00:52:01in Japanese here.
- 00:52:03And so this is a much smaller model compared to say,
- 00:52:05a Llama 70 B, yet still more performant, for example.
- 00:52:11And so SageMaker's really enabled us to do a lot
- 00:52:13of this experimentation very rapidly, at TRI.
- 00:52:19The final one I just wanna mention, it's not covered
- 00:52:21as much in this topic, but, you know,
- 00:52:24there's also the ability to do other sort of workloads,
- 00:52:27such as serving models as well.
- 00:52:28And we've been leveraging SageMaker endpoints to actually,
- 00:52:33you know, serve both open source models,
- 00:52:36as well as the models that we have in-house,
- 00:52:40internally, across TRI and maybe eventually
- 00:52:42externally as well.
- 00:52:46So with that, I just wanted to say,
- 00:52:47there's sort of three primary areas which we are,
- 00:52:51we're focused on in using SageMaker four.
- 00:52:55Small scale experiments, such as,
- 00:52:57one to eight nodes, for example.
- 00:52:59Large scale training,
- 00:53:00up to, you know, 32, 64, or more instances.
- 00:53:07As well as surveying as well.
- 00:53:10And so, you know, SageMaker has been very critical,
- 00:53:13and important for our training of these,
- 00:53:18of this variety of models, and experimentation generally.
- 00:53:24And I just wanted to close, I guess, with saying like,
- 00:53:28you know, it's been great working with SageMaker,
- 00:53:31for training all of these models.
- 00:53:32In the next time, hopefully when we come back
- 00:53:35to AWS re:Invent, maybe we will have a foundation model,
- 00:53:41that can be trained once and do a whole lot,
- 00:53:43do many different robotics tasks,
- 00:53:46in response to language and other things as well.
- 00:53:50So with that, I'll end, and maybe I'll give Gal.
- 00:53:56- Thank you Tom.
- 00:53:58Yeah, oh.
- 00:53:59We just wanted to end by showing you a couple of links,
- 00:54:01QR codes to learn more about SageMaker and how to use it.
- 00:54:06And thank you all for your time.
- 00:54:08We'll all stand around here for a little bit longer
- 00:54:11if you have any questions.
- 00:54:12I actually think some members of the Smart Sifting team
- 00:54:14are also here, if you have questions about that
- 00:54:16and want to learn more.
- Machine Learning
- Amazon SageMaker
- Deep Learning
- Model Training
- Artificial Intelligence
- Fine-tuning
- Pre-training
- Robotics
- Data Management
- Distributed Training