Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Hadoop | Simplilearn
Resumen
TLDRThis video tutorial introduces Hive, a data warehouse system that simplifies querying large datasets in the Hadoop ecosystem using HiveQL, a SQL-like language. It begins with a historical overview of Hive's development from its inception at Facebook's Hadoop solutions to its widespread adoption. The architecture is explained, detailing key components like the Hive client, drivers (JDBC, ODBC), and the Metastore. Data modeling concepts such as tables, partitions, and buckets are discussed, alongside Hive's data types. Furthermore, it contrasts Hive with traditional RDBMS, highlighting their operational differences, such as schema enforcement and scaling capabilities. The video concludes with a live demo showcasing Hive commands and functionalities within the Cloudera Hadoop setup.
Para llevar
- 📚 Hive simplifies querying large data sets with HQL!
- 🛠️ It originated from Facebook's needs to manage big data.
- 🏗️ Hive's architecture includes clients, servers, and a metastore.
- 📊 Data modeling in Hive involves tables, partitions, and buckets.
- ⚙️ Hive operates in local and MapReduce modes based on data size.
- 💾 Supports both primitive and complex data types for flexibility.
- 🏆 Hive is not a database, it's a data warehouse for analysis.
- 📈 Easily scalable at a lower cost compared to RDBMS.
- 💡 The Hive metastore manages metadata for efficient data retrieval.
- 📝 Hands-on demo shows how to execute Hive commands effectively.
Cronología
- 00:00:00 - 00:05:00
The tutorial introduces Hive, led by Richard Kirschner from Simply Learn. It covers the history, architecture, features, and a hands-on demo involving Hive on the Cloudera Hadoop file system. It starts with the need for Hive due to the complexities of coding in Java for data processing, leading to the development of HiveQL, a SQL-like query language for querying large datasets.
- 00:05:00 - 00:10:00
Hive was developed at Facebook to manage substantial data using Hadoop. The tutorial explains that Hive uses a SQL-like language for ease of use, facilitating querying and analysis of large datasets stored in HDFS. It outlines Hive's role as a data warehouse which translates user queries into MapReduce tasks for execution.
- 00:10:00 - 00:15:00
The architecture of Hive is detailed, including components like the Hive client, Thrift applications, JDBC, and ODBC drivers. It describes how the Hive server processes queries and the role of the Hive driver in compiling and executing tasks, as well as the Metadata Store for storing table information.
- 00:15:00 - 00:20:00
The data flow within Hive is elaborated, highlighting the interaction between the user interface, compiler, execution engine, and HDFS. The explanation covers how queries are executed and how metadata is managed, providing insights into efficient data retrieval processes.
- 00:20:00 - 00:25:00
Hive data modeling is explained, focusing on the structure of tables, partitions for grouping data, and buckets for efficient querying. The importance of designing efficient schemas to improve query performance is emphasized.
- 00:25:00 - 00:30:00
Hive data types are categorized into primitive and complex types, mirroring SQL data types. Primitive types include numerical and string data, while complex types allow for the storage of arrays and maps, crucial for advanced data analytics.
- 00:30:00 - 00:35:00
Hive operates in two modes: local mode for small datasets on a single data node, and MapReduce mode for larger datasets across multiple data nodes, emphasizing the scalability of Hive for big data applications.
- 00:35:00 - 00:40:00
Differences between Hive and traditional RDBMS are outlined, including Hive's schema-on-read approach versus RDBMS's schema-on-write, and the capacity for handling petabytes of data in Hive as opposed to terabytes in RDBMS, showcasing Hive's efficiency in data warehousing.
- 00:40:00 - 00:45:20
The last segment discusses Hive's features, including the use of HiveQL, its SQL-like interface, simultaneous querying by multiple users, support for various data types, and the importance of scalability and cost-effectiveness in a big data context. It concludes with a live demo of HiveQL commands in the Cloudera environment.
Mapa mental
Vídeo de preguntas y respuestas
What is Hive?
Hive is a data warehouse system for querying and analyzing large datasets stored in the Hadoop file system (HDFS), using a SQL-like query language known as HiveQL or HQL.
How does Hive differ from traditional RDBMS?
Hive enforces schema on read, while RDBMS enforces schema on write. Hive is designed for large data in petabytes, whereas RDBMS typically manages data in terabytes.
What are the main modes of Hive?
Hive operates in local mode for small datasets on a single data node, and MapReduce mode for processing larger data sets across multiple data nodes.
What is the Hive Metastore?
The Metastore is a repository for Hive metadata, which contains information about the structure of tables and their schemas.
Can Hive handle complex data types?
Yes, Hive supports both primitive (e.g., integers, strings) and complex data types (e.g., arrays, maps, structs).
Ver más resúmenes de vídeos
Don't Buy a Car Until You Watch THIS Video | How to Negotiate in 2025
The Science of Emotions & Relationships | Huberman Lab Essentials
Warm up before go to Pattaya Thailand | 泰國新手教學系列02:提前熱身 | 六款APP推薦 | 泰國 芭提雅
Natural EYE-HEALING SECRETS Big Pharma Doesn't Want You to Know'' | Dr. Barbara O'Neill
Bio-processing overview (Upstream and downstream process)
População mundial/animação
- 00:00:03hello and welcome to hive tutorial my
- 00:00:06name is richard kirschner with the
- 00:00:07simply learn team that is
- 00:00:10www.simplylearn.com get certified get
- 00:00:13ahead what's in it for you today in our
- 00:00:15hive tutorial first we're going to start
- 00:00:17with the history of hive what is hive
- 00:00:20architecture of hive data flow and hive
- 00:00:23hive data modeling hive data types
- 00:00:26different modes of hive and difference
- 00:00:28between hive and rdb ems finally we're
- 00:00:32going to look into the features of hive
- 00:00:34and do a quick hands-on demo on hive in
- 00:00:37the cloudera hadoop file system let's
- 00:00:39dive in with a brief history of hive so
- 00:00:42the history of hive begins with facebook
- 00:00:44facebook began using hadoop as a
- 00:00:46solution to handle the growing big data
- 00:00:49and we're not talking about a data that
- 00:00:50fits on one or two or even five
- 00:00:52computers we're talking due to the fits
- 00:00:55on if you've looked at any of our other
- 00:00:56hadoop tutorials you'll know we're
- 00:00:58talking about very big data and data
- 00:01:00pools and facebook certainly has a lot
- 00:01:02of data it tracks as we know the hadoop
- 00:01:05uses mapreduce for processing data
- 00:01:08mapreduce required users to write long
- 00:01:10codes and so you'd have these really
- 00:01:12extensive java codes very complicated
- 00:01:14for the average person to use not all
- 00:01:16users were versed in java and other
- 00:01:18coding languages this proved to be a
- 00:01:20disadvantage for them users were
- 00:01:22comfortable with writing queries in sql
- 00:01:24sql has been around for a long time the
- 00:01:26standard
- 00:01:27sql query language hive was developed
- 00:01:30with the vision to incorporate the
- 00:01:32concepts of tables columns just like sql
- 00:01:35so why hive well the problem was for
- 00:01:38processing and analyzing data users
- 00:01:40found it difficult to code as not all of
- 00:01:42them were well versed with the coding
- 00:01:44languages you have your processing ever
- 00:01:46analyzing and so the solution was
- 00:01:48required a language similar to sql which
- 00:01:51was well known to all the users and thus
- 00:01:53the hive or hql language evolved what is
- 00:01:57hive hive is a data warehouse system
- 00:02:00which is used for querying and analyzing
- 00:02:02large data sets stored in the hdfs or
- 00:02:04the hadoop file system hive uses a query
- 00:02:07language that we call hive ql or hql
- 00:02:10which is similar to sql so if we take
- 00:02:13our user the user sends out their hive
- 00:02:16queries and then that is converted into
- 00:02:18a mapreduce tasks and then accesses the
- 00:02:21hadoop mapreduce system let's take a
- 00:02:23look at the architecture of hive
- 00:02:25architecture of hive we have the hive
- 00:02:27client
- 00:02:28so that could be the programmer or maybe
- 00:02:30it's a manager who knows enough sql to
- 00:02:32do a basic query to look up the data
- 00:02:34they need the hive client supports
- 00:02:36different types of client applications
- 00:02:38in different languages prefer for
- 00:02:40performing queries and so we have our
- 00:02:42thrift application in the hive thrift
- 00:02:44client thrift is a software framework
- 00:02:47hive server is based on thrift so it can
- 00:02:50serve the request from all programming
- 00:02:51language that support thrift and then we
- 00:02:54have our jdbc application and the hive
- 00:02:57jdbc driver jdbc java database
- 00:03:01connectivity jdbc application is
- 00:03:04connected through the jdbc driver and
- 00:03:06then you have the odbc application or
- 00:03:08the hive odbc driver the odbc or open
- 00:03:13database connectivity the odbc
- 00:03:15application is connected through the
- 00:03:17odbc driver with the growing development
- 00:03:20of all of our different scripting
- 00:03:21languages python c plus plus spar
- 00:03:24java you can find just about any
- 00:03:26connection in any of the main scripting
- 00:03:28languages and so we have our hive
- 00:03:30services as we look at deeper into the
- 00:03:33architecture hive supports various
- 00:03:35services
- 00:03:36so you have your hive server basically
- 00:03:38your thrift application or your hive
- 00:03:40thrift client or your jdbc or your hive
- 00:03:42jdbc driver your odbc application or
- 00:03:45your hive odbc driver they all connect
- 00:03:47into the hive server and you have your
- 00:03:49hive web interface you also have your
- 00:03:52cli now the hive web interface is a gui
- 00:03:55is provided to execute hive queries and
- 00:03:58we'll actually be using that later on
- 00:04:00today so you can see kind of what that
- 00:04:02looks like and get a feel for what that
- 00:04:04means commands are executed directly in
- 00:04:06cli and then the cli is a direct
- 00:04:09terminal window and i'll also show you
- 00:04:12that too so you can see how those two
- 00:04:14different interfaces work these then
- 00:04:16push the code into the hive driver hive
- 00:04:18driver is responsible for all the
- 00:04:20queries submitted so everything goes
- 00:04:22through that driver let's take a closer
- 00:04:23look at the hive driver the hive driver
- 00:04:25now performs three steps internally one
- 00:04:28is a compiler hive driver passes query
- 00:04:31to compiler where it is checked and
- 00:04:32analyzed then the optimizer kicks in and
- 00:04:35the optimize logical plan in the form of
- 00:04:37a graph of mapreduce and hdfs tasks is
- 00:04:40obtained and then finally in the
- 00:04:42executor in the final step the tasks are
- 00:04:45executed we look at the architecture we
- 00:04:47also have to note the meta store
- 00:04:49metastore is a repository for hive
- 00:04:51metadata stores metadata for hive tables
- 00:04:54and you can think of this as your schema
- 00:04:56and where is it located and it's stored
- 00:04:58on the apache derby db processing and
- 00:05:00resource management is all handled by
- 00:05:02the mapreduce v1 you'll see mapreduce v2
- 00:05:06the yarn and the tez these are all
- 00:05:08different ways of managing these
- 00:05:10resources depending on what version of
- 00:05:11hadoop you're in hive uses mapreduce
- 00:05:13framework to process queries and then we
- 00:05:16have our distributed storage which is
- 00:05:18the hdfs and if you looked at our hadoop
- 00:05:21tutorials you'll know that these are on
- 00:05:23commodity machines and are linearly
- 00:05:25scalable that means they're very
- 00:05:27affordable a lot of time when you're
- 00:05:28talking about big data you're talking
- 00:05:30about a tenth of the price of storing it
- 00:05:32on enterprise computers and then we look
- 00:05:34at the data flow and hive
- 00:05:36so in our data flow and hive we have our
- 00:05:38hive in the hadoop system and underneath
- 00:05:40the user interface or the ui we have our
- 00:05:43driver our compiler our execution engine
- 00:05:45and our meta store that all goes into
- 00:05:47the mapreduce and the hadoop file system
- 00:05:50so when we execute a query you see it
- 00:05:52coming in here it goes into the driver
- 00:05:54step one step two we get a plan what are
- 00:05:56we going to do refers to the query
- 00:05:58execution uh then we go to the metadata
- 00:06:00it's like well what kind of metadata are
- 00:06:02we actually looking at where is this
- 00:06:03data located what is the schema on it
- 00:06:06then this comes back with the metadata
- 00:06:08into the compiler then the compiler
- 00:06:10takes all that information and the send
- 00:06:13plan returns it to the driver the driver
- 00:06:15then sends the execute plan to the
- 00:06:17execution engine once it's in the
- 00:06:19execution engine the execution engine
- 00:06:22acts as a bridge between hive and hadoop
- 00:06:25to process the query and that's going
- 00:06:27into your mapreduce in your hadoop file
- 00:06:29system or your hdfs and then we come
- 00:06:32back with the metadata operations it
- 00:06:34goes back into the metastore to update
- 00:06:37or let it know what's going on which
- 00:06:38also goes to the between it's a
- 00:06:41communication between the execution
- 00:06:42engine and the metastore execution
- 00:06:44engine communications is
- 00:06:46bi-directionally with the metastore to
- 00:06:48perform operations like create drop
- 00:06:51tables metastore stores information
- 00:06:54about tables and columns so again we're
- 00:06:56talking about the schema of your
- 00:06:57database and once we have that we have a
- 00:06:59bi-directional
- 00:07:01send results communication back into the
- 00:07:03driver and then we have the fetch
- 00:07:05results which goes back to the client so
- 00:07:07let's take a little bit look at the hive
- 00:07:09data modeling hive data modeling so you
- 00:07:12have your high data modeling you have
- 00:07:13your tables you have your partitions and
- 00:07:15you have buckets the tables in hive are
- 00:07:18created the same way it is done in rdbms
- 00:07:21so when you're looking at your
- 00:07:22traditional sql server or mysql server
- 00:07:25where you might have enterprise
- 00:07:27equipment and a lot of
- 00:07:29people pulling and moving stuff off of
- 00:07:31there the tables are gonna look very
- 00:07:32similar and this makes it very easy to
- 00:07:34take that information and let's say you
- 00:07:36need to keep current information but you
- 00:07:39need to store all of your years of
- 00:07:41transactions back into the hadoop hive
- 00:07:44so you match those those all kind of
- 00:07:46look the same the tables are the same
- 00:07:48your databases look very similar and you
- 00:07:50can easily import them back you can
- 00:07:51easily store them into the hive system
- 00:07:54partitions here tables are organized
- 00:07:56into partitions for grouping same type
- 00:07:58of data based on partition key this can
- 00:08:01become very important for speeding up
- 00:08:04the process of doing queries so if
- 00:08:06you're looking at dates as far as like
- 00:08:08your employment dates of employees if
- 00:08:10that's what you're tracking you might
- 00:08:12add a partition there because that might
- 00:08:13be one of the key things that you're
- 00:08:14always looking up as far as employees
- 00:08:16are concerned and finally we have
- 00:08:18buckets uh data present in partitions
- 00:08:20can be further divided into buckets for
- 00:08:22efficient querying again there's that
- 00:08:24efficiency at this level a lot of times
- 00:08:27you're taught you're working with the
- 00:08:29programmer and the admin of your hadoop
- 00:08:32file system to maximize the efficiency
- 00:08:35of that file system so it's usually a
- 00:08:37two-person job and we're talking about
- 00:08:39hive data modeling you want to make sure
- 00:08:41that they work together and you're
- 00:08:43maximizing your resources hive data
- 00:08:46types so we're talking about hive data
- 00:08:48types we have our primitive data types
- 00:08:50and our complex data types a lot of this
- 00:08:53will look familiar because it mirrors a
- 00:08:55lot of stuff in sql in our primitive
- 00:08:57data types we have the numerical data
- 00:08:59types string data type date time data
- 00:09:02type and
- 00:09:03miscellaneous data type and these should
- 00:09:05be very they're kind of self-explanatory
- 00:09:07but just in case numerical data is your
- 00:09:10floats your integers your short integers
- 00:09:12all of that numerical data comes in as a
- 00:09:14number a string of course is characters
- 00:09:17and numbers and then you have your date
- 00:09:19time stamp and then we have kind of a
- 00:09:21general way of pulling your own created
- 00:09:23data types in there that's your
- 00:09:25miscellaneous data type and we have
- 00:09:27complex data types so you can store
- 00:09:29arrays you can store maps you can store
- 00:09:32structures uh and even units in there as
- 00:09:34we dig into hive data types
- 00:09:37and we have the primitive data types and
- 00:09:39the complex data types so we look at
- 00:09:41primitive data types and we're looking
- 00:09:42at numeric data types data types like an
- 00:09:45integer a float a decimal those are all
- 00:09:47stored as numbers in the hive data
- 00:09:50system a string data type data types
- 00:09:52like characters and strings you store
- 00:09:54the name of the person you're working
- 00:09:55with uh you know john doe the city
- 00:09:58memphis the state tennessee maybe it's
- 00:10:02boulder colorado usa or maybe it's hyper
- 00:10:05bad
- 00:10:06india that's all going to be string and
- 00:10:08stored as a string character and of
- 00:10:10course we have our date time data type
- 00:10:12data types like timestamp date interval
- 00:10:15those are very common as far as tracking
- 00:10:18sales anything like that you just think
- 00:10:19if you can type a stamp of time on it or
- 00:10:22maybe you're dealing with a race and you
- 00:10:23want to know the interval how long did
- 00:10:24the person take to complete whatever
- 00:10:26task it was all that is date time data
- 00:10:29type and then we talk miscellaneous data
- 00:10:31type these are like boolean and binary
- 00:10:34and when you get into boolean and binary
- 00:10:35you can actually almost create anything
- 00:10:37in there but your yes knows zero one now
- 00:10:39let's take a look at complex data types
- 00:10:41a little closer uh we have arrays so
- 00:10:44your syntax is of data type and it's an
- 00:10:46array and you can just think of an array
- 00:10:48as a collection of same
- 00:10:51entities one two three four if they're
- 00:10:53all numbers and you have maps this is a
- 00:10:55collection of key value pairs
- 00:10:58so understanding maps is so central to
- 00:11:01hadoop uh so when we store maps you have
- 00:11:03a key which is a set you can only have
- 00:11:05one key per mapped value and so you in
- 00:11:08hadoop of course you collect uh the same
- 00:11:10keys and you can add them all up or do
- 00:11:12something with all the contents of the
- 00:11:13same key but this is our map as a
- 00:11:16primitive type data type in our
- 00:11:18collection of key value pairs and then
- 00:11:20collection of complex data with comment
- 00:11:23so we can have a structure we have a
- 00:11:24column name data type comment call a
- 00:11:27column comment so you can get very
- 00:11:29complicated structures in here with your
- 00:11:31collection of data and your commented
- 00:11:33setup and then we have units and this is
- 00:11:35a collection of heterogeneous data types
- 00:11:38so the syntax for this is union type
- 00:11:40data type data type and so on so it's
- 00:11:43all going to be the same a little bit
- 00:11:44different than the arrays where you can
- 00:11:46actually mix and match different modes
- 00:11:48of hive hive operates in two modes
- 00:11:51depending on the number and size of data
- 00:11:53nodes we have our local mode and our map
- 00:11:56reduce mode when we talk about the local
- 00:11:58mode it is used when hadoop is having
- 00:12:00one data node and the data is small
- 00:12:03processing will be very fast on a
- 00:12:04smaller data sets which are present in
- 00:12:06local machine and this might be that you
- 00:12:08have a local file stuff you're uploading
- 00:12:11into the hive and you need to do some
- 00:12:13processes in there you can go ahead and
- 00:12:14run those high processes and queries on
- 00:12:17it usually you don't see much in the way
- 00:12:19of a single node hadoop system if you're
- 00:12:21going to do that you might as well just
- 00:12:22use like an sql database or even a java
- 00:12:26sqlite or something python sqlite so you
- 00:12:29don't really see a lot of single node
- 00:12:31hadoop databases but you do see the
- 00:12:33local mode in hive where you're working
- 00:12:35with a small amount of data that's going
- 00:12:37to be integrated into the larger
- 00:12:39database and then we have the map reduce
- 00:12:41mode this is used when hadoop is having
- 00:12:44multiple data nodes and the data is
- 00:12:45spread across various data nodes
- 00:12:47processing large datasets can be more
- 00:12:49efficient using this mode and this you
- 00:12:51can think of instead of it being one two
- 00:12:53three or even five computers we're
- 00:12:56usually talking with the hadoop file
- 00:12:58system we're looking at 10 computers 15
- 00:13:01100 where this data is spread across all
- 00:13:03those different hadoop nodes difference
- 00:13:05between hive and
- 00:13:07rdbms remember rdbms stands for the
- 00:13:10relational database management system
- 00:13:13let's take a look at the difference
- 00:13:14between hive and the rdbms with hive
- 00:13:18hive enforces schema on read and it's
- 00:13:21very important that whatever is coming
- 00:13:23in that's when hive's looking at it and
- 00:13:24making sure that it fits the model
- 00:13:27the rdbms enforces a schema when it
- 00:13:30actually writes the data into the
- 00:13:31database so it's read the data and then
- 00:13:33once it starts to write it that's where
- 00:13:35it's going to give you the error or tell
- 00:13:36you something's incorrect about your
- 00:13:38scheme high of data size is in petabytes
- 00:13:41that is hard to imagine you know we're
- 00:13:43looking at your personal computer on
- 00:13:45your desk maybe you have 10 terabytes if
- 00:13:47it's a high-end computer but we're
- 00:13:49talking petabytes so that's hundreds of
- 00:13:51computers grouped together when a rdbms
- 00:13:54data size is in terabytes very rarely do
- 00:13:57you see an rdbms system that's spread
- 00:14:00over more than five computers and
- 00:14:02there's a lot of reasons for that with
- 00:14:04the rdbms it actually has a high end
- 00:14:07amount of writes to the hard drive
- 00:14:09there's a lot more going on there you're
- 00:14:10writing and polling stuff so you really
- 00:14:12don't want to get too big with an rd bms
- 00:14:14or you're gonna run into a lot of
- 00:14:15problems with hive you can take it as
- 00:14:17big as you want hive is based on the
- 00:14:19notion of write once and read many times
- 00:14:23this is so important and they call it
- 00:14:25worm which is write w
- 00:14:28read are many times m they refer to it
- 00:14:31as worm and that's true of any of a lot
- 00:14:32of your hadoop setup it's it's altered a
- 00:14:35little bit but in general we're looking
- 00:14:36at archiving data that you want to do
- 00:14:38data analysis on we're looking at
- 00:14:40pulling all that stuff off your rd bms
- 00:14:42from years and years and years of
- 00:14:44business or whatever your company does
- 00:14:46or scientific research and putting that
- 00:14:48into a huge data pool so that you can
- 00:14:50now do queries on it and get that
- 00:14:52information out of it with the rdbms
- 00:14:55it's based on the notion of read and
- 00:14:56write many times
- 00:14:58so you're continually updating this
- 00:14:59database you're continually bringing up
- 00:15:01new stuff new sales
- 00:15:03the account changes because they have a
- 00:15:06different licensing now whatever
- 00:15:07software you're selling all that kind of
- 00:15:09stuff where the data is continually
- 00:15:10fluctuating and then hive resembles a
- 00:15:12traditional database by supporting sql
- 00:15:15but it is not a database it is a data
- 00:15:18warehouse this is very important it goes
- 00:15:20with all the other stuff we've talked
- 00:15:21about that we're not looking at a
- 00:15:23database but a data warehouse to store
- 00:15:25the data and still have fast and easy
- 00:15:28access to it for doing queries you can
- 00:15:31think of
- 00:15:32twitter and facebook they have so many
- 00:15:35posts that are archived back
- 00:15:36historically those posts aren't going to
- 00:15:38change they made the post they're posted
- 00:15:40they're there and they're in their
- 00:15:41database but they have to store it in a
- 00:15:42warehouse in case they want to pull it
- 00:15:44back up with the rdbms it's a type of
- 00:15:47database management system which is
- 00:15:49based on the relational model of data
- 00:15:51and then with hive easily scalable at a
- 00:15:54low cost again we're talking maybe a
- 00:15:56thousand dollars per terabyte um the
- 00:15:58rdbms is not scalable at a low cost when
- 00:16:02you first start on the lower end you're
- 00:16:03talking about 10 000 per terabyte of
- 00:16:06data including all the backup on the
- 00:16:08models and all the added necessities to
- 00:16:10support it as you scale it up you have
- 00:16:13to scale those computers and hardware up
- 00:16:15so you might start off with a basic
- 00:16:17server and then you upgrade to a sun
- 00:16:20computer to run it and you spend you
- 00:16:22know tens of thousands of dollars for
- 00:16:23that hardware upgrade with hive you just
- 00:16:25put another computer into your hadoop
- 00:16:28file system so let's look at some of the
- 00:16:29features of hive
- 00:16:31when we're looking at the features of
- 00:16:32hive we're talking about the use of sql
- 00:16:35like language called hive ql a lot of
- 00:16:37times you'll see that as hql which is
- 00:16:39easier than long codes this is nice if
- 00:16:42you're working with your shareholders
- 00:16:44you come to them and you say hey you can
- 00:16:46do a basic sql query on here and pull up
- 00:16:48the information you need this way you
- 00:16:50don't have to take off have your
- 00:16:51programmers jump in every time they want
- 00:16:53to look up something in the database
- 00:16:55they actually now can easily do that if
- 00:16:56they're not
- 00:16:57skilled in programming and script
- 00:17:00writing tables are used which are
- 00:17:01similar to the rdbms hints easier to
- 00:17:04understand and one of the things i like
- 00:17:06about this is when i'm bringing tables
- 00:17:08in from a mysql server or sql server
- 00:17:10there's almost a direct reflection
- 00:17:12between the two so when you're looking
- 00:17:13at one which is the data which is
- 00:17:15continually changing and then you're
- 00:17:16going into the archive database it's not
- 00:17:18this huge jump where you have to learn a
- 00:17:20whole new language
- 00:17:22you mirror that same schema into the
- 00:17:24hdfs into the hive making it very easy
- 00:17:27to go between the two and then using
- 00:17:29hive ql multiple users can
- 00:17:31simultaneously query data so again you
- 00:17:34have multiple clients in there and they
- 00:17:36send in their query that's also true
- 00:17:37with the rdbms which kind of cues them
- 00:17:40up because it's running so fast you
- 00:17:41don't notice the lag time well you get
- 00:17:43that also with the hql as you add more
- 00:17:46computers and query can go very quickly
- 00:17:48depending on how many computers and how
- 00:17:50much resources each machine has to pull
- 00:17:52the information and hive supports a
- 00:17:55variety of data types
- 00:17:57so with hive it's designed to be on the
- 00:18:00hadoop system which you can put almost
- 00:18:02anything into the hadoop file system so
- 00:18:04with all that let's take a look at a
- 00:18:07demo on hive ql or hql before i dive
- 00:18:11into the hands-on demo let's take a look
- 00:18:14at the website hive.apache.org
- 00:18:17that's the main website since apache
- 00:18:20it's an apache open source
- 00:18:22software this is the main software for
- 00:18:24the main site for the build and if you
- 00:18:26go in here you'll see that they're
- 00:18:27slowly migrating hive into beehive and
- 00:18:30so if you see beehive versus hive note
- 00:18:32the beehive as the new release is coming
- 00:18:34out that's all it is it reflects a lot
- 00:18:36of the same functionality of hive it's
- 00:18:38the same thing and then we like to pull
- 00:18:40up some kind of documentation on
- 00:18:43commands and for this i'm actually going
- 00:18:45to go to hortonworks hive cheat sheet
- 00:18:48and that's because hortonworks and
- 00:18:50cloudera are two of the most common used
- 00:18:53builds for hadoop and four which include
- 00:18:57hive and all the different tools in
- 00:18:58there and so hortonworks has a pretty
- 00:19:00good pdf you can download cheat sheet on
- 00:19:03there i believe cloudera does too but
- 00:19:04we'll go ahead and just look at the
- 00:19:06horton one because it's the one that
- 00:19:07comes up really good and you can see
- 00:19:08when we look at the query language it
- 00:19:10compares mysql server to hive ql or hql
- 00:19:14and you can see the basic select we
- 00:19:16select from columns from table where
- 00:19:19conditions exist the most basic command
- 00:19:22on there and they have different things
- 00:19:23you can do with it just like you do with
- 00:19:25your sql and if you scroll down you'll
- 00:19:28see
- 00:19:28data types so here's your integer your
- 00:19:30flow your binary double string timestamp
- 00:19:33and all the different data types you can
- 00:19:34use some different semantics different
- 00:19:37keys features functions
- 00:19:40for running a hive query command line
- 00:19:42setup and of course a hive shell uh set
- 00:19:45up in here so you can see right here if
- 00:19:47we loop through it has a lot of your
- 00:19:48basic stuff and it is we're basically
- 00:19:50looking at sql across a horton database
- 00:19:53we're going to go ahead and run our
- 00:19:55hadoop cluster hive demo and i'm going
- 00:19:58to go ahead and use the cloudera quick
- 00:20:00start this is in the virtual box so
- 00:20:03again we have an oracle virtual box
- 00:20:05which is open source and then we have
- 00:20:08our cloudera quickstart which is the
- 00:20:10hadoop setup on a single node now
- 00:20:12obviously hadoop and hive are designed
- 00:20:15to run across a cluster of computers so
- 00:20:17we talk about a single node is for
- 00:20:19education testing that kind of thing and
- 00:20:21if you have a chance you can always go
- 00:20:23back and look at our demo we had on
- 00:20:27setting up a hadoop system in a single
- 00:20:29cluster just set a note down below in
- 00:20:31the youtube video and our team will get
- 00:20:33in contact with you and send you that
- 00:20:35link if you don't already have it or you
- 00:20:36can contact us at the
- 00:20:39www.simplylearn.com now in here it's
- 00:20:40always important to note that you do
- 00:20:42need
- 00:20:43on your computer if you're running on
- 00:20:45windows because i'm on a windows machine
- 00:20:47you're going to need probably about 12
- 00:20:49gigabytes to actually run this it used
- 00:20:51to be goodbye with a lot less but as
- 00:20:52things have evolved they take up more
- 00:20:54and more resources and you need the
- 00:20:56professional version if you have the
- 00:20:58home version i was able to get that to
- 00:21:00run but boy did it take a lot of extra
- 00:21:02work to get the home version to let me
- 00:21:05use the virtual setup on there and we'll
- 00:21:07simply click on the cloudera quick start
- 00:21:09and i'm going to go and just start that
- 00:21:10up and this is starting up our linux so
- 00:21:13we have our windows 10 which is a
- 00:21:14computer i'm on and then i have the
- 00:21:17virtual box which is going to have a
- 00:21:18linux operating system in it and we'll
- 00:21:20skip ahead so you don't have to watch
- 00:21:22the whole install something interesting
- 00:21:24to know about the cloudera is that it's
- 00:21:27running on linuxcentos and for whatever
- 00:21:29reason i've always had to click on it
- 00:21:32and hit the escape button for it to spin
- 00:21:35up and then you'll see the dos come in
- 00:21:37here now that our cloudera spun up on
- 00:21:39our virtual machine with the linux on we
- 00:21:42can see here we have our it uses the
- 00:21:45thunderbird browser on here by default
- 00:21:47and automatically opens up a number of
- 00:21:49different tabs for us and a quick note
- 00:21:51cause i mentioned like the restrictions
- 00:21:53on getting set up on your own computer
- 00:21:55if you have a home edition computer and
- 00:21:57you're worried about setting it up on
- 00:21:59there you can also go in there and spin
- 00:22:01up a one month free service on amazon
- 00:22:04web service to play with this so there's
- 00:22:06other options you're not stuck with just
- 00:22:08doing it on the quick start menu you can
- 00:22:10spin this up in many other ways now the
- 00:22:12first thing we want to note is that
- 00:22:13we've come in here into cloudera and i'm
- 00:22:15going to access this in two ways uh the
- 00:22:18first one is we're going to use hue and
- 00:22:20i'm going to open up hue and i'll take
- 00:22:22it a moment to load from the setup on
- 00:22:24here and hue is nice if i go in and use
- 00:22:28hue as an editor into hive or into the
- 00:22:31hadoop setup usually i'm doing it as a
- 00:22:34from an admin side because it has a lot
- 00:22:37more information a lot of visuals less
- 00:22:39to do with you know actually diving in
- 00:22:41there and just executing code and you
- 00:22:43can also write this code into files and
- 00:22:45scripts and there's other things you can
- 00:22:47otherwise you can upload it into hive
- 00:22:49but today we're going to look at the
- 00:22:50command lines and we'll upload it into
- 00:22:52hue and then we'll go into and actually
- 00:22:54do our work in a terminal window under
- 00:22:56the hive shell now in the hue browser
- 00:22:59window if you go under query and click
- 00:23:00on the pull down menu and then you go
- 00:23:02under editor and you'll see hive there
- 00:23:04we go there's our hive setup i go and
- 00:23:06click on hive and this will open up our
- 00:23:08query down here and now it has a nice
- 00:23:10little b that shows our hive going and
- 00:23:12we can go something very simple down
- 00:23:14here like show
- 00:23:16databases and we follow it with the
- 00:23:18semicolon and that's the standard in
- 00:23:21hive is you always add our
- 00:23:23punctuation at the end there and i'll go
- 00:23:25ahead and run this and the query will
- 00:23:26show up underneath and you'll see down
- 00:23:28here since this is a new quick start i
- 00:23:30just put on here you'll see it has the
- 00:23:32default down here for the databases
- 00:23:35that's the database name i haven't
- 00:23:37actually created any databases on here
- 00:23:38and then there's a lot of other like
- 00:23:40assistant function tables
- 00:23:42your databases up here there's all kinds
- 00:23:45of things you can research you can look
- 00:23:46at through hue as far as a bigger
- 00:23:49picture the downside of this is it
- 00:23:51always seems to lag for me whenever i'm
- 00:23:53doing this i always seem to run slow so
- 00:23:55if you're in cloudera you can open up a
- 00:23:57terminal window they actually have an
- 00:23:59icon at the top you can also go under
- 00:24:01applications and under applications
- 00:24:03system tools and terminal either one
- 00:24:05will work it's just a regular terminal
- 00:24:07window and this terminal window is now
- 00:24:09running underneath our linux so this is
- 00:24:11a linux terminal window or on our
- 00:24:13virtual machine which is resting on our
- 00:24:16regular windows 10 machine and we'll go
- 00:24:18ahead and zoom this in so you can see
- 00:24:19the text better on your own video and i
- 00:24:22simply just clicked on view and zoom in
- 00:24:24and then all we have to do is type in
- 00:24:26hive and this will open up the shell on
- 00:24:28here and it takes it just a moment to
- 00:24:30load when starting up hive i also want
- 00:24:33to note that depending on your rights on
- 00:24:36the computer you're on in your action
- 00:24:37you might have to do pseudohype and put
- 00:24:39in your password and username most
- 00:24:41computers are usually set up with the
- 00:24:43hive login again it just depends on how
- 00:24:45you're accessing the linux system and
- 00:24:47the hive shell once we're in here we can
- 00:24:49go ahead and do a simple uh hql command
- 00:24:52show databases and if we do that we'll
- 00:24:55see here that we don't have any
- 00:24:56databases so we can go ahead and create
- 00:24:58a database and we'll just call it office
- 00:25:01for today for this moment now if i do
- 00:25:03show we'll just do the up arrow up arrow
- 00:25:06is a hotkey that works in both linux and
- 00:25:08in hive so i can go back and paste
- 00:25:10through all the commands i've typed in
- 00:25:11and we can see now that i have my
- 00:25:14there's of course a default database and
- 00:25:16then there's the office database so now
- 00:25:18we've created a database it's pretty
- 00:25:19quick and easy and we can go ahead and
- 00:25:21drop the database we can do drop
- 00:25:23database
- 00:25:24office now this will work on this
- 00:25:26database because it's empty if your
- 00:25:28database was not empty you would have to
- 00:25:30do cascade and that drops all the tables
- 00:25:33in the database and the database itself
- 00:25:36now if we do show database and we'll go
- 00:25:39ahead and recreate our database because
- 00:25:40we're going to use the office database
- 00:25:42for the rest of this hands-on demo a
- 00:25:45really handy command to now
- 00:25:47set with the sql or hql is to use office
- 00:25:51and what that does is that sets office
- 00:25:54as a default database so instead of
- 00:25:56having to reference the database every
- 00:25:59time we work with a table it now
- 00:26:01automatically assumes that's the
- 00:26:02database being used whatever tables
- 00:26:04we're working on the difference is you
- 00:26:06put the database name period table and
- 00:26:08i'll show you in just a minute what that
- 00:26:09looks like and how that's different if
- 00:26:11we're going to have a table and a
- 00:26:13database we should probably load some
- 00:26:14data into it so let me go ahead and
- 00:26:16switch gears here and open up a terminal
- 00:26:19window you can just open another
- 00:26:20terminal window and it'll open up right
- 00:26:21on top of the one that you have hive
- 00:26:23shell running in and when we're in this
- 00:26:25terminal window first we're going to go
- 00:26:27ahead and just do a list which is of
- 00:26:28course a linux command you can see all
- 00:26:30the files i have in here this is the
- 00:26:32default load we can change directory to
- 00:26:34documents we can list in documents and
- 00:26:38we're actually going to be looking at
- 00:26:40employee.csv a linux command is the cat
- 00:26:44you can use this actually to combine
- 00:26:45documents there's all kinds of things
- 00:26:46that cat does but if we want to just
- 00:26:48display the contents of our employee.c
- 00:26:52file we can simply do cat employee csv
- 00:26:55and when we're looking at this we want
- 00:26:57to know a couple things one there's a
- 00:27:00line at the top okay so the very first
- 00:27:02thing we notice is that we have a header
- 00:27:04line the next thing we notice is that
- 00:27:06the data is comma separated and in this
- 00:27:09particular case you'll see a space here
- 00:27:12generally with these you've got to be
- 00:27:13real careful with spaces there's all
- 00:27:15kinds of things you've got to watch out
- 00:27:16for because they can cause issues these
- 00:27:18bases won't because these are all
- 00:27:20strings that the space is connected to
- 00:27:22if this was a space next to the integer
- 00:27:24you would get a null value that comes
- 00:27:26into the database without doing
- 00:27:27something extra in there now with most
- 00:27:29of hadoop that's important to know that
- 00:27:31you're writing the data once reading it
- 00:27:33many times and that's true of almost all
- 00:27:36your hadoop things coming in so you
- 00:27:38really want to process the data before
- 00:27:40it gets into the database and for those
- 00:27:43who of you have studied data
- 00:27:45transformation that's the etyl where you
- 00:27:47extract transfer form and then load the
- 00:27:51data so you really want to extract and
- 00:27:53transform before putting it into the
- 00:27:54hive then you load it into the hive with
- 00:27:56the transformed data and of course we
- 00:27:58also want to note the schema we have an
- 00:28:00integer string string integer integer so
- 00:28:03we kept it pretty simple in here as far
- 00:28:04as the way the data is set up the last
- 00:28:06thing that you're going to want to look
- 00:28:07up
- 00:28:08is the source since we're doing local
- 00:28:11uploads we want to know what the path is
- 00:28:13we have the whole path in this case it's
- 00:28:15home slash cloudera slash documents and
- 00:28:18these are just text documents we're
- 00:28:20working with right now we're not doing
- 00:28:21anything fancy so we can do a simple get
- 00:28:24edit employee.csv
- 00:28:26and you'll see it comes up here it's
- 00:28:28just a text document so i can easily
- 00:28:30remove these added spaces there we go
- 00:28:33and then we go and just save it and so
- 00:28:35now it has a new setup in there we've
- 00:28:36edited it the g edit is usually one of
- 00:28:39the default that loads into linux so any
- 00:28:42text editor will do back to the hive
- 00:28:44shell so let's go ahead and create a
- 00:28:46table employee and what i want you to
- 00:28:48note here is i did not put the semicolon
- 00:28:51on the end here semicolon tells it to
- 00:28:53execute that line so this is kind of
- 00:28:55nice if you're you can actually just
- 00:28:57paste it in if you have it written on
- 00:28:58another sheet and you can see right here
- 00:29:00where i have create table employee and
- 00:29:02it goes into the next line on there so i
- 00:29:04can do all of my commands at once now
- 00:29:07just so i don't have any typo errors i
- 00:29:08went ahead and just pasted the next
- 00:29:10three lines in and the next one is our
- 00:29:13schema if you remember correctly from
- 00:29:15the other side we had the different
- 00:29:17values in here which was id
- 00:29:19name department year of joining and
- 00:29:22salary and the id is an integer name is
- 00:29:25a string department string air joining
- 00:29:26energy salary an integer and they're in
- 00:29:29brackets we put close brackets around
- 00:29:31them and you could do this all as one
- 00:29:32line and then we have row format
- 00:29:34delimited fields terminated by comma and
- 00:29:37this is important because the default is
- 00:29:40tabs so if i do it now it won't find any
- 00:29:42terminated fields so you'll get a bunch
- 00:29:44of null values loaded into your table
- 00:29:47and then finally our table properties we
- 00:29:49want to skip the header line count
- 00:29:51equals 1. now this is a lot of work for
- 00:29:54uploading a single file it's kind of
- 00:29:56goofy when you're uploading a single
- 00:29:57file that you have to put all this in
- 00:29:59here but keep in mind hive and hadoop is
- 00:30:02designed for writing many files into the
- 00:30:05database you write them all in there and
- 00:30:06then you can they're saved it's an
- 00:30:08archive it's a data warehouse and then
- 00:30:10you're able to do all your queries on
- 00:30:11them so a lot of times we're not looking
- 00:30:13at just the one file coming up we're
- 00:30:15loading hundreds of files you have your
- 00:30:18reports coming off of your main database
- 00:30:20all those reports are being loaded you
- 00:30:22have your log files you have i mean all
- 00:30:24this different data is being dumped into
- 00:30:26hadoop and in this case hive on top of
- 00:30:28hadoop and so we need to let it know hey
- 00:30:30how do i handle these files coming in
- 00:30:32and then we have the semicolon at the
- 00:30:34end which lets us know to go ahead and
- 00:30:35run this line and so we'll go ahead and
- 00:30:37run that and now if we do a show tables
- 00:30:40you can see there's our employee on
- 00:30:41there we can also describe if we do
- 00:30:44describe employee
- 00:30:46you can see that we have our id integer
- 00:30:48name string department string year of
- 00:30:51joining integer and salary integer and
- 00:30:54then finally let's just do a select star
- 00:30:56from employee very basic sql nhql
- 00:31:00command selecting data it's going to
- 00:31:02come up and we haven't put anything in
- 00:31:04it so as we expect there's no data in it
- 00:31:07so if we flip back to our
- 00:31:10linux terminal window you can see where
- 00:31:12we did the cat
- 00:31:14employee.csv and you can see all the
- 00:31:16data we expect to come into it and we
- 00:31:18also did our pwd and right here you see
- 00:31:21the path you need that full path when
- 00:31:23you are loading data you know you can do
- 00:31:26a browse and if i did it right now with
- 00:31:28just the employee.csv as a name it will
- 00:31:31work but that is a really bad habit in
- 00:31:33general when you're loading data because
- 00:31:35it's you don't know what else is going
- 00:31:36on in the computer you want to do the
- 00:31:38full path almost in all your data loads
- 00:31:40so let's go ahead and flip back over
- 00:31:42here to our hive shell we're working in
- 00:31:45and the command for this is load data so
- 00:31:48that says hey we're loading data that's
- 00:31:49a hive command hql and we want local
- 00:31:52data so you got to put down local in
- 00:31:54path so now it needs to know where the
- 00:31:56path is now to make this more legible
- 00:31:59i'm just going to go ahead and hit enter
- 00:32:00then we'll just paste the full path in
- 00:32:02there which i have stored over on the
- 00:32:04side like a good prepared demo and
- 00:32:06you'll see here we have home cloudera
- 00:32:08documents employee.csv so it's a whole
- 00:32:11path for this text document in here and
- 00:32:13we go ahead and hit enter in there and
- 00:32:15then we have to let it know where the
- 00:32:17data is going so now we have a source
- 00:32:19and we need a destination and it's going
- 00:32:20to go into the table and we'll just call
- 00:32:22it employee we'll just match the table
- 00:32:25in there and because i want it to
- 00:32:26execute we put the semicolon on the end
- 00:32:29it goes ahead and executes all three
- 00:32:31lines now if we go back if you remember
- 00:32:34we did the select star from employee
- 00:32:36just using the up arrow to page through
- 00:32:38my different commands i've already typed
- 00:32:41in you can see right here we have as we
- 00:32:43expect we have rows sam mike and nick
- 00:32:45and we have all their information
- 00:32:46showing in our four rows and then let's
- 00:32:49go ahead and do uh select
- 00:32:51and count let's look at a couple of
- 00:32:53these different select options you can
- 00:32:55do we're going to count everything from
- 00:32:57employee now this is kind of interesting
- 00:32:59because the first one just pops up with
- 00:33:01the basic select because it doesn't need
- 00:33:04to go through the full map reduce phase
- 00:33:07but when you start doing a count it does
- 00:33:09go through the full map redo setup in
- 00:33:12the hive in hadoop and because i'm doing
- 00:33:14this demo on a single node cloudera
- 00:33:18virtual box on top of a windows 10 all
- 00:33:21the benefits of running it on a cluster
- 00:33:23are gone and instead is now going
- 00:33:25through all those added layers so it
- 00:33:27takes longer to run you know like i said
- 00:33:30when you do a single node as i said
- 00:33:32earlier it doesn't do any good as an
- 00:33:34actual distribution because you're only
- 00:33:35running it on one computer and then
- 00:33:37you've added all these different layers
- 00:33:38to run it and we see it comes up with
- 00:33:40four and that's what we expect we have
- 00:33:41four rows we expect four at the end and
- 00:33:44if you remember from
- 00:33:45our cheat sheet which we brought up here
- 00:33:48from hortons it's a pretty good one
- 00:33:49there's all these different commands we
- 00:33:51can do we'll look at one more command
- 00:33:52where we do the
- 00:33:54what they call sub queries right down
- 00:33:56here because that's really common to do
- 00:33:58a lot of sub queries and so we'll do
- 00:34:00select
- 00:34:01star or all different columns from
- 00:34:05now if we weren't using the office
- 00:34:07database it would look like this from
- 00:34:10office dot employee and either one will
- 00:34:12work on this particular one because we
- 00:34:15have office set as a default on there so
- 00:34:18from office employee and then the
- 00:34:20command where creates a subset and in
- 00:34:23this case we want to know where the
- 00:34:24salary is greater than 25
- 00:34:28000. there we go and of course we end
- 00:34:30with our semicolon and if we run this
- 00:34:32query you can see it pops up and there's
- 00:34:34our salaries of people top earners we
- 00:34:36have rose and i t and mike and hr kudos
- 00:34:39to them of course they're fictional i
- 00:34:41don't actually we don't actually have a
- 00:34:42rose and a mic in those positions or
- 00:34:44maybe we do so finally we want to go
- 00:34:46ahead and do is we're done with this
- 00:34:48table now remember you're dealing with
- 00:34:49the data warehouse so you usually don't
- 00:34:51do a lot of dropping of tables and
- 00:34:54databases but we're going to go ahead
- 00:34:56and drop this table here before we drop
- 00:34:58it one more quick note is we can change
- 00:35:01it so what we're going to do is we're
- 00:35:03going to alter table office employee and
- 00:35:06we want to go ahead and rename it
- 00:35:08there's some other commands you can do
- 00:35:09in here but rename is pretty common and
- 00:35:11we're going to rename it to
- 00:35:13and it's going to stay in office and
- 00:35:16it turns out one of our
- 00:35:17shareholders really doesn't like the
- 00:35:19word employee he wants employees plural
- 00:35:22it's a big deal to him so let's go ahead
- 00:35:24and change that name for the table it's
- 00:35:26that easy because it's just changing the
- 00:35:28metadata on there and now if we do show
- 00:35:30tables you'll see we now have employees
- 00:35:33not employee and then at this point
- 00:35:36maybe we're doing some house cleaning
- 00:35:37because this is all practice so we're
- 00:35:39going to go ahead and drop table and
- 00:35:40we'll drop table employees because we
- 00:35:43changed the name in there so if we did
- 00:35:45employee just give us an error and now
- 00:35:47if we do show tables you'll see all the
- 00:35:49tables are gone now the next thing we
- 00:35:50want to take a look at and we're going
- 00:35:52to walk back through the loading of data
- 00:35:54just real quick because we're going to
- 00:35:56load two tables in here and let me just
- 00:35:58float back to our terminal window so we
- 00:36:01can see what those tables are that we're
- 00:36:03loading and so up here we have a
- 00:36:05customer we have a customer
- 00:36:07file and we have an order file we want
- 00:36:09to go ahead and put the customers and
- 00:36:10the orders into here so those are the
- 00:36:12two we're doing and of course it's
- 00:36:13always nice to see what you're working
- 00:36:15with so let's do our cat
- 00:36:17customer.csv we could always do g edit
- 00:36:20but we don't really need to edit these
- 00:36:21we just want to take a look at the data
- 00:36:23in customer and important in here is
- 00:36:25again we have a header so we have to
- 00:36:27skip a line comma separated uh nothing
- 00:36:30odd with the data we have our schema
- 00:36:32which is uh integer string integer
- 00:36:36string integer so you know you'd want to
- 00:36:38take that note that down or flip back
- 00:36:40and forth when you're doing it and then
- 00:36:41let's go ahead and do cat order dot csv
- 00:36:44and we can see we have oid which i'm
- 00:36:47guessing is the order id we have a date
- 00:36:49up something new we've done integers and
- 00:36:51strings but we haven't done date when
- 00:36:53you're importing new and you never
- 00:36:56worked with the date date's always one
- 00:36:57of the more trickier fields to port in
- 00:36:59when that's true of just about any
- 00:37:01scripting language i've worked with all
- 00:37:03of them have their own idea of how
- 00:37:04date's supposed to be formatted what the
- 00:37:06default is this particular format or
- 00:37:09it's year and it has all four uh digits
- 00:37:13dash month two digits dash day is the
- 00:37:16standard import for the hive so you'll
- 00:37:19have to look up and see what the
- 00:37:20different formats are if you're going to
- 00:37:22do a different format in there coming in
- 00:37:24or you're not able to pre-process the
- 00:37:25data but this would be a pre-processing
- 00:37:27of the data thing coming in if you
- 00:37:29remember correctly from our edel which
- 00:37:31is uh e just in case you weren't able to
- 00:37:33hear me last time etl which stands for
- 00:37:37extract transform then load so you want
- 00:37:40to make sure you're transforming this
- 00:37:42data before it gets into here and so
- 00:37:44we're going to go ahead and bring
- 00:37:45both this data in here and really we're
- 00:37:47doing this so we can show you the basic
- 00:37:49join there is if you remember from our
- 00:37:51setup merge join all kinds of different
- 00:37:53things you can do but joining different
- 00:37:55data sets is so common so it's really
- 00:37:58important to know how to do this we need
- 00:37:59to go ahead and bring in these two data
- 00:38:00sets and you can see where i just
- 00:38:02created a table customer here's our
- 00:38:04schema the integer name age address
- 00:38:07salary here's our eliminated by commas
- 00:38:10and our table properties where we skip a
- 00:38:12line well let's go ahead and load the
- 00:38:13data first and then we'll do that with
- 00:38:15our order and let's go ahead and put
- 00:38:17that in here and i've got it split into
- 00:38:19three lines so you can see it easily we
- 00:38:21got load data local in path so we know
- 00:38:23we're loading data we know it's local
- 00:38:25and we have the path here's the complete
- 00:38:27path for uh oh this is supposed to be
- 00:38:29order csv grab the wrong one of course
- 00:38:32it's going to give me errors because you
- 00:38:33can't recreate the same table on there
- 00:38:35and here we go create table here's our
- 00:38:38integer date customer the basic setup
- 00:38:41that we had coming in here for our
- 00:38:42schema row format commas table
- 00:38:45properties skip header line and then
- 00:38:47finally let's load the data into
- 00:38:50our order table load data local in path
- 00:38:53home cloudera documents order.csv into
- 00:38:56table order now if we did everything
- 00:38:58right we should be able to do select
- 00:39:00star from customer and you can see we
- 00:39:03have all seven customers and then we can
- 00:39:05do select star from order and we have uh
- 00:39:09four orders uh so this is just like a
- 00:39:11quick frame we have a lot of times when
- 00:39:13you have your customer databases in
- 00:39:15business you have thousands of customers
- 00:39:17from years and years and some of them
- 00:39:20you know they move they close their
- 00:39:21business they change names all kinds of
- 00:39:23things happen uh so we want to do is we
- 00:39:25want to go ahead and find just the
- 00:39:27information connected to these orders
- 00:39:30and who's connected to them and so let's
- 00:39:32go ahead and do it's a select because
- 00:39:33we're going to display information so
- 00:39:35select and this is kind of interesting
- 00:39:37we're going to do c dot id
- 00:39:40and i'm going to define c as customer as
- 00:39:42a customer table in just a minute then
- 00:39:44we're going to do c dot name and again
- 00:39:47we're going to define the c c dot age so
- 00:39:50this means from the customer we want to
- 00:39:52know their id their name their age and
- 00:39:54then you know i'd also like to know the
- 00:39:56order amount uh so let's do o for dot
- 00:39:59amount and then this is where we need to
- 00:40:01go ahead and define uh what we're doing
- 00:40:04and i'm going to capitalize from
- 00:40:05customer so we're going to take the
- 00:40:07customer table in here and we're going
- 00:40:09to name it c that's where the c comes
- 00:40:11from so that's the customer table c and
- 00:40:13we want to join order as o that's where
- 00:40:16our o comes from so the o dot amount is
- 00:40:18what we're joining in there and then we
- 00:40:20want to do this on we got to tell it how
- 00:40:22to connect the two tables c dot id
- 00:40:25equals o dot customer underscore id so
- 00:40:29now we know how they're joined and
- 00:40:31remember we have seven customers in here
- 00:40:34we have four orders and as it processes
- 00:40:36we should get a return
- 00:40:38of four different names joined together
- 00:40:41and they're joined based on of course
- 00:40:43the orders on there and once we're done
- 00:40:45we now have the order number the person
- 00:40:48who made the order their age and the
- 00:40:51amount of the order which came from the
- 00:40:52order table so you have your different
- 00:40:54information and you can see how the join
- 00:40:56works here very common use of tables and
- 00:40:59hql and sql and let's do one more thing
- 00:41:02with our database and then i'll show you
- 00:41:05a couple other hive commands and let's
- 00:41:07go ahead and do a drop and we're going
- 00:41:09to drop database office
- 00:41:12and if you're looking at this and you uh
- 00:41:14remember from earlier this will give me
- 00:41:16an error and this to see what that looks
- 00:41:18like it says failed to execute exception
- 00:41:21one or more tables exist so if you
- 00:41:24remember from before you can't just drop
- 00:41:26a database unless you tell it to cascade
- 00:41:29that lets it know i don't care how many
- 00:41:31tables are in it let's get rid of it and
- 00:41:33in hadoop since it's an art it's a
- 00:41:35warehouse a data warehouse you usually
- 00:41:36don't do a lot of dropping uh maybe at
- 00:41:38the beginning when you're developing the
- 00:41:39schemas and you realize you messed up
- 00:41:41you might drop some stuff but down the
- 00:41:43road you're really just adding commodity
- 00:41:46machines to take up so you can store
- 00:41:47more stuff on it so you usually don't do
- 00:41:49a lot of database dropping and some
- 00:41:51other
- 00:41:52fun commands to know is you can do
- 00:41:54select round 2.3 is round value you can
- 00:41:57do a round off in
- 00:41:58hive we can do as floor value which is
- 00:42:02going to give us a two so it turns it
- 00:42:03into an integer versus a float it goes
- 00:42:06down you know basically truncates it but
- 00:42:08it goes down and we can also do ceiling
- 00:42:10which is going to round it up so we're
- 00:42:12looking for the next integer above
- 00:42:14there's a few commands we didn't show in
- 00:42:16here because we're on a single node as
- 00:42:19as an admin to help spediate the process
- 00:42:22you usually add in partitions for the
- 00:42:24data and buckets you can't do that on a
- 00:42:27single node because the when you add a
- 00:42:29partition it partitions it across
- 00:42:31separate nodes but beyond that you can
- 00:42:33see it's very straightforward we have
- 00:42:35sql coming in and all your basic queries
- 00:42:38that are in sql are very similar to hql
- 00:42:42key takeaways so we took a look at the
- 00:42:44history of hive and how it evolved from
- 00:42:47the hadoop file system to an hql similar
- 00:42:50to sql layer on top of hadoop with the
- 00:42:53full metal metastore and all that
- 00:42:56information connected to the hadoop file
- 00:42:58system that way you can easily scale it
- 00:43:00up and still have you know underneath
- 00:43:02the hadoop setup while still having the
- 00:43:04sql query language available we looked
- 00:43:07at what is hive and we looked at the
- 00:43:09high of queries going in through the map
- 00:43:11reduce into the hadoop map reduce system
- 00:43:13we dig a little deeper to look at the
- 00:43:15architecture of hive and all the
- 00:43:17different pieces and how they fit
- 00:43:18together including the fact that it has
- 00:43:20the hive client and you have your thrift
- 00:43:22applications and your
- 00:43:24jdbc applications in your odbc
- 00:43:26applications uh we have the hive web
- 00:43:28interface which we looked at in a demo
- 00:43:30along with the cli which your client
- 00:43:33direct interface which we use for most
- 00:43:34of the demo and again this is how do you
- 00:43:36get these commands into hive and if
- 00:43:38you're uh using the
- 00:43:41hive web interface is great for maybe
- 00:43:43your shareholders you're working with
- 00:43:45and some of them are not technically
- 00:43:47literate or even if they are you know
- 00:43:48it's a quick look up for data i'm not be
- 00:43:51beyond jumping into the hue web
- 00:43:54interface and looking something up
- 00:43:55versus a terminal window or if you're
- 00:43:57running the back end we're running the
- 00:43:58program whether it's python or java you
- 00:44:00have those the thrift applications which
- 00:44:02connect in so you can extend with all
- 00:44:04the major scripting languages usually
- 00:44:06have their own plugin for sending that
- 00:44:08information over to our hadoop file
- 00:44:10system we dug in deeper into the data
- 00:44:12flow in hive so your user interface and
- 00:44:15the different steps it takes to go
- 00:44:16through and get in out of the mapreduce
- 00:44:18system we also took a glance at hive
- 00:44:21data types and certainly these are
- 00:44:23always evolving so it's good to look at
- 00:44:25the apache website and especially with
- 00:44:27the new stuff come up underneath of
- 00:44:29beehive which is also hive if you see
- 00:44:31that don't let that scare you it's just
- 00:44:33the beta version coming up with the new
- 00:44:35updates and then we looked at the
- 00:44:37features of hive and how it works as an
- 00:44:39ecosystem on there with that i'd like to
- 00:44:42thank you for joining us today for
- 00:44:44hadoop hive again my name is richard
- 00:44:46kirschner with the simply learn team for
- 00:44:48more information visit us at
- 00:44:50www.simplylearn.com
- 00:44:53you can also post comments here on the
- 00:44:55youtube channel we do have moderators
- 00:44:57that monitor that and we certainly will
- 00:44:59respond to your postings again thank you
- 00:45:01for joining us today get certified get
- 00:45:04ahead
- 00:45:09hi there if you like this video
- 00:45:11subscribe to the simply learn youtube
- 00:45:12channel and click here to watch similar
- 00:45:15videos turn it up and get certified
- 00:45:17click here
- Hive
- HQL
- HiveQL
- Data Warehouse
- Hadoop
- Metastore
- HDFS
- Big Data
- SQL
- Data Types