Hive is a data warehouse system for querying and analyzing large datasets stored in the Hadoop file system (HDFS), using a SQL-like query language known as HiveQL or HQL.

How does Hive differ from traditional RDBMS?

Hive enforces schema on read, while RDBMS enforces schema on write. Hive is designed for large data in petabytes, whereas RDBMS typically manages data in terabytes.

What are the main modes of Hive?

Hive operates in local mode for small datasets on a single data node, and MapReduce mode for processing larger data sets across multiple data nodes.

What is the Hive Metastore?

The Metastore is a repository for Hive metadata, which contains information about the structure of tables and their schemas.

Can Hive handle complex data types?

Yes, Hive supports both primitive (e.g., integers, strings) and complex data types (e.g., arrays, maps, structs).

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Hadoop | Simplilearn

00:45:20

https://www.youtube.com/watch?v=rr17cbPGWGA

Resumen

TLDRThis video tutorial introduces Hive, a data warehouse system that simplifies querying large datasets in the Hadoop ecosystem using HiveQL, a SQL-like language. It begins with a historical overview of Hive's development from its inception at Facebook's Hadoop solutions to its widespread adoption. The architecture is explained, detailing key components like the Hive client, drivers (JDBC, ODBC), and the Metastore. Data modeling concepts such as tables, partitions, and buckets are discussed, alongside Hive's data types. Furthermore, it contrasts Hive with traditional RDBMS, highlighting their operational differences, such as schema enforcement and scaling capabilities. The video concludes with a live demo showcasing Hive commands and functionalities within the Cloudera Hadoop setup.

Para llevar

📚 Hive simplifies querying large data sets with HQL!
🛠️ It originated from Facebook's needs to manage big data.
🏗️ Hive's architecture includes clients, servers, and a metastore.
📊 Data modeling in Hive involves tables, partitions, and buckets.
⚙️ Hive operates in local and MapReduce modes based on data size.
💾 Supports both primitive and complex data types for flexibility.
🏆 Hive is not a database, it's a data warehouse for analysis.
📈 Easily scalable at a lower cost compared to RDBMS.
💡 The Hive metastore manages metadata for efficient data retrieval.
📝 Hands-on demo shows how to execute Hive commands effectively.

Cronología

00:00:00 - 00:05:00
The tutorial introduces Hive, led by Richard Kirschner from Simply Learn. It covers the history, architecture, features, and a hands-on demo involving Hive on the Cloudera Hadoop file system. It starts with the need for Hive due to the complexities of coding in Java for data processing, leading to the development of HiveQL, a SQL-like query language for querying large datasets.
00:05:00 - 00:10:00
Hive was developed at Facebook to manage substantial data using Hadoop. The tutorial explains that Hive uses a SQL-like language for ease of use, facilitating querying and analysis of large datasets stored in HDFS. It outlines Hive's role as a data warehouse which translates user queries into MapReduce tasks for execution.
00:10:00 - 00:15:00
The architecture of Hive is detailed, including components like the Hive client, Thrift applications, JDBC, and ODBC drivers. It describes how the Hive server processes queries and the role of the Hive driver in compiling and executing tasks, as well as the Metadata Store for storing table information.
00:15:00 - 00:20:00
The data flow within Hive is elaborated, highlighting the interaction between the user interface, compiler, execution engine, and HDFS. The explanation covers how queries are executed and how metadata is managed, providing insights into efficient data retrieval processes.
00:20:00 - 00:25:00
Hive data modeling is explained, focusing on the structure of tables, partitions for grouping data, and buckets for efficient querying. The importance of designing efficient schemas to improve query performance is emphasized.
00:25:00 - 00:30:00
Hive data types are categorized into primitive and complex types, mirroring SQL data types. Primitive types include numerical and string data, while complex types allow for the storage of arrays and maps, crucial for advanced data analytics.
00:30:00 - 00:35:00
Hive operates in two modes: local mode for small datasets on a single data node, and MapReduce mode for larger datasets across multiple data nodes, emphasizing the scalability of Hive for big data applications.
00:35:00 - 00:40:00
Differences between Hive and traditional RDBMS are outlined, including Hive's schema-on-read approach versus RDBMS's schema-on-write, and the capacity for handling petabytes of data in Hive as opposed to terabytes in RDBMS, showcasing Hive's efficiency in data warehousing.
00:40:00 - 00:45:20
The last segment discusses Hive's features, including the use of HiveQL, its SQL-like interface, simultaneous querying by multiple users, support for various data types, and the importance of scalability and cost-effectiveness in a big data context. It concludes with a live demo of HiveQL commands in the Cloudera environment.

Mapa mental

Vídeo de preguntas y respuestas

What is Hive?
Hive is a data warehouse system for querying and analyzing large datasets stored in the Hadoop file system (HDFS), using a SQL-like query language known as HiveQL or HQL.
How does Hive differ from traditional RDBMS?
Hive enforces schema on read, while RDBMS enforces schema on write. Hive is designed for large data in petabytes, whereas RDBMS typically manages data in terabytes.
What are the main modes of Hive?
Hive operates in local mode for small datasets on a single data node, and MapReduce mode for processing larger data sets across multiple data nodes.
What is the Hive Metastore?
The Metastore is a repository for Hive metadata, which contains information about the structure of tables and their schemas.
Can Hive handle complex data types?
Yes, Hive supports both primitive (e.g., integers, strings) and complex data types (e.g., arrays, maps, structs).

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.

Subtítulos

Desplazamiento automático:

00:00:03
hello and welcome to hive tutorial my
00:00:06
name is richard kirschner with the
00:00:07
simply learn team that is
00:00:10
www.simplylearn.com get certified get
00:00:13
ahead what's in it for you today in our
00:00:15
hive tutorial first we're going to start
00:00:17
with the history of hive what is hive
00:00:20
architecture of hive data flow and hive
00:00:23
hive data modeling hive data types
00:00:26
different modes of hive and difference
00:00:28
between hive and rdb ems finally we're
00:00:32
going to look into the features of hive
00:00:34
and do a quick hands-on demo on hive in
00:00:37
the cloudera hadoop file system let's
00:00:39
dive in with a brief history of hive so
00:00:42
the history of hive begins with facebook
00:00:44
facebook began using hadoop as a
00:00:46
solution to handle the growing big data
00:00:49
and we're not talking about a data that
00:00:50
fits on one or two or even five
00:00:52
computers we're talking due to the fits
00:00:55
on if you've looked at any of our other
00:00:56
hadoop tutorials you'll know we're
00:00:58
talking about very big data and data
00:01:00
pools and facebook certainly has a lot
00:01:02
of data it tracks as we know the hadoop
00:01:05
uses mapreduce for processing data
00:01:08
mapreduce required users to write long
00:01:10
codes and so you'd have these really
00:01:12
extensive java codes very complicated
00:01:14
for the average person to use not all
00:01:16
users were versed in java and other
00:01:18
coding languages this proved to be a
00:01:20
disadvantage for them users were
00:01:22
comfortable with writing queries in sql
00:01:24
sql has been around for a long time the
00:01:26
standard
00:01:27
sql query language hive was developed
00:01:30
with the vision to incorporate the
00:01:32
concepts of tables columns just like sql
00:01:35
so why hive well the problem was for
00:01:38
processing and analyzing data users
00:01:40
found it difficult to code as not all of
00:01:42
them were well versed with the coding
00:01:44
languages you have your processing ever
00:01:46
analyzing and so the solution was
00:01:48
required a language similar to sql which
00:01:51
was well known to all the users and thus
00:01:53
the hive or hql language evolved what is
00:01:57
hive hive is a data warehouse system
00:02:00
which is used for querying and analyzing
00:02:02
large data sets stored in the hdfs or
00:02:04
the hadoop file system hive uses a query
00:02:07
language that we call hive ql or hql
00:02:10
which is similar to sql so if we take
00:02:13
our user the user sends out their hive
00:02:16
queries and then that is converted into
00:02:18
a mapreduce tasks and then accesses the
00:02:21
hadoop mapreduce system let's take a
00:02:23
look at the architecture of hive
00:02:25
architecture of hive we have the hive
00:02:27
client
00:02:28
so that could be the programmer or maybe
00:02:30
it's a manager who knows enough sql to
00:02:32
do a basic query to look up the data
00:02:34
they need the hive client supports
00:02:36
different types of client applications
00:02:38
in different languages prefer for
00:02:40
performing queries and so we have our
00:02:42
thrift application in the hive thrift
00:02:44
client thrift is a software framework
00:02:47
hive server is based on thrift so it can
00:02:50
serve the request from all programming
00:02:51
language that support thrift and then we
00:02:54
have our jdbc application and the hive
00:02:57
jdbc driver jdbc java database
00:03:01
connectivity jdbc application is
00:03:04
connected through the jdbc driver and
00:03:06
then you have the odbc application or
00:03:08
the hive odbc driver the odbc or open
00:03:13
database connectivity the odbc
00:03:15
application is connected through the
00:03:17
odbc driver with the growing development
00:03:20
of all of our different scripting
00:03:21
languages python c plus plus spar
00:03:24
java you can find just about any
00:03:26
connection in any of the main scripting
00:03:28
languages and so we have our hive
00:03:30
services as we look at deeper into the
00:03:33
architecture hive supports various
00:03:35
services
00:03:36
so you have your hive server basically
00:03:38
your thrift application or your hive
00:03:40
thrift client or your jdbc or your hive
00:03:42
jdbc driver your odbc application or
00:03:45
your hive odbc driver they all connect
00:03:47
into the hive server and you have your
00:03:49
hive web interface you also have your
00:03:52
cli now the hive web interface is a gui
00:03:55
is provided to execute hive queries and
00:03:58
we'll actually be using that later on
00:04:00
today so you can see kind of what that
00:04:02
looks like and get a feel for what that
00:04:04
means commands are executed directly in
00:04:06
cli and then the cli is a direct
00:04:09
terminal window and i'll also show you
00:04:12
that too so you can see how those two
00:04:14
different interfaces work these then
00:04:16
push the code into the hive driver hive
00:04:18
driver is responsible for all the
00:04:20
queries submitted so everything goes
00:04:22
through that driver let's take a closer
00:04:23
look at the hive driver the hive driver
00:04:25
now performs three steps internally one
00:04:28
is a compiler hive driver passes query
00:04:31
to compiler where it is checked and
00:04:32
analyzed then the optimizer kicks in and
00:04:35
the optimize logical plan in the form of
00:04:37
a graph of mapreduce and hdfs tasks is
00:04:40
obtained and then finally in the
00:04:42
executor in the final step the tasks are
00:04:45
executed we look at the architecture we
00:04:47
also have to note the meta store
00:04:49
metastore is a repository for hive
00:04:51
metadata stores metadata for hive tables
00:04:54
and you can think of this as your schema
00:04:56
and where is it located and it's stored
00:04:58
on the apache derby db processing and
00:05:00
resource management is all handled by
00:05:02
the mapreduce v1 you'll see mapreduce v2
00:05:06
the yarn and the tez these are all
00:05:08
different ways of managing these
00:05:10
resources depending on what version of
00:05:11
hadoop you're in hive uses mapreduce
00:05:13
framework to process queries and then we
00:05:16
have our distributed storage which is
00:05:18
the hdfs and if you looked at our hadoop
00:05:21
tutorials you'll know that these are on
00:05:23
commodity machines and are linearly
00:05:25
scalable that means they're very
00:05:27
affordable a lot of time when you're
00:05:28
talking about big data you're talking
00:05:30
about a tenth of the price of storing it
00:05:32
on enterprise computers and then we look
00:05:34
at the data flow and hive
00:05:36
so in our data flow and hive we have our
00:05:38
hive in the hadoop system and underneath
00:05:40
the user interface or the ui we have our
00:05:43
driver our compiler our execution engine
00:05:45
and our meta store that all goes into
00:05:47
the mapreduce and the hadoop file system
00:05:50
so when we execute a query you see it
00:05:52
coming in here it goes into the driver
00:05:54
step one step two we get a plan what are
00:05:56
we going to do refers to the query
00:05:58
execution uh then we go to the metadata
00:06:00
it's like well what kind of metadata are
00:06:02
we actually looking at where is this
00:06:03
data located what is the schema on it
00:06:06
then this comes back with the metadata
00:06:08
into the compiler then the compiler
00:06:10
takes all that information and the send
00:06:13
plan returns it to the driver the driver
00:06:15
then sends the execute plan to the
00:06:17
execution engine once it's in the
00:06:19
execution engine the execution engine
00:06:22
acts as a bridge between hive and hadoop
00:06:25
to process the query and that's going
00:06:27
into your mapreduce in your hadoop file
00:06:29
system or your hdfs and then we come
00:06:32
back with the metadata operations it
00:06:34
goes back into the metastore to update
00:06:37
or let it know what's going on which
00:06:38
also goes to the between it's a
00:06:41
communication between the execution
00:06:42
engine and the metastore execution
00:06:44
engine communications is
00:06:46
bi-directionally with the metastore to
00:06:48
perform operations like create drop
00:06:51
tables metastore stores information
00:06:54
about tables and columns so again we're
00:06:56
talking about the schema of your
00:06:57
database and once we have that we have a
00:06:59
bi-directional
00:07:01
send results communication back into the
00:07:03
driver and then we have the fetch
00:07:05
results which goes back to the client so
00:07:07
let's take a little bit look at the hive
00:07:09
data modeling hive data modeling so you
00:07:12
have your high data modeling you have
00:07:13
your tables you have your partitions and
00:07:15
you have buckets the tables in hive are
00:07:18
created the same way it is done in rdbms
00:07:21
so when you're looking at your
00:07:22
traditional sql server or mysql server
00:07:25
where you might have enterprise
00:07:27
equipment and a lot of
00:07:29
people pulling and moving stuff off of
00:07:31
there the tables are gonna look very
00:07:32
similar and this makes it very easy to
00:07:34
take that information and let's say you
00:07:36
need to keep current information but you
00:07:39
need to store all of your years of
00:07:41
transactions back into the hadoop hive
00:07:44
so you match those those all kind of
00:07:46
look the same the tables are the same
00:07:48
your databases look very similar and you
00:07:50
can easily import them back you can
00:07:51
easily store them into the hive system
00:07:54
partitions here tables are organized
00:07:56
into partitions for grouping same type
00:07:58
of data based on partition key this can
00:08:01
become very important for speeding up
00:08:04
the process of doing queries so if
00:08:06
you're looking at dates as far as like
00:08:08
your employment dates of employees if
00:08:10
that's what you're tracking you might
00:08:12
add a partition there because that might
00:08:13
be one of the key things that you're
00:08:14
always looking up as far as employees
00:08:16
are concerned and finally we have
00:08:18
buckets uh data present in partitions
00:08:20
can be further divided into buckets for
00:08:22
efficient querying again there's that
00:08:24
efficiency at this level a lot of times
00:08:27
you're taught you're working with the
00:08:29
programmer and the admin of your hadoop
00:08:32
file system to maximize the efficiency
00:08:35
of that file system so it's usually a
00:08:37
two-person job and we're talking about
00:08:39
hive data modeling you want to make sure
00:08:41
that they work together and you're
00:08:43
maximizing your resources hive data
00:08:46
types so we're talking about hive data
00:08:48
types we have our primitive data types
00:08:50
and our complex data types a lot of this
00:08:53
will look familiar because it mirrors a
00:08:55
lot of stuff in sql in our primitive
00:08:57
data types we have the numerical data
00:08:59
types string data type date time data
00:09:02
type and
00:09:03
miscellaneous data type and these should
00:09:05
be very they're kind of self-explanatory
00:09:07
but just in case numerical data is your
00:09:10
floats your integers your short integers
00:09:12
all of that numerical data comes in as a
00:09:14
number a string of course is characters
00:09:17
and numbers and then you have your date
00:09:19
time stamp and then we have kind of a
00:09:21
general way of pulling your own created
00:09:23
data types in there that's your
00:09:25
miscellaneous data type and we have
00:09:27
complex data types so you can store
00:09:29
arrays you can store maps you can store
00:09:32
structures uh and even units in there as
00:09:34
we dig into hive data types
00:09:37
and we have the primitive data types and
00:09:39
the complex data types so we look at
00:09:41
primitive data types and we're looking
00:09:42
at numeric data types data types like an
00:09:45
integer a float a decimal those are all
00:09:47
stored as numbers in the hive data
00:09:50
system a string data type data types
00:09:52
like characters and strings you store
00:09:54
the name of the person you're working
00:09:55
with uh you know john doe the city
00:09:58
memphis the state tennessee maybe it's
00:10:02
boulder colorado usa or maybe it's hyper
00:10:05
bad
00:10:06
india that's all going to be string and
00:10:08
stored as a string character and of
00:10:10
course we have our date time data type
00:10:12
data types like timestamp date interval
00:10:15
those are very common as far as tracking
00:10:18
sales anything like that you just think
00:10:19
if you can type a stamp of time on it or
00:10:22
maybe you're dealing with a race and you
00:10:23
want to know the interval how long did
00:10:24
the person take to complete whatever
00:10:26
task it was all that is date time data
00:10:29
type and then we talk miscellaneous data
00:10:31
type these are like boolean and binary
00:10:34
and when you get into boolean and binary
00:10:35
you can actually almost create anything
00:10:37
in there but your yes knows zero one now
00:10:39
let's take a look at complex data types
00:10:41
a little closer uh we have arrays so
00:10:44
your syntax is of data type and it's an
00:10:46
array and you can just think of an array
00:10:48
as a collection of same
00:10:51
entities one two three four if they're
00:10:53
all numbers and you have maps this is a
00:10:55
collection of key value pairs
00:10:58
so understanding maps is so central to
00:11:01
hadoop uh so when we store maps you have
00:11:03
a key which is a set you can only have
00:11:05
one key per mapped value and so you in
00:11:08
hadoop of course you collect uh the same
00:11:10
keys and you can add them all up or do
00:11:12
something with all the contents of the
00:11:13
same key but this is our map as a
00:11:16
primitive type data type in our
00:11:18
collection of key value pairs and then
00:11:20
collection of complex data with comment
00:11:23
so we can have a structure we have a
00:11:24
column name data type comment call a
00:11:27
column comment so you can get very
00:11:29
complicated structures in here with your
00:11:31
collection of data and your commented
00:11:33
setup and then we have units and this is
00:11:35
a collection of heterogeneous data types
00:11:38
so the syntax for this is union type
00:11:40
data type data type and so on so it's
00:11:43
all going to be the same a little bit
00:11:44
different than the arrays where you can
00:11:46
actually mix and match different modes
00:11:48
of hive hive operates in two modes
00:11:51
depending on the number and size of data
00:11:53
nodes we have our local mode and our map
00:11:56
reduce mode when we talk about the local
00:11:58
mode it is used when hadoop is having
00:12:00
one data node and the data is small
00:12:03
processing will be very fast on a
00:12:04
smaller data sets which are present in
00:12:06
local machine and this might be that you
00:12:08
have a local file stuff you're uploading
00:12:11
into the hive and you need to do some
00:12:13
processes in there you can go ahead and
00:12:14
run those high processes and queries on
00:12:17
it usually you don't see much in the way
00:12:19
of a single node hadoop system if you're
00:12:21
going to do that you might as well just
00:12:22
use like an sql database or even a java
00:12:26
sqlite or something python sqlite so you
00:12:29
don't really see a lot of single node
00:12:31
hadoop databases but you do see the
00:12:33
local mode in hive where you're working
00:12:35
with a small amount of data that's going
00:12:37
to be integrated into the larger
00:12:39
database and then we have the map reduce
00:12:41
mode this is used when hadoop is having
00:12:44
multiple data nodes and the data is
00:12:45
spread across various data nodes
00:12:47
processing large datasets can be more
00:12:49
efficient using this mode and this you
00:12:51
can think of instead of it being one two
00:12:53
three or even five computers we're
00:12:56
usually talking with the hadoop file
00:12:58
system we're looking at 10 computers 15
00:13:01
100 where this data is spread across all
00:13:03
those different hadoop nodes difference
00:13:05
between hive and
00:13:07
rdbms remember rdbms stands for the
00:13:10
relational database management system
00:13:13
let's take a look at the difference
00:13:14
between hive and the rdbms with hive
00:13:18
hive enforces schema on read and it's
00:13:21
very important that whatever is coming
00:13:23
in that's when hive's looking at it and
00:13:24
making sure that it fits the model
00:13:27
the rdbms enforces a schema when it
00:13:30
actually writes the data into the
00:13:31
database so it's read the data and then
00:13:33
once it starts to write it that's where
00:13:35
it's going to give you the error or tell
00:13:36
you something's incorrect about your
00:13:38
scheme high of data size is in petabytes
00:13:41
that is hard to imagine you know we're
00:13:43
looking at your personal computer on
00:13:45
your desk maybe you have 10 terabytes if
00:13:47
it's a high-end computer but we're
00:13:49
talking petabytes so that's hundreds of
00:13:51
computers grouped together when a rdbms
00:13:54
data size is in terabytes very rarely do
00:13:57
you see an rdbms system that's spread
00:14:00
over more than five computers and
00:14:02
there's a lot of reasons for that with
00:14:04
the rdbms it actually has a high end
00:14:07
amount of writes to the hard drive
00:14:09
there's a lot more going on there you're
00:14:10
writing and polling stuff so you really
00:14:12
don't want to get too big with an rd bms
00:14:14
or you're gonna run into a lot of
00:14:15
problems with hive you can take it as
00:14:17
big as you want hive is based on the
00:14:19
notion of write once and read many times
00:14:23
this is so important and they call it
00:14:25
worm which is write w
00:14:28
read are many times m they refer to it
00:14:31
as worm and that's true of any of a lot
00:14:32
of your hadoop setup it's it's altered a
00:14:35
little bit but in general we're looking
00:14:36
at archiving data that you want to do
00:14:38
data analysis on we're looking at
00:14:40
pulling all that stuff off your rd bms
00:14:42
from years and years and years of
00:14:44
business or whatever your company does
00:14:46
or scientific research and putting that
00:14:48
into a huge data pool so that you can
00:14:50
now do queries on it and get that
00:14:52
information out of it with the rdbms
00:14:55
it's based on the notion of read and
00:14:56
write many times
00:14:58
so you're continually updating this
00:14:59
database you're continually bringing up
00:15:01
new stuff new sales
00:15:03
the account changes because they have a
00:15:06
different licensing now whatever
00:15:07
software you're selling all that kind of
00:15:09
stuff where the data is continually
00:15:10
fluctuating and then hive resembles a
00:15:12
traditional database by supporting sql
00:15:15
but it is not a database it is a data
00:15:18
warehouse this is very important it goes
00:15:20
with all the other stuff we've talked
00:15:21
about that we're not looking at a
00:15:23
database but a data warehouse to store
00:15:25
the data and still have fast and easy
00:15:28
access to it for doing queries you can
00:15:31
think of
00:15:32
twitter and facebook they have so many
00:15:35
posts that are archived back
00:15:36
historically those posts aren't going to
00:15:38
change they made the post they're posted
00:15:40
they're there and they're in their
00:15:41
database but they have to store it in a
00:15:42
warehouse in case they want to pull it
00:15:44
back up with the rdbms it's a type of
00:15:47
database management system which is
00:15:49
based on the relational model of data
00:15:51
and then with hive easily scalable at a
00:15:54
low cost again we're talking maybe a
00:15:56
thousand dollars per terabyte um the
00:15:58
rdbms is not scalable at a low cost when
00:16:02
you first start on the lower end you're
00:16:03
talking about 10 000 per terabyte of
00:16:06
data including all the backup on the
00:16:08
models and all the added necessities to
00:16:10
support it as you scale it up you have
00:16:13
to scale those computers and hardware up
00:16:15
so you might start off with a basic
00:16:17
server and then you upgrade to a sun
00:16:20
computer to run it and you spend you
00:16:22
know tens of thousands of dollars for
00:16:23
that hardware upgrade with hive you just
00:16:25
put another computer into your hadoop
00:16:28
file system so let's look at some of the
00:16:29
features of hive
00:16:31
when we're looking at the features of
00:16:32
hive we're talking about the use of sql
00:16:35
like language called hive ql a lot of
00:16:37
times you'll see that as hql which is
00:16:39
easier than long codes this is nice if
00:16:42
you're working with your shareholders
00:16:44
you come to them and you say hey you can
00:16:46
do a basic sql query on here and pull up
00:16:48
the information you need this way you
00:16:50
don't have to take off have your
00:16:51
programmers jump in every time they want
00:16:53
to look up something in the database
00:16:55
they actually now can easily do that if
00:16:56
they're not
00:16:57
skilled in programming and script
00:17:00
writing tables are used which are
00:17:01
similar to the rdbms hints easier to
00:17:04
understand and one of the things i like
00:17:06
about this is when i'm bringing tables
00:17:08
in from a mysql server or sql server
00:17:10
there's almost a direct reflection
00:17:12
between the two so when you're looking
00:17:13
at one which is the data which is
00:17:15
continually changing and then you're
00:17:16
going into the archive database it's not
00:17:18
this huge jump where you have to learn a
00:17:20
whole new language
00:17:22
you mirror that same schema into the
00:17:24
hdfs into the hive making it very easy
00:17:27
to go between the two and then using
00:17:29
hive ql multiple users can
00:17:31
simultaneously query data so again you
00:17:34
have multiple clients in there and they
00:17:36
send in their query that's also true
00:17:37
with the rdbms which kind of cues them
00:17:40
up because it's running so fast you
00:17:41
don't notice the lag time well you get
00:17:43
that also with the hql as you add more
00:17:46
computers and query can go very quickly
00:17:48
depending on how many computers and how
00:17:50
much resources each machine has to pull
00:17:52
the information and hive supports a
00:17:55
variety of data types
00:17:57
so with hive it's designed to be on the
00:18:00
hadoop system which you can put almost
00:18:02
anything into the hadoop file system so
00:18:04
with all that let's take a look at a
00:18:07
demo on hive ql or hql before i dive
00:18:11
into the hands-on demo let's take a look
00:18:14
at the website hive.apache.org
00:18:17
that's the main website since apache
00:18:20
it's an apache open source
00:18:22
software this is the main software for
00:18:24
the main site for the build and if you
00:18:26
go in here you'll see that they're
00:18:27
slowly migrating hive into beehive and
00:18:30
so if you see beehive versus hive note
00:18:32
the beehive as the new release is coming
00:18:34
out that's all it is it reflects a lot
00:18:36
of the same functionality of hive it's
00:18:38
the same thing and then we like to pull
00:18:40
up some kind of documentation on
00:18:43
commands and for this i'm actually going
00:18:45
to go to hortonworks hive cheat sheet
00:18:48
and that's because hortonworks and
00:18:50
cloudera are two of the most common used
00:18:53
builds for hadoop and four which include
00:18:57
hive and all the different tools in
00:18:58
there and so hortonworks has a pretty
00:19:00
good pdf you can download cheat sheet on
00:19:03
there i believe cloudera does too but
00:19:04
we'll go ahead and just look at the
00:19:06
horton one because it's the one that
00:19:07
comes up really good and you can see
00:19:08
when we look at the query language it
00:19:10
compares mysql server to hive ql or hql
00:19:14
and you can see the basic select we
00:19:16
select from columns from table where
00:19:19
conditions exist the most basic command
00:19:22
on there and they have different things
00:19:23
you can do with it just like you do with
00:19:25
your sql and if you scroll down you'll
00:19:28
see
00:19:28
data types so here's your integer your
00:19:30
flow your binary double string timestamp
00:19:33
and all the different data types you can
00:19:34
use some different semantics different
00:19:37
keys features functions
00:19:40
for running a hive query command line
00:19:42
setup and of course a hive shell uh set
00:19:45
up in here so you can see right here if
00:19:47
we loop through it has a lot of your
00:19:48
basic stuff and it is we're basically
00:19:50
looking at sql across a horton database
00:19:53
we're going to go ahead and run our
00:19:55
hadoop cluster hive demo and i'm going
00:19:58
to go ahead and use the cloudera quick
00:20:00
start this is in the virtual box so
00:20:03
again we have an oracle virtual box
00:20:05
which is open source and then we have
00:20:08
our cloudera quickstart which is the
00:20:10
hadoop setup on a single node now
00:20:12
obviously hadoop and hive are designed
00:20:15
to run across a cluster of computers so
00:20:17
we talk about a single node is for
00:20:19
education testing that kind of thing and
00:20:21
if you have a chance you can always go
00:20:23
back and look at our demo we had on
00:20:27
setting up a hadoop system in a single
00:20:29
cluster just set a note down below in
00:20:31
the youtube video and our team will get
00:20:33
in contact with you and send you that
00:20:35
link if you don't already have it or you
00:20:36
can contact us at the
00:20:39
www.simplylearn.com now in here it's
00:20:40
always important to note that you do
00:20:42
need
00:20:43
on your computer if you're running on
00:20:45
windows because i'm on a windows machine
00:20:47
you're going to need probably about 12
00:20:49
gigabytes to actually run this it used
00:20:51
to be goodbye with a lot less but as
00:20:52
things have evolved they take up more
00:20:54
and more resources and you need the
00:20:56
professional version if you have the
00:20:58
home version i was able to get that to
00:21:00
run but boy did it take a lot of extra
00:21:02
work to get the home version to let me
00:21:05
use the virtual setup on there and we'll
00:21:07
simply click on the cloudera quick start
00:21:09
and i'm going to go and just start that
00:21:10
up and this is starting up our linux so
00:21:13
we have our windows 10 which is a
00:21:14
computer i'm on and then i have the
00:21:17
virtual box which is going to have a
00:21:18
linux operating system in it and we'll
00:21:20
skip ahead so you don't have to watch
00:21:22
the whole install something interesting
00:21:24
to know about the cloudera is that it's
00:21:27
running on linuxcentos and for whatever
00:21:29
reason i've always had to click on it
00:21:32
and hit the escape button for it to spin
00:21:35
up and then you'll see the dos come in
00:21:37
here now that our cloudera spun up on
00:21:39
our virtual machine with the linux on we
00:21:42
can see here we have our it uses the
00:21:45
thunderbird browser on here by default
00:21:47
and automatically opens up a number of
00:21:49
different tabs for us and a quick note
00:21:51
cause i mentioned like the restrictions
00:21:53
on getting set up on your own computer
00:21:55
if you have a home edition computer and
00:21:57
you're worried about setting it up on
00:21:59
there you can also go in there and spin
00:22:01
up a one month free service on amazon
00:22:04
web service to play with this so there's
00:22:06
other options you're not stuck with just
00:22:08
doing it on the quick start menu you can
00:22:10
spin this up in many other ways now the
00:22:12
first thing we want to note is that
00:22:13
we've come in here into cloudera and i'm
00:22:15
going to access this in two ways uh the
00:22:18
first one is we're going to use hue and
00:22:20
i'm going to open up hue and i'll take
00:22:22
it a moment to load from the setup on
00:22:24
here and hue is nice if i go in and use
00:22:28
hue as an editor into hive or into the
00:22:31
hadoop setup usually i'm doing it as a
00:22:34
from an admin side because it has a lot
00:22:37
more information a lot of visuals less
00:22:39
to do with you know actually diving in
00:22:41
there and just executing code and you
00:22:43
can also write this code into files and
00:22:45
scripts and there's other things you can
00:22:47
otherwise you can upload it into hive
00:22:49
but today we're going to look at the
00:22:50
command lines and we'll upload it into
00:22:52
hue and then we'll go into and actually
00:22:54
do our work in a terminal window under
00:22:56
the hive shell now in the hue browser
00:22:59
window if you go under query and click
00:23:00
on the pull down menu and then you go
00:23:02
under editor and you'll see hive there
00:23:04
we go there's our hive setup i go and
00:23:06
click on hive and this will open up our
00:23:08
query down here and now it has a nice
00:23:10
little b that shows our hive going and
00:23:12
we can go something very simple down
00:23:14
here like show
00:23:16
databases and we follow it with the
00:23:18
semicolon and that's the standard in
00:23:21
hive is you always add our
00:23:23
punctuation at the end there and i'll go
00:23:25
ahead and run this and the query will
00:23:26
show up underneath and you'll see down
00:23:28
here since this is a new quick start i
00:23:30
just put on here you'll see it has the
00:23:32
default down here for the databases
00:23:35
that's the database name i haven't
00:23:37
actually created any databases on here
00:23:38
and then there's a lot of other like
00:23:40
assistant function tables
00:23:42
your databases up here there's all kinds
00:23:45
of things you can research you can look
00:23:46
at through hue as far as a bigger
00:23:49
picture the downside of this is it
00:23:51
always seems to lag for me whenever i'm
00:23:53
doing this i always seem to run slow so
00:23:55
if you're in cloudera you can open up a
00:23:57
terminal window they actually have an
00:23:59
icon at the top you can also go under
00:24:01
applications and under applications
00:24:03
system tools and terminal either one
00:24:05
will work it's just a regular terminal
00:24:07
window and this terminal window is now
00:24:09
running underneath our linux so this is
00:24:11
a linux terminal window or on our
00:24:13
virtual machine which is resting on our
00:24:16
regular windows 10 machine and we'll go
00:24:18
ahead and zoom this in so you can see
00:24:19
the text better on your own video and i
00:24:22
simply just clicked on view and zoom in
00:24:24
and then all we have to do is type in
00:24:26
hive and this will open up the shell on
00:24:28
here and it takes it just a moment to
00:24:30
load when starting up hive i also want
00:24:33
to note that depending on your rights on
00:24:36
the computer you're on in your action
00:24:37
you might have to do pseudohype and put
00:24:39
in your password and username most
00:24:41
computers are usually set up with the
00:24:43
hive login again it just depends on how
00:24:45
you're accessing the linux system and
00:24:47
the hive shell once we're in here we can
00:24:49
go ahead and do a simple uh hql command
00:24:52
show databases and if we do that we'll
00:24:55
see here that we don't have any
00:24:56
databases so we can go ahead and create
00:24:58
a database and we'll just call it office
00:25:01
for today for this moment now if i do
00:25:03
show we'll just do the up arrow up arrow
00:25:06
is a hotkey that works in both linux and
00:25:08
in hive so i can go back and paste
00:25:10
through all the commands i've typed in
00:25:11
and we can see now that i have my
00:25:14
there's of course a default database and
00:25:16
then there's the office database so now
00:25:18
we've created a database it's pretty
00:25:19
quick and easy and we can go ahead and
00:25:21
drop the database we can do drop
00:25:23
database
00:25:24
office now this will work on this
00:25:26
database because it's empty if your
00:25:28
database was not empty you would have to
00:25:30
do cascade and that drops all the tables
00:25:33
in the database and the database itself
00:25:36
now if we do show database and we'll go
00:25:39
ahead and recreate our database because
00:25:40
we're going to use the office database
00:25:42
for the rest of this hands-on demo a
00:25:45
really handy command to now
00:25:47
set with the sql or hql is to use office
00:25:51
and what that does is that sets office
00:25:54
as a default database so instead of
00:25:56
having to reference the database every
00:25:59
time we work with a table it now
00:26:01
automatically assumes that's the
00:26:02
database being used whatever tables
00:26:04
we're working on the difference is you
00:26:06
put the database name period table and
00:26:08
i'll show you in just a minute what that
00:26:09
looks like and how that's different if
00:26:11
we're going to have a table and a
00:26:13
database we should probably load some
00:26:14
data into it so let me go ahead and
00:26:16
switch gears here and open up a terminal
00:26:19
window you can just open another
00:26:20
terminal window and it'll open up right
00:26:21
on top of the one that you have hive
00:26:23
shell running in and when we're in this
00:26:25
terminal window first we're going to go
00:26:27
ahead and just do a list which is of
00:26:28
course a linux command you can see all
00:26:30
the files i have in here this is the
00:26:32
default load we can change directory to
00:26:34
documents we can list in documents and
00:26:38
we're actually going to be looking at
00:26:40
employee.csv a linux command is the cat
00:26:44
you can use this actually to combine
00:26:45
documents there's all kinds of things
00:26:46
that cat does but if we want to just
00:26:48
display the contents of our employee.c
00:26:52
file we can simply do cat employee csv
00:26:55
and when we're looking at this we want
00:26:57
to know a couple things one there's a
00:27:00
line at the top okay so the very first
00:27:02
thing we notice is that we have a header
00:27:04
line the next thing we notice is that
00:27:06
the data is comma separated and in this
00:27:09
particular case you'll see a space here
00:27:12
generally with these you've got to be
00:27:13
real careful with spaces there's all
00:27:15
kinds of things you've got to watch out
00:27:16
for because they can cause issues these
00:27:18
bases won't because these are all
00:27:20
strings that the space is connected to
00:27:22
if this was a space next to the integer
00:27:24
you would get a null value that comes
00:27:26
into the database without doing
00:27:27
something extra in there now with most
00:27:29
of hadoop that's important to know that
00:27:31
you're writing the data once reading it
00:27:33
many times and that's true of almost all
00:27:36
your hadoop things coming in so you
00:27:38
really want to process the data before
00:27:40
it gets into the database and for those
00:27:43
who of you have studied data
00:27:45
transformation that's the etyl where you
00:27:47
extract transfer form and then load the
00:27:51
data so you really want to extract and
00:27:53
transform before putting it into the
00:27:54
hive then you load it into the hive with
00:27:56
the transformed data and of course we
00:27:58
also want to note the schema we have an
00:28:00
integer string string integer integer so
00:28:03
we kept it pretty simple in here as far
00:28:04
as the way the data is set up the last
00:28:06
thing that you're going to want to look
00:28:07
up
00:28:08
is the source since we're doing local
00:28:11
uploads we want to know what the path is
00:28:13
we have the whole path in this case it's
00:28:15
home slash cloudera slash documents and
00:28:18
these are just text documents we're
00:28:20
working with right now we're not doing
00:28:21
anything fancy so we can do a simple get
00:28:24
edit employee.csv
00:28:26
and you'll see it comes up here it's
00:28:28
just a text document so i can easily
00:28:30
remove these added spaces there we go
00:28:33
and then we go and just save it and so
00:28:35
now it has a new setup in there we've
00:28:36
edited it the g edit is usually one of
00:28:39
the default that loads into linux so any
00:28:42
text editor will do back to the hive
00:28:44
shell so let's go ahead and create a
00:28:46
table employee and what i want you to
00:28:48
note here is i did not put the semicolon
00:28:51
on the end here semicolon tells it to
00:28:53
execute that line so this is kind of
00:28:55
nice if you're you can actually just
00:28:57
paste it in if you have it written on
00:28:58
another sheet and you can see right here
00:29:00
where i have create table employee and
00:29:02
it goes into the next line on there so i
00:29:04
can do all of my commands at once now
00:29:07
just so i don't have any typo errors i
00:29:08
went ahead and just pasted the next
00:29:10
three lines in and the next one is our
00:29:13
schema if you remember correctly from
00:29:15
the other side we had the different
00:29:17
values in here which was id
00:29:19
name department year of joining and
00:29:22
salary and the id is an integer name is
00:29:25
a string department string air joining
00:29:26
energy salary an integer and they're in
00:29:29
brackets we put close brackets around
00:29:31
them and you could do this all as one
00:29:32
line and then we have row format
00:29:34
delimited fields terminated by comma and
00:29:37
this is important because the default is
00:29:40
tabs so if i do it now it won't find any
00:29:42
terminated fields so you'll get a bunch
00:29:44
of null values loaded into your table
00:29:47
and then finally our table properties we
00:29:49
want to skip the header line count
00:29:51
equals 1. now this is a lot of work for
00:29:54
uploading a single file it's kind of
00:29:56
goofy when you're uploading a single
00:29:57
file that you have to put all this in
00:29:59
here but keep in mind hive and hadoop is
00:30:02
designed for writing many files into the
00:30:05
database you write them all in there and
00:30:06
then you can they're saved it's an
00:30:08
archive it's a data warehouse and then
00:30:10
you're able to do all your queries on
00:30:11
them so a lot of times we're not looking
00:30:13
at just the one file coming up we're
00:30:15
loading hundreds of files you have your
00:30:18
reports coming off of your main database
00:30:20
all those reports are being loaded you
00:30:22
have your log files you have i mean all
00:30:24
this different data is being dumped into
00:30:26
hadoop and in this case hive on top of
00:30:28
hadoop and so we need to let it know hey
00:30:30
how do i handle these files coming in
00:30:32
and then we have the semicolon at the
00:30:34
end which lets us know to go ahead and
00:30:35
run this line and so we'll go ahead and
00:30:37
run that and now if we do a show tables
00:30:40
you can see there's our employee on
00:30:41
there we can also describe if we do
00:30:44
describe employee
00:30:46
you can see that we have our id integer
00:30:48
name string department string year of
00:30:51
joining integer and salary integer and
00:30:54
then finally let's just do a select star
00:30:56
from employee very basic sql nhql
00:31:00
command selecting data it's going to
00:31:02
come up and we haven't put anything in
00:31:04
it so as we expect there's no data in it
00:31:07
so if we flip back to our
00:31:10
linux terminal window you can see where
00:31:12
we did the cat
00:31:14
employee.csv and you can see all the
00:31:16
data we expect to come into it and we
00:31:18
also did our pwd and right here you see
00:31:21
the path you need that full path when
00:31:23
you are loading data you know you can do
00:31:26
a browse and if i did it right now with
00:31:28
just the employee.csv as a name it will
00:31:31
work but that is a really bad habit in
00:31:33
general when you're loading data because
00:31:35
it's you don't know what else is going
00:31:36
on in the computer you want to do the
00:31:38
full path almost in all your data loads
00:31:40
so let's go ahead and flip back over
00:31:42
here to our hive shell we're working in
00:31:45
and the command for this is load data so
00:31:48
that says hey we're loading data that's
00:31:49
a hive command hql and we want local
00:31:52
data so you got to put down local in
00:31:54
path so now it needs to know where the
00:31:56
path is now to make this more legible
00:31:59
i'm just going to go ahead and hit enter
00:32:00
then we'll just paste the full path in
00:32:02
there which i have stored over on the
00:32:04
side like a good prepared demo and
00:32:06
you'll see here we have home cloudera
00:32:08
documents employee.csv so it's a whole
00:32:11
path for this text document in here and
00:32:13
we go ahead and hit enter in there and
00:32:15
then we have to let it know where the
00:32:17
data is going so now we have a source
00:32:19
and we need a destination and it's going
00:32:20
to go into the table and we'll just call
00:32:22
it employee we'll just match the table
00:32:25
in there and because i want it to
00:32:26
execute we put the semicolon on the end
00:32:29
it goes ahead and executes all three
00:32:31
lines now if we go back if you remember
00:32:34
we did the select star from employee
00:32:36
just using the up arrow to page through
00:32:38
my different commands i've already typed
00:32:41
in you can see right here we have as we
00:32:43
expect we have rows sam mike and nick
00:32:45
and we have all their information
00:32:46
showing in our four rows and then let's
00:32:49
go ahead and do uh select
00:32:51
and count let's look at a couple of
00:32:53
these different select options you can
00:32:55
do we're going to count everything from
00:32:57
employee now this is kind of interesting
00:32:59
because the first one just pops up with
00:33:01
the basic select because it doesn't need
00:33:04
to go through the full map reduce phase
00:33:07
but when you start doing a count it does
00:33:09
go through the full map redo setup in
00:33:12
the hive in hadoop and because i'm doing
00:33:14
this demo on a single node cloudera
00:33:18
virtual box on top of a windows 10 all
00:33:21
the benefits of running it on a cluster
00:33:23
are gone and instead is now going
00:33:25
through all those added layers so it
00:33:27
takes longer to run you know like i said
00:33:30
when you do a single node as i said
00:33:32
earlier it doesn't do any good as an
00:33:34
actual distribution because you're only
00:33:35
running it on one computer and then
00:33:37
you've added all these different layers
00:33:38
to run it and we see it comes up with
00:33:40
four and that's what we expect we have
00:33:41
four rows we expect four at the end and
00:33:44
if you remember from
00:33:45
our cheat sheet which we brought up here
00:33:48
from hortons it's a pretty good one
00:33:49
there's all these different commands we
00:33:51
can do we'll look at one more command
00:33:52
where we do the
00:33:54
what they call sub queries right down
00:33:56
here because that's really common to do
00:33:58
a lot of sub queries and so we'll do
00:34:00
select
00:34:01
star or all different columns from
00:34:05
now if we weren't using the office
00:34:07
database it would look like this from
00:34:10
office dot employee and either one will
00:34:12
work on this particular one because we
00:34:15
have office set as a default on there so
00:34:18
from office employee and then the
00:34:20
command where creates a subset and in
00:34:23
this case we want to know where the
00:34:24
salary is greater than 25
00:34:28
000. there we go and of course we end
00:34:30
with our semicolon and if we run this
00:34:32
query you can see it pops up and there's
00:34:34
our salaries of people top earners we
00:34:36
have rose and i t and mike and hr kudos
00:34:39
to them of course they're fictional i
00:34:41
don't actually we don't actually have a
00:34:42
rose and a mic in those positions or
00:34:44
maybe we do so finally we want to go
00:34:46
ahead and do is we're done with this
00:34:48
table now remember you're dealing with
00:34:49
the data warehouse so you usually don't
00:34:51
do a lot of dropping of tables and
00:34:54
databases but we're going to go ahead
00:34:56
and drop this table here before we drop
00:34:58
it one more quick note is we can change
00:35:01
it so what we're going to do is we're
00:35:03
going to alter table office employee and
00:35:06
we want to go ahead and rename it
00:35:08
there's some other commands you can do
00:35:09
in here but rename is pretty common and
00:35:11
we're going to rename it to
00:35:13
and it's going to stay in office and
00:35:16
it turns out one of our
00:35:17
shareholders really doesn't like the
00:35:19
word employee he wants employees plural
00:35:22
it's a big deal to him so let's go ahead
00:35:24
and change that name for the table it's
00:35:26
that easy because it's just changing the
00:35:28
metadata on there and now if we do show
00:35:30
tables you'll see we now have employees
00:35:33
not employee and then at this point
00:35:36
maybe we're doing some house cleaning
00:35:37
because this is all practice so we're
00:35:39
going to go ahead and drop table and
00:35:40
we'll drop table employees because we
00:35:43
changed the name in there so if we did
00:35:45
employee just give us an error and now
00:35:47
if we do show tables you'll see all the
00:35:49
tables are gone now the next thing we
00:35:50
want to take a look at and we're going
00:35:52
to walk back through the loading of data
00:35:54
just real quick because we're going to
00:35:56
load two tables in here and let me just
00:35:58
float back to our terminal window so we
00:36:01
can see what those tables are that we're
00:36:03
loading and so up here we have a
00:36:05
customer we have a customer
00:36:07
file and we have an order file we want
00:36:09
to go ahead and put the customers and
00:36:10
the orders into here so those are the
00:36:12
two we're doing and of course it's
00:36:13
always nice to see what you're working
00:36:15
with so let's do our cat
00:36:17
customer.csv we could always do g edit
00:36:20
but we don't really need to edit these
00:36:21
we just want to take a look at the data
00:36:23
in customer and important in here is
00:36:25
again we have a header so we have to
00:36:27
skip a line comma separated uh nothing
00:36:30
odd with the data we have our schema
00:36:32
which is uh integer string integer
00:36:36
string integer so you know you'd want to
00:36:38
take that note that down or flip back
00:36:40
and forth when you're doing it and then
00:36:41
let's go ahead and do cat order dot csv
00:36:44
and we can see we have oid which i'm
00:36:47
guessing is the order id we have a date
00:36:49
up something new we've done integers and
00:36:51
strings but we haven't done date when
00:36:53
you're importing new and you never
00:36:56
worked with the date date's always one
00:36:57
of the more trickier fields to port in
00:36:59
when that's true of just about any
00:37:01
scripting language i've worked with all
00:37:03
of them have their own idea of how
00:37:04
date's supposed to be formatted what the
00:37:06
default is this particular format or
00:37:09
it's year and it has all four uh digits
00:37:13
dash month two digits dash day is the
00:37:16
standard import for the hive so you'll
00:37:19
have to look up and see what the
00:37:20
different formats are if you're going to
00:37:22
do a different format in there coming in
00:37:24
or you're not able to pre-process the
00:37:25
data but this would be a pre-processing
00:37:27
of the data thing coming in if you
00:37:29
remember correctly from our edel which
00:37:31
is uh e just in case you weren't able to
00:37:33
hear me last time etl which stands for
00:37:37
extract transform then load so you want
00:37:40
to make sure you're transforming this
00:37:42
data before it gets into here and so
00:37:44
we're going to go ahead and bring
00:37:45
both this data in here and really we're
00:37:47
doing this so we can show you the basic
00:37:49
join there is if you remember from our
00:37:51
setup merge join all kinds of different
00:37:53
things you can do but joining different
00:37:55
data sets is so common so it's really
00:37:58
important to know how to do this we need
00:37:59
to go ahead and bring in these two data
00:38:00
sets and you can see where i just
00:38:02
created a table customer here's our
00:38:04
schema the integer name age address
00:38:07
salary here's our eliminated by commas
00:38:10
and our table properties where we skip a
00:38:12
line well let's go ahead and load the
00:38:13
data first and then we'll do that with
00:38:15
our order and let's go ahead and put
00:38:17
that in here and i've got it split into
00:38:19
three lines so you can see it easily we
00:38:21
got load data local in path so we know
00:38:23
we're loading data we know it's local
00:38:25
and we have the path here's the complete
00:38:27
path for uh oh this is supposed to be
00:38:29
order csv grab the wrong one of course
00:38:32
it's going to give me errors because you
00:38:33
can't recreate the same table on there
00:38:35
and here we go create table here's our
00:38:38
integer date customer the basic setup
00:38:41
that we had coming in here for our
00:38:42
schema row format commas table
00:38:45
properties skip header line and then
00:38:47
finally let's load the data into
00:38:50
our order table load data local in path
00:38:53
home cloudera documents order.csv into
00:38:56
table order now if we did everything
00:38:58
right we should be able to do select
00:39:00
star from customer and you can see we
00:39:03
have all seven customers and then we can
00:39:05
do select star from order and we have uh
00:39:09
four orders uh so this is just like a
00:39:11
quick frame we have a lot of times when
00:39:13
you have your customer databases in
00:39:15
business you have thousands of customers
00:39:17
from years and years and some of them
00:39:20
you know they move they close their
00:39:21
business they change names all kinds of
00:39:23
things happen uh so we want to do is we
00:39:25
want to go ahead and find just the
00:39:27
information connected to these orders
00:39:30
and who's connected to them and so let's
00:39:32
go ahead and do it's a select because
00:39:33
we're going to display information so
00:39:35
select and this is kind of interesting
00:39:37
we're going to do c dot id
00:39:40
and i'm going to define c as customer as
00:39:42
a customer table in just a minute then
00:39:44
we're going to do c dot name and again
00:39:47
we're going to define the c c dot age so
00:39:50
this means from the customer we want to
00:39:52
know their id their name their age and
00:39:54
then you know i'd also like to know the
00:39:56
order amount uh so let's do o for dot
00:39:59
amount and then this is where we need to
00:40:01
go ahead and define uh what we're doing
00:40:04
and i'm going to capitalize from
00:40:05
customer so we're going to take the
00:40:07
customer table in here and we're going
00:40:09
to name it c that's where the c comes
00:40:11
from so that's the customer table c and
00:40:13
we want to join order as o that's where
00:40:16
our o comes from so the o dot amount is
00:40:18
what we're joining in there and then we
00:40:20
want to do this on we got to tell it how
00:40:22
to connect the two tables c dot id
00:40:25
equals o dot customer underscore id so
00:40:29
now we know how they're joined and
00:40:31
remember we have seven customers in here
00:40:34
we have four orders and as it processes
00:40:36
we should get a return
00:40:38
of four different names joined together
00:40:41
and they're joined based on of course
00:40:43
the orders on there and once we're done
00:40:45
we now have the order number the person
00:40:48
who made the order their age and the
00:40:51
amount of the order which came from the
00:40:52
order table so you have your different
00:40:54
information and you can see how the join
00:40:56
works here very common use of tables and
00:40:59
hql and sql and let's do one more thing
00:41:02
with our database and then i'll show you
00:41:05
a couple other hive commands and let's
00:41:07
go ahead and do a drop and we're going
00:41:09
to drop database office
00:41:12
and if you're looking at this and you uh
00:41:14
remember from earlier this will give me
00:41:16
an error and this to see what that looks
00:41:18
like it says failed to execute exception
00:41:21
one or more tables exist so if you
00:41:24
remember from before you can't just drop
00:41:26
a database unless you tell it to cascade
00:41:29
that lets it know i don't care how many
00:41:31
tables are in it let's get rid of it and
00:41:33
in hadoop since it's an art it's a
00:41:35
warehouse a data warehouse you usually
00:41:36
don't do a lot of dropping uh maybe at
00:41:38
the beginning when you're developing the
00:41:39
schemas and you realize you messed up
00:41:41
you might drop some stuff but down the
00:41:43
road you're really just adding commodity
00:41:46
machines to take up so you can store
00:41:47
more stuff on it so you usually don't do
00:41:49
a lot of database dropping and some
00:41:51
other
00:41:52
fun commands to know is you can do
00:41:54
select round 2.3 is round value you can
00:41:57
do a round off in
00:41:58
hive we can do as floor value which is
00:42:02
going to give us a two so it turns it
00:42:03
into an integer versus a float it goes
00:42:06
down you know basically truncates it but
00:42:08
it goes down and we can also do ceiling
00:42:10
which is going to round it up so we're
00:42:12
looking for the next integer above
00:42:14
there's a few commands we didn't show in
00:42:16
here because we're on a single node as
00:42:19
as an admin to help spediate the process
00:42:22
you usually add in partitions for the
00:42:24
data and buckets you can't do that on a
00:42:27
single node because the when you add a
00:42:29
partition it partitions it across
00:42:31
separate nodes but beyond that you can
00:42:33
see it's very straightforward we have
00:42:35
sql coming in and all your basic queries
00:42:38
that are in sql are very similar to hql
00:42:42
key takeaways so we took a look at the
00:42:44
history of hive and how it evolved from
00:42:47
the hadoop file system to an hql similar
00:42:50
to sql layer on top of hadoop with the
00:42:53
full metal metastore and all that
00:42:56
information connected to the hadoop file
00:42:58
system that way you can easily scale it
00:43:00
up and still have you know underneath
00:43:02
the hadoop setup while still having the
00:43:04
sql query language available we looked
00:43:07
at what is hive and we looked at the
00:43:09
high of queries going in through the map
00:43:11
reduce into the hadoop map reduce system
00:43:13
we dig a little deeper to look at the
00:43:15
architecture of hive and all the
00:43:17
different pieces and how they fit
00:43:18
together including the fact that it has
00:43:20
the hive client and you have your thrift
00:43:22
applications and your
00:43:24
jdbc applications in your odbc
00:43:26
applications uh we have the hive web
00:43:28
interface which we looked at in a demo
00:43:30
along with the cli which your client
00:43:33
direct interface which we use for most
00:43:34
of the demo and again this is how do you
00:43:36
get these commands into hive and if
00:43:38
you're uh using the
00:43:41
hive web interface is great for maybe
00:43:43
your shareholders you're working with
00:43:45
and some of them are not technically
00:43:47
literate or even if they are you know
00:43:48
it's a quick look up for data i'm not be
00:43:51
beyond jumping into the hue web
00:43:54
interface and looking something up
00:43:55
versus a terminal window or if you're
00:43:57
running the back end we're running the
00:43:58
program whether it's python or java you
00:44:00
have those the thrift applications which
00:44:02
connect in so you can extend with all
00:44:04
the major scripting languages usually
00:44:06
have their own plugin for sending that
00:44:08
information over to our hadoop file
00:44:10
system we dug in deeper into the data
00:44:12
flow in hive so your user interface and
00:44:15
the different steps it takes to go
00:44:16
through and get in out of the mapreduce
00:44:18
system we also took a glance at hive
00:44:21
data types and certainly these are
00:44:23
always evolving so it's good to look at
00:44:25
the apache website and especially with
00:44:27
the new stuff come up underneath of
00:44:29
beehive which is also hive if you see
00:44:31
that don't let that scare you it's just
00:44:33
the beta version coming up with the new
00:44:35
updates and then we looked at the
00:44:37
features of hive and how it works as an
00:44:39
ecosystem on there with that i'd like to
00:44:42
thank you for joining us today for
00:44:44
hadoop hive again my name is richard
00:44:46
kirschner with the simply learn team for
00:44:48
more information visit us at
00:44:50
www.simplylearn.com
00:44:53
you can also post comments here on the
00:44:55
youtube channel we do have moderators
00:44:57
that monitor that and we certainly will
00:44:59
respond to your postings again thank you
00:45:01
for joining us today get certified get
00:45:04
ahead
00:45:09
hi there if you like this video
00:45:11
subscribe to the simply learn youtube
00:45:12
channel and click here to watch similar
00:45:15
videos turn it up and get certified
00:45:17
click here

Etiquetas

Hive
HQL
HiveQL
Data Warehouse
Hadoop
Metastore
HDFS
Big Data
SQL
Data Types