Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Hadoop | Simplilearn

00:45:20
https://www.youtube.com/watch?v=rr17cbPGWGA

概要

TLDRThis video tutorial introduces Hive, a data warehouse system that simplifies querying large datasets in the Hadoop ecosystem using HiveQL, a SQL-like language. It begins with a historical overview of Hive's development from its inception at Facebook's Hadoop solutions to its widespread adoption. The architecture is explained, detailing key components like the Hive client, drivers (JDBC, ODBC), and the Metastore. Data modeling concepts such as tables, partitions, and buckets are discussed, alongside Hive's data types. Furthermore, it contrasts Hive with traditional RDBMS, highlighting their operational differences, such as schema enforcement and scaling capabilities. The video concludes with a live demo showcasing Hive commands and functionalities within the Cloudera Hadoop setup.

収穫

  • 📚 Hive simplifies querying large data sets with HQL!
  • 🛠️ It originated from Facebook's needs to manage big data.
  • 🏗️ Hive's architecture includes clients, servers, and a metastore.
  • 📊 Data modeling in Hive involves tables, partitions, and buckets.
  • ⚙️ Hive operates in local and MapReduce modes based on data size.
  • 💾 Supports both primitive and complex data types for flexibility.
  • 🏆 Hive is not a database, it's a data warehouse for analysis.
  • 📈 Easily scalable at a lower cost compared to RDBMS.
  • 💡 The Hive metastore manages metadata for efficient data retrieval.
  • 📝 Hands-on demo shows how to execute Hive commands effectively.

タイムライン

  • 00:00:00 - 00:05:00

    The tutorial introduces Hive, led by Richard Kirschner from Simply Learn. It covers the history, architecture, features, and a hands-on demo involving Hive on the Cloudera Hadoop file system. It starts with the need for Hive due to the complexities of coding in Java for data processing, leading to the development of HiveQL, a SQL-like query language for querying large datasets.

  • 00:05:00 - 00:10:00

    Hive was developed at Facebook to manage substantial data using Hadoop. The tutorial explains that Hive uses a SQL-like language for ease of use, facilitating querying and analysis of large datasets stored in HDFS. It outlines Hive's role as a data warehouse which translates user queries into MapReduce tasks for execution.

  • 00:10:00 - 00:15:00

    The architecture of Hive is detailed, including components like the Hive client, Thrift applications, JDBC, and ODBC drivers. It describes how the Hive server processes queries and the role of the Hive driver in compiling and executing tasks, as well as the Metadata Store for storing table information.

  • 00:15:00 - 00:20:00

    The data flow within Hive is elaborated, highlighting the interaction between the user interface, compiler, execution engine, and HDFS. The explanation covers how queries are executed and how metadata is managed, providing insights into efficient data retrieval processes.

  • 00:20:00 - 00:25:00

    Hive data modeling is explained, focusing on the structure of tables, partitions for grouping data, and buckets for efficient querying. The importance of designing efficient schemas to improve query performance is emphasized.

  • 00:25:00 - 00:30:00

    Hive data types are categorized into primitive and complex types, mirroring SQL data types. Primitive types include numerical and string data, while complex types allow for the storage of arrays and maps, crucial for advanced data analytics.

  • 00:30:00 - 00:35:00

    Hive operates in two modes: local mode for small datasets on a single data node, and MapReduce mode for larger datasets across multiple data nodes, emphasizing the scalability of Hive for big data applications.

  • 00:35:00 - 00:40:00

    Differences between Hive and traditional RDBMS are outlined, including Hive's schema-on-read approach versus RDBMS's schema-on-write, and the capacity for handling petabytes of data in Hive as opposed to terabytes in RDBMS, showcasing Hive's efficiency in data warehousing.

  • 00:40:00 - 00:45:20

    The last segment discusses Hive's features, including the use of HiveQL, its SQL-like interface, simultaneous querying by multiple users, support for various data types, and the importance of scalability and cost-effectiveness in a big data context. It concludes with a live demo of HiveQL commands in the Cloudera environment.

もっと見る

マインドマップ

ビデオQ&A

  • What is Hive?

    Hive is a data warehouse system for querying and analyzing large datasets stored in the Hadoop file system (HDFS), using a SQL-like query language known as HiveQL or HQL.

  • How does Hive differ from traditional RDBMS?

    Hive enforces schema on read, while RDBMS enforces schema on write. Hive is designed for large data in petabytes, whereas RDBMS typically manages data in terabytes.

  • What are the main modes of Hive?

    Hive operates in local mode for small datasets on a single data node, and MapReduce mode for processing larger data sets across multiple data nodes.

  • What is the Hive Metastore?

    The Metastore is a repository for Hive metadata, which contains information about the structure of tables and their schemas.

  • Can Hive handle complex data types?

    Yes, Hive supports both primitive (e.g., integers, strings) and complex data types (e.g., arrays, maps, structs).

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:03
    hello and welcome to hive tutorial my
  • 00:00:06
    name is richard kirschner with the
  • 00:00:07
    simply learn team that is
  • 00:00:10
    www.simplylearn.com get certified get
  • 00:00:13
    ahead what's in it for you today in our
  • 00:00:15
    hive tutorial first we're going to start
  • 00:00:17
    with the history of hive what is hive
  • 00:00:20
    architecture of hive data flow and hive
  • 00:00:23
    hive data modeling hive data types
  • 00:00:26
    different modes of hive and difference
  • 00:00:28
    between hive and rdb ems finally we're
  • 00:00:32
    going to look into the features of hive
  • 00:00:34
    and do a quick hands-on demo on hive in
  • 00:00:37
    the cloudera hadoop file system let's
  • 00:00:39
    dive in with a brief history of hive so
  • 00:00:42
    the history of hive begins with facebook
  • 00:00:44
    facebook began using hadoop as a
  • 00:00:46
    solution to handle the growing big data
  • 00:00:49
    and we're not talking about a data that
  • 00:00:50
    fits on one or two or even five
  • 00:00:52
    computers we're talking due to the fits
  • 00:00:55
    on if you've looked at any of our other
  • 00:00:56
    hadoop tutorials you'll know we're
  • 00:00:58
    talking about very big data and data
  • 00:01:00
    pools and facebook certainly has a lot
  • 00:01:02
    of data it tracks as we know the hadoop
  • 00:01:05
    uses mapreduce for processing data
  • 00:01:08
    mapreduce required users to write long
  • 00:01:10
    codes and so you'd have these really
  • 00:01:12
    extensive java codes very complicated
  • 00:01:14
    for the average person to use not all
  • 00:01:16
    users were versed in java and other
  • 00:01:18
    coding languages this proved to be a
  • 00:01:20
    disadvantage for them users were
  • 00:01:22
    comfortable with writing queries in sql
  • 00:01:24
    sql has been around for a long time the
  • 00:01:26
    standard
  • 00:01:27
    sql query language hive was developed
  • 00:01:30
    with the vision to incorporate the
  • 00:01:32
    concepts of tables columns just like sql
  • 00:01:35
    so why hive well the problem was for
  • 00:01:38
    processing and analyzing data users
  • 00:01:40
    found it difficult to code as not all of
  • 00:01:42
    them were well versed with the coding
  • 00:01:44
    languages you have your processing ever
  • 00:01:46
    analyzing and so the solution was
  • 00:01:48
    required a language similar to sql which
  • 00:01:51
    was well known to all the users and thus
  • 00:01:53
    the hive or hql language evolved what is
  • 00:01:57
    hive hive is a data warehouse system
  • 00:02:00
    which is used for querying and analyzing
  • 00:02:02
    large data sets stored in the hdfs or
  • 00:02:04
    the hadoop file system hive uses a query
  • 00:02:07
    language that we call hive ql or hql
  • 00:02:10
    which is similar to sql so if we take
  • 00:02:13
    our user the user sends out their hive
  • 00:02:16
    queries and then that is converted into
  • 00:02:18
    a mapreduce tasks and then accesses the
  • 00:02:21
    hadoop mapreduce system let's take a
  • 00:02:23
    look at the architecture of hive
  • 00:02:25
    architecture of hive we have the hive
  • 00:02:27
    client
  • 00:02:28
    so that could be the programmer or maybe
  • 00:02:30
    it's a manager who knows enough sql to
  • 00:02:32
    do a basic query to look up the data
  • 00:02:34
    they need the hive client supports
  • 00:02:36
    different types of client applications
  • 00:02:38
    in different languages prefer for
  • 00:02:40
    performing queries and so we have our
  • 00:02:42
    thrift application in the hive thrift
  • 00:02:44
    client thrift is a software framework
  • 00:02:47
    hive server is based on thrift so it can
  • 00:02:50
    serve the request from all programming
  • 00:02:51
    language that support thrift and then we
  • 00:02:54
    have our jdbc application and the hive
  • 00:02:57
    jdbc driver jdbc java database
  • 00:03:01
    connectivity jdbc application is
  • 00:03:04
    connected through the jdbc driver and
  • 00:03:06
    then you have the odbc application or
  • 00:03:08
    the hive odbc driver the odbc or open
  • 00:03:13
    database connectivity the odbc
  • 00:03:15
    application is connected through the
  • 00:03:17
    odbc driver with the growing development
  • 00:03:20
    of all of our different scripting
  • 00:03:21
    languages python c plus plus spar
  • 00:03:24
    java you can find just about any
  • 00:03:26
    connection in any of the main scripting
  • 00:03:28
    languages and so we have our hive
  • 00:03:30
    services as we look at deeper into the
  • 00:03:33
    architecture hive supports various
  • 00:03:35
    services
  • 00:03:36
    so you have your hive server basically
  • 00:03:38
    your thrift application or your hive
  • 00:03:40
    thrift client or your jdbc or your hive
  • 00:03:42
    jdbc driver your odbc application or
  • 00:03:45
    your hive odbc driver they all connect
  • 00:03:47
    into the hive server and you have your
  • 00:03:49
    hive web interface you also have your
  • 00:03:52
    cli now the hive web interface is a gui
  • 00:03:55
    is provided to execute hive queries and
  • 00:03:58
    we'll actually be using that later on
  • 00:04:00
    today so you can see kind of what that
  • 00:04:02
    looks like and get a feel for what that
  • 00:04:04
    means commands are executed directly in
  • 00:04:06
    cli and then the cli is a direct
  • 00:04:09
    terminal window and i'll also show you
  • 00:04:12
    that too so you can see how those two
  • 00:04:14
    different interfaces work these then
  • 00:04:16
    push the code into the hive driver hive
  • 00:04:18
    driver is responsible for all the
  • 00:04:20
    queries submitted so everything goes
  • 00:04:22
    through that driver let's take a closer
  • 00:04:23
    look at the hive driver the hive driver
  • 00:04:25
    now performs three steps internally one
  • 00:04:28
    is a compiler hive driver passes query
  • 00:04:31
    to compiler where it is checked and
  • 00:04:32
    analyzed then the optimizer kicks in and
  • 00:04:35
    the optimize logical plan in the form of
  • 00:04:37
    a graph of mapreduce and hdfs tasks is
  • 00:04:40
    obtained and then finally in the
  • 00:04:42
    executor in the final step the tasks are
  • 00:04:45
    executed we look at the architecture we
  • 00:04:47
    also have to note the meta store
  • 00:04:49
    metastore is a repository for hive
  • 00:04:51
    metadata stores metadata for hive tables
  • 00:04:54
    and you can think of this as your schema
  • 00:04:56
    and where is it located and it's stored
  • 00:04:58
    on the apache derby db processing and
  • 00:05:00
    resource management is all handled by
  • 00:05:02
    the mapreduce v1 you'll see mapreduce v2
  • 00:05:06
    the yarn and the tez these are all
  • 00:05:08
    different ways of managing these
  • 00:05:10
    resources depending on what version of
  • 00:05:11
    hadoop you're in hive uses mapreduce
  • 00:05:13
    framework to process queries and then we
  • 00:05:16
    have our distributed storage which is
  • 00:05:18
    the hdfs and if you looked at our hadoop
  • 00:05:21
    tutorials you'll know that these are on
  • 00:05:23
    commodity machines and are linearly
  • 00:05:25
    scalable that means they're very
  • 00:05:27
    affordable a lot of time when you're
  • 00:05:28
    talking about big data you're talking
  • 00:05:30
    about a tenth of the price of storing it
  • 00:05:32
    on enterprise computers and then we look
  • 00:05:34
    at the data flow and hive
  • 00:05:36
    so in our data flow and hive we have our
  • 00:05:38
    hive in the hadoop system and underneath
  • 00:05:40
    the user interface or the ui we have our
  • 00:05:43
    driver our compiler our execution engine
  • 00:05:45
    and our meta store that all goes into
  • 00:05:47
    the mapreduce and the hadoop file system
  • 00:05:50
    so when we execute a query you see it
  • 00:05:52
    coming in here it goes into the driver
  • 00:05:54
    step one step two we get a plan what are
  • 00:05:56
    we going to do refers to the query
  • 00:05:58
    execution uh then we go to the metadata
  • 00:06:00
    it's like well what kind of metadata are
  • 00:06:02
    we actually looking at where is this
  • 00:06:03
    data located what is the schema on it
  • 00:06:06
    then this comes back with the metadata
  • 00:06:08
    into the compiler then the compiler
  • 00:06:10
    takes all that information and the send
  • 00:06:13
    plan returns it to the driver the driver
  • 00:06:15
    then sends the execute plan to the
  • 00:06:17
    execution engine once it's in the
  • 00:06:19
    execution engine the execution engine
  • 00:06:22
    acts as a bridge between hive and hadoop
  • 00:06:25
    to process the query and that's going
  • 00:06:27
    into your mapreduce in your hadoop file
  • 00:06:29
    system or your hdfs and then we come
  • 00:06:32
    back with the metadata operations it
  • 00:06:34
    goes back into the metastore to update
  • 00:06:37
    or let it know what's going on which
  • 00:06:38
    also goes to the between it's a
  • 00:06:41
    communication between the execution
  • 00:06:42
    engine and the metastore execution
  • 00:06:44
    engine communications is
  • 00:06:46
    bi-directionally with the metastore to
  • 00:06:48
    perform operations like create drop
  • 00:06:51
    tables metastore stores information
  • 00:06:54
    about tables and columns so again we're
  • 00:06:56
    talking about the schema of your
  • 00:06:57
    database and once we have that we have a
  • 00:06:59
    bi-directional
  • 00:07:01
    send results communication back into the
  • 00:07:03
    driver and then we have the fetch
  • 00:07:05
    results which goes back to the client so
  • 00:07:07
    let's take a little bit look at the hive
  • 00:07:09
    data modeling hive data modeling so you
  • 00:07:12
    have your high data modeling you have
  • 00:07:13
    your tables you have your partitions and
  • 00:07:15
    you have buckets the tables in hive are
  • 00:07:18
    created the same way it is done in rdbms
  • 00:07:21
    so when you're looking at your
  • 00:07:22
    traditional sql server or mysql server
  • 00:07:25
    where you might have enterprise
  • 00:07:27
    equipment and a lot of
  • 00:07:29
    people pulling and moving stuff off of
  • 00:07:31
    there the tables are gonna look very
  • 00:07:32
    similar and this makes it very easy to
  • 00:07:34
    take that information and let's say you
  • 00:07:36
    need to keep current information but you
  • 00:07:39
    need to store all of your years of
  • 00:07:41
    transactions back into the hadoop hive
  • 00:07:44
    so you match those those all kind of
  • 00:07:46
    look the same the tables are the same
  • 00:07:48
    your databases look very similar and you
  • 00:07:50
    can easily import them back you can
  • 00:07:51
    easily store them into the hive system
  • 00:07:54
    partitions here tables are organized
  • 00:07:56
    into partitions for grouping same type
  • 00:07:58
    of data based on partition key this can
  • 00:08:01
    become very important for speeding up
  • 00:08:04
    the process of doing queries so if
  • 00:08:06
    you're looking at dates as far as like
  • 00:08:08
    your employment dates of employees if
  • 00:08:10
    that's what you're tracking you might
  • 00:08:12
    add a partition there because that might
  • 00:08:13
    be one of the key things that you're
  • 00:08:14
    always looking up as far as employees
  • 00:08:16
    are concerned and finally we have
  • 00:08:18
    buckets uh data present in partitions
  • 00:08:20
    can be further divided into buckets for
  • 00:08:22
    efficient querying again there's that
  • 00:08:24
    efficiency at this level a lot of times
  • 00:08:27
    you're taught you're working with the
  • 00:08:29
    programmer and the admin of your hadoop
  • 00:08:32
    file system to maximize the efficiency
  • 00:08:35
    of that file system so it's usually a
  • 00:08:37
    two-person job and we're talking about
  • 00:08:39
    hive data modeling you want to make sure
  • 00:08:41
    that they work together and you're
  • 00:08:43
    maximizing your resources hive data
  • 00:08:46
    types so we're talking about hive data
  • 00:08:48
    types we have our primitive data types
  • 00:08:50
    and our complex data types a lot of this
  • 00:08:53
    will look familiar because it mirrors a
  • 00:08:55
    lot of stuff in sql in our primitive
  • 00:08:57
    data types we have the numerical data
  • 00:08:59
    types string data type date time data
  • 00:09:02
    type and
  • 00:09:03
    miscellaneous data type and these should
  • 00:09:05
    be very they're kind of self-explanatory
  • 00:09:07
    but just in case numerical data is your
  • 00:09:10
    floats your integers your short integers
  • 00:09:12
    all of that numerical data comes in as a
  • 00:09:14
    number a string of course is characters
  • 00:09:17
    and numbers and then you have your date
  • 00:09:19
    time stamp and then we have kind of a
  • 00:09:21
    general way of pulling your own created
  • 00:09:23
    data types in there that's your
  • 00:09:25
    miscellaneous data type and we have
  • 00:09:27
    complex data types so you can store
  • 00:09:29
    arrays you can store maps you can store
  • 00:09:32
    structures uh and even units in there as
  • 00:09:34
    we dig into hive data types
  • 00:09:37
    and we have the primitive data types and
  • 00:09:39
    the complex data types so we look at
  • 00:09:41
    primitive data types and we're looking
  • 00:09:42
    at numeric data types data types like an
  • 00:09:45
    integer a float a decimal those are all
  • 00:09:47
    stored as numbers in the hive data
  • 00:09:50
    system a string data type data types
  • 00:09:52
    like characters and strings you store
  • 00:09:54
    the name of the person you're working
  • 00:09:55
    with uh you know john doe the city
  • 00:09:58
    memphis the state tennessee maybe it's
  • 00:10:02
    boulder colorado usa or maybe it's hyper
  • 00:10:05
    bad
  • 00:10:06
    india that's all going to be string and
  • 00:10:08
    stored as a string character and of
  • 00:10:10
    course we have our date time data type
  • 00:10:12
    data types like timestamp date interval
  • 00:10:15
    those are very common as far as tracking
  • 00:10:18
    sales anything like that you just think
  • 00:10:19
    if you can type a stamp of time on it or
  • 00:10:22
    maybe you're dealing with a race and you
  • 00:10:23
    want to know the interval how long did
  • 00:10:24
    the person take to complete whatever
  • 00:10:26
    task it was all that is date time data
  • 00:10:29
    type and then we talk miscellaneous data
  • 00:10:31
    type these are like boolean and binary
  • 00:10:34
    and when you get into boolean and binary
  • 00:10:35
    you can actually almost create anything
  • 00:10:37
    in there but your yes knows zero one now
  • 00:10:39
    let's take a look at complex data types
  • 00:10:41
    a little closer uh we have arrays so
  • 00:10:44
    your syntax is of data type and it's an
  • 00:10:46
    array and you can just think of an array
  • 00:10:48
    as a collection of same
  • 00:10:51
    entities one two three four if they're
  • 00:10:53
    all numbers and you have maps this is a
  • 00:10:55
    collection of key value pairs
  • 00:10:58
    so understanding maps is so central to
  • 00:11:01
    hadoop uh so when we store maps you have
  • 00:11:03
    a key which is a set you can only have
  • 00:11:05
    one key per mapped value and so you in
  • 00:11:08
    hadoop of course you collect uh the same
  • 00:11:10
    keys and you can add them all up or do
  • 00:11:12
    something with all the contents of the
  • 00:11:13
    same key but this is our map as a
  • 00:11:16
    primitive type data type in our
  • 00:11:18
    collection of key value pairs and then
  • 00:11:20
    collection of complex data with comment
  • 00:11:23
    so we can have a structure we have a
  • 00:11:24
    column name data type comment call a
  • 00:11:27
    column comment so you can get very
  • 00:11:29
    complicated structures in here with your
  • 00:11:31
    collection of data and your commented
  • 00:11:33
    setup and then we have units and this is
  • 00:11:35
    a collection of heterogeneous data types
  • 00:11:38
    so the syntax for this is union type
  • 00:11:40
    data type data type and so on so it's
  • 00:11:43
    all going to be the same a little bit
  • 00:11:44
    different than the arrays where you can
  • 00:11:46
    actually mix and match different modes
  • 00:11:48
    of hive hive operates in two modes
  • 00:11:51
    depending on the number and size of data
  • 00:11:53
    nodes we have our local mode and our map
  • 00:11:56
    reduce mode when we talk about the local
  • 00:11:58
    mode it is used when hadoop is having
  • 00:12:00
    one data node and the data is small
  • 00:12:03
    processing will be very fast on a
  • 00:12:04
    smaller data sets which are present in
  • 00:12:06
    local machine and this might be that you
  • 00:12:08
    have a local file stuff you're uploading
  • 00:12:11
    into the hive and you need to do some
  • 00:12:13
    processes in there you can go ahead and
  • 00:12:14
    run those high processes and queries on
  • 00:12:17
    it usually you don't see much in the way
  • 00:12:19
    of a single node hadoop system if you're
  • 00:12:21
    going to do that you might as well just
  • 00:12:22
    use like an sql database or even a java
  • 00:12:26
    sqlite or something python sqlite so you
  • 00:12:29
    don't really see a lot of single node
  • 00:12:31
    hadoop databases but you do see the
  • 00:12:33
    local mode in hive where you're working
  • 00:12:35
    with a small amount of data that's going
  • 00:12:37
    to be integrated into the larger
  • 00:12:39
    database and then we have the map reduce
  • 00:12:41
    mode this is used when hadoop is having
  • 00:12:44
    multiple data nodes and the data is
  • 00:12:45
    spread across various data nodes
  • 00:12:47
    processing large datasets can be more
  • 00:12:49
    efficient using this mode and this you
  • 00:12:51
    can think of instead of it being one two
  • 00:12:53
    three or even five computers we're
  • 00:12:56
    usually talking with the hadoop file
  • 00:12:58
    system we're looking at 10 computers 15
  • 00:13:01
    100 where this data is spread across all
  • 00:13:03
    those different hadoop nodes difference
  • 00:13:05
    between hive and
  • 00:13:07
    rdbms remember rdbms stands for the
  • 00:13:10
    relational database management system
  • 00:13:13
    let's take a look at the difference
  • 00:13:14
    between hive and the rdbms with hive
  • 00:13:18
    hive enforces schema on read and it's
  • 00:13:21
    very important that whatever is coming
  • 00:13:23
    in that's when hive's looking at it and
  • 00:13:24
    making sure that it fits the model
  • 00:13:27
    the rdbms enforces a schema when it
  • 00:13:30
    actually writes the data into the
  • 00:13:31
    database so it's read the data and then
  • 00:13:33
    once it starts to write it that's where
  • 00:13:35
    it's going to give you the error or tell
  • 00:13:36
    you something's incorrect about your
  • 00:13:38
    scheme high of data size is in petabytes
  • 00:13:41
    that is hard to imagine you know we're
  • 00:13:43
    looking at your personal computer on
  • 00:13:45
    your desk maybe you have 10 terabytes if
  • 00:13:47
    it's a high-end computer but we're
  • 00:13:49
    talking petabytes so that's hundreds of
  • 00:13:51
    computers grouped together when a rdbms
  • 00:13:54
    data size is in terabytes very rarely do
  • 00:13:57
    you see an rdbms system that's spread
  • 00:14:00
    over more than five computers and
  • 00:14:02
    there's a lot of reasons for that with
  • 00:14:04
    the rdbms it actually has a high end
  • 00:14:07
    amount of writes to the hard drive
  • 00:14:09
    there's a lot more going on there you're
  • 00:14:10
    writing and polling stuff so you really
  • 00:14:12
    don't want to get too big with an rd bms
  • 00:14:14
    or you're gonna run into a lot of
  • 00:14:15
    problems with hive you can take it as
  • 00:14:17
    big as you want hive is based on the
  • 00:14:19
    notion of write once and read many times
  • 00:14:23
    this is so important and they call it
  • 00:14:25
    worm which is write w
  • 00:14:28
    read are many times m they refer to it
  • 00:14:31
    as worm and that's true of any of a lot
  • 00:14:32
    of your hadoop setup it's it's altered a
  • 00:14:35
    little bit but in general we're looking
  • 00:14:36
    at archiving data that you want to do
  • 00:14:38
    data analysis on we're looking at
  • 00:14:40
    pulling all that stuff off your rd bms
  • 00:14:42
    from years and years and years of
  • 00:14:44
    business or whatever your company does
  • 00:14:46
    or scientific research and putting that
  • 00:14:48
    into a huge data pool so that you can
  • 00:14:50
    now do queries on it and get that
  • 00:14:52
    information out of it with the rdbms
  • 00:14:55
    it's based on the notion of read and
  • 00:14:56
    write many times
  • 00:14:58
    so you're continually updating this
  • 00:14:59
    database you're continually bringing up
  • 00:15:01
    new stuff new sales
  • 00:15:03
    the account changes because they have a
  • 00:15:06
    different licensing now whatever
  • 00:15:07
    software you're selling all that kind of
  • 00:15:09
    stuff where the data is continually
  • 00:15:10
    fluctuating and then hive resembles a
  • 00:15:12
    traditional database by supporting sql
  • 00:15:15
    but it is not a database it is a data
  • 00:15:18
    warehouse this is very important it goes
  • 00:15:20
    with all the other stuff we've talked
  • 00:15:21
    about that we're not looking at a
  • 00:15:23
    database but a data warehouse to store
  • 00:15:25
    the data and still have fast and easy
  • 00:15:28
    access to it for doing queries you can
  • 00:15:31
    think of
  • 00:15:32
    twitter and facebook they have so many
  • 00:15:35
    posts that are archived back
  • 00:15:36
    historically those posts aren't going to
  • 00:15:38
    change they made the post they're posted
  • 00:15:40
    they're there and they're in their
  • 00:15:41
    database but they have to store it in a
  • 00:15:42
    warehouse in case they want to pull it
  • 00:15:44
    back up with the rdbms it's a type of
  • 00:15:47
    database management system which is
  • 00:15:49
    based on the relational model of data
  • 00:15:51
    and then with hive easily scalable at a
  • 00:15:54
    low cost again we're talking maybe a
  • 00:15:56
    thousand dollars per terabyte um the
  • 00:15:58
    rdbms is not scalable at a low cost when
  • 00:16:02
    you first start on the lower end you're
  • 00:16:03
    talking about 10 000 per terabyte of
  • 00:16:06
    data including all the backup on the
  • 00:16:08
    models and all the added necessities to
  • 00:16:10
    support it as you scale it up you have
  • 00:16:13
    to scale those computers and hardware up
  • 00:16:15
    so you might start off with a basic
  • 00:16:17
    server and then you upgrade to a sun
  • 00:16:20
    computer to run it and you spend you
  • 00:16:22
    know tens of thousands of dollars for
  • 00:16:23
    that hardware upgrade with hive you just
  • 00:16:25
    put another computer into your hadoop
  • 00:16:28
    file system so let's look at some of the
  • 00:16:29
    features of hive
  • 00:16:31
    when we're looking at the features of
  • 00:16:32
    hive we're talking about the use of sql
  • 00:16:35
    like language called hive ql a lot of
  • 00:16:37
    times you'll see that as hql which is
  • 00:16:39
    easier than long codes this is nice if
  • 00:16:42
    you're working with your shareholders
  • 00:16:44
    you come to them and you say hey you can
  • 00:16:46
    do a basic sql query on here and pull up
  • 00:16:48
    the information you need this way you
  • 00:16:50
    don't have to take off have your
  • 00:16:51
    programmers jump in every time they want
  • 00:16:53
    to look up something in the database
  • 00:16:55
    they actually now can easily do that if
  • 00:16:56
    they're not
  • 00:16:57
    skilled in programming and script
  • 00:17:00
    writing tables are used which are
  • 00:17:01
    similar to the rdbms hints easier to
  • 00:17:04
    understand and one of the things i like
  • 00:17:06
    about this is when i'm bringing tables
  • 00:17:08
    in from a mysql server or sql server
  • 00:17:10
    there's almost a direct reflection
  • 00:17:12
    between the two so when you're looking
  • 00:17:13
    at one which is the data which is
  • 00:17:15
    continually changing and then you're
  • 00:17:16
    going into the archive database it's not
  • 00:17:18
    this huge jump where you have to learn a
  • 00:17:20
    whole new language
  • 00:17:22
    you mirror that same schema into the
  • 00:17:24
    hdfs into the hive making it very easy
  • 00:17:27
    to go between the two and then using
  • 00:17:29
    hive ql multiple users can
  • 00:17:31
    simultaneously query data so again you
  • 00:17:34
    have multiple clients in there and they
  • 00:17:36
    send in their query that's also true
  • 00:17:37
    with the rdbms which kind of cues them
  • 00:17:40
    up because it's running so fast you
  • 00:17:41
    don't notice the lag time well you get
  • 00:17:43
    that also with the hql as you add more
  • 00:17:46
    computers and query can go very quickly
  • 00:17:48
    depending on how many computers and how
  • 00:17:50
    much resources each machine has to pull
  • 00:17:52
    the information and hive supports a
  • 00:17:55
    variety of data types
  • 00:17:57
    so with hive it's designed to be on the
  • 00:18:00
    hadoop system which you can put almost
  • 00:18:02
    anything into the hadoop file system so
  • 00:18:04
    with all that let's take a look at a
  • 00:18:07
    demo on hive ql or hql before i dive
  • 00:18:11
    into the hands-on demo let's take a look
  • 00:18:14
    at the website hive.apache.org
  • 00:18:17
    that's the main website since apache
  • 00:18:20
    it's an apache open source
  • 00:18:22
    software this is the main software for
  • 00:18:24
    the main site for the build and if you
  • 00:18:26
    go in here you'll see that they're
  • 00:18:27
    slowly migrating hive into beehive and
  • 00:18:30
    so if you see beehive versus hive note
  • 00:18:32
    the beehive as the new release is coming
  • 00:18:34
    out that's all it is it reflects a lot
  • 00:18:36
    of the same functionality of hive it's
  • 00:18:38
    the same thing and then we like to pull
  • 00:18:40
    up some kind of documentation on
  • 00:18:43
    commands and for this i'm actually going
  • 00:18:45
    to go to hortonworks hive cheat sheet
  • 00:18:48
    and that's because hortonworks and
  • 00:18:50
    cloudera are two of the most common used
  • 00:18:53
    builds for hadoop and four which include
  • 00:18:57
    hive and all the different tools in
  • 00:18:58
    there and so hortonworks has a pretty
  • 00:19:00
    good pdf you can download cheat sheet on
  • 00:19:03
    there i believe cloudera does too but
  • 00:19:04
    we'll go ahead and just look at the
  • 00:19:06
    horton one because it's the one that
  • 00:19:07
    comes up really good and you can see
  • 00:19:08
    when we look at the query language it
  • 00:19:10
    compares mysql server to hive ql or hql
  • 00:19:14
    and you can see the basic select we
  • 00:19:16
    select from columns from table where
  • 00:19:19
    conditions exist the most basic command
  • 00:19:22
    on there and they have different things
  • 00:19:23
    you can do with it just like you do with
  • 00:19:25
    your sql and if you scroll down you'll
  • 00:19:28
    see
  • 00:19:28
    data types so here's your integer your
  • 00:19:30
    flow your binary double string timestamp
  • 00:19:33
    and all the different data types you can
  • 00:19:34
    use some different semantics different
  • 00:19:37
    keys features functions
  • 00:19:40
    for running a hive query command line
  • 00:19:42
    setup and of course a hive shell uh set
  • 00:19:45
    up in here so you can see right here if
  • 00:19:47
    we loop through it has a lot of your
  • 00:19:48
    basic stuff and it is we're basically
  • 00:19:50
    looking at sql across a horton database
  • 00:19:53
    we're going to go ahead and run our
  • 00:19:55
    hadoop cluster hive demo and i'm going
  • 00:19:58
    to go ahead and use the cloudera quick
  • 00:20:00
    start this is in the virtual box so
  • 00:20:03
    again we have an oracle virtual box
  • 00:20:05
    which is open source and then we have
  • 00:20:08
    our cloudera quickstart which is the
  • 00:20:10
    hadoop setup on a single node now
  • 00:20:12
    obviously hadoop and hive are designed
  • 00:20:15
    to run across a cluster of computers so
  • 00:20:17
    we talk about a single node is for
  • 00:20:19
    education testing that kind of thing and
  • 00:20:21
    if you have a chance you can always go
  • 00:20:23
    back and look at our demo we had on
  • 00:20:27
    setting up a hadoop system in a single
  • 00:20:29
    cluster just set a note down below in
  • 00:20:31
    the youtube video and our team will get
  • 00:20:33
    in contact with you and send you that
  • 00:20:35
    link if you don't already have it or you
  • 00:20:36
    can contact us at the
  • 00:20:39
    www.simplylearn.com now in here it's
  • 00:20:40
    always important to note that you do
  • 00:20:42
    need
  • 00:20:43
    on your computer if you're running on
  • 00:20:45
    windows because i'm on a windows machine
  • 00:20:47
    you're going to need probably about 12
  • 00:20:49
    gigabytes to actually run this it used
  • 00:20:51
    to be goodbye with a lot less but as
  • 00:20:52
    things have evolved they take up more
  • 00:20:54
    and more resources and you need the
  • 00:20:56
    professional version if you have the
  • 00:20:58
    home version i was able to get that to
  • 00:21:00
    run but boy did it take a lot of extra
  • 00:21:02
    work to get the home version to let me
  • 00:21:05
    use the virtual setup on there and we'll
  • 00:21:07
    simply click on the cloudera quick start
  • 00:21:09
    and i'm going to go and just start that
  • 00:21:10
    up and this is starting up our linux so
  • 00:21:13
    we have our windows 10 which is a
  • 00:21:14
    computer i'm on and then i have the
  • 00:21:17
    virtual box which is going to have a
  • 00:21:18
    linux operating system in it and we'll
  • 00:21:20
    skip ahead so you don't have to watch
  • 00:21:22
    the whole install something interesting
  • 00:21:24
    to know about the cloudera is that it's
  • 00:21:27
    running on linuxcentos and for whatever
  • 00:21:29
    reason i've always had to click on it
  • 00:21:32
    and hit the escape button for it to spin
  • 00:21:35
    up and then you'll see the dos come in
  • 00:21:37
    here now that our cloudera spun up on
  • 00:21:39
    our virtual machine with the linux on we
  • 00:21:42
    can see here we have our it uses the
  • 00:21:45
    thunderbird browser on here by default
  • 00:21:47
    and automatically opens up a number of
  • 00:21:49
    different tabs for us and a quick note
  • 00:21:51
    cause i mentioned like the restrictions
  • 00:21:53
    on getting set up on your own computer
  • 00:21:55
    if you have a home edition computer and
  • 00:21:57
    you're worried about setting it up on
  • 00:21:59
    there you can also go in there and spin
  • 00:22:01
    up a one month free service on amazon
  • 00:22:04
    web service to play with this so there's
  • 00:22:06
    other options you're not stuck with just
  • 00:22:08
    doing it on the quick start menu you can
  • 00:22:10
    spin this up in many other ways now the
  • 00:22:12
    first thing we want to note is that
  • 00:22:13
    we've come in here into cloudera and i'm
  • 00:22:15
    going to access this in two ways uh the
  • 00:22:18
    first one is we're going to use hue and
  • 00:22:20
    i'm going to open up hue and i'll take
  • 00:22:22
    it a moment to load from the setup on
  • 00:22:24
    here and hue is nice if i go in and use
  • 00:22:28
    hue as an editor into hive or into the
  • 00:22:31
    hadoop setup usually i'm doing it as a
  • 00:22:34
    from an admin side because it has a lot
  • 00:22:37
    more information a lot of visuals less
  • 00:22:39
    to do with you know actually diving in
  • 00:22:41
    there and just executing code and you
  • 00:22:43
    can also write this code into files and
  • 00:22:45
    scripts and there's other things you can
  • 00:22:47
    otherwise you can upload it into hive
  • 00:22:49
    but today we're going to look at the
  • 00:22:50
    command lines and we'll upload it into
  • 00:22:52
    hue and then we'll go into and actually
  • 00:22:54
    do our work in a terminal window under
  • 00:22:56
    the hive shell now in the hue browser
  • 00:22:59
    window if you go under query and click
  • 00:23:00
    on the pull down menu and then you go
  • 00:23:02
    under editor and you'll see hive there
  • 00:23:04
    we go there's our hive setup i go and
  • 00:23:06
    click on hive and this will open up our
  • 00:23:08
    query down here and now it has a nice
  • 00:23:10
    little b that shows our hive going and
  • 00:23:12
    we can go something very simple down
  • 00:23:14
    here like show
  • 00:23:16
    databases and we follow it with the
  • 00:23:18
    semicolon and that's the standard in
  • 00:23:21
    hive is you always add our
  • 00:23:23
    punctuation at the end there and i'll go
  • 00:23:25
    ahead and run this and the query will
  • 00:23:26
    show up underneath and you'll see down
  • 00:23:28
    here since this is a new quick start i
  • 00:23:30
    just put on here you'll see it has the
  • 00:23:32
    default down here for the databases
  • 00:23:35
    that's the database name i haven't
  • 00:23:37
    actually created any databases on here
  • 00:23:38
    and then there's a lot of other like
  • 00:23:40
    assistant function tables
  • 00:23:42
    your databases up here there's all kinds
  • 00:23:45
    of things you can research you can look
  • 00:23:46
    at through hue as far as a bigger
  • 00:23:49
    picture the downside of this is it
  • 00:23:51
    always seems to lag for me whenever i'm
  • 00:23:53
    doing this i always seem to run slow so
  • 00:23:55
    if you're in cloudera you can open up a
  • 00:23:57
    terminal window they actually have an
  • 00:23:59
    icon at the top you can also go under
  • 00:24:01
    applications and under applications
  • 00:24:03
    system tools and terminal either one
  • 00:24:05
    will work it's just a regular terminal
  • 00:24:07
    window and this terminal window is now
  • 00:24:09
    running underneath our linux so this is
  • 00:24:11
    a linux terminal window or on our
  • 00:24:13
    virtual machine which is resting on our
  • 00:24:16
    regular windows 10 machine and we'll go
  • 00:24:18
    ahead and zoom this in so you can see
  • 00:24:19
    the text better on your own video and i
  • 00:24:22
    simply just clicked on view and zoom in
  • 00:24:24
    and then all we have to do is type in
  • 00:24:26
    hive and this will open up the shell on
  • 00:24:28
    here and it takes it just a moment to
  • 00:24:30
    load when starting up hive i also want
  • 00:24:33
    to note that depending on your rights on
  • 00:24:36
    the computer you're on in your action
  • 00:24:37
    you might have to do pseudohype and put
  • 00:24:39
    in your password and username most
  • 00:24:41
    computers are usually set up with the
  • 00:24:43
    hive login again it just depends on how
  • 00:24:45
    you're accessing the linux system and
  • 00:24:47
    the hive shell once we're in here we can
  • 00:24:49
    go ahead and do a simple uh hql command
  • 00:24:52
    show databases and if we do that we'll
  • 00:24:55
    see here that we don't have any
  • 00:24:56
    databases so we can go ahead and create
  • 00:24:58
    a database and we'll just call it office
  • 00:25:01
    for today for this moment now if i do
  • 00:25:03
    show we'll just do the up arrow up arrow
  • 00:25:06
    is a hotkey that works in both linux and
  • 00:25:08
    in hive so i can go back and paste
  • 00:25:10
    through all the commands i've typed in
  • 00:25:11
    and we can see now that i have my
  • 00:25:14
    there's of course a default database and
  • 00:25:16
    then there's the office database so now
  • 00:25:18
    we've created a database it's pretty
  • 00:25:19
    quick and easy and we can go ahead and
  • 00:25:21
    drop the database we can do drop
  • 00:25:23
    database
  • 00:25:24
    office now this will work on this
  • 00:25:26
    database because it's empty if your
  • 00:25:28
    database was not empty you would have to
  • 00:25:30
    do cascade and that drops all the tables
  • 00:25:33
    in the database and the database itself
  • 00:25:36
    now if we do show database and we'll go
  • 00:25:39
    ahead and recreate our database because
  • 00:25:40
    we're going to use the office database
  • 00:25:42
    for the rest of this hands-on demo a
  • 00:25:45
    really handy command to now
  • 00:25:47
    set with the sql or hql is to use office
  • 00:25:51
    and what that does is that sets office
  • 00:25:54
    as a default database so instead of
  • 00:25:56
    having to reference the database every
  • 00:25:59
    time we work with a table it now
  • 00:26:01
    automatically assumes that's the
  • 00:26:02
    database being used whatever tables
  • 00:26:04
    we're working on the difference is you
  • 00:26:06
    put the database name period table and
  • 00:26:08
    i'll show you in just a minute what that
  • 00:26:09
    looks like and how that's different if
  • 00:26:11
    we're going to have a table and a
  • 00:26:13
    database we should probably load some
  • 00:26:14
    data into it so let me go ahead and
  • 00:26:16
    switch gears here and open up a terminal
  • 00:26:19
    window you can just open another
  • 00:26:20
    terminal window and it'll open up right
  • 00:26:21
    on top of the one that you have hive
  • 00:26:23
    shell running in and when we're in this
  • 00:26:25
    terminal window first we're going to go
  • 00:26:27
    ahead and just do a list which is of
  • 00:26:28
    course a linux command you can see all
  • 00:26:30
    the files i have in here this is the
  • 00:26:32
    default load we can change directory to
  • 00:26:34
    documents we can list in documents and
  • 00:26:38
    we're actually going to be looking at
  • 00:26:40
    employee.csv a linux command is the cat
  • 00:26:44
    you can use this actually to combine
  • 00:26:45
    documents there's all kinds of things
  • 00:26:46
    that cat does but if we want to just
  • 00:26:48
    display the contents of our employee.c
  • 00:26:52
    file we can simply do cat employee csv
  • 00:26:55
    and when we're looking at this we want
  • 00:26:57
    to know a couple things one there's a
  • 00:27:00
    line at the top okay so the very first
  • 00:27:02
    thing we notice is that we have a header
  • 00:27:04
    line the next thing we notice is that
  • 00:27:06
    the data is comma separated and in this
  • 00:27:09
    particular case you'll see a space here
  • 00:27:12
    generally with these you've got to be
  • 00:27:13
    real careful with spaces there's all
  • 00:27:15
    kinds of things you've got to watch out
  • 00:27:16
    for because they can cause issues these
  • 00:27:18
    bases won't because these are all
  • 00:27:20
    strings that the space is connected to
  • 00:27:22
    if this was a space next to the integer
  • 00:27:24
    you would get a null value that comes
  • 00:27:26
    into the database without doing
  • 00:27:27
    something extra in there now with most
  • 00:27:29
    of hadoop that's important to know that
  • 00:27:31
    you're writing the data once reading it
  • 00:27:33
    many times and that's true of almost all
  • 00:27:36
    your hadoop things coming in so you
  • 00:27:38
    really want to process the data before
  • 00:27:40
    it gets into the database and for those
  • 00:27:43
    who of you have studied data
  • 00:27:45
    transformation that's the etyl where you
  • 00:27:47
    extract transfer form and then load the
  • 00:27:51
    data so you really want to extract and
  • 00:27:53
    transform before putting it into the
  • 00:27:54
    hive then you load it into the hive with
  • 00:27:56
    the transformed data and of course we
  • 00:27:58
    also want to note the schema we have an
  • 00:28:00
    integer string string integer integer so
  • 00:28:03
    we kept it pretty simple in here as far
  • 00:28:04
    as the way the data is set up the last
  • 00:28:06
    thing that you're going to want to look
  • 00:28:07
    up
  • 00:28:08
    is the source since we're doing local
  • 00:28:11
    uploads we want to know what the path is
  • 00:28:13
    we have the whole path in this case it's
  • 00:28:15
    home slash cloudera slash documents and
  • 00:28:18
    these are just text documents we're
  • 00:28:20
    working with right now we're not doing
  • 00:28:21
    anything fancy so we can do a simple get
  • 00:28:24
    edit employee.csv
  • 00:28:26
    and you'll see it comes up here it's
  • 00:28:28
    just a text document so i can easily
  • 00:28:30
    remove these added spaces there we go
  • 00:28:33
    and then we go and just save it and so
  • 00:28:35
    now it has a new setup in there we've
  • 00:28:36
    edited it the g edit is usually one of
  • 00:28:39
    the default that loads into linux so any
  • 00:28:42
    text editor will do back to the hive
  • 00:28:44
    shell so let's go ahead and create a
  • 00:28:46
    table employee and what i want you to
  • 00:28:48
    note here is i did not put the semicolon
  • 00:28:51
    on the end here semicolon tells it to
  • 00:28:53
    execute that line so this is kind of
  • 00:28:55
    nice if you're you can actually just
  • 00:28:57
    paste it in if you have it written on
  • 00:28:58
    another sheet and you can see right here
  • 00:29:00
    where i have create table employee and
  • 00:29:02
    it goes into the next line on there so i
  • 00:29:04
    can do all of my commands at once now
  • 00:29:07
    just so i don't have any typo errors i
  • 00:29:08
    went ahead and just pasted the next
  • 00:29:10
    three lines in and the next one is our
  • 00:29:13
    schema if you remember correctly from
  • 00:29:15
    the other side we had the different
  • 00:29:17
    values in here which was id
  • 00:29:19
    name department year of joining and
  • 00:29:22
    salary and the id is an integer name is
  • 00:29:25
    a string department string air joining
  • 00:29:26
    energy salary an integer and they're in
  • 00:29:29
    brackets we put close brackets around
  • 00:29:31
    them and you could do this all as one
  • 00:29:32
    line and then we have row format
  • 00:29:34
    delimited fields terminated by comma and
  • 00:29:37
    this is important because the default is
  • 00:29:40
    tabs so if i do it now it won't find any
  • 00:29:42
    terminated fields so you'll get a bunch
  • 00:29:44
    of null values loaded into your table
  • 00:29:47
    and then finally our table properties we
  • 00:29:49
    want to skip the header line count
  • 00:29:51
    equals 1. now this is a lot of work for
  • 00:29:54
    uploading a single file it's kind of
  • 00:29:56
    goofy when you're uploading a single
  • 00:29:57
    file that you have to put all this in
  • 00:29:59
    here but keep in mind hive and hadoop is
  • 00:30:02
    designed for writing many files into the
  • 00:30:05
    database you write them all in there and
  • 00:30:06
    then you can they're saved it's an
  • 00:30:08
    archive it's a data warehouse and then
  • 00:30:10
    you're able to do all your queries on
  • 00:30:11
    them so a lot of times we're not looking
  • 00:30:13
    at just the one file coming up we're
  • 00:30:15
    loading hundreds of files you have your
  • 00:30:18
    reports coming off of your main database
  • 00:30:20
    all those reports are being loaded you
  • 00:30:22
    have your log files you have i mean all
  • 00:30:24
    this different data is being dumped into
  • 00:30:26
    hadoop and in this case hive on top of
  • 00:30:28
    hadoop and so we need to let it know hey
  • 00:30:30
    how do i handle these files coming in
  • 00:30:32
    and then we have the semicolon at the
  • 00:30:34
    end which lets us know to go ahead and
  • 00:30:35
    run this line and so we'll go ahead and
  • 00:30:37
    run that and now if we do a show tables
  • 00:30:40
    you can see there's our employee on
  • 00:30:41
    there we can also describe if we do
  • 00:30:44
    describe employee
  • 00:30:46
    you can see that we have our id integer
  • 00:30:48
    name string department string year of
  • 00:30:51
    joining integer and salary integer and
  • 00:30:54
    then finally let's just do a select star
  • 00:30:56
    from employee very basic sql nhql
  • 00:31:00
    command selecting data it's going to
  • 00:31:02
    come up and we haven't put anything in
  • 00:31:04
    it so as we expect there's no data in it
  • 00:31:07
    so if we flip back to our
  • 00:31:10
    linux terminal window you can see where
  • 00:31:12
    we did the cat
  • 00:31:14
    employee.csv and you can see all the
  • 00:31:16
    data we expect to come into it and we
  • 00:31:18
    also did our pwd and right here you see
  • 00:31:21
    the path you need that full path when
  • 00:31:23
    you are loading data you know you can do
  • 00:31:26
    a browse and if i did it right now with
  • 00:31:28
    just the employee.csv as a name it will
  • 00:31:31
    work but that is a really bad habit in
  • 00:31:33
    general when you're loading data because
  • 00:31:35
    it's you don't know what else is going
  • 00:31:36
    on in the computer you want to do the
  • 00:31:38
    full path almost in all your data loads
  • 00:31:40
    so let's go ahead and flip back over
  • 00:31:42
    here to our hive shell we're working in
  • 00:31:45
    and the command for this is load data so
  • 00:31:48
    that says hey we're loading data that's
  • 00:31:49
    a hive command hql and we want local
  • 00:31:52
    data so you got to put down local in
  • 00:31:54
    path so now it needs to know where the
  • 00:31:56
    path is now to make this more legible
  • 00:31:59
    i'm just going to go ahead and hit enter
  • 00:32:00
    then we'll just paste the full path in
  • 00:32:02
    there which i have stored over on the
  • 00:32:04
    side like a good prepared demo and
  • 00:32:06
    you'll see here we have home cloudera
  • 00:32:08
    documents employee.csv so it's a whole
  • 00:32:11
    path for this text document in here and
  • 00:32:13
    we go ahead and hit enter in there and
  • 00:32:15
    then we have to let it know where the
  • 00:32:17
    data is going so now we have a source
  • 00:32:19
    and we need a destination and it's going
  • 00:32:20
    to go into the table and we'll just call
  • 00:32:22
    it employee we'll just match the table
  • 00:32:25
    in there and because i want it to
  • 00:32:26
    execute we put the semicolon on the end
  • 00:32:29
    it goes ahead and executes all three
  • 00:32:31
    lines now if we go back if you remember
  • 00:32:34
    we did the select star from employee
  • 00:32:36
    just using the up arrow to page through
  • 00:32:38
    my different commands i've already typed
  • 00:32:41
    in you can see right here we have as we
  • 00:32:43
    expect we have rows sam mike and nick
  • 00:32:45
    and we have all their information
  • 00:32:46
    showing in our four rows and then let's
  • 00:32:49
    go ahead and do uh select
  • 00:32:51
    and count let's look at a couple of
  • 00:32:53
    these different select options you can
  • 00:32:55
    do we're going to count everything from
  • 00:32:57
    employee now this is kind of interesting
  • 00:32:59
    because the first one just pops up with
  • 00:33:01
    the basic select because it doesn't need
  • 00:33:04
    to go through the full map reduce phase
  • 00:33:07
    but when you start doing a count it does
  • 00:33:09
    go through the full map redo setup in
  • 00:33:12
    the hive in hadoop and because i'm doing
  • 00:33:14
    this demo on a single node cloudera
  • 00:33:18
    virtual box on top of a windows 10 all
  • 00:33:21
    the benefits of running it on a cluster
  • 00:33:23
    are gone and instead is now going
  • 00:33:25
    through all those added layers so it
  • 00:33:27
    takes longer to run you know like i said
  • 00:33:30
    when you do a single node as i said
  • 00:33:32
    earlier it doesn't do any good as an
  • 00:33:34
    actual distribution because you're only
  • 00:33:35
    running it on one computer and then
  • 00:33:37
    you've added all these different layers
  • 00:33:38
    to run it and we see it comes up with
  • 00:33:40
    four and that's what we expect we have
  • 00:33:41
    four rows we expect four at the end and
  • 00:33:44
    if you remember from
  • 00:33:45
    our cheat sheet which we brought up here
  • 00:33:48
    from hortons it's a pretty good one
  • 00:33:49
    there's all these different commands we
  • 00:33:51
    can do we'll look at one more command
  • 00:33:52
    where we do the
  • 00:33:54
    what they call sub queries right down
  • 00:33:56
    here because that's really common to do
  • 00:33:58
    a lot of sub queries and so we'll do
  • 00:34:00
    select
  • 00:34:01
    star or all different columns from
  • 00:34:05
    now if we weren't using the office
  • 00:34:07
    database it would look like this from
  • 00:34:10
    office dot employee and either one will
  • 00:34:12
    work on this particular one because we
  • 00:34:15
    have office set as a default on there so
  • 00:34:18
    from office employee and then the
  • 00:34:20
    command where creates a subset and in
  • 00:34:23
    this case we want to know where the
  • 00:34:24
    salary is greater than 25
  • 00:34:28
    000. there we go and of course we end
  • 00:34:30
    with our semicolon and if we run this
  • 00:34:32
    query you can see it pops up and there's
  • 00:34:34
    our salaries of people top earners we
  • 00:34:36
    have rose and i t and mike and hr kudos
  • 00:34:39
    to them of course they're fictional i
  • 00:34:41
    don't actually we don't actually have a
  • 00:34:42
    rose and a mic in those positions or
  • 00:34:44
    maybe we do so finally we want to go
  • 00:34:46
    ahead and do is we're done with this
  • 00:34:48
    table now remember you're dealing with
  • 00:34:49
    the data warehouse so you usually don't
  • 00:34:51
    do a lot of dropping of tables and
  • 00:34:54
    databases but we're going to go ahead
  • 00:34:56
    and drop this table here before we drop
  • 00:34:58
    it one more quick note is we can change
  • 00:35:01
    it so what we're going to do is we're
  • 00:35:03
    going to alter table office employee and
  • 00:35:06
    we want to go ahead and rename it
  • 00:35:08
    there's some other commands you can do
  • 00:35:09
    in here but rename is pretty common and
  • 00:35:11
    we're going to rename it to
  • 00:35:13
    and it's going to stay in office and
  • 00:35:16
    it turns out one of our
  • 00:35:17
    shareholders really doesn't like the
  • 00:35:19
    word employee he wants employees plural
  • 00:35:22
    it's a big deal to him so let's go ahead
  • 00:35:24
    and change that name for the table it's
  • 00:35:26
    that easy because it's just changing the
  • 00:35:28
    metadata on there and now if we do show
  • 00:35:30
    tables you'll see we now have employees
  • 00:35:33
    not employee and then at this point
  • 00:35:36
    maybe we're doing some house cleaning
  • 00:35:37
    because this is all practice so we're
  • 00:35:39
    going to go ahead and drop table and
  • 00:35:40
    we'll drop table employees because we
  • 00:35:43
    changed the name in there so if we did
  • 00:35:45
    employee just give us an error and now
  • 00:35:47
    if we do show tables you'll see all the
  • 00:35:49
    tables are gone now the next thing we
  • 00:35:50
    want to take a look at and we're going
  • 00:35:52
    to walk back through the loading of data
  • 00:35:54
    just real quick because we're going to
  • 00:35:56
    load two tables in here and let me just
  • 00:35:58
    float back to our terminal window so we
  • 00:36:01
    can see what those tables are that we're
  • 00:36:03
    loading and so up here we have a
  • 00:36:05
    customer we have a customer
  • 00:36:07
    file and we have an order file we want
  • 00:36:09
    to go ahead and put the customers and
  • 00:36:10
    the orders into here so those are the
  • 00:36:12
    two we're doing and of course it's
  • 00:36:13
    always nice to see what you're working
  • 00:36:15
    with so let's do our cat
  • 00:36:17
    customer.csv we could always do g edit
  • 00:36:20
    but we don't really need to edit these
  • 00:36:21
    we just want to take a look at the data
  • 00:36:23
    in customer and important in here is
  • 00:36:25
    again we have a header so we have to
  • 00:36:27
    skip a line comma separated uh nothing
  • 00:36:30
    odd with the data we have our schema
  • 00:36:32
    which is uh integer string integer
  • 00:36:36
    string integer so you know you'd want to
  • 00:36:38
    take that note that down or flip back
  • 00:36:40
    and forth when you're doing it and then
  • 00:36:41
    let's go ahead and do cat order dot csv
  • 00:36:44
    and we can see we have oid which i'm
  • 00:36:47
    guessing is the order id we have a date
  • 00:36:49
    up something new we've done integers and
  • 00:36:51
    strings but we haven't done date when
  • 00:36:53
    you're importing new and you never
  • 00:36:56
    worked with the date date's always one
  • 00:36:57
    of the more trickier fields to port in
  • 00:36:59
    when that's true of just about any
  • 00:37:01
    scripting language i've worked with all
  • 00:37:03
    of them have their own idea of how
  • 00:37:04
    date's supposed to be formatted what the
  • 00:37:06
    default is this particular format or
  • 00:37:09
    it's year and it has all four uh digits
  • 00:37:13
    dash month two digits dash day is the
  • 00:37:16
    standard import for the hive so you'll
  • 00:37:19
    have to look up and see what the
  • 00:37:20
    different formats are if you're going to
  • 00:37:22
    do a different format in there coming in
  • 00:37:24
    or you're not able to pre-process the
  • 00:37:25
    data but this would be a pre-processing
  • 00:37:27
    of the data thing coming in if you
  • 00:37:29
    remember correctly from our edel which
  • 00:37:31
    is uh e just in case you weren't able to
  • 00:37:33
    hear me last time etl which stands for
  • 00:37:37
    extract transform then load so you want
  • 00:37:40
    to make sure you're transforming this
  • 00:37:42
    data before it gets into here and so
  • 00:37:44
    we're going to go ahead and bring
  • 00:37:45
    both this data in here and really we're
  • 00:37:47
    doing this so we can show you the basic
  • 00:37:49
    join there is if you remember from our
  • 00:37:51
    setup merge join all kinds of different
  • 00:37:53
    things you can do but joining different
  • 00:37:55
    data sets is so common so it's really
  • 00:37:58
    important to know how to do this we need
  • 00:37:59
    to go ahead and bring in these two data
  • 00:38:00
    sets and you can see where i just
  • 00:38:02
    created a table customer here's our
  • 00:38:04
    schema the integer name age address
  • 00:38:07
    salary here's our eliminated by commas
  • 00:38:10
    and our table properties where we skip a
  • 00:38:12
    line well let's go ahead and load the
  • 00:38:13
    data first and then we'll do that with
  • 00:38:15
    our order and let's go ahead and put
  • 00:38:17
    that in here and i've got it split into
  • 00:38:19
    three lines so you can see it easily we
  • 00:38:21
    got load data local in path so we know
  • 00:38:23
    we're loading data we know it's local
  • 00:38:25
    and we have the path here's the complete
  • 00:38:27
    path for uh oh this is supposed to be
  • 00:38:29
    order csv grab the wrong one of course
  • 00:38:32
    it's going to give me errors because you
  • 00:38:33
    can't recreate the same table on there
  • 00:38:35
    and here we go create table here's our
  • 00:38:38
    integer date customer the basic setup
  • 00:38:41
    that we had coming in here for our
  • 00:38:42
    schema row format commas table
  • 00:38:45
    properties skip header line and then
  • 00:38:47
    finally let's load the data into
  • 00:38:50
    our order table load data local in path
  • 00:38:53
    home cloudera documents order.csv into
  • 00:38:56
    table order now if we did everything
  • 00:38:58
    right we should be able to do select
  • 00:39:00
    star from customer and you can see we
  • 00:39:03
    have all seven customers and then we can
  • 00:39:05
    do select star from order and we have uh
  • 00:39:09
    four orders uh so this is just like a
  • 00:39:11
    quick frame we have a lot of times when
  • 00:39:13
    you have your customer databases in
  • 00:39:15
    business you have thousands of customers
  • 00:39:17
    from years and years and some of them
  • 00:39:20
    you know they move they close their
  • 00:39:21
    business they change names all kinds of
  • 00:39:23
    things happen uh so we want to do is we
  • 00:39:25
    want to go ahead and find just the
  • 00:39:27
    information connected to these orders
  • 00:39:30
    and who's connected to them and so let's
  • 00:39:32
    go ahead and do it's a select because
  • 00:39:33
    we're going to display information so
  • 00:39:35
    select and this is kind of interesting
  • 00:39:37
    we're going to do c dot id
  • 00:39:40
    and i'm going to define c as customer as
  • 00:39:42
    a customer table in just a minute then
  • 00:39:44
    we're going to do c dot name and again
  • 00:39:47
    we're going to define the c c dot age so
  • 00:39:50
    this means from the customer we want to
  • 00:39:52
    know their id their name their age and
  • 00:39:54
    then you know i'd also like to know the
  • 00:39:56
    order amount uh so let's do o for dot
  • 00:39:59
    amount and then this is where we need to
  • 00:40:01
    go ahead and define uh what we're doing
  • 00:40:04
    and i'm going to capitalize from
  • 00:40:05
    customer so we're going to take the
  • 00:40:07
    customer table in here and we're going
  • 00:40:09
    to name it c that's where the c comes
  • 00:40:11
    from so that's the customer table c and
  • 00:40:13
    we want to join order as o that's where
  • 00:40:16
    our o comes from so the o dot amount is
  • 00:40:18
    what we're joining in there and then we
  • 00:40:20
    want to do this on we got to tell it how
  • 00:40:22
    to connect the two tables c dot id
  • 00:40:25
    equals o dot customer underscore id so
  • 00:40:29
    now we know how they're joined and
  • 00:40:31
    remember we have seven customers in here
  • 00:40:34
    we have four orders and as it processes
  • 00:40:36
    we should get a return
  • 00:40:38
    of four different names joined together
  • 00:40:41
    and they're joined based on of course
  • 00:40:43
    the orders on there and once we're done
  • 00:40:45
    we now have the order number the person
  • 00:40:48
    who made the order their age and the
  • 00:40:51
    amount of the order which came from the
  • 00:40:52
    order table so you have your different
  • 00:40:54
    information and you can see how the join
  • 00:40:56
    works here very common use of tables and
  • 00:40:59
    hql and sql and let's do one more thing
  • 00:41:02
    with our database and then i'll show you
  • 00:41:05
    a couple other hive commands and let's
  • 00:41:07
    go ahead and do a drop and we're going
  • 00:41:09
    to drop database office
  • 00:41:12
    and if you're looking at this and you uh
  • 00:41:14
    remember from earlier this will give me
  • 00:41:16
    an error and this to see what that looks
  • 00:41:18
    like it says failed to execute exception
  • 00:41:21
    one or more tables exist so if you
  • 00:41:24
    remember from before you can't just drop
  • 00:41:26
    a database unless you tell it to cascade
  • 00:41:29
    that lets it know i don't care how many
  • 00:41:31
    tables are in it let's get rid of it and
  • 00:41:33
    in hadoop since it's an art it's a
  • 00:41:35
    warehouse a data warehouse you usually
  • 00:41:36
    don't do a lot of dropping uh maybe at
  • 00:41:38
    the beginning when you're developing the
  • 00:41:39
    schemas and you realize you messed up
  • 00:41:41
    you might drop some stuff but down the
  • 00:41:43
    road you're really just adding commodity
  • 00:41:46
    machines to take up so you can store
  • 00:41:47
    more stuff on it so you usually don't do
  • 00:41:49
    a lot of database dropping and some
  • 00:41:51
    other
  • 00:41:52
    fun commands to know is you can do
  • 00:41:54
    select round 2.3 is round value you can
  • 00:41:57
    do a round off in
  • 00:41:58
    hive we can do as floor value which is
  • 00:42:02
    going to give us a two so it turns it
  • 00:42:03
    into an integer versus a float it goes
  • 00:42:06
    down you know basically truncates it but
  • 00:42:08
    it goes down and we can also do ceiling
  • 00:42:10
    which is going to round it up so we're
  • 00:42:12
    looking for the next integer above
  • 00:42:14
    there's a few commands we didn't show in
  • 00:42:16
    here because we're on a single node as
  • 00:42:19
    as an admin to help spediate the process
  • 00:42:22
    you usually add in partitions for the
  • 00:42:24
    data and buckets you can't do that on a
  • 00:42:27
    single node because the when you add a
  • 00:42:29
    partition it partitions it across
  • 00:42:31
    separate nodes but beyond that you can
  • 00:42:33
    see it's very straightforward we have
  • 00:42:35
    sql coming in and all your basic queries
  • 00:42:38
    that are in sql are very similar to hql
  • 00:42:42
    key takeaways so we took a look at the
  • 00:42:44
    history of hive and how it evolved from
  • 00:42:47
    the hadoop file system to an hql similar
  • 00:42:50
    to sql layer on top of hadoop with the
  • 00:42:53
    full metal metastore and all that
  • 00:42:56
    information connected to the hadoop file
  • 00:42:58
    system that way you can easily scale it
  • 00:43:00
    up and still have you know underneath
  • 00:43:02
    the hadoop setup while still having the
  • 00:43:04
    sql query language available we looked
  • 00:43:07
    at what is hive and we looked at the
  • 00:43:09
    high of queries going in through the map
  • 00:43:11
    reduce into the hadoop map reduce system
  • 00:43:13
    we dig a little deeper to look at the
  • 00:43:15
    architecture of hive and all the
  • 00:43:17
    different pieces and how they fit
  • 00:43:18
    together including the fact that it has
  • 00:43:20
    the hive client and you have your thrift
  • 00:43:22
    applications and your
  • 00:43:24
    jdbc applications in your odbc
  • 00:43:26
    applications uh we have the hive web
  • 00:43:28
    interface which we looked at in a demo
  • 00:43:30
    along with the cli which your client
  • 00:43:33
    direct interface which we use for most
  • 00:43:34
    of the demo and again this is how do you
  • 00:43:36
    get these commands into hive and if
  • 00:43:38
    you're uh using the
  • 00:43:41
    hive web interface is great for maybe
  • 00:43:43
    your shareholders you're working with
  • 00:43:45
    and some of them are not technically
  • 00:43:47
    literate or even if they are you know
  • 00:43:48
    it's a quick look up for data i'm not be
  • 00:43:51
    beyond jumping into the hue web
  • 00:43:54
    interface and looking something up
  • 00:43:55
    versus a terminal window or if you're
  • 00:43:57
    running the back end we're running the
  • 00:43:58
    program whether it's python or java you
  • 00:44:00
    have those the thrift applications which
  • 00:44:02
    connect in so you can extend with all
  • 00:44:04
    the major scripting languages usually
  • 00:44:06
    have their own plugin for sending that
  • 00:44:08
    information over to our hadoop file
  • 00:44:10
    system we dug in deeper into the data
  • 00:44:12
    flow in hive so your user interface and
  • 00:44:15
    the different steps it takes to go
  • 00:44:16
    through and get in out of the mapreduce
  • 00:44:18
    system we also took a glance at hive
  • 00:44:21
    data types and certainly these are
  • 00:44:23
    always evolving so it's good to look at
  • 00:44:25
    the apache website and especially with
  • 00:44:27
    the new stuff come up underneath of
  • 00:44:29
    beehive which is also hive if you see
  • 00:44:31
    that don't let that scare you it's just
  • 00:44:33
    the beta version coming up with the new
  • 00:44:35
    updates and then we looked at the
  • 00:44:37
    features of hive and how it works as an
  • 00:44:39
    ecosystem on there with that i'd like to
  • 00:44:42
    thank you for joining us today for
  • 00:44:44
    hadoop hive again my name is richard
  • 00:44:46
    kirschner with the simply learn team for
  • 00:44:48
    more information visit us at
  • 00:44:50
    www.simplylearn.com
  • 00:44:53
    you can also post comments here on the
  • 00:44:55
    youtube channel we do have moderators
  • 00:44:57
    that monitor that and we certainly will
  • 00:44:59
    respond to your postings again thank you
  • 00:45:01
    for joining us today get certified get
  • 00:45:04
    ahead
  • 00:45:09
    hi there if you like this video
  • 00:45:11
    subscribe to the simply learn youtube
  • 00:45:12
    channel and click here to watch similar
  • 00:45:15
    videos turn it up and get certified
  • 00:45:17
    click here
タグ
  • Hive
  • HQL
  • HiveQL
  • Data Warehouse
  • Hadoop
  • Metastore
  • HDFS
  • Big Data
  • SQL
  • Data Types