Introduction to Hive

 What is Hive



  • The Apache Hive was first released in the year 2012
  • Apache Hive is an open-source data warehouse system built on top of Hadoop for querying and analyzing large datasets stored in Hdfs files.
  • Hive uses a language called Hive-QL (HQL), which is similar to SQL. 
  • Hive-QL automatically translates SQL-like queries into Map Reduce jobs.
  • Hive is not a database, but it uses a database (called the metastore) to store the tables that you define. 
  • The Hive generally runs on your workstation and converts your SQL query into a series of jobs for execution on a Hadoop cluster
  • Hive abstracts the complexity of Hadoop. The main thing to notice is that there is no need to learn java for Hive.

History of Hive


  • Data Infrastructure Team at Facebook developed Hive. 
  • Apache Hive is also one of the technologies that are being used to address the requirements at Facebook. 
  • It is very popular with all the users internally at Facebook.
  • It is being used to run thousands of jobs on the cluster with hundreds of users, for a wide variety of applications.
  • Apache Hive-Hadoop cluster at Facebook stores more than 2PB of raw data. It regularly loads 15 TB of data on a daily basis.
  • Now it is being used and developed by a number of companies like Amazon, IBM, Yahoo, Netflix, Financial Industry Regulatory Authority (FINRA) and many others.






Hive Architecture






Why Hive


Writing map reduce code is very tough and can not be expected from all engineers. Facebook was struggling with same issue and they created Hive-SQL language which in the background runs map-reduce jobs



Hive-SQL:



It is called as Hive SQL Coding, most of syntax are similar to SQL

  • Hive CLI: 
    • It is a Hive Command line interface that connects to a Hive Server instances
    • $ hive
    • hive >

  • Beeline:
    • It is a new command line clint that connects to a Hive server2 instance
    • $ beeline
    • Hive version 0.11.0-snapshot by Apache
    • beeline>



How to Run Hive queries



  • The Hive CLI shell is started using the hive executable: 
    • $ hive -h hostname 
    • hive> 
      • Use the ‐f flag to specify a file that contains a Hive script: 
      • $ hive -f myquery.hive 

  • Beeline is started using the beeline executable: 
    • $ beeline 
    • Hive version 0.11.0-SNAPSHOT by Apache 
    • beeline> 
    • Once Beeline is started, you issue a connect command to connect to a specific HiveServer2 instance: 
      • beeline> !connect jdbc:hive2://hostname:10000 username password org.apache.hive.jdbc.HiveDriver

No comments:

Post a Comment