Bucketing & Hive - Demo

 Implement Bucketing in Hive


What is Bucketing


  • The bucketing in Hive is a data organizing technique. 
  • It's also an IO performance tuning technique
  • It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets. 
  • We can use bucketing in Hive when the implementation of partitioning becomes difficult.
  • we can also divide partitions further into buckets.


Bucketing Logic:


  • The concept of bucketing is based on the hashing technique.
    • Here, modules of the current column value and the number of required buckets are calculated (let say, F(x) % 3).
      • Now, based on the resulted value, the data is stored in the corresponding bucket






Implement Bucketing


In the Hadoop sandbox terminal:

  • Create 1 file with min one integer column
    • stud with id and other columns






  • Copy the files in HDFS




Go to Hive terminal use Hive command

  • Create database and change scope to the created database
    • create database bucket;
    • use bucket; 




  • Enable hive. enforce.bucketing
    • set hive.enforce.bucketing = true; 



  • Create hive table emp and load 'databucket' file

    • Create table emp and load file 'databucket'
      • create table emp(id int, name string, age int, sal int)  row format delimited  fields terminated by ',' ;
      • load data inpath 'databucket.csv' into table emp;





    • Create table emp_bucket with bucketing on the id column
      • create table emp_bucket(id int, name string, age int, sal int) clustered by (id) into 4 buckets row format delimited  fields terminated by ',' ;
    • Load emp_bucket table from emp table
      • insert into table emp_bucket select id, name, age, sal  from emp ; 






Conclusion: We can see 4 files created for my bucketed table






No comments:

Post a Comment