Implement Bucketing in Hive
What is Bucketing
- The bucketing in Hive is a data organizing technique.
- It's also an IO performance tuning technique
- It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets.
- We can use bucketing in Hive when the implementation of partitioning becomes difficult.
- we can also divide partitions further into buckets.
Bucketing Logic:
- The concept of bucketing is based on the hashing technique.
- Here, modules of the current column value and the number of required buckets are calculated (let say, F(x) % 3).
- Now, based on the resulted value, the data is stored in the corresponding bucket
Implement Bucketing
In the Hadoop sandbox terminal:
- Copy the files in HDFS
- Create database and change scope to the created database
- create database bucket;
- use bucket;
- Enable hive. enforce.bucketing
- set hive.enforce.bucketing = true;
- Create hive table emp and load 'databucket' file
- Create table emp and load file 'databucket'
- create table emp(id int, name string, age int, sal int) row format delimited fields terminated by ',' ;
- load data inpath 'databucket.csv' into table emp;
- Create table emp_bucket with bucketing on the id column
- create table emp_bucket(id int, name string, age int, sal int) clustered by (id) into 4 buckets row format delimited fields terminated by ',' ;
- Load emp_bucket table from emp table
- insert into table emp_bucket select id, name, age, sal from emp ;
Conclusion: We can see 4 files created for my bucketed table
No comments:
Post a Comment