Bucketing & Hive - Demo

Implement Bucketing in Hive

What is Bucketing

The bucketing in Hive is a data organizing technique.

It's also an IO performance tuning technique

It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets.

We can use bucketing in Hive when the implementation of partitioning becomes difficult.

we can also divide partitions further into buckets.

Bucketing Logic:

The concept of bucketing is based on the hashing technique.

Here, modules of the current column value and the number of required buckets are calculated (let say, F(x) % 3).

Now, based on the resulted value, the data is stored in the corresponding bucket

Implement Bucketing

In the Hadoop sandbox terminal:

Create 1 file with min one integer column

stud with id and other columns

Copy the files in HDFS

Go to Hive terminal use Hive command

Create database and change scope to the created database

create database bucket;
use bucket;

Enable hive. enforce.bucketing

set hive.enforce.bucketing = true;

Create hive table emp and load 'databucket' file

Create table emp and load file 'databucket'

create table emp(id int, name string, age int, sal int) row format delimited fields terminated by ',' ;
load data inpath 'databucket.csv' into table emp;

Create table emp_bucket with bucketing on the id column

create table emp_bucket(id int, name string, age int, sal int) clustered by (id) into 4 buckets row format delimited fields terminated by ',' ;

Load emp_bucket table from emp table

insert into table emp_bucket select id, name, age, sal from emp ;

Conclusion: We can see 4 files created for my bucketed table

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)