Hive Partitioning in Big Data

 Partition is horizontally dividing the data into number of slice in equal and manageable manner.

Every partition is stored as directory within data warehouse table.

     In data warehouse, this partition concept is common, but there are two types of Partitions are available in data warehouse concepts.

            1. SQL Partition

            2. Hive Partition


To more information visit OnlineITGuru big data hadoop course tutorial blog

1. SQL Partition:-

      The main work of SQL Partition is dividing the large amount of data into number of slices by based on table single column only.

       SQL Partition is very hard comparing to Hive Partition because SQL server only support one column for Partition.


2. Hive Partition:-

       The main work of Hive Partition is also same as SQL Partition, but the main difference between SQL Partition and Hive Partition is SQL Partition only supports single column in table. While in Hive Partition, it supports multiple columns in a table.


Using Partitions can make it faster to do queries on slices of the data.

In Hive Partition concept there is two different type of partitions:-

     i. Static Partition

    ii. Dynamic Partition


i. Static Partition

   In Static Partition, we statically add a partition in table and move the file into the partition of the table. We need to specify the partition column value in each and every LOAD statement.

Creation of Partitioned Table:-

      Syntax:-

             CREATE TABLE TableName(ColName1 DataType1, ColName2 DataType2)

             PARTITIONED BY (PartitionColumn1 DataType2, PartitionColumn2 DataType2)

             ROW FORMAT DELIMITED

             FIELDS TERMINATED BY 'delimiter character'

             STORED AS <fileformat>;

hive (hivedb)> create table statictable (id int,name string,salary int)
             > partitioned by (city string)
             > row format delimited
             > fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 23.871 seconds


In description, we can see the Partitioned column "city"

hive (hivedb)> describe statictable;
OK
id                      int                                         
name                    string                                      
salary                  int                                         
city                    string                                      
          
# Partition Information          
# col_name                data_type               comment             
          
city                    string                                      
Time taken: 1.498 seconds, Fetched: 9 row(s)


Loading the data in the table:- 

hive (hivedb)> load data local inpath '/home/hduser/HIVE/data1.log' into table statictable PARTITION(city= 'Pune');
Loading data to table hivedb.statictable partition (city=Pune)
OK
Time taken: 15.667 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data2.log' into table statictable PARTITION(city= 'Mumbai');
Loading data to table hivedb.statictable partition (city=Mumbai)
OK
Time taken: 2.727 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data3.log' into table statictable PARTITION(city= 'Nagpur');
Loading data to table hivedb.statictable partition (city=Nagpur)
OK
Time taken: 1.76 seconds


To view the created Partitions:-

hive (hivedb)> show partitions statictable;
OK
city=Mumbai
city=Nagpur
city=Pune
Time taken: 0.939 seconds, Fetched: 3 row(s)



To view the content of the table:-

hive (hivedb)> select * from  statictable;
OK
200    Sonu    5000    Mumbai
201    Nik    3000    Mumbai
202    Sam    4000    Mumbai
203    Chotu    4000    Mumbai
.
.
.
224    Suruchi    42500    Pune
225    Vaibhav    28500    Pune
226    Kaustubh    57500    Pune
227    Jitesh    33500    Pune
228    Mayur    44500    Pune
229    Mohit    41500    Pune
Time taken: 0.677 seconds, Fetched: 80 row(s)


Adding a Partition:-

     Syntax:- ALTER TABLE TableName ADD PARTITION PartitionName;

hive (hivedb)> alter table statictable ADD PARTITION (city='Chennai');
OK
Time taken: 2.483 seconds
hive (hivedb)> show partitions statictable;
OK
city=Chennai
city=Mumbai
city=Nagpur
city=Pune
Time taken: 0.561 seconds, Fetched: 4 row(s)

New Partition is added "city=Chennai"



Renaming a Partition:-

 Syntax:- ALTER TABLE PARTITION ExistingPartition RENAME TO PARTITION NewPartition

hive (hivedb)> alter table statictable PARTITION (city='Chennai') RENAME TO PARTITION (city='Indore');
OK
Time taken: 2.964 seconds
hive (hivedb)> show partitions statictable;
OK
city=Indore
city=Mumbai
city=Nagpur
city=Pune
Time taken: 0.549 seconds, Fetched: 4 row(s)

Partition is renamed from Chennai to Indore



Dropping a Partition:-

     Syntax:- ALTER TABLE TableName DROP PARTITION PartitionName;

hive (hivedb)> alter table statictable drop PARTITION (city='Indore');
Dropped the partition city=Indore
OK
Time taken: 5.042 seconds
hive (hivedb)> show partitions statictable;
OK
city=Mumbai
city=Nagpur
city=Pune
Time taken: 0.463 seconds, Fetched: 3 row(s)


Creation of Partition on multiple columns:-

hive (hivedb)> create table multiplepartitiontable (id int,name string,salary int)
             > partitioned by (city string,dated string)
             > row format delimited
             > fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 0.637 seconds


In description, we can see the Partitioned column "city and dated" columns

hive (hivedb)> describe multiplepartitiontable;
OK
id                      int                                         
name                    string                                      
salary                  int                                         
city                    string                                      
dated                   string                                      
          
# Partition Information          
# col_name                data_type               comment             
          
city                    string                                      
dated                   string                                      
Time taken: 0.137 seconds, Fetched: 11 row(s)


Loading the data in the table:- 

hive (hivedb)> load data local inpath '/home/hduser/HIVE/data1.log' into table multiplepartitiontable PARTITION(city= 'Pune',dated='01/02/2017');
Loading data to table hivedb.multiplepartitiontable partition (city=Pune, dated=01/02/2017)
OK
Time taken: 1.88 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data2.log' into table multiplepartitiontable PARTITION(city= 'Pune',dated='02/02/2017');
Loading data to table hivedb.multiplepartitiontable partition (city=Pune, dated=02/02/2017)
OK
Time taken: 1.478 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data3.log' into table multiplepartitiontable PARTITION(city= 'Nagpur',dated='03/02/2017');
Loading data to table hivedb.multiplepartitiontable partition (city=Nagpur, dated=03/02/2017)
OK
Time taken: 1.394 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data2.log' into table multiplepartitiontable PARTITION(city= 'Mumbai',dated='03/02/2017');
Loading data to table hivedb.multiplepartitiontable partition (city=Mumbai, dated=03/02/2017)
OK
Time taken: 1.466 seconds


To see the created Partitions:-

hive (hivedb)> show partitions multiplepartitiontable;
OK
city=Mumbai/dated=03%2F02%2F2017
city=Nagpur/dated=03%2F02%2F2017
city=Pune/dated=01%2F02%2F2017
city=Pune/dated=02%2F02%2F2017
Time taken: 0.198 seconds, Fetched: 4 row(s)


To view the content of the table:-

hive (hivedb)> select * from  multiplepartitiontable;
OK
200    Sonu    5000    Mumbai    03/02/2017
201    Nik    3000    Mumbai    03/02/2017
202    Sam    4000    Mumbai    03/02/2017
203    Chotu    4000    Mumbai    03/02/2017
204    Aju    5000    Mumbai    03/02/2017
.
.
.
.
304    Chiku    1050    Pune    02/02/2017
305    Monu    3550    Pune    02/02/2017
306    Chutki    5550    Pune    02/02/2017
307    Tunu    3550    Pune    02/02/2017
308    Laddu    4350    Pune    02/02/2017
309    Tinku    4550    Pune    02/02/2017
310    Piku    2450    Pune    02/02/2017
Time taken: 4.57 seconds, Fetched: 101 row(s)


To learn complete course visit ITGuru big data online training Blog.

No comments:

Powered by Blogger.