HIVE STORAGE FORMATS in Hadoop

 In this post we will take a look on the different Storage File Formats and Record Formats in Hive


Before we move forward lets discuss for a split second about Apache Hive.

Apache Hive which is a data warehouse system for Hadoop facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems, first created at Facebook . Hive provide a means to project structure onto this data and query the data using a SQL-like language called HiveQL.…read more on Hive 

  

Among the different storage file formats that are used in hive, the default and simplest storage file format is the TEXTFILE.

To learn complete course visit:hadoop admin training

TEXTFILE 

The data in a TEXTFILE is stored as plain text, one line per record. The TEXTFILE is very useful for sharing data with other tools and also when you want to manually edit the data in the file. However the TEXTFILE is less proficient when compared to the other formats.


SYNTAX :


CREATE TABLE TEXTFILE_TABLE (

COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT

) STORED AS TEXTFILE;



SEQUENCE FILE

In sequence files the data is stored in a binary storage format consisting of binary key value pairs. A complete row is stored as single binary value. Sequence files are more compact than text and fit well the map-reduce output format. Sequence files do support block compression and can be compressed on value, or block level, to improve its IO profile further.

SEQUENCEFILE is a standard format that is supported by Hadoop itself and is good choice for Hive table storage especially when you want to integrate Hive with other techonolgies in the Hadoop ecosystem.

The USING  sequence  file keywords lets you create a sequence  File. Here is an example statement to create a table using sequence File:




 CREATE TABLE SEQUENCEFILE_TABLE (

COLUMN1 STRING,

COLUMN2 STRING,

COLUMN3 INT,

COLUMN4 INT

) STORED AS SEQUENCEFILE

Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.


RCFILE OR RECORD COLUMNAR FILE


The RCFILE is one more file format that can be used with Hive. The RCFILE stores columns of a table in a record columnar format rather than row oriented fashion  and provides considerable compression and query performance benefits with highly efficient storage space utilization. Hive added the RCFile format in version 0.6.0.


RC file format is more useful when tables have large number of columns but only few columns are typically retrieved.

  

The RCFile combines multiple functions to provide the following features

  •  Fast data storing
  •  Improved query processing,
  • Optimized storage space utilization
  • Dynamic data access patterns.

SYNTAX:


CREATE TABLE RCFILE_TABLE (

COLUMN1 STRING,

COLUMN2 STRING,

COLUMN3 INT,

COLUMN4 INT ) STORED AS RCFILE;


Compressed RCFile reduces the IO and storage significantly over text, sequence file, and row formats. Compression on a column base is more efficient here since it can take advantage of similarity of the data in a column.



ORC FILE OR OPTIMIZED ROW COLUMNAR FILE

ORCFILE stands for Optimized Row Columnar File and it’s a new Hive File Format that was created to provide many advantages over the RCFILE format while processing data. The ORC File format comes with the Hive 0.11 version and cannot be used with previous versions.


Lightweight indexes are included with ORC file to improve the performance.

Also it uses specific encoders for different column data types to improve compression further, e.g. variable length compression on integers 

ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format allowing parallel processing of row collections across a cluster.


ORC files compress better than RC files, enabling faster queries. To use it just add STORED AS orc to the end of your create table statements like this:


CREATE TABLE mytable (

COLUMN1 STRING,

COLUMN2 STRING,

COLUMN3 INT,

COLUMN4 INT

) STORED AS orc;


To more information visit OnlineITguru's hadoop admin online course Blog.

No comments:

Powered by Blogger.