The concept of partitioning in HIVE is to divide a table into parts based on the value of one or more columns.
For example if we want to search for a particular city(say Bangalore) in our Employee table, we write a sql query with 'where' clause as stated below :
Now, the entire table has to be scanned to get this information. If the data is small it would be justified but HIVE is meant for huge data. So what we can do is partition the city based on 'Bangalore'. And what would happen is all Bangalore related data would go to that partition.
Similarly if we partition the city based on 'Delhi' all 'Delhi' related data would go to another partition.
Now if the same query is triggered, it would be quite easy for HIVE. It can go to that particular partition and get the data without scanning the entire table.
We are creating a new Employee table with partitioning concept.
So, we have created a new table named new_employee with columns emp_id, name and salary. The fourth column 'city' is declared as the partitioned column.
Now lets say we have a file named employee.csv in '/data/mypath/employee.csv' location in HDFS. So, in order to load the data to the partitioned table, we need to use the below command :
So, what HIVE does is creates a directory 'new_employee', then creates a subdirectory
"city='Bangalore'".
Inside this subdirectory lies all data related to the city, Bangalore.
apps/hive/warehouse/new_employee/city=Bangalore
And yes, the column name along with it's value 'city=Bangalore' becomes a subdirectory.
In the above partitioned table 'new_employee' we have a partitioned column called 'city'.Although its a partitioned column but looks like a normal column to us.If we run :
The output would be :
Also if we trigger a SELECT query, the city column behaves like any normal column.
SELECT emp_id,name,salary,city FROM new_employee WHERE city = 'Bangalore';
So for us it looks like a normal column. But with this technique HIVE can fetch data much faster.