Whenever we define a variable in HIVE, we have to tell HIVE what kind of value is it going to store inside it. i.e. the value inside the variable can be a word or a sentence or a number.
Data Types are the way to tell a variable, what type of value is it going to store.
Below are the classification of data types in HIVE :
The data type which holds numbers.
The complex data types supported in HIVE are Array, Map and Struct. These complex data types can combine primitive data types and provide a collection of data.
Say for example a primitive data type (ex. string) can hold a name but a complex data type is capable of holding a group of strings. i.e. it can hold a group of names.
An Array in HIVE is used to hold a collection of primitive data types.
Say we are going to create a table which contains a group of students belonging to a particular city. Below is the example :
In the above example we have created a table 'student_location' containing two variables 'city' of string type and 'names' of array type. But if you notice the array variable has a string variable attached to it (i.e. array<string> ). Which means the 'names' variable will contain not just contain one name but a group of names.
So, the above table will contain a city and a group of students belonging to that city.
After the data is inserted if we run the select statement, the below output is produced.
select * from student_location;
city | names |
---|---|
Hydrabad | ["Vinay","Ajay","Darshil"] |
Delhi | ["Sahil","Peter"] |
So, as we can see above each row contains a 'city' and a group of 'names' associated to that city.
A Map in HIVE is used to hold a collection of data in the form of key and value pair. Where the 'key' should be of a primitive data type and the 'value' can be either a primitive or a complex data type.
Think of the key and value pair as a locker in a bank. There are so many lockers and each person owning the locker is given a key. Same concept applies for a map in HIVE. While inserting the value in a table, you insert both the key and value (same way while opening a locker you are given a key and the locker). And while retrieving using 'select' statement you just specify the key and the value will be retrieved for you.
Let's look at the below example to get a clear picture.
Now we are going to change the 'student_location' table in the 'array' example and redefine such that the 'names' column along with the name also contains the roll number of the students.
In the above table we have 'names map <int,string> '. If you see closely, the 'map <int,string> ' has two variables associated to it. Where 'int' is the key (Which is the roll number of a student) and the 'string' is the value (Which is the name of the student). We will be explaining in a little while what the last three line means.
Now, we will be taking the data from a file(say csv file) and load the data to the 'student_location' table.
As we can see in the above student.csv file, every field is separated by comma(,). Where 'Hydrabad' is the first field and '1:Vinay*3:Ajay*5:Darshil' is the second field which is actually a map. The map actually contains three fields
1:Vinay
3:Ajay
5:Darshil
each separated by an asterix(*). Where '1' is the key and 'Vinay' is the value. And the key and value is separated by colon(:).
Now match with the last three lines of the 'CREATE TABLE' statement. It's self explanatory.
The 'load data' command is used to load data from the csv file.
LOAD DATA LOCAL INPATH 'student.csv' INTO TABLE student_location;
And all the data is loaded into the 'student_location' table.
SELECT * FROM student_location;
city | names |
---|---|
Hydrabad | {1:"Vinay",3:"Ajay",5:"Darshil"} |
Delhi | {2:"Sahil",4:"Peter"} |
As we can see the 'names' column returns the map containing key and value pair.
We can also specify the 'key' in the 'select statement' to retrieve it's value.
SELECT city, names['3'] FROM student_location;
city | c1 |
---|---|
Hydrabad | Ajay |
In the 'select statement' we have specified 'names['3']' where '3' is the key which has the value 'Ajay' associated to it.
So we have seen, how can we retrieve the value along with the key.
A struct data type in HIVE is used to hold one or more data types inside it. It is like a capsule holding various elements inside it.
Now we are going to change the 'student_location' completely where we will be having two fields 'name' i.e. name of the student and 'address' which will have three fields i.e. city, state, pin_code.
As we can see above the 'address' field is of 'struct <city:string,state:string,pin_code string>
Now, we will be taking the data from a file(say csv file) and load the data to the 'student_location' table.
As we can see in the above student.csv file, every field is separated by comma(,). Where 'Vinay' is the first field and 'Hydrabad*Telengana*500013' is the second field which is actually a struct data type. The struct actually contains three variables:
Hydrabad
Telengana
500013
each separated by an asterix(*). Where each one is a city, state and pin_code clubbed inside the struct data type just like the medicine put inside a capsule.
Now match it with the last two lines of the 'CREATE' statement. It says each element in the row will be separated by a comma(,) and each element of the struct field will be separated by an asterisk(*).
The 'load data' command is used to load data from the csv file.
LOAD DATA LOCAL INPATH 'student.csv' INTO TABLE student_location;
And all the data is loaded into the 'student_location' table.
SELECT * FROM student_location;
name | address |
---|---|
Vinay | {"city":"Hydrabad","state":"Telengana","pin_code":"500013"} |
Sahil | {"city":"Delhi","state":"Delhi","pin_code":"110005"} |
As we can see above the address column returns the contents of struct data type in the form of variable name along with it's values.
We can also specify the variable name of the struct data type (ex. city, state, pin_code) in the 'select statement' to retrieve it's value.
SELECT name, address.city FROM student_location;
name | city |
---|---|
Vinay | Hydrabad |
Sahil | Delhi |
In the 'select statement' we have specified 'address.city' where city is the variable of the struct datatype. It can be accessed by a dot(.).