Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
Basically, Hive is the tool to process structured data in Hadoop we use Hive. It is a data warehouse infrastructure.And it is used to summarize Big Data, it resides on top of Hadoop. Also, makes querying and analyzing easy.
TextFile, SequenceFile, RCFile, AVRO, ORC, Parquet.
It is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It also provides client access to this information with the help of metastore service API
Hive offers an embedded Derby database instance backed by the local disk for the metastore, by default. To this concept what we call embedded metastore configuration.
Apache Hive provides Different Built-in operators for data operations to be implemented on the tables present inside Apache Hive warehouse.
Hive operators are used for mathematical operations on operands. It returns specific value as per the logic applied.
Types Of Hive Operators Are:
For decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is called as a Bucketing in Hive.
In Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.
Internal Table (Managed table): Managed table is also Known as Internal table. This is the default table in Hive. When user create a table in Hive without specifying it as external, by default we will get a Managed table.
If we create a table as a managed table, the table will be created in a specific location in HDFS.
By default, the table data will be created in /usr/hive/warehouse directory of HDFS.
External Tables:External table is mostly created for external use as when the data is used outside Hive. Whenever we want to delete the table’s metadata and we want to keep the table’s data as it is, we use External table. External table only deletes the schema of the table.a
There are three ways of Interacting with Hive:
I Hive Thrift Client:
Basically, with any programming language that supports thrift, we can interact with HIVE.
However, to connect to the HIVE Server the BeeLine CLI uses JDBC Driver.
III. ODBC Driver:
Also, we can use an ODBC Driver application. Since that support ODBC to connect to the HIVE server.
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA INPATH," ) and use Hive to correctly "parse" that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to "parse" data stored in HDFS to be used by Hive.
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Instance of a Java class (Thrift or native Java)
Yes, with the help of LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive. However, to do so, the user needs to specify the storage path of the managed table as the value to the LOCATION keyword, that will help to change the default location of a managed table.
Following classes are used by Hive to read and write HDFS files
However, there are several types of Hive DDL commands, we commonly use. such as:
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive data warehouse.
The main purpose of Hive Thrift server is it allows access to Hive over a single port.
Thrift server is also known as Thrift Server.However, for scalable cross-language services development Thrift is a software framework. Also, it allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
The HiveServer2 is a server interface and part of Hive Services that enables remote clients to execute queries against Hive and retrieve the results. The current implementation(HS2), based on Thrift RPC which has improved version of Hive Server 1 and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC Drivers.
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
Create external table table_name ( id int, myfields string ) location '/my/location/in/hdfs';
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory
As hive creates the schema and appends on top of an existing data file. One can have multiple schema for one data file, the schema will be saved in hive metastore and data will not be parsed or serialized to disk in a given schema. When we will try to retrieve data, the schema will be used. For example if we have 5 columns (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schemas by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).
But while querying, if we specify any column other than abcolumnsove list, will result in NULL values.
Wherever we run the hive in the embedded mode it automatically creates the local metastore.And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml. Property is
"javax.jdo.option.ConnectionURL" with default value "jdbc:derby:;databaseName=metastore_db;create=true".
So to change the behavior change the location to an absolute path, so metastore will be used from that location.
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.
Examples: 'Express' RLIKE 'Exp' -> True 'Express' RLIKE '^E.*' -> True (Regular expression)
Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has a value ‘Express’. In these situations, RLIKE will work better without using TRIM.
'Express ' RLIKE 'Express' -> True
Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.
Here is an example:
hive -e 'select * from table'
You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:
hive -e 'select * from table' > / home/acdgild/hiveql_output.tsv
In a local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
There are two ways to know the current database. One temporary in cli and second one is persistent.
1) in CLI just enter this command: set hive.cli.print.current.db=true;
2) In hive-site.xml paste this code:
<property> <name>hive.cli.print.current.db</name> <value>true</value> </property>
The defauMetastore configuration metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an errthe or. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache Hive:
A common way to load data into Hive is to create an external table. You can create an external table that points to an HDFS directory. You can copy an external file into the HDFS location using either of the HDFS commands put or copy. Here, once I create the table named PAGE_VIEW_STG, I use the HDFS put command to load the data into the table.
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoop dfs -put /tmp/pv_2016-03-09.txt /user/data/staging/page_view
Note that you can transform the initial data and load it into another Hive table, as shown in this example. The file /tmp/pv_2016-03-09.txt contains the page views served on 9 March 2016. These page views are loaded into the PAGE_VIEW table from the initial staging table named PAGE_VIEW_STG, using the following Statement.
FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt='2016-03-09', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA INPATH '/input' INTO TABLE temp_table;
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ',' STORED AS SEQUENCEFILE;
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
Yes, we can run UNIX shell commands from Hive using the! Mark before the command.For example: !pwd at hive prompt will list the current directory.
We can execute Hive queries from the script files by using the source command.
Hive> source /path/to/file/file_with_query.hql