This set of Apache Hive MCQs covers advanced concepts of Apache Hive, including query optimization, ORC files, partition pruning, bucketing, small file handling, Hive indexes, and performance tuning techniques in Hive. Useful for GATE, UGC NET, University Semester Exams, and Big Data Technology preparation.
Topic: Big Data (Apache Hive) | Set: 4
Difficulty: Hard to Medium | Total Questions: 15
Apache Hive MCQs with Answers
Q1. Which operation in Hive is traditionally the most expensive?
A. SELECT
B. JOIN (on large datasets)
C. DESCRIBE
D. SHOW TABLES
View Answer & Explanation
Answer: B
Explanation: Joins require shuffling data across the network, which is the most I/O and network-intensive operation in distributed systems.
Q2. Why are ORC files superior for large queries?
A. They support better compression and use column-level stats to skip stripes
B. They are plain text
C. They are easier to read
D. They use less CPU
View Answer & Explanation
Answer: A
Explanation: Columnar storage allows Hive to read only the columns needed, and striping allows skipping blocks (stripes) that don’t match the query filter.
Q3. How do you handle a “Cartesian Product” in Hive?
A. Hive throws an error by default unless hive.mapred.mode is set to nonstrict
B. It is always allowed
C. It is never allowed
D. It is faster than a join
View Answer & Explanation
Answer: A
Explanation: Cartesian products are usually accidental and highly expensive, so Hive blocks them unless explicitly allowed by the user.
Q4. What is the “Hive Warehouse Connector”?
A. A tool to connect Hive to Spark
B. A tool to connect Hive to a web browser
C. A tool to connect Hive to local files
D. A security tool
View Answer & Explanation
Answer: A
Explanation: The Hive Warehouse Connector (HWC) allows Apache Spark to read and write to Hive Managed/ACID tables efficiently.
Q5. What is the role of hive.stats.autogather?
A. To automatically gather statistics during data loading
B. To delete logs
C. To update the cluster firmware
D. To list users
View Answer & Explanation
Answer: A
Explanation: Enabling this property ensures statistics are updated during INSERT operations, keeping the CBO metadata current.
Q6. When is GROUP BY better than DISTINCT?
A. For counting unique values
B. When data is already sorted
C. For large datasets because GROUP BY can use combiner optimization
D. Never
View Answer & Explanation
Answer: C
Explanation: GROUP BY can utilize map-side combiners to partially aggregate data, whereas DISTINCT often forces more data to reducers.
Q7. What is the “Small File Problem” in HDFS?
A. Large files take too much space
B. Too many small files exhaust the NameNode’s memory due to metadata overhead
C. Small files cannot be read
D. Small files are faster
View Answer & Explanation
Answer: B
Explanation: Every file, block, and directory takes up memory in the NameNode; an excessive number of small files can crash the NameNode.
Q8. How do you merge small files in Hive?
A. Using ALTER TABLE … CONCATENATE (for ORC) or a re-insert
B. Using HDFS delete
C. You cannot merge them
D. Using a shell script
View Answer & Explanation
Answer: A
Explanation: The CONCATENATE command or performing an INSERT OVERWRITE into the same table are common strategies to compact small files into fewer, larger blocks.
Q9. What does set hive.enforce.bucketing=true; do?
A. It forces Hive to respect the number of buckets when writing data
B. It deletes buckets
C. It renames buckets
D. It creates partitions
Answer: A
View Answer & Explanation
Explanation: Without this setting, Hive might ignore the bucket count; setting it to true ensures the output directory structure matches the bucket definition.
Q10. What is a “Hive Index”?
A. A database-style index to speed up column lookups
B. A way to store tables in a specific order
C. A compression format
D. A security feature
View Answer & Explanation
Answer: A
Explanation: Hive indexes can be used to speed up queries, though they are often less popular than partitioning/bucketing due to maintenance overhead.
Q11. Which property controls the “Join” memory limit?
A. hive.auto.convert.join.noconditionaltask.size
B. hive.join.memory.limit
C. mapred.child.java.opts
D. hive.memory.size
View Answer & Explanation
Answer: A
Explanation: This property defines the threshold for the size of a table to be held in memory for a Map-side (Broadcast) join.
Q12. What happens to “unused” partitions in a query?
A. They are read but filtered later
B. Hive uses “Partition Pruning” to skip them entirely, saving I/O
C. They are deleted
D. They cause errors
View Answer & Explanation
Answer: B
Explanation: Partition pruning is a key optimization that ensures only the necessary directory paths are opened during a scan.
Q13. What is the primary benefit of ORC over Parquet for Hive?
A. They are the same
B. ORC is native to Hive and often has better compression/indexing support for Hive-specific features
C. Parquet cannot be used in Hive
D. ORC is plain text
View Answer & Explanation
Answer: B
Explanation: While both are excellent, ORC was specifically designed for Hive, offering tighter integration and metadata support.
Q14. Can Hive queries be scheduled?
A. Only using Cron
B. Yes, using external tools like Apache Oozie or Airflow
C. Hive has a built-in scheduler
D. No, they must be manual
View Answer & Explanation
Answer: B
Explanation: Hive does not have an internal job scheduler, so tools like Oozie, Airflow, or Cron are used for automation.
Q15. What is the “Hive Metastore Service”?
A. A separate Java service that provides the schema interface to the database
B. The database itself
C. A shell script
D. A MapReduce job
View Answer & Explanation
Answer: A
Explanation: The Metastore Service acts as a Thrift-based middleware that allows Hive to communicate with the backing database (MySQL/PostgreSQL) safely.
Conclusion
These Hive MCQs questions covered advanced concepts such as ORC optimization, partition pruning, bucketing enforcement, Hive indexes, small file handling, query optimization, and advanced Hive performance tuning.
Practicing these questions is useful for Big Data Technology, Hadoop ecosystem learning, GATE CS, UGC NET, and university semester examinations.
For better understanding of concepts and architecture, refer Apache Hive.