Hive Metastore (HMS) and hive_metastore in Databricks: The right level of detail
September 19, 2024
Gentrit Mehmeti
Whether you are a junior data engineer or someone with years of experience in data engineering the chances of escaping the encounter with Apache Hive are low. Even though it is highly ignored and often labelled as outdated, it has made its way into the world of big data technologies and it’s something that sooner or later will get in your way.
Even though Databricks now offers Unity Catalog as a more advanced solution, replacing the traditional Hive Metastore, it's still crucial to understand the concept and its significance in big data technologies.
To fully understand what HMS and hive_metastore in Databricks are we need to have an understanding about the following concepts: Apache Hive, metadata and catalog in Databricks.
We are going to explain these concepts in a simplified manner one by one
Apache Hive
Apache Hive is a distributed, fault-tolerant (you def. need to know what fault tolerant is) data warehouse system that enables analytics at a massive scale. It is built on top of Apache Hadoop and supports storage on AWS S3, Azure Data Lake Storage, Google cloud storage, etc. through Hadoop File System. Hive allows users to read, write, and manage petabytes of data using SQL.
Hadoop Distributed File System (HDFS) is a file system that manages large data sets that can run on commodity hardware. What is Hadoop?
Hadoop (official definition)
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
Now that we’ve explored how Hadoop provides the distributed and fault-tolerant infrastructure for storing and processing large datasets, it’s important to understand how Hive leverages this architecture. While Hadoop’s HDFS handles data storage across clusters, Hive adds another layer of abstraction to simplify how we interact with this data through SQL-like queries.
One key component that makes Hive’s operations efficient and user-friendly is the Metastore. The Metastore plays a vital role in managing the metadata associated with the data stored in HDFS. Let’s delve deeper into what the Metastore is and how it organizes and maintains crucial metadata within the Hive ecosystem.
Metastore and Metadata:
Metadata is data about data. Metastore captures and stores the data about the data!
A Metastore stores all the structural information of schema information, tables definitions and partitions in the warehouse including column and column type information.
Hive metastore (HMS) on the other hand is a service that stores metadata related to Apache Hive and other services, in a backend RDBMS, such as MySQL or PostgreSQL that works very well with Spark, Prestodb and other distributed query engines. Remember Spark working well with Hive Metastore, it is very important for Databricks.
Why capture metadata and use this information in a metastore
When running things in a distributed manner one big challenge in big data technologies is being able to have a central repository that keeps track of data about the DATA so whenever the DATA is ready to be consumed the consumer knows the information about the DATA through data (metadata)
The DATA in this case refers to physical data and “data” refers to metadata of the physical data.
A metastore makes it easier to organize, access and govern large amounts of data in distributed data sources. It helps to manage complex metadata, ensure consistency, optimize query performance and abstract storage details.
hive_metastore in Databricks
Databricks consists of catalogs
A catalog in Databricks refers to the highest-level organizational unit that contains schemas and the tables within them. It serves to manage and organize and access data in a structured way.
hive_metastore is the default name for Hive Metastore catalog in Databricks that stores and manages the metadata about the schemas, tables, views and tables using Hive Metastore service. All the tables are by default stored in hive_metastore catalog unless specified otherwise.
When tables are created in Databricks using SQL warehouse or Spark in notebooks the metadata for that table is stored in the Hive Metastore, while the actual data is stored in cloud storage like AWS S3, Azure Data Lake etc.
Remember the importance of Hive Metastore working well with Apache Spark?
When using Spark to query data in Databricks, the Hive Metastore is queried to retrieve the metadata. Since Spark works well with Hive and is a core component of Databricks it uses the metadata in hive_metastore of Databricks which utilizes the Apache Hive Metastore and queries the data very efficiently.
While hive_metastore in Databricks serves as a centralized place for metadata management and helps to separate metadata and data, providing compatibility with Spark and SQL it still lacks on governance, auditing, access control, data discovery and more features compared to Unity Catalog offered by Databricks.
Unity Catalog through its flexibility and modern features offers significant advantages in data management and governance compared to traditional hive_metastore. Check-out my next blog to discuss the major advances in Unity Catalog.