Azure | 07. Azure Blob Storage vs. Azure Data Lake Storage Gen2 (ADLS Gen2)

Binayadas Aruldas
6 min readApr 30, 2024

--

Let’s start with the data — what kind of data can you store in Azure Blob or ADLS Gen2?

Azure Blob storage and ADLS Gen2 are both well-suited for storing unstructured data. Think videos, photos, audio files, text files, Excel files, and more! Since you are storing data in an unstructured format, you cannot directly query data in either service.

How do you provision these services in Azure?

Both Azure Blob storage and ADLS Gen2 are provisioned through an Azure Storage Account. To reduce administrative overhead, Azure Storage Accounts contain four different Azure storage services — Blobs, Queues, Tables, and Files. You can use all or just one of these services within a single storage account, up to the resource limits.

  • Azure Blob Storage: Object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data — a.k.a. data that does not adhere to a particular schema or definition, such as text data, photos, videos, etc. Blobs are organized by containers. By the way, Blob = Binary Large Object.
  • Azure Queues: Service for storing large numbers of messages that can be accessed from anywhere in the world via authenticated calls using HTTP or HTTPS. A queue contains millions of messages, up to the total capacity limit of a storage account. Queues are often used to create a backlog of work to process asynchronously.
  • Azure Tables: Service for storing NoSQL (key/value) data with a schema-less design. Table storage is often used to store flexible datasets such as user data for web apps, device information, or other types of metadata. A storage account can contain any number of tables, up to its capacity limit.
  • Azure Files: A fully managed file share service in the cloud, accessible via Server Message Block (SMB) protocol or Network File System (NFS) protocol. Azure Files can be used to completely replace or supplement traditional on-premises file servers.

Azure Blob storage is optimized for storing massive amounts of unstructured data. After creating an Azure Storage account, the next step is to create containers which are used to organize a set of Blobs, like a directory in a file system — similar, but not the same! Blob storage accounts are only capable of mimicking a hierarchical folder structure; they do not support true directories. Once you have created containers, you can store your Blobs (the actual files).

Azure Data Lake Storage Gen1

  • DLS Gen1 is an Apache Hadoop file system that is compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. If you’re not familiar with Hadoop, it is an open-source platform that focuses on simplifying distributed data processing. This is kind of a big deal for Hadoop users! To work with the Hadoop ecosystem, your data needs to be stored in HDFS. So, that means that users could store all of their data in ADLS Gen1 and use it in their Hadoop workloads.
  • ADLS Gen1 supports virtually unlimited storage. Individual files can range from kilobytes to petabytes in size.
  • Access Control Lists (ACLs) can be implemented to manage access to your data in ADLS Gen1.
  • ADLS Gen1 can be accessed via the file system, prefixed by adl://. The ability to access data this way allows for potential optimizations, particularly in big data analytics scenarios.

! IMPORTANT ! You should not use ADLS Gen1 for any new projects. It is a legacy service and it is recommended to instead use ADLS Gen2.

Azure Data Lake Storage Gen2

Azure Blob Storage + Azure Data Lake Storage Gen1 = Azure Data Lake Storage Gen2

ADLS Gen2 truly is the result of converging the capabilities of two storage services, Azure Blob Storage and Azure Data Lake Storage Gen1. The result? You get the best of both worlds. File system semantics, directory and file-level security capabilities from ADLS Gen1 are combined with the low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob Storage.

ADLS Gen2 was designed with big data analytics in mind and is a key component in modern data analytics, data science, and data warehousing architectures. A fundamental component of ADLS Gen2 is the addition of a hierarchical namespace to Blob storage. To explain what this term really means, think about the file explorer on your computer. You likely have created (or at least attempted to create) an organized folder structure. Unlike Blob storage, you have the ability to create a folder structure with a hierarchy in your ADLS Gen2 account. Besides providing a familiar interface style for developers, the hierarchical namespace is preferred when working with big data analytics frameworks like Hive and Spark. Without real directories, applications must process potentially millions of individual blobs to accomplish directory-level tasks, whereas the hierarchical namespace processes these tasks by updating the parent directory. Spark jobs, for example, often write output to temporary locations and rename the location at the end of the job. The time to rename is significantly lower with a hierarchical namespace.

ADLS Gen2 accounts are provisioned by configuring the “enable hierarchical namespace” option in the creation process of an Azure Storage Account. Once you provision a storage account, you cannot modify the hierarchical namespace configuration.

The next image shows what you will expect to see if you are provisioning your storage account from the Azure portal. Under the advanced tab, there is an option called “Data Lake Storage Gen2 hierarchical namespace” which is disabled by default. To use the ADLS Gen2 capabilities, switch this to enabled and continue through the resource provisioning process.

Once the storage account is provisioned, you can verify that the hierarchical namespace is enabled by navigating to the resource in the Azure portal and searching for the “Configuration” option on the left-hand blade. Notice the option to enable hierarchical namespace is greyed out since it cannot be modified post-provisioning.

Besides the hierarchical namespace, ADLS Gen2 has several other notable capabilities:

  • Like ADLS Gen1, ADLS Gen2 is Hadoop compatible, meaning you can manage and access data just as you would with HDFS.
  • The new ABFS driver (ABFS = Azure Blob Filesystem) is available within all Apache Hadoop environments and allows for other Azure services to access data stored in ADLS Gen2. These services include- Azure HDInsight, Azure Databricks, and Azure Synapse Analytics. The ABFS driver is optimized specifically for big data analytics.
  • Both ACL and POSIX permissions, plus additional granularity specific to ADLS Gen2, are supported.
  • Data stored in ADLS Gen2 is not required to be moved or transformed prior to performing analysis, reducing the required transaction cost.
  • ADLS Gen2 provides the same data redundancy and access tier offerings as Azure Blob storage.

To summarize, ADLS Gen2 is built on top of Azure Blob storage. It supports the core capabilities of Azure Blob storage while leveraging ADLS Gen1 features and introducing new functionality. To reiterate, ADLS Gen2 is not a separate service in Azure, but is provisioned through an Azure Storage Account by enabling the hierarchical namespace configuration option. ADLS Gen2 is optimized for big data analytics workloads.

Comparison: Azure Blob Storage vs. Azure Data Lake Storage Gen2

Azure Data Lake Store Gen2 is a superset of Azure Blob storage capabilities. In the list below, some of the key differences between ADLS Gen2 and Blob storage are summarized.

  • ADLS Gen2 supports ACL and POSIX permissions allowing for more granular access control compared to Blob storage.
  • ADLS Gen2 introduces a hierarchical namespace. This is a true file system, unlike Blob Storage which has a flat namespace. This capability has a significant impact on performance, especially in big data analytics scenarios.
  • ADLS Gen2 is an HDFS-compatible store. This means that Apache Hadoop services can use data stored in ADLS Gen2. Azure Blob storage is not Hadoop-compatible.

Yes, there are price differences between Azure Blob storage and ADLS Gen2. Generally, transactional costs for ADLS Gen2 are slightly higher than those of Blob, but this is oftentimes offset by the resulting reduced compute costs.

--

--

Binayadas Aruldas

Techie who loves to write | I write about Productivity, tech, note taking, Azure, and SQL