Briefly define transparency, fault tolerance, scalability, and naming. Discuss these concepts in the context of HDFS architecture and how some of them you think change in the cluster-oriented architecture.

Transparency:

The term Transparency in distributed systems implies that the end user should not face any difficulty when accessing the remote files, the same way as local files. The user should be able to access the files from any system along as the system is part of distributed system. The client/user is not bothered of how the file(s) are stored and how the data is moved between the files.

Hadoop Distributed File system:

In case of HDFS architecture, the end user does not worry on how the Name Node maintains the metadata of HDFS and how the data is replicated across different nodes which are part of distributed system.

Fault Tolerance: Another important aspect of distributed system is Fault Tolerance. Distributed system should be able to tolerate from the possible failures of network communication, machine failures, storage device crashes etc. a fault tolerant system should continue functioning (it is fine if the system is very slow during these scenarios) but should not make the entire system down

HDFS is a highly fault tolerant system. How we can achieve this in HDFS.

With the help of replication process, whenever there is a file that needs to be stored by the user, the file will be divided into blocks of data and these blocks of data are distributed across different machines present in HDFS cluster. After this, replica of each block is created on other machines present in the cluster. By default HDFS creates 3 copies of a file on other machines present in the cluster. So due to some reason if any machine on the HDFS goes down or fails, then user can easily access that data from other machines in the cluster in which replica of file is present. This is entirely transparent to the user, user doesn’t know that there is an big system behind the scenes

Scalability:

The capability of a system to adopt to increased service load is called as Scalability. A scalable design in distributed system should withstand high-service load, should be able to accommodate users growth and seamless integration of new systems into it.

HDFS is designed as a scalable distributed file system to support thousands of nodes within a single cluster. I would like to take the example of Uber’s Engineering group how they have adopted Hadoop as the storage (HDFS) and YARN as computing engine when the user’s growth is increased drastically. As a result, the system performance reduced drastically. After through research they identified reasons for slowness of NameNode performance and changed their design by introducing the concept of View File Systems to split their HDFS into multiple name spaces and used View File Systems mount points to present a single virtual namespace to users. More on this can be found at : https://eng.uber.com/scaling-hdfs/

According to Apache Cook book, The default architecture of Hadoop utilizes a single NameNode as a master over the remaining data nodes. With a single NameNode, all data is forced into a bottleneck. This limits the Hadoop cluster to 50-200 million files.

The implementation of a single NameNode also requires the use of commercial-grade NAS, not budget-friendly commodity hardware.

A better alternative to the single NameNode architecture is one that uses a distributed metadata structure. A visualized comparison of the two architectures is provided below:

As we can see, the distributed metadata architecture uses 100% commodity hardware. In addition to the savings in cost, it boasts an equally pleasing 10-20 times increase in performance and avoids the file bottleneck with a file limit of up to 1T, greater than 5000 times the capacity of the single NameNode architecture.

Source : https://www.smartdatacollective.com/how-maximize-performance-and-scalability-within-your-hadoop-architecture/

Conclusion: Every industry whether it is Banking, Health Care, Financials, Insurance, Online Marketing companies etc are trying to shift from the traditional RDBMS systems to Distributed systems (most of the cases HDFS) to support their user drastic growth and they are seeing the results. As Hadoop Architecture is very flexible to allow changes to it, most of the companies are applying their own methodologies to hand huge amount of data without compromising the performance. A classic example of how the Uber modified their current HDFS design to handle ever growing users size.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: