Scala vs Python

List some of the advantages of using Scala, as opposed to other languages.

In order to explain this, I am comparing Scala vs Python.

If you want to write some serious Apache Spark programming it is better to choose Scala because of following reasons.

Note: Following comparisons are based on the fact that I am going to write Apache Spark for some of the big data analytics operations:

  1. Scala is 10 times faster than python because of running on JVM.
  2. Scala is Static Typed language where as Python is Dynamic typed language
  3. Asynchronous versus parallel versus concurrent programming:

Asynchronous programming: It involves some calculations time-intensive tasks, which on the one hand are engaging a thread in the background but do not affect the normal flow of the program.

Parallel programming:  It incorporates several threads to perform a task faster and so does concurrent programming. But there’s a subtle difference between these two.

The program flow in parallel programming is deterministic whereas in concurrent programming it’s not.

For example, a scenario where you send multiple requests to perform and return responses regardless of response order is said to be concurrent programming. But where you break down your task into multiple sub-tasks to achieve parallelism can be defined as the core idea of parallel programming.

Scala works better when compared to Python when we are dealing with either of these 3

Waterfall and Agile Comparison

Project manager’s roleProject manager serves as an active leader by prioritizing and assigning tasks to team members.Agile project manager (or Scrum Master) acts primarily as a facilitator, removing any barriers the team faces. Team shares more responsibility in managing their own work.  
ScopeProject deliverables and plans are well-established and documented in the early stages of initiating and planning. Changes go through a formal change request process.Planning happens in shorter iterations and focuses on delivering value quickly. Subsequent iterations are adjusted in response to feedback or unforeseen issues.
ScheduleFollows a mostly linear path through the initiating, planning, executing, and closing phases of the project.  Time is organized into phases called Sprints. Each Sprint has a defined duration, with a set list of deliverables planned at the start of the Sprint.
CostCosts are kept under control by careful estimation up front and close monitoring throughout the life cycle of the project.Costs and schedule could change with each iteration.
QualityProject manager makes plans and clearly defines criteria to measure quality at the beginning of the project.Team solicits ongoing stakeholder input and user feedback by testing products in the field and regularly implementing improvements.
CommunicationProject manager continually communicates progress toward milestones and other key indicators to stakeholders, ensuring that the project is on track to meet the customer’s expectations.Team is customer-focused, with consistent communication between users and the project team.
StakeholdersProject manager continually manages and monitors stakeholder engagement to ensure the project is on track.Team frequently provides deliverables to stakeholders throughout the project. Progress toward milestones is dependent upon stakeholder feedback.

Euclidean vs Manhattan Distance

Euclidean Distance: In order to calculate the distance b/w tow points in a given Space – can be a N-dimensional, we can use Euclidean distance formula:

D(x1,x2) = SQRT((x1-x2)**2 + (y1-y2)**2) where SQRT is Square Root D stands for distance b/w point x1,x2. we can extend this for N- dimensional plan as follows

D(x1,x2,x3,…xn) = SQRT((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2 +…(xn – xn-1)**2)

Manhattan distance: it is defined as follows:

D(x1,x2) = |x1-x2| + |y1-y2| i.e. Sum of Absolute distance b/w x1,x2 and y1 and y2

Apache Spark

Describe what are Accumulators and Broadcast Variables in Spark and when to use these shared variables?

Accumulators & Broadcast Variables:

Both Accumulators and Broadcast variables are considered as Shared variables in spark. Meaning Spark will allow both worker nodes and Driver program to mutually access the values of these variable while processing the data in spark cluster environment.

Accumulators: Accumulators are the shared variables which will be defined inside the Driver Program. we can define them as: acctVar = sc.accumulator().

Though This variable – acctVar is defined inside driver program, but spark allows this variable accessed by the all worker nodes that are participated in the cluster.

To understand how accumulator variable works in spark, let us take the following real time scenario:

Problem Statement: Count how many customers are from country “USA” in Fake_Customers.csv

Solution: In order to find out count of customers, let us define spark accumulator variable as countUSACustomers = sc.accumulator(0); where sc→ SparkContext object; the variable as initial value of 0

As spark is a distributed computation framework, the driver program will have the logic to increment the value of countUSACustomers by 1 every time whenever the worker node encounters country as “USA” while the worker node(s) is processing it’s part of the file.

Conclusion: Once the file processing is completed, the final of countUSACustomers would gives the final solution to the problem statement

Meaning though the variable – countUSACustomers is defined inside the driver program, still it is accessed, and updated/written by each worker node.

Broadcast Variables: Accumulators allows the worker nodes to write/update the shared variable value, Broadcast variable on other hand will copy/broadcast the dataset/variable values to each of the worker nodes.

Let us take the following example in order to explain functionality of Broadcast variable: Problem Statement: Customer and CountryCodes are 2 input files which has following information Customer.csv: customer information plus country names Metadata: {First_Name, Last_Name, CountryName,City,Zip, Province} CountryCode.csv: CountyName and CountryCode – represents the International dialling code

Metadata: {CountryName, CountryCode} Broadcast variable declaration sc.boradcast(,)

In order to get the desired results, spark needs to join both these datasets across the worker nodes. This is definitely expensive operation as spark needs to shuffle b/w the worker nodes while joining both the datasets.

Alternatively spark allows to broadcast the 2nd data set (in this case CountryCode.csv since this is smaller dataset) onto each worker node; and each worker node can perform the local lookup on the 2nd data set while retrieving country code details by passing country name. Meaning CountryName will be passed as input in return CountryCode will be returned by the worker node. This is drastically increases the performance as no need to go over the network while joining the data between the datasets.

Code Snippet: Problem #1 – Accumulators demo In the following example – the variable num is defined as accumulator variable. It has initial value of 1. This accumulator variable is used by multiple worker nodes and returns accumulated value of 15 at the end a an output from pyspark import SparkContext, SparkConf if __name__ == “__main__”: conf = SparkConf().setAppName(‘AccumulatorsDemo’).setMaster(“local[*]”) sc = SparkContext(conf = conf) num = sc.accumulator(1) def f(x): global num num+=x rdd = sc.parallelize([2,3,4,5]) rdd.foreach(f) final = num.value print(“Accumulated value is -> %i” % (final))

Problem #2 – Broadcast demo In the following example – Showing a Broadcast variable called words_new, it has an attribute called value, this attribute stores the data and then it is used to return a broadcasted value as shown below: from pyspark import SparkContext, SparkConf if __name__ == “__main__”: conf = SparkConf().setAppName(‘BroadcastDemo’).setMaster(“local[*]”) sc = SparkContext(conf = conf) words_new = sc.broadcast([“USA”, “India”, “China”, “Canada”, “Brazil”]) data = words_new.value print(“Stored data ->”,data) element = words_new.value[2] print(“Printing a particular element in RDD -> %s”,element)

References: Following resources are used as a reference to build above 2 examples on Accumulators and Broadcast demos https://www.tutorialspoint.com/apache_spark/advanced_spark_programming.htm Other Study material – Canvas Advanced Spark Programming module

CAP Theorem

Eric Brewer, systems professor at the University of California, Berkeley, brought the different trade-offs together in a keynote address to the Principles of Distributed Computing (PODC) conference in 2000.He presented the CAP theorem.

The CAP Theorem is a fundamental theorem in distributed systems that states any distributed system can have at most two of the following three properties.

Partition tolerance
Consistency in distributed system:

Let’s consider a very simple distributed system. Our system is composed of two servers, G1 and G2. Both servers are keeping track of the same variable, V, whose value is initially V1. G1 and G2 (nodes of the distributed systems) can communicate with each other and can also communicate with external clients.

Here’s how Gilbert and Lynch describe consistency:

“any read operation that begins after a write operation completes must return that value, or the result of a later write operation”

For example, when a client writes V1 to node G1, before acknowledges to client, node G1 replicates the copy of V1 to node G2 and once G2 acknowledges to G1, then only G1 sends success confirmation to the client. So, when client reads the value from the node G2 it will always returns the same value i.e. V1. If any distributed systems act like this then it is in Consistent state.


Here’s how Gilbert and Lynch describe availability.

“Every request received by a non-failing node in the system must result in a response”

In an available system, if our client sends a request to a server and the server has not crashed, then the server must eventually respond to the client. The server is not allowed to ignore the client’s requests.

Network Partition tolerant:

Partition tolerance in CAP means tolerance to a network partition. An example of a network partition is when two nodes can’t talk to each other, but there are clients able to talk to either one or both of those nodes.

A CA system guarantees strong consistency, at the cost of not being able to process requests unless all nodes are able to talk to each other.

An AP system is able to function during the network split, while being able to provide various forms of eventual consistency.

R + W > N à where R: reads, W:writes and N total number of replicas.

If The number of nodes the client writes (W) to + The number of nodes the client reads (R) from > Number of replicas, then system is Consistent state

Ex: if I write into one node and read from one node and number of replicas is 3 then that distributed system is known as Eventually consistent

If I write 2 and read 2 and number of replicas are 2 then the distributed system is Strongly consistent.

ACID VS Consistency in CAP theory:

When it comes to Traditional relational database systems, we all know that ACID is a benchmark that it’s a must property. Without ACID properties we can’t call a database as RDBMS.

Luckily for the world of distributed systems – like Google’s BigTable; Amazon’s Dynamo and Facebook’s Cassandra deal with a loss of consistency and still maintains system reliability with the help of BASE – stands for Basically Available Soft state, Eventual consistency.

Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be ‘failure’ to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account.

Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to ‘eventual consistency,’ thus the state of the system is always ‘soft.’

Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.

Conclusion: Here is my two cents on Applying Consistency on every single transaction (when transactions are in trillions) is costly affair and it needs lot of complex code needs to be developed by the developers without compromising the performance. Fortunately companies like Google, Yahoo, Twitter, Amazon etc are able to interact with the customers across the globe continuously, with the necessary availability and partition tolerance, while keeping their costs down, their systems up, and their customers happy.

Having said that it’s ok for the distributed systems not to maintain ACID.

Which applications/companies are using distributed systems which has Eventual Consistency?

Amazon’s S3 is one popular database where it supports the Eventual Consistency. More can be found at https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html (Links to an external site.)

The companies which uses Amazon S3 database includes but not limited to are:

Smart Vision Labs, Inc.
Cane River Pecan Company
Deep Blue Communications, Inc.

Briefly define transparency, fault tolerance, scalability, and naming. Discuss these concepts in the context of HDFS architecture and how some of them you think change in the cluster-oriented architecture.


The term Transparency in distributed systems implies that the end user should not face any difficulty when accessing the remote files, the same way as local files. The user should be able to access the files from any system along as the system is part of distributed system. The client/user is not bothered of how the file(s) are stored and how the data is moved between the files.

Hadoop Distributed File system:

In case of HDFS architecture, the end user does not worry on how the Name Node maintains the metadata of HDFS and how the data is replicated across different nodes which are part of distributed system.

Fault Tolerance: Another important aspect of distributed system is Fault Tolerance. Distributed system should be able to tolerate from the possible failures of network communication, machine failures, storage device crashes etc. a fault tolerant system should continue functioning (it is fine if the system is very slow during these scenarios) but should not make the entire system down

HDFS is a highly fault tolerant system. How we can achieve this in HDFS.

With the help of replication process, whenever there is a file that needs to be stored by the user, the file will be divided into blocks of data and these blocks of data are distributed across different machines present in HDFS cluster. After this, replica of each block is created on other machines present in the cluster. By default HDFS creates 3 copies of a file on other machines present in the cluster. So due to some reason if any machine on the HDFS goes down or fails, then user can easily access that data from other machines in the cluster in which replica of file is present. This is entirely transparent to the user, user doesn’t know that there is an big system behind the scenes


The capability of a system to adopt to increased service load is called as Scalability. A scalable design in distributed system should withstand high-service load, should be able to accommodate users growth and seamless integration of new systems into it.

HDFS is designed as a scalable distributed file system to support thousands of nodes within a single cluster. I would like to take the example of Uber’s Engineering group how they have adopted Hadoop as the storage (HDFS) and YARN as computing engine when the user’s growth is increased drastically. As a result, the system performance reduced drastically. After through research they identified reasons for slowness of NameNode performance and changed their design by introducing the concept of View File Systems to split their HDFS into multiple name spaces and used View File Systems mount points to present a single virtual namespace to users. More on this can be found at : https://eng.uber.com/scaling-hdfs/

According to Apache Cook book, The default architecture of Hadoop utilizes a single NameNode as a master over the remaining data nodes. With a single NameNode, all data is forced into a bottleneck. This limits the Hadoop cluster to 50-200 million files.

The implementation of a single NameNode also requires the use of commercial-grade NAS, not budget-friendly commodity hardware.

A better alternative to the single NameNode architecture is one that uses a distributed metadata structure. A visualized comparison of the two architectures is provided below:

As we can see, the distributed metadata architecture uses 100% commodity hardware. In addition to the savings in cost, it boasts an equally pleasing 10-20 times increase in performance and avoids the file bottleneck with a file limit of up to 1T, greater than 5000 times the capacity of the single NameNode architecture.

Source : https://www.smartdatacollective.com/how-maximize-performance-and-scalability-within-your-hadoop-architecture/

Conclusion: Every industry whether it is Banking, Health Care, Financials, Insurance, Online Marketing companies etc are trying to shift from the traditional RDBMS systems to Distributed systems (most of the cases HDFS) to support their user drastic growth and they are seeing the results. As Hadoop Architecture is very flexible to allow changes to it, most of the companies are applying their own methodologies to hand huge amount of data without compromising the performance. A classic example of how the Uber modified their current HDFS design to handle ever growing users size.

NO SQL Databases

MongoDB: This is one of the NoSQL DB. It’s a NoSQL solution. This is also called as Document Database. Basic architecture is as follows:

MongoDB will have multiple databases; Each Database will have multiple Collections; Each collections will have set of documents. This is where data will be stored in case of MongoDB

MongoDB stores the data in documents and these documents are in JSON format – Java Script Object Notation. There is no schema concept for mongodb

  1. In Mongo DB world you will have one or more databases on your MongoDB server
  2. Each Database will hold one or more collections
  3. A collection is equivalent to a table in RDBMS database world
  4. Each collection can have n-number of documents. It can have multiple documents
  5. Documents will have actual data
  6. Following diagram shows how mongodb database works