Building Blocks for Super Massive Data StructuresEric BotcherBlockedUnblockFollowFollowingApr 29Photo of a supermassive black hole which was imaged using a supermassive amount of data.
htmlThe Term ‘Big data’ focus on crunching data that is much more than you can fit in your computer’s RAM.
Working with big data takes ever more sophisticated techniques than running smaller scale analytics.
Spreadsheets (such as Microsoft Excel) are good at doing a quick analysis of a small amount of data.
However, as the size of the dataset increases the system resources required by excel also increases because the entire dataset must be fully open in RA, limiting the amount of data that can be analyzed without advanced data structures.
Because of this excel only works for datasets up to 2–4 gigs(depending on the version being used), which is about 3.
5 million rows(100 columns deep).
Needless to say that big data operations require some sophisticated data structures.
Spreadsheets are like building houses where big data analytics is like building a skyscraper.
While the subsystems in both houses and skyscrapers play a vital role in the functionality of the building, subsystems in the skyscrapers are more engineered and require more efficiency to support the functionality of the building on a whole.
Big-Woah!Big O notation is an engineering tool used to describe how the time of an algorithm increases with the number of inputs.
This is very important to consider when working in big data environments because by nature big data algorithms are usually on the right side of the Big-O notation graphs involving many inputs.
In big data, inefficient systems can cause processes to run slow or in the worst case scenario, crash mid-process losing days or even weeks of work.
In Big O, O is the operation being run where N is the number of inputs, and while N increases O grows in size based on the operation being run.
O(1) would be an operation that runs at the same speed regardless of the number of inputs, where O(n²) describes a function where the operation time increases at an exponential rate.
In building large scale data analytics structures, you need to avoid such exponential growth like the plague, because it will kill you… or at least your project.
The two most import factors in avoiding these big-Ohh nightmares are 1) data structures being used and 2) the functions used to traverse those data structures.
com/The “Fun” in FunctionalityBefore we dive into the data structures, we need to take a look at the functionality supported by such data.
In order for our data to add to the functionality of our tower, we will need to think about how we are going to access and manipulate data.
These functions can be broken down into three categories; 1) insert algorithms, 2) search and/or sort algorithms, and 3) delete algorithms, all of which can be examined with CRUD(Create Read Update and Destroy).
Insert algorithms create and update data within the data structure.
Depending on the rules of the data structures, these insert data and the beginning and/or middle and/or end of the data structure.
If inserted into the beginning or ends fo the data structure, the big-O notation is usually pretty constant( around O(1) for most and O(n) for insert into a point ).
In ordered data structures, this operation takes the longest because the machine needs to traverse the list for the logical point of insertion.
Search algorithms traverse through the data structure and return the desired data.
These operations tend to be around O(n) because they are traversing through the data structures for an object and returning it when it does, in the worst case it would have to parse through every data instance in the structure before returning it.
One good solution to this is by using a binary search algorithm as a way to greatly increase the speed of these operations.
A binary search skips through half the data and performs the next check on the upper half or lower half depending on the order of the data greatly increasing the speed of a search.
However, binary searches can only be performed on ordered data structures which take sorting and more precise insertion algorithms when creating the data structures.
Delete algorithms search through the data structure and destroy a data point based on predefined parameters.
In some data structure such as stacks and ques, this is quick because the destroyed data follows the rules of the data structure.
However, in ordered data structures, delete operations involve traversing the dataset for the specific data point and deleting it.
Because of this delete functions tend to also fall between O(1) (for structures that delete either the first or last element to O(n) when the program needs to go traverse the data structure to find a specified data point and destroy it.
Data Structures:All these functions are meant to be run on data structures that can be set up to optimize runtime or memory usage based on the needs of the program being used.
Some of the basic data structures that we will be covering today will be Arrays, stacks, queues, linked-lists, and Hash Tables.
Photo Credit: http://bigocheatsheet.
com/Arrays are essentially strings parsed into datapoints by either a specified character length or a separator value.
For example, a .
csv is a file where the array where the values are separated by commas.
As mentioned earlier, excel spreadsheets are large arrays of data that can be easily manipulated.
In python, lists, dictionaries, and tuples are common forms of arrays.
In order to access elements in the array indexes specify certain points in the string where the machine can jump to if specified by its programming.
This makes looping through an array simple and straight forward(making the search algorithms o(n) or O(n/2) for binary search functions).
The downside of an array is that the entire array must be stored in RAM in order to access any of the elements, which limits its size.
Stacks are like arrays except that they follow specific rules in order to speed up the functionality of the data structure.
Stacks always perform CRUD operation on the first element in the data structure.
Think of a Stack like a stack of plates at a buffet line, to load the stack, plates are placed on top of the stack and drawn down in the order in which they are stacked.
Since the operations are only performed on the first element in the array, stack operation tends to be very quick O(1) which operations only affect the first element to O(n) when a stack needs to reindex after an operation.
Queues are similar to stacks but instead of last in first out, it’s first in first out.
Think about a line at a nightclub, the first person in line is the first person into the club and when someone enters the club the next person takes that persons place in line and so forth.
Like stacks, operations are performed on a few points in the array (first and last).
As objects move through the queue, python will need to reindex the entire queue every time an object gets out of the queue, making operations falling between O(1) and O(n).
Linked Lists are data structures that use linked nodes to CRUD data.
A node in a linked list (singly linked list) contains two pieces of data, and object and the location of the next node.
Depending on how the linked list is set up, not all nodes in a linked list need to be open in RAM which makes it less memory intensive than an array.
However, because a node in a linked list only knows where the next node is, binary search operations on a linked list are imposable, which increases the time it takes to traverse a linked list to O(n).
Hashtables are probably the most unique data structure in this group because the data is not in any particular order, but instead in hash buckets.
Hash tables are key lookup arrays where a key is run through a hash function to point to a location in memory called a hash bucket.
Assuming there are more hash buckets than objects to store in those buckets, hash tables can provide instant search, insert and delete operations on the table giving it an O(1).
However, the number of buckets can be limited and a hash collision occurs when an object pushes another object out of the hashtable buckets.
One way to mitigate such collisions is to use hash tables in tandem with linked lists to make the buckets traversable.
All together now:These data structures are just the building blocks of big data.
In order for us to build our skyscraper to process big data, these data structures will need to be used together to distill large amounts of data into logical conclusions.
Arrays that can have operations run on them individually, the order of those operations can be determined by the order of Queue or Stack.
Hash tables can be used with linked lists to have a large amount of data that is easily accessible, and are used with linked list to avoid hash collisions.
All of these structures are tied into the production environment which can conduct real-time data analytics which backs up real-world decisions and machine learning algorithms to power our data-driven society.
References:Taking that picture of a black hole required massive amounts of dataIn context: Eight radio telescopes trained at the center of the Messier 87 galaxy each recorded 350TB of data per day…www.
comBig Data 101: Intro To Probabilistic Data StructuresOftentimes while analyzing big data we have a need to make checks on pieces of data like number of items in thedataconomy.
comA Data Scientist’s Guide to Data Structures & Algorithms, Part 1Note: This is part 1 of a 2-part series.
For the follow-up post, see here.
comA Data Scientist’s Guide to Data Structures & Algorithms, Part 2In my last post, I described Big O notation, why it matters, and common search and sort algorithms and their time…towardsdatascience.
comExcel Memory Limits – Decision ModelsMicrosoft Excel Calculation Secrets and optimization tips, calculation methods, calculation sequence, dependencies and…www.
comTaking Hash Tables Off The ShelfTruth time: learning about theoretical things every week can get a bit monotonous.
As much as it’s important to learn…medium.
com.. More details