Big Data - Distributed File-Based Databases

Big Data – Distributed File-Based Databases

Posted on March 30, 2022 by Muhammad Rawish Siddiqui

Distributed file-based solutions technologies, such as the open source Hadoop, are an inexpensive way to store large amounts of data in different formats. Hadoop stores files of any type – structured, semi-structured, and unstructured. Using a configuration similar to MPP Shared-nothing (an MPP foundation for file storage), it shares files across processing servers. It is ideal for storing data securely (as many copies are made), but has challenges when trying to allow access to data via structured or analytical mechanism (like SQL).

Due to its relatively low cost, Hadoop has become the landing zone of choice for many organizations. From Hadoop, data can be moved to MPP Shared-nothing databases to have algorithms run against it.

The language used in file-based solutions is called MapReduce. This language has three main steps:

Map: Identify and obtain the data to be analyzed
Shuffle: Combine the data according to the analytical patterns desired
Reduce: Remove duplication or perform aggregation in order to reduce the size of the resulting data set to only what is required

Big Data – Distributed File-Based Databases

Leave a Reply Cancel reply

Subscribe to Our Newsletter

Recent Posts

Archives

Categories