image

Big Data – Distributed File-Based Databases

Distributed file-based solutions technologies, such as the open source Hadoop, are an inexpensive way to store large amounts of data in different formats. Hadoop stores files of any type – structured, semi-structured, and unstructured. Using a configuration similar to MPP Shared-nothing (an MPP foundation for file storage), it shares files across processing servers. It is ideal for storing data securely (as many copies are made), but has challenges when trying to allow access to data via structured or analytical mechanism (like SQL).

Due to its relatively low cost, Hadoop has become the landing zone of choice for many organizations. From Hadoop, data can be moved to MPP Shared-nothing databases to have algorithms run against it.

The language used in file-based solutions is called MapReduce. This language has three main steps:

  • Map: Identify and obtain the data to be analyzed
  • Shuffle: Combine the data according to the analytical patterns desired
  • Reduce: Remove duplication or perform aggregation in order to reduce the size of the resulting data set to only what is required

Leave a Reply

Your email address will not be published. Required fields are marked *

13 + thirteen =