Big Data

Currently this resource draws some inspiration from Big Data. (The former goal is not only to create a comprehensive collection of single pieces, but also bring them together, indicate their dependencies between each other, and group them together.)

The current goal is to form pieces into divisions where each piece has a discernible learning objective, with the overall learning objective being that a learner understand how Big Data works.

Big Picture

First, the big picture of how technologies used in the context of Big Data interlink and can be grouped is described on this page. After that, Technologies are discussed in a more detailed way.

What is Big Data?

Data is the raw bits of unfiltered data that different programs produce, in contrast to information which has been processed. Big Data refers to the volume, velocity and variety of the raw data. The volume and velocity of such data is usually measured in Gigabytes or Terabytes per unit of time, such as hours or days. Variety here refers to the forms of the data, which can be structured in a database like Oracle or MySQL or unstructured like a log file. There has been a proposal for another V being veracity, or truth of the data. This refers to ensuring the data is cleaned and accurate.

Why Big Data?

Traditional data processing practices focused on centralized batch processes that ran against finite data growth processes. With the explosive growth of the Internet, and the decreasing costs of both storage and data collection pieces, these centralized resources are quickly overwhelmed.

The response to this was to harness the power of parallel computing. Up until big data, parallel processing was expensive and specialized, requiring custom programming and special hardware. Big data aimed to use ordinary commodity hardware to form a large cluster which was able to do the processing.

How does Big Data Work?

In contrast to conventional processing, where all of the data is brought to the CPU for processing, Big Data sends the program that needs to be executed to where the various bits of data. In simple terms, it takes a single file and splits it into pieces, and then stores those pieces on various physical computers around the cluster. When it needs to perform processing on that data, it figures out which computers have parts of that file, and sends the program to be run to those various computers. This allows each computer to process it's piece of the file at the same time, giving full parallelism. After this processing, depending on what information is needed, there might have to be some post-processing, and then a result is produced.

It's important to recognize there is some overhead to the process, and that data that is conventional sized would be slower to process with this overhead. However, with larger volumes of data, the processing time quickly overwhelms the overhead, providing significant savings.

What these Learning Resources Examine

These learning resources will focus on the various mechanisms which are put together to build a Big Data processing system. Our focus will be on the Open Source Hadoop system, and we will examine file storage, batch and interactive processing, data organization and other topics.

Learning Resources

File Storage

In Hadoop, File storage is the responsibility of the Hadoop Distributed File System, or HDFS. To learn more about how it works, click the HDFS link.

Reading and Contributing

At the moment, the structure is fairly simple, actually everything that belongs to big data is currently in Technologies, over the time articles should be clustered in better groups, i.e. theoretical algorithms and data structures, implementations, etc.

Articles should have a the following structure:

brief description of the technology the article describes, to give everybody an overview what it is about.
list of related^[1] technologies^[2]
more detailed description tackling following questions:
1. what is the idea of the technology? (what is the problem?, what is it different from other technologies?)
2. how does the technology work?
3. What are things the approach is good in?
4. What are drawbacks^[3]?
References, structured the following:
1. Original articles, publications, homepage
2. articles, papers, blogs, etc. with further descriptions
3. Implementations and extensions to the approach itself
4. Use cases and reports on how to install/maintain/use it (do not add this kind of information directly here)

The discussion pages of each article should be be used for further comments, please feel free to do this!

If you want to add or extend an article, feel free to do this, please try to follow the proposed structure

↑ Technologies the discussed approach is based on, uses, or is similar to.
↑ This section is completely for fast navigation within related technologies, preferred are links within this wikiversity topic, otherwise link to wikipedia pages or just a line of text are accepted. For everything else, please use the references section.
↑ Nearly all approaches are not perfect suited for every situation, we do not intend to diminish the work that has been done.

Acknowledgement

The initial idea for this collection came from the Reading Club on big data by René Pickhardt.

[1] Technologies the discussed approach is based on, uses, or is similar to.

[2] This section is completely for fast navigation within related technologies, preferred are links within this wikiversity topic, otherwise link to wikipedia pages or just a line of text are accepted. For everything else, please use the references section.

[3] Nearly all approaches are not perfect suited for every situation, we do not intend to diminish the work that has been done.

[1]

[2]

[3]