Data science

From Wikiversity
Jump to navigation Jump to search

Data Science is a family of techniques and technologies used together to extract knowledge and value from data, and to deliver that insight to a scientific project, stakeholders or consumers of the analysis.

Popular techniques for data science include from machine learning algorithms such as support vector machines and random forests to statistics techniques such as regression analysis. Other techniques include deep learning (using neural networks) and unsupervised learning, including Principal Component Analysis (PCA) and clustering techniques like hierearchical clustering and K-Means.

Popular technologies for data science include open-sourced tools such as R[1], and Python[2], and Spark[3], as well as proprietary tools, such as IBM SPSS[4], Azure Machine Learning Studio[5], SAS[6], as well as many others.

Learning Tasks[edit | edit source]

  • (Basic Workflow) Explain the basic workflow to create a result for decision makers from available raw data for an use-case of your choice and scientific background and answer the following questions:
    • What kind of insights do you want to gain from the raw data you have collected that you plan to collect?
    • What is the target group of scientist or decision maker that would benefit from your results?
    • What are the scientific methods of data analysis do you know and how do you selected the appropriate scientific methods to analyse the collected data?
    • How are other scientists able to reproduce the results in an independent study? How does this contribute to confidence of decision makers to derive decisions from the applied scientific data analysis techniques?
  • (Big Data) Search for estimations for the daily, monthly, yearly created amount of data in your area expertise. What kind of IT infrastructure do you need to process large amount of data? Applied algorithms in data science might gain performance (duration for processing is only 30% with algorithm A in comparison to algorithm or method B but at the same time to might loose accuracy in the analysis. Derive basic consequences for handling big data and selecting appropriate scientific methods for the analysis of big data. Compare these methods with brute force methods on small data sets, that can be handled on your PC.
  • Dynamic Document Generation The concept of Dynamic Document Generation reads data from a file or from multiple files and performs data analysis on the data and produces
    • a presentation with charts and the results of the analysis e.g. with KnitR or as a LibreOffice Impress or as Wiki2Reveal for use in a Wikiversity learning resources.
    • a Office document with that chats and results (e.g. LibreOffice Writer) or
    • a PDF document as report for decision makers.
Explain the benefit for the decision makers if those reports as decision support products are created on a daily basis. Create a small KnitR workflow for a basis statistical analysis of sample data set in your area of scientific expertise.
  • (Database Management) Explain the role of database management to perform data science. What are the requirements and constraints and what are the challenges?
  • (Use Case) The use case in this learning resource applies basic concept of normalization, data density and moving average as an examples to analysis of some sample data. At the same time is provides basic insights on encoding data science in an OpenSource script that analyses data and produces a decision support document for further analysis.

See also[edit | edit source]

References[edit | edit source]