Open Machine Learning

From Wikiversity
Jump to navigation Jump to search

Introduction[edit | edit source]

This page about Open Machine Learning can be displayed as Wiki2Reveal slides. Single sections are regarded as slides and modifications on the slides will immediately affect the content of the slides. The following aspects of Open Machine Learning are considered in detail:

  • (1) Open Source for learning algorithms
  • (2) Open Training Data for reproducible status used pre-trained machine e.g. in science and education
  • (3) Licensing of derivative work in the context of Open Machine Learning

Objective[edit | edit source]

This learning resource about Open Machine Learning in Wikiversity has the objective to address the role of Open Training Data (OTD) for generated output.

Target Group[edit | edit source]

The target groups for Open Machine Learning of the learning resource are:

  • Bachelor/Master students with the subject
  • people that are interested in Open Innovation Ecosystems and want to use ML infrastructure on Open Data repositories (e.g. for a humanitarian support of communities).

Reproducibility[edit | edit source]

In reproducibility of machine learning, one considers the sources necessary for it to be possible to reproduce an existing system capable of learning (Digital Public Good[1]). These include.

  • Algorithms that defines the behavior of the adaptive system,
  • training data that also defines and modifies the behavior of the system; and
  • machine state data, which describes the state of the adaptive system at time .

Reproducibility for algorithms[edit | edit source]

Learning algorithms define how the machine state changes as a function of the input data or training data. For open machine learning, these must be available as open source code so that the code can not only be used, but also reviewed and improved by the scientific community. As an introductory example, the gradient descent method can be mentioned, which is used for learning algorithms of backpropagation networks for error minimization.

Training data - Definition of In-Output-Behaviour[edit | edit source]

Two learning systems and can, for example. use the same artificial neural network for machine learning with the same initial state , but "feed" it with completely different training data sets and thus exhibit completely different behavior at a later time . In an open reproducible system, the training data used is openly accessible according to the FAIR data principle.

Learning Tasks - Machine Learning[edit | edit source]

Learning activities focus on role of training data for the machine learning algorithms:

  • (Novice to ML) If you are new to Machine Learning (ML) it is recommended to start with a exploring the concept and foundations of machine learning.
  • (Supervised, unsupervised ML) Explain the concept supervised and unsupervised machine learning and apply this to text generation with training data as text documents under an open content license for the training data.

Learning Tasks - Derivative Work[edit | edit source]

To address licensing and derivative work we consider open licensing models, that can be used to assure community access. The community can access the resources, modify the resources and preserve the resources for the community to which people contributed to in an evolutionary process.

  • (Derivated Work) Explain what "derivative work" means e.g. for Open Source software or Creative Commons documents.
  • (Open Data) What are the challenges and constraints for Open Data that is used for Machine Learning (ML)? Can users add data to an repository?
  • (Transparent Chain of Licenses) Assume that a machine with an initial state is trained with training data that is issued under a specific open license . The license allows derivative work. If machine learning is generative then the chain licenses will assign the same license to the generated text.
  • (Versions of Training Data Sets) Training data may change over time so the the training data may have a time index a well.
  • (Multiple Licenses for Training Data Sets ) If training data sets are aggregated from different sources, then different involved licenses may also change of time (e.g. means that at time data set is aggregated from training data with the licenses and .
  • (Learning Algorithm) The used learning algorithm defines how the machine state evolves depending on the training data. With a discrete iterative step from to and the training data used at time . This means that learning algorithm maps the current machine state together with the training data to the new machine state . The next learning step creates in an inductive way the next machine state . We can define , meaning that the machine state will remain unchanged if no training data is provided.
  • (Origin, Experimental Design, Meta data) For scientific purpose it is often important for a specific purposes, that it can be clarified who collected the data and what was the experimental design. Explain the requirements of data collection and the related scientific standards for it. Discuss the similarities and differences in the context of training data for machine learning. Assume that also origin, institution and experimental design is available as meta data for the training data set. What other meta is relevant for you to assess the quality of data, that is used for training?
  • (Transparency for Trained Models) If training data is not available due privacy regulation e.g. for medical data, then the detailed license information with meta data together institute which trained the model, could help to assess, if the machine at time might be used for a specific purpose.
  • (Machine State) If we aggregate the consideration above we can create a reference to the machine state , that trains the machine with the learning algorithm and the training data . At time machine learning used the list of licenses and optional other metadata (for reproducability). The in-output-pair defines that generated with the input . So the tupel defines how , was generated with and the involved licenses with the meta data for . Assuming that and are references to version in a version control system. Discuss the limitations of the approach especially when a huge amount of data is used for training and is constantly trained with a input stream of data.
  • (Generative Artificial Intelligence) Assume a user uses generative AI, some components of the tupel might not be known. But at least the sequence of prompts with the corresponding output can be compulsory to be documented especially for student home work. This allows to identify the added value of the student beyond generative AI. Key questions for the assessment:
    • Was the logical structure of thesis defined by the generative AI and what are the reasons for students to change the provided generated logical structure? How did the changes improve the logical structure of the thesis in comparison to the AI generated results?
    • Are citations/references are included in the generated document? Do the citations provide evidence for the discussed content in the thesis?
    • Is the state of the art in science properly integrated in the thesis or does the research question in the thesis require other relevant scientific results to cover the current scientific knowledge about the topic? Did the student add those references and did argue in the thesis why these citations are missing?
    • Using generative AI in a thesis in terms of transparency requires 3 components
      • (Prompt Results) prompts and results pairs ,
      • (Manual Changes) manual changes of by the students to and
      • (Meta Analysis) meta discussion why changes to are necessary to fulfill the scientific requirements of the thesis.
  • (Licensing Chains) With licensing chains it is possible to with different licensing models are used along with the training data . Due to this openness of licensing training data can be reduced to subset , because not the complete set of training data is compliant with a required license. Instead of training the machine with we train with the license compliant subset with a suitable license and obtain a new machine state according to license compatibility . Discuss applications of this scenario and discuss PROs and CONs of a reduced set of training data for the training process at time .

Examples - Derivative Work for Data[edit | edit source]

Consider the following examples as introduction and discuss differences and similarities of Machine Learning that is based on the training set :

  • (New Data) Due to a new empirical study with the same experimental design of the existing training data set at time data is added and new set ,
  • (Missing Data) in the existing training data set at time missing values are added and is the corrected data.
  • (Correct Data) input errors correct e.g. input data about the temperature was modified to ).

Learning Tasks - Open Data[edit | edit source]

Transfer the concept of derivative work to open data and discuss how alteration and modification of data can be managed in a trusted way by the community.

Learning Tasks - Training Data[edit | edit source]

Analyze open licensing models (like GNU Public License, Creative commons, ...) how derivative work of donate digital contributions and derivative can remain open for community? How does the licensing model contribute to Open Innovation Ecosystem with digital public goods[1] . Apply this concept to training data for Open Machine Learning and discuss the requirements and constraints. Apply open data on Spatial Risk Management e.g. in the context of road safety[2]. What are the benefits, challenges, requirements and constraints of using machine learning in that context?

Learning Tasks - Open Machine Learning Licensing Chains[edit | edit source]

Assume we have training data (e.g. text documents under a Creative Commons License) and train a system machine at a time with an open source machine learning algorithm, that is transparently available by the community. A new system state of the machine changes the In-Out-Behaviour (IOB) by the training process. Now it generates the output with input data with . What licensing model should be assigned to the output if the training data is provided under the license ? Discuss different views about a licensing chain that the uses the output of the machine as training data to generate .

Learning Tasks - Same Machine Learning Algorithm different Training Data[edit | edit source]

Assuming we have two different Open Training data sets and . Furthermore use a neural network model (e.g. Backpropagation networks and a predefined topology of the network (i.e. number of neurons, connections between neurons, layers of neurons, ...) and activation functions of the neurons. So the starting states of two machines and at time are considered to the same.

Use different Training Data Sets[edit | edit source]

Due to the two different data sets and . for the open transparent and reproducible training of and the machines evolve on different pathways in the time index .

Trained In-Output-Behavior[edit | edit source]

In general the In-Output-Behaviour (IOB) will be different at the time . Discuss the role training data and as part of "programming" the IOB of the machines.

Bias in Training Data[edit | edit source]

What is a Bias? Discuss an example training data that has bias (e.g. in the context of human rights, added fake news data, missing data, unreliable data sources,...) and explain how the bias in the training data has an impact of the In-Output-Behaviour (IOB) of the machine that is used for decision making e.g. in the medical domain[3]. How can transparency and openness for the training data can help to identify a bias[4]. What are the challenges, requirements and constraints (e.g. data privacy regulations)?

Assessment and Machine Learning[edit | edit source]

Assume a student has to do an assessment about a specific topic and has to produce a specific document as homework for it. The student works hard , creates a good appropriate homework and submits the result to the teacher. Additionally student provides the document to another student . Student takes the same document just like taking a book from the shelf and submits the same document to the teacher too (e.g. without looking into the document). From an outside perspective we assume that student has understood the topic of the homework. Student may not have any insights into the topic:

  • An oral assessment about the topic or
  • a supervised transfer task of the lessons learnt to a new example or counterexample may uncover the knowledge about the topic.

Transfer this example to Machine Learning and create consequences, how an assessment can be designed that assesses gained knowledge and problem solving capabilities instead of a successful delivery of an appropriate document for the learning task.

References[edit | edit source]

  1. 1.0 1.1 Nordhaug, L. M., & Harris, L. (2021). Digital public goods: enablers of digital sovereignty. DOI: 10.1787/c023cb2e-en - In book: Development Co-operation Report 2021
  2. Najjar, A., Kaneko, S. I., & Miyanaga, Y. (2017, February). Combining satellite imagery and open data to map road safety. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).
  3. Mac Namee, B., Cunningham, P., Byrne, S., & Corrigan, O. I. (2002). The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1), 51-70.
  4. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12 (pp. 158-171). Springer Berlin Heidelberg.

External References[edit | edit source]

See also[edit | edit source]

Page Information[edit | edit source]

You can display this page as Wiki2Reveal slides

Wiki2Reveal[edit | edit source]

The Wiki2Reveal slides were created for the Open community approach' and the Link for the Wiki2Reveal Slides was created with the link generator.