Commercial Data Harvesting

From Wikiversity
Jump to navigation Jump to search

The concept Commercial Data Harvesting (CDH) needs 5 basic constituents:

  • (Benefit/Incentive) information/communication service or game that is attractive for users. The user is provider of the data that can be sold by the CDH company. For the users it is intended that they perceive the IT infrastructure as a service instead of their data being harvested.
  • (USER GROUP) a large community of users of the service that generate the data (e.g. users of an information system, messenger or in general a software package)
  • (CDH COMPANY: Service Provider) Company that performs commerical data harvesting.
  • (Method: User Data Analysis) analysis of collected data from and about the user by using data mining approaches to destill (digital) products[1] that can be sold so customers of the company
  • (CDH CUSTOMERS: Buyers of User Data and derived products) customers of the company that performs commercial data harvesting (CDH). The customers are willing to pay for the knowledge about users, e.g. tailored advertisments according to profile of the users. Users are embedded as employees in commerical, research and development contexts and provide by their interaction with their digital environment data to the CDH company.
  • (DATA4SERVICE) The payment for CDH user data and derived services and IT products allow a free service (e.g. free e-mail account, free use of messenger, ...). CHD is not dependent on a free service as a reward for provision of data.


[edit | edit source]

This leads to the following definition:

Commercial Data Harvesting is a concept that
  • uses a communication and information service or game to collect data from a target user group and
  • sell the data or derived digital products to customers, that expect a benefit form having the data or using a digital service, that is based on the harvested user data.

Value of Harvested Data

[edit | edit source]

The value is data and the derived information services is dependent on the

  • Size: the size community determines if the impact of CUSTOMERS of the data is harvested.
  • Community Network: Who communicates to whom? What type of target group works in the network (educators/students, engineers/developers, researchers, administration)? What type of data can be harvested?
  • Content: What are the topics that are discussed?


[edit | edit source]

Explain the requirements and constraints to avoid commercial data harvesting in critical infrastructure!

  • Explore concept of business analytics for CDH now from user data provider as an individual or as a company or as institutions. What kind of CDH data creates disadvantages or vulnerability for yourself or for your working environment?
  • How would you address this with internal capacity building e.g. for staff members?

Digital Learning Environments

[edit | edit source]

Derived Information

[edit | edit source]
  • Create a user-profile of knowledge and expertise, e.g. to derive tailored advertisments. Basis driver is, that the probability of buying a product is higher if advertisment matches with interests and background of the users.
  • Political opinions and attitudes: Political statements can be tailored to public opinions that are identified by data mining methods.
  • Leisure activties, used technology: Users can be guided to leisure activities that are of interest for the user
  • Health related information and fitness. Certain activities have a positive or negative impact on health. The knowledge about these activities may be of interest for health care and health insurance.

IT-Environments for Harvesting

[edit | edit source]
  • Commericial data harvesting needs IT environments in which users leave a "large" Digital Footprint. Analyse your own online behavior! Where do you leave a digital footprint (determine roughly the percentage of total online time or explicitly the time span for each IT environment. Examples of IT-environments that can serve as harvesting environments are:
    • Messengers (WhatsApp, Telegram, Signal, deltaChat, Mail, ...)
    • Social Media,
    • Office Products (e.g. writing project proposals, summaries, results, an analysis, ...)
    • GPS-Tracks and Navigation,
    • Voice Recognition,
    • Videoconferencing that is running on IT-infrastructure, that is not controlled by the company, research and developement unit,
    • Online petitions
    • ...
  • Analyse the benefits and drawbacks for yourself and perform a Risk Analysis
    • for yourself,
    • for a company or institution you work for or
    • in general for institutions, companies, ... you know (e.g. health care facilities, governmental administration, ...).

Learning Tasks

[edit | edit source]
  • (Customer or Data Source) Users may apply or use certain service with a divers constraints. The software tools might be free or preinstalled or even compulsary to use with a digital service if you buy a hardware product. Services are e.g.
    • e-mail account,
    • fitness analysis,
    • routing and navigation support,
    • ...
Expand the list above and identify the type of data that can be collected, e.g. for mail services the content of the data or if mail is encrypted e.g. with GNU Privacy Guard who communicates with home can be harvested.
  • (Drivers to allow Commercial Data Harvesting) The drivers for allowing commercial data harvesting might be different.
    • So users regard themselves as a customer of a provider of a free digital service, instead of being the information source by using a digital products.
    • users do not know alternatives,
    • users cannot afford a paid service,
    • compare paid and free services and choose to select the free service,
    • users do not care, that data is harvested with the service,
    • ...
Expand the list above and analyze your own IT-habits of using a specific service. Why do you use a specific service and when do you avoid using a specific service?
  • (Commercial Value of Harvested Data) Harvested data is processed and analyzed and sold to someone, who is willing to pay for the information itself or the derived services from the analyzed user-profiles. From the angle of the data harvester it is important that users spent as much time in the digital infrastructure and expose there data about there profile (e.g. for tailored advertisment)? Why is it important that users e.g. regards themselves as "customers of a free digital service" or do not think much about instead of being part of a sold digital product? Elaborate on the optimization of data harvesting for a services. What are valuable information for the harvester. What is "noise" that can be ignored? How can pattern recognition and machine learning be used to distinguish between valuable information with commercial value und irrelevant "noise" in the harvested data? Keep in mind the identification of relevant and irrelevant information is dependent on the information you want to extract. Select an example and explain the different approaches for analytic methods for the data!
  • (Speech Recognition) Explain the role of speech recognition with mobile devices[2] for Commercial Data Harvesting. How is it possible to derive tailored advertisment by analysis of conversations. What are the potential privacy concerns[3] of individuals, research or development units, health care facilities,...
  • (Competition with an Award) A company designs a competition with a first, second and third price for providing a solution for a given problem.
    • Compare the PROs and CONs of a competition in comparision to research and development unit of the company.
    • There are many submissions to the competitions that have weaknesses and will not get a award. Why do have even unsuccessful submissions to competition a value for the company and the solution for the given problem. Would you communicate the value of submissions for the company to the participants?
What are the similarities and differences of Competition with a Award and Commercial data harvesting?
Discuss the need to communicate the "WHY" data is collected from a Neutral Point of View (NPOV) to support decision making of users if they want to share the data or are not willing to share the data for a specific purpose.
  • (Task for Authors of the Learning Resource) How should this learning resource should evolve that the Neutral Point of View (NPOV) in Wikiversity is respected (use talk/discuss page])?
  • (Artificial Intelligence) Commercial Data Harvesting e.g. from mobile devices, fitness trackers, ... generate user-specific data. Analyse the concepts of artificial intelligence and explain, how AI can be applied for pattern recognition of collected data about users!
  • (Digital Learning Environment) This learning task focuses on learning envirnoments and protective measures for learners/students to avoid commercial data harvesting in the educational system. Explain why it is important that data about students must be protected. Furthermore adapting a digital learning environment to the requirements and constraints of the learner (Learner Analytics) can be designed that the data about the learner does not leave the device of the learner has control about
    • the selected application,
    • the selected storage device,
    • data protection of the learner data,
    • ...
Explain, how you would design a OpenSource mobile device distribution (e.g. based on LineageOS) to create a tailored Linux/Android distribution for colleges, universities or schools that have all the tools preinstalled that are allowed in the IT-Infrastructure of the educational unit. How can schools and educational units share Open Educational Resources and add adaptive components to the underlying Open Source Operating System? Build bottom up
  • from the operating system on clients with root access,
  • Open Source Server infrastructure that can be shared between schools e.g. a RESTful API
  • a boot selector for the device to select initial operating system for the tailored installtion of the LineageOS image for the device course, class, ....
  • Explain how you would select the appropriate apps for the group of learner, ...
  • ...

Do you want to create a paper for WikiJournal of Science? Extend the topic with the state of the art technology, IT-strategies and an analysis of basic concepts of business plans for CDH or write an encyclopedic paper for the WikiJournal of Science, feel free to incorporate parts of the learning resource into the paper. Just use the "Cite this page..." feature for reference (see also Open Paper Development)

See also

[edit | edit source]


[edit | edit source]
  1. Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999, September). Analysis of a very large web search engine query log. In ACm SIGIR Forum (Vol. 33, No. 1, pp. 6-12). ACM.
  2. McGraw, I., Prabhavalkar, R., Alvarez, R., Arenas, M. G., Rao, K., Rybach, D., ... & Parada, C. (2016, March). Personalized speech recognition on mobile devices. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5955-5959). IEEE.
  3. Ramos, C., Augusto, J. C., & Shapiro, D. (2008). Ambient intelligence—the next step for artificial intelligence. IEEE Intelligent Systems, 23(2), 15-18.
  4. Humanitarian Open Street Map Team - Web Portal (accessed 2017/09/11) -