The usefulness in anthropology and digital ethnography of the free software corpus analysis tool TXM

From Wikiversity
Jump to navigation Jump to search

Gnome-devel.svg Working on it!

Acknowledgement

I would like to thank Psychoslave for informing me about the existence of the Wikimedia-l mailing list, the team of the textomerie project developing and supporting the TXM software[1], the instructors of the course Méthodologie de l'analyse de corpus en linguistique taught at UCLouvain University during the years 2017- 2018 and 2018-2019[2], and finally the deepl GmbH company[3] for is free access online translation service used to write this text from my native French.

Introduction and initial question[edit]

As a result of the rise of ICTs, anthropologists, like other workers in the social sciences and humanities, are increasingly confronted with the use and observation of means of communication implemented in various digital spaces. Social networks, forums, instant messaging groups, mailing lists, collaborative sites, have become places of communication and life used by many human communities. As a result of this evolution, science, including anthropology, has turned them into new fields of research as digital anthropology or new [4].

It is therefore in the field of anthropology as a science and ethnography as a methodology that this research is part of. We will not discuss about linguistic anthropology here, analyse anthropological literature analysis[5], even nor online ethnography as a complement to the linguistic analysis of log data[6] but well about corpus analysis software as tool in anthropological ethnographic field. Corpora can be numerous within digital communication spaces as much as corpus analysis software, but our choice will be limited in this study to the archives of a mailing list and the TXM software

Thus, the initial question of this research could be summarized as follows:

" Can the corpus analysis software TXM help a digital anthropologist researcher in his ethnographic field work? "

The Wikimedia mailing list as a corpus[edit]

Why Wikimedia mailing list ?[edit]

One reason to chose Wikmedia mailing list as corpus was leads by than Wikimedia movement is my doctoral thesis focuses. An other reason was the facility to create the corpus by copy past the contain of mailing list archive month by month on separated files in .txt format directly usable by the software.

The contain of this mailing list could be interesting for me, but difficult to use by a simple reading. The idea of using TXM as sophisticated search engine to explore these thousands of messages came to my mind and was at the origin of this present research work.

And as a last argument, the archives of the mailing list are directly copied and pasted, month by month, from a web page displaying the CC-BY 3.0 license[7], which makes their use very easy.

Mailing list description[edit]

Wikimedia community mailing list labeled "Wikimedia-l"[8] is a discussion list for the Wikimedia community and the larger network of organizations (Wikimedia Foundation, chapter organizations, affiliates, partners) supporting its work. This mailing list can, for example, be used for:

  • The initial planning phase of potential new Wikimedia projects and initiatives
  • Organizational issues of the Wikimedia Foundation, chapter organizations, others
  • Discussing the setup of local Wikimedia chapters
  • Developing and evaluating grant-making programs
  • Planning elections, polls and votes
  • Discussion of projects that don't already have a mailing list
  • Finding ways to raise funds
  • Other Wikimedia-related issues

Corpus description[edit]

The corpus was initially constituted by a folder containing 175 files (one file per month from april 2004 to october 2018) for a total of 385.2 MB. All the text was constitueted by 83.854.542.

The free software TXM as analysis tool[edit]

TXM User Manual Version 0.7 ALPHA

Why TXM[edit]

Some people claim to be vegetarians and do not eat meat. For my part, I claim to be a libriste and do not "eat" proprietary software as defined by Richard Stallman[9]. XTM 0.7.9[10] meets my expectations in this respect. On the other hand TXM is developed by a team of French researchers and a good documentation of the software in French was available from the project's Internet website[11] in the form of manual[12] video tutorials[13]. Finally, the project has a mailing list[14] and a Wiki[15] that give to me the opportunity to receive support from community members in French..

TXM description[16][edit]

TXM is free, open-source Unicode, XML & TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal. It provides.

Qualitative analysis[edit]

  • Concordances of lexical patterns based on the efficient CQP full text search engine and its CQL query language
  • CQL pattern frequency lists for any word property (type, lemma, pos...) thanks to the integration TreeTagger's integration for lemmatization and pos tagging
  • CQL pattern occurrence graphics
  • lexical patterns are expressed in the CQL query language, based on word & structure level properties: (for example)
    • "aiming" to simply search for the word ’aiming’
    • ".*ing" to search for words ending in "ing" (including mainly verb forms)
    • [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
    • [lemma="group"] [] 0,3 [pos="VERB" & word=".*ing"] to search for the collocation followed by a with at most 3 words in between
  • rich HTML-based text edition navigation with links from all other tools

Quantitative analysis[edit]

  • factorial correspondance analysis
  • constrative word specificities
  • hierarchical classification
  • analysis of cooccurring words or lexical patterns

Corpus Data Model[edit]

  • Indexes words and their properties as well as hierarchical structure of texts
  • Indexes external or internal metadata of texts or speakers
  • Allows construction of various subcorpora and partitions (for constrative analysis between text structures or groups of words)

Personal feedback about Installation, importation and use of features[edit]

Before TXM, I had used very few textometric software and always in a very punctual way. Getting to grips with this software did not seem exessively difficult to me, but it would certainly have been if I had not previously acquired some knowledge of corpus analysis in linguistics. Without this previous training, I would have had to assimilate at the same time as discovering the software a whole set of concepts such as occurrence, lemma, tolken, etc.

Honestly, it seems to me that it is possible for someone who has enough time to successfully install the software and use it only from the manual that I used in French for myself, but that also exists in English in a Beta version.

In the end, the only problems I encountered in this experiment were the installation and use of the Treetagger automation software, which, unlike the R statistical processing software, is not pre-installed within TXM. These problems were related to configuration errors on my part and another problem probably related to a downloaded and corrupted file.

Finally, it should be noted that the process of importing my corpus leading to the creation of an XML file containing the categorization and lemmatization informations takes more than three hours on a desktop computer. At the end of the process an 8 GB overload of my RAM forcing the computer to use the swap space on the hard disk. Finally, the folder of the corpus binary format produced in more than one hour of calculation, was 6.5 GB in size and could not be loaded on my laptop due to lack of disk space while more than 15 GB was available.

It therefore seems important to me to point out that before embarking on the analysis of a corpus with TXM, it is necessary to ensure that the computer material is powerful enough according to the size of the text. Other example, after creating two partitions (12 months and 14 years) the software's start-up have increase to few seconds to nearly five minutes.

The software seemed relatively stable to me when you don't run a calculation until the end of a precedent. Faced with the size of the corpus and the power of my desktop computer, some processes can reach high or even excessive execution times. When the software freezes and its shutdown must be done via the computer's operating system, some of the work done before the shutdown may be lost. It would therefore be advisable to restart the application after performing an important job.

Useful informations for the ethnographer provided by TXM functionalities[edit]

One by one, we will discuss here the functionalities offered by the TXM software, and their ability to provide useful information for the ethnographer. For each useful feature, we will give an example applied to the analysis of the archives of the Wikimedia-l mailing list.

Edition[edit]

The editing function allows you to browse the entire corpus in html display with the display of an information bubble on each word indicating its lexical category.

The navigation is done file by file with the name of the file as the header of the tab and a contextual menu by right click allows the sending of a word to the concordancer.

Application

Without leaving the TXM software, this function gave me the opportunity to briefly browse the corpus in full text to locate its structure and launch some concordance search from any name or pseudonym of known people. A way to easily browse all the interventions of an actor that you would like to follow within the mailing list. We will come back to the functionality of the concordancer later.

Lexicon[edit]

A Lexicon analysis (list of work ordered by frequency) already give some good informations to the ethnographer which words are most often cited by the community on the mailing list, a searcher can get information on :

  • information on the main topics of discussion in the community, use full for guiding individual semi-directive interview;
  • information on the most active members on the mailing list, use full for selecting interviewees;
  • information about the most used email provider, use full to know in which communication channel will allow the maximum number of contacts to be contacted.

Example from the corpus :

  • In this corpus constituted by a mailing list archive, the lexicon show 863676 @ concurrency wich means than the same amount of messages posted on the mailing list,


Conclusion[edit]

Other type of corpus possible.

Theoretical resources[edit]

  • Strategic Interaction and Knowledge Sharing in the KDE Developer Mailing List[17].
  • What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List[18].
  • Analyse de complexité des textes coutumostratiques[19]
  • Outline of natural language processing[20]
  • TXM French user manual[21]

Papers to explore[edit]

  • Explore, play, analyse your corpus with TXM[22]

External resources[edit]

Note and sources[edit]

  1. "L'équipe TXM - Projet Textométrie". textometrie.ens-lyon.fr. Retrieved 2018-12-16.
  2. UCL/SGSI (2018). "Méthodologie de l'analyse de corpus en linguistique". UCLCatalogue des formations 2018-2019 (in French). Retrieved 2018-12-16.
  3. "DeepL Translator". www.deepl.com. Retrieved 2018-12-16.
  4. As an indication, in the English-speaking sphere, the article cyber-anthropology renamed digital anthropology appeared on 11 September 2005 (https://en.wikipedia.org/w/index.php?title=Digital_anthropology&dir=prev&action=history and the article on virtual ethnography renamed cyber-ethnography on 1 February 2006 (https://en.wikipedia.org/w/index.php?title=Cyber-ethnography&dir=prev&action=history
  5. This kind of research could be very interesting for example as an extension of digital archiving work carried out by the ODAS platform (https://www.odsas.net).
  6. Androutsopoulos, Jannis (2008-09-04). "Potentials and Limitations of Discourse-Centred Online Ethnography". Language@Internet 5 (8). ISSN 1860-2029. http://www.languageatinternet.org/articles/2008/1610. 
  7. Source: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
  8. Source: https://lists.wikimedia.org/pipermail/wikimedia-l/
  9. Williams, Sam (2011-11-30). Free as in Freedom [Paperback]: Richard Stallman's Crusade for Free Software. "O'Reilly Media, Inc.". ISBN 9781449324643.
  10. Heiden, S. (2010). The TXM Platform : Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University.
  11. "Projet Textométrie". textometrie.ens-lyon.fr. Retrieved 2018-12-13.
  12. Heiden, Serge (2018-02-26). "Manuel de TXM 0.7 FR". txm.sourceforge.net (in French). Retrieved 2018-12-13.
  13. "Atelier d'initiation à TXM de Bénédicte Pincemin du 27 septembre 2012". txm.sourceforge.net. Retrieved 2018-12-13.
  14. "txm-users - TXM users mailing list - subrequest". groupes.renater.fr. Retrieved 2018-12-13.
  15. "index [Le wiki de la liste txm-users]". groupes.renater.fr. Retrieved 2018-12-13.
  16. "Présentation - Projet Textométrie". textometrie.ens-lyon.fr. Retrieved 2018-12-11.
  17. Kuk, George (2006-07). "Strategic Interaction and Knowledge Sharing in the KDE Developer Mailing List". Management Science 52 (7): 1031–1042. doi:10.1287/mnsc.1060.0551. ISSN 0025-1909. https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.1060.0551. 
  18. "Purchase: What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List". dl.acm.org. Retrieved 2018-11-20.
  19. "Recherche:Analyse de complexité des textes coutumostratiques — Wikiversité". fr.wikiversity.org (in French). Retrieved 2018-11-20.
  20. "Outline of natural language processing" (in en). Wikipedia. 2018-10-08. https://en.wikipedia.org/w/index.php?title=Outline_of_natural_language_processing&oldid=863062167. 
  21. Heiden, Serge (2018-02-26). "Manuel de TXM 0.7 FR". txm.sourceforge.net (in French). Retrieved 2018-11-20.
  22. "Explore, play, analyse your corpus with TXM | DHd-Blog". dhd-blog.org. Retrieved 2018-12-13.