# WikiJournal Preprints/Algorithms for Categorical-Generative Analysis: Implementing an Inductive, Comparative Method for Social Processes based on Formal Language Theory

Open access • Publication charge free • Public peer review

<meta name='citation_doi' value=>

## Article information

Author: Bruno da Rocha Braga[a][i]

Bruno da Rocha Braga, "Algorithms for Categorical-Generative Analysis: Implementing an Inductive, Comparative Method for Social Processes based on Formal Language Theory", WikiJournal Preprints, Wikidata Q107656570

## Abstract

This paper proposes a methodology for the analysis of a category of social process based on a representative set of sequences of outcomes of decision-making events. The goal is to describe the evolutionary path of a category in more than one empirical setting such that structural comparisons become viable. First, the present work argues for the classification task on types of events using the Quine-McCluskey algorithm, which is based on Set Theory. Next, it endorses the classification task on sequences of event outcomes using the Nevill-Manning algorithm, which relies on Formal Language Theory. In the practice of research, both differ from the classification algorithms based upon Information Theory and Probability Theory because of assumptions about the nature of the phenomenon under studying in addition to the underlying mathematics used.

## Introduction

The concept of structure is in the foundations of science, mainly of the social sciences. Because of the difficulty to make controlled experiments to generate empirical data, the social research design usually counts upon theoretically guided models and deductive methods of analysis based on the nature of data being quantitative or qualitative. On the one hand, quantitative researchers rely upon the objectivist assumption of directly observable phenomena, but they still need to consider a formal model compatible with the distributional structure underlying the data under analysis. On the other hand, qualitative researchers reject objectivism because of the difficulty of measurement and the dynamics of change in the objects of studying, even though they often need to assume an unobservable structure underlying the phenomenon such that plausible conclusions can become possible. In addition, there is even more diversity among the methods of social analysis than this dichotomy suggests.

In the social sciences, researchers rely upon four major scientific methods[1]: the statistical, the experimental, the case study, and the comparative. They make structural assumptions to test theoretical relations between either concepts or variables regarded as important to explain the phenomenon. The exam on the assumptions about structure is often absent from the research protocols in empiricist studies. Nonetheless, this question becomes fundamental to reliability when studying complex, dynamic and contingent phenomena: what is under inquiry is not the empirical data but the structure that generated it. Furthermore, among these four research approaches, the comparative method is the one in which this issue is more critical to its reliability because there is one social structure for each empirical setting.

The comparative method was originally applicable to the studying of the historical evolution of languages. It consists of the analytical comparison of both common and discordant features between distinct languages from the same linguistic group. In other social sciences, the comparative method is a historical and empirical approach to the comparison of distinct instances from the same category of social phenomenon: they share the same general structure but diverge on their evolutionary path because of the specificities (i.e., the context) found on each empirical setting. For instance, organizations, institutions, political regimes and countries are objects of inquiry commonly tackled using the comparative approach.

The comparative method shares some features with the case study method when both focuses on overstressing the context of an instance of the phenomenon. Its implementation often seems to be a multiple case study due to the systematic selection of pairs of cases. The comparative method can still rely on analytical techniques based on formal models like the statistical and experimental methods, even though the choice for the comparative method often takes place when neither statistical nor experimental methods are applicable due to spatial and/or temporal heterogeneity in data.

Qualitative Comparative Analysis (QCA) is a technique to recognize configurations of values for a set of categorical variables using empirical data[2]. In the social sciences, particularly in Sociology and Political Science, research based on QCA is also known as configurational analysis in contrast with statistical analysis. The latter approach is still “comparative” in the way it contrasts many entities, but there are key differences. On the one hand, configurational analysis relies on Set Theory to assess the powers and liabilities of both common and discordant features of a context-specific phenomenon: it implements a membership-based comparative method over systematically selected representative cases. On the other hand, statistical analysis relies on Probability Theory to assess the net-effects of common features of a population-wide phenomenon: it implements a variance-based comparative method over randomly sampled ordinary instances. Today, configurational analysis is nearly a synonym of comparative method, and QCA is a tool implementing it, although the term “comparative method” originally refers to a historical approach that dates back QCA.

The goal of QCA’s configuration analysis is to test deterministic hypotheses about the membership relation between categorical variables under the assumption of error-free measures of a set of attributes. The critics claim that the situations in which QCA is applicable are rare[3] because of a triad of assumptions that are naïve simplifications regarding the complexity of the objects of studying. First, the absence of measurement errors, which is an improbable assumption at all. Second, the inductive approach to theory formulation from raw data – except for experimental studies, theory is indispensable to build up a model guiding data collection and analysis. Third, the resulting deterministic membership relation between categorical variables – finite sets of discrete elements rather than dimensions of a numerical function – is applicable to the mechanisms of the same concrete system only. Indeed, the use of a configurational analysis technique in substitution for a statistical analysis technique is a huge mistake that many researchers using QCA have incurred under the “small-N sample” rhetoric: a minimum sample size is always necessary to provide valid generalizations to a population.

Fortunately, the situations in which a set-theoretic method is applicable are not rare in the social sciences: whenever an empirical data set results from the same specific social structure, which is similar to a time series generated from a parameterized function, QCA is quite applicable. For example, learning patterns of behavior that are contingent to a specific empirical setting is the common goal of research works using the comparative approach in many fields of the social sciences. At first, the worth of applying QCA is a matter of understanding both the inference algorithm and the nature of the structure underlying the phenomenon under inquiry. After all, the right tool is the one applied to the right problem whatever the type of the method in use.

This paper argues the polemic around QCA is actually a matter of the epistemology under consideration rather than an ordinary mistake. There is a necessity to replace the assumption of epistemic objectivism and the falsification logic from Social Positivism, which is often taken for granted by the critics, with the assumption of epistemic relativism and the retroduction logic from Critical Realism[4]. This post-positivist paradigm shift helps scholars to reconcile a widely acknowledged theoretical proposition about the phenomenon under inquiry with a surprising or anomalous fact instead of rejecting the former.

In addition, the present paper introduces a retroductive analytical procedure that is equivalent to a classification algorithm much like those in use for the practice of normal science; however, relying on configurational rather than statistical reasoning. Firstly, this work presents the classification problem through the two main approaches to solve it: one relying upon Information Theory, described in the form of the decision-tree learning model, and other relying on Probability Theory, described in the form of the logistic regression model. It also suggests that the configuration analysis is a kind of classification analysis based on algorithms working with combinational logic and/or sequential logic. The first ever used is the algorithm for the simplification of logic functions, which relies upon Set Theory. Other algorithm may be regarded such as an extension to the first, but relying upon Formal Language Theory instead – in its turn, an extension to Set Theory. The use of this second algorithm to learn a path-dependent, sequential logic from empirical data and the definition of it as a distinct approach to the classification problem are two contributions of this paper.

## The Classification Problem

Consider a collection of instances of a category described by a fixed set of attributes, in which the values for one of these attributes, named class, depend on the values for the other attributes altogether. The class attribute delimits mutually exclusive partitions of the instance space. Consequently, different configurations of the values of the other attributes can determine the same class, or are associated with each value for class. The procedure for recognition of the pattern of values in the attributes discloses the partitioning of the instance space into the given set of classes: it solves a supervised learning problem in data science, known as classification problem. Any algorithm solving this problem, which is known as classifier, may employ different data analysis techniques to produce a classification model, guessing a type of concrete structure that generated the data under analysis. The resulting classification model can predict the class of an instance relying only on the values for its descriptive attributes.

This work discusses the two main groups of classifiers in order to introduce a new group of classifiers that relies on a completely different philosophy of how to produce scientific knowledge - under the assumptions of the epistemological paradigm known as Critical Realism in substitution for those of the mainstream epistemological paradigm, which is Social Positivism.

The next sections present the two main classification algorithms to enable readers to compare them with other two algorithms suggested to implement the proposed third approach to the classification problem. In the first section, classifiers based on Information Theory implement the deterministic partitioning of the instance space. Next, classifiers based on Probability Theory calculate the chance of a given instance fitting to each class in place of predicting a single one. Finally, two classifiers based on Set Theory and Formal Language Theory make the deterministic partitioning of an instance space relying upon the relations of necessity and sufficiency[5] between the attributes from representative instances rather than relying on the relation of covariance between the attributes from sample instances like other classifiers do.

### Classification based on Information Theory

The simplest classifier of this group, the ID3 algorithm[6] can build up a classification structure such as a decision tree from the attribute data using a recursive, “greedy” approach. After selecting the more important descriptive attribute, ID3 makes the partitioning of the instance space for each possible outcome. Then, it recursively repeats this operation for each partition until there are no attributes left unselected. The criterion for attribute selection is based upon the concept of entropy from Claude Shannon's Information Theory[7]: a measure of uncertainty or disorder involved in the value of a random variable or the outcome of a stochastic process. Consider the minimization of the entropy H(S) in each partition S, given by H(S) = ${\displaystyle \sum _{\mathit {cC}}-p\left(c\right).{\log }_{2}p\left(c\right)}$: it gradually reduces the amount of heterogeneity until all instances in a partition belong to the same class (c ${\displaystyle \in }$ C). Alternatively, consider the maximization of the information gain IG(A,S), a measure of the difference in the entropy from before to after the partition S: it is split again on an attribute A, given by IG(S, A) = H(S) – H(S|A), where conditional entropy is given by H(S|A) = ${\displaystyle \sum _{{\mathit {cC}},{\mathit {aA}}}-p\left(c,a\right).{\log }_{2}{\frac {p\left(c,a\right)}{p\left(a\right)}}}$.

An extension of the ID3 algorithm, the C4.5 algorithm[8] can build up a classification structure, but in the form of a ruleset; each path from the root node to a leaf is turned into a rule whose conditions are the possible values for each attribute along the nodes in the path. Although there are other technical improvements in relation to ID3, this is the main difference between these algorithms.

All classification algorithms based upon the concept of information entropy produce the deterministic partitioning of the instance space into n-dimensional hyper-rectangles, where n is the number of attributes in addition to the class attribute. There can be many partitions as the level of disorder of the instance space imposes, which can turn into a problem of over-fitting. Some pruning techniques are applicable to avoid this problem, but it may also be the case of using another type of classification algorithm.

### Classification based on Probability Theory

Other type of classification algorithm relies on binary logistic regression[9]. Consider the logistic function f(x) = ${\displaystyle {\frac {L}{1+{e}^{-k\left(x-{x}_{0}\right)}}}}$, which is a sigmoid function, where L is the maximum value of x, k is the logistic growth rate, and x0 is the midpoint such that its curve has a “S” shape around the midpoint. The standard logistic function is f(x) = ${\displaystyle {\frac {1}{1+{e}^{-x}}}={\frac {{e}^{x}}{{e}^{x}+1}}}$, that is, a logistic function with L=1, k=1, and x0=0.

In statistics, the logistic or logit model is the regression model with a logistic function (or any other sigmoid function) modeling a dichotomous or binary dependent variable; even though the independent variables can assume any real value such that P(Y=1 | X) = ${\displaystyle {\beta }_{0}+{\beta }_{1}X}$. In this sense, the logistic regression analysis is the procedure to estimate the β-coefficients of a logistic model given by the equation P(Y=1 | X) = ${\displaystyle {\frac {{e}^{0+1X}}{{e}^{0+1X}+1}}}$${\displaystyle \therefore }$${\displaystyle {\frac {P\left(Y=1\vee X\right)}{1-P\left(Y=1\vee X\right)}}={e}^{0+1X}}$ ${\displaystyle \therefore }$ ${\displaystyle \ln \left({\frac {P\left(Y=1\vee X\right)}{1-P\left(Y=1\vee X\right)}}\right)={\beta }_{0}+{\beta }_{1}X}$ using the maximum likelihood estimator. It applies the maximum-likelihood ratio to predict the statistical significance of the random variables that take part of the logistic regression equation. An extension of this model with a categorical dependent variable is a multinomial logistic regression.

Why is it the case for the logistic regression instead of linear regression in order to perform classification analysis? First, linear regression relies on a numerical dependent variable while logistic regression uses a categorical dependent variable. There exists no ordering relation between the values of the mutually exclusive classes, but there is order among the elements in the numerical codomain. Reordering the list of classes using the linear regression model would result in a different relation each time. As classification analysis needs a categorical dependent variable, it would be unconceivable to use the linear regression model; except for the dichotomous case, in which the order does not matter anymore because there are only two possible values in the codomain.

There is an important consequence of adopting each of the classification approaches. On the one hand, using an entropy-based classification method, the partitioning of the instance space is a hyper-rectangle, implying in one linear boundary for each attribute perpendicular to its axis. On the other hand, a logistic classification makes one smooth linear boundary between the instances of distinct classes, which can be of any direction. Thus, the choice for which classification model to use relies on the disposition of the instances of the classes in the space under analysis. Logistic classification works better if there may be a single functional decision boundary between the groups of instances of distinct classes in a high dimensional space. The classifiers using entropy work better if there is a dispositional pattern between the groups of instances necessitating multiple decision rules with small number of attributes (less than fifteen); nevertheless, having far too many degrees of freedom makes it also more prone to over-fitting.

The logistic regression model outputs deterministic relations too. If the estimation of the coefficients of a regression equation reaches a high confidence, then prediction is highly accurate, which is only possible on truly deterministic systems. Nevertheless, regression analysis can assess unsystematic errors of measurement, which improves the confidence of the estimation. This is a critical flaw in the entropic algorithms, which can only assess errors of misclassification.

## Classification based on Set Theory

Configuration is any statement in which its attributes depend on the relative disposition of every constituent part; that is, the arrangement of things put together that explains the observed result as a whole. The relationship between the whole and the structure of its component parts is comprehensible using a systemic perspective: the configuration of interrelationships between constituent entities yields an emergent form such that the properties of the whole are not a direct result from the properties of the parts. This is a key idea not only in fields of hard science such as algebra and chemistry, but on fields of social sciences too. In addition, the scientists are also concerned with configurations evolving over time: patterns of individual or collective behavior observed when studying a single social system or multiple social systems with the same basic structure. These patterns emerge over a historical period of observation, and even across distinct empirical settings in the case of multiple systems.

QCA proponents claim this technique to be a configurational analysis tool, which unveils functionally equivalent causal paths to the outcome of interest in a supposedly complex phenomenon[10]. However, QCA lacks the features to acknowledge complex, dynamic and contingent patterns of sequences of events that take place over space and time. The rationality behind QCA depends on combinational logic only. To acknowledge patterns in chains of events it is necessary a sequential logic – the one acknowledged in regular patterns using a finite automaton, which is the first level of algorithmic complexity just above combinational logic. Even though both logics are compatible – sequential logic is a kind of extension of combinational logic –, there are still higher levels of pattern complexity above the regular one that are associated to complex, dynamic and contingent social phenomena. In addition, there is also the need to adopt an analytical procedure beyond the a-historical configurational analysis of QCA without leaving behind the accumulated expertise of QCA researchers.

This paper recommends a kind of empiricist methodology of classification analysis to solve this challenge. The simplest inductive approach to the classification problem is to use csQCA, which implements a combinational logic learning algorithm for a set of dichotomous variables. Contrary to the hypothetical-deductive approach, csQCA manipulates a set of representative instances of a category of social form rather than ordinary samples from a population of social forms. In the next section, the algorithm used in csQCA is presented as a classifier based on Set Theory. In addition, there are also sequential pattern recognition algorithms with higher levels of complexity that use a rule-based model joint with their procedure for ruleset induction. In this sense, a second algorithm is presented as a classifier based on a specific computational complexity level from Formal Language Theory – an extension of Set Theory – that acknowledges complex and dynamic patterns of event outcomes relying upon the feature of recursion. Finally, the present paper suggests an extension to this method to model any contingent phenomenon using these two algorithms together, which rises once more the classification problem to an even more high level of complexity.

### Rule Induction using the Quine-McCluskey Algorithm

Consider the logic function f(t1, …, tn), also called binary or Boolean function, as any function with the form f: Bn → B defined on an n-dimensional domain Bn, which are n logic terms representable as either {0, 1} or {true, false}. Consider also the problem of finding out its corresponding formula like a sum-of-products of terms in the Canonical Disjunctive Normal Form (DNF), known as Minterm Canonical Form[11]. Given this logic function f as a truth table, there are 2n possible combinations of each term or its negation, known as minterms. An implicant of this function f is any minterm mi such that if its value is “true,” then the value of the function f is also “true;” in a way, that mi logically implies f. Any representation of the logic function f by a formula uses the sum of minterms that are implicants. If at least one of the implicants is logically equivalent to a combination of other implicants, then it is not the representation with the minimum number of minterms. In this circumstance, there is a need to minimize the number of input terms in a sum of minterms.

For the simplification of the formula of a logic function, it is possible to use algebraic manipulation based on the reduction principle, given by A.B + ~A.B = (A + ~A).B = B, combining pairs of implicants into a new implicant, and then reducing the number of minterms too. Whenever no other combination of implicants is possible, the resulting formula consists of the sum of all prime implicants of the logic function. If there is at least one prime implicant resulting from a combination of minterms that also occur in another prime implicant, then it is not an essential prime implicant; eliminating it from the formula of the function is necessary. The simplest formula must be a combination of essential prime implicants only.

The Quine-McCluskey (QMC) algorithm[12][13] was the first systematic procedure for the application of the Method of Prime Implicants for the minimization of logic functions. Given a logic function f in the form of a truth-table, it finds out all implicants of f. Next, it recursively uses the reduction principle to transform pairs of implicants with only one divergent term (i.e., B versus ~B) between them into a single one until there is only prime implicants left. Finally, the additional elimination of redundant prime implicants reduces their number to the essential prime implicants, resulting the simplest formula for the logic function f.

Usually, engineers apply QMC for the design of digital circuits, but the proponents of configuration analysis apply it for empirical social research. The csQCA calculates all combinations of attributes leading to each outcome of a type of event as essential prime implicants of the logic function underlying the empirical data. There are versions for nominal variables and ordinal variables, which are multi-value minimization QCA (mvQCA). Tosmana[14] is a tool that implements this categorical variable-based approach generalizing the dichotomous variable-based one.

The QMC algorithm builds a linear function model on a domain set of dichotomous variables. In csQCA, the assessment of the resulting model relies on two fitting criteria: the measure of coverage provides the percentage of all observations having a specific configuration of conditions; and the measure of consistency provides the percentage of observations in which there is a specific configuration of conditions having the same event outcome.

Once built a truth table with empirical data, crisp-set QCA calculates a measure of coverage for each configuration of conditions; it determines which of them is dominant (or exceptional) in comparison with the other alternative implicants of the same event outcome. The csQCA tool calculates a measure of consistency for each configuration of attribute conditions empirically observed as well. The perfect consistency refers to the situation in which all instances of each configuration of conditions result into the same event outcome. Otherwise, there are contradictory lines in the truth table; that is, the same configuration of conditions is in association with different event outcomes, which relates to a measure of consistency between zero and one. This is evidence of the current inability of the model to explain the phenomenon perfectly.

QCA proponents handle contradictory lines by adding or removing one condition of a configuration, in a way that resembles an abductive inference. The goal is to discern the anomalous fact. A third solution is to accept the contradictory rules based on either the theoretical or the substantive knowledge of the scientist; however, it is not different of acknowledging a specific (but unknown) context. After eliminating or accepting all contradictions, the next step in this procedure of configurational analysis is to simplify the conjunctions of attribute conditions using the Method of Prime Implicants.

For this application, QMC can test the hypothesis of a deterministic set-theoretic relationship between the input terms and the output of interest. It is only valid for the set of instances under inquiry because there is no assessment of the appropriateness of the sample size for statistical inference. QMC builds up a classification model based on a specific pattern for the values of the attribute conditions from a set of representative instances. Selecting instances in a systematic way instead of randomly sampling them is a way to avoid bias, reflecting the variety of configurations of conditions leading to the same event outcome; however, it does not reflect the frequency of these configurations in the population. In other words, the number of times each configuration of conditions occurs in the selection of instances does not matter in QCA: statistical generalization of the results to the whole population is not possible; only analytical generalization to the theory is valid. In addition, the number of instances is irrelevant to the reliability of the conclusions: using the QMC algorithm like a set-theoretic classifier, it is possible to directly manipulate the relations between theoretical concepts rather than indirectly manipulating them by processing sample data.

## Classification based on Formal Language Theory

Historical analysis of configurations relies on the selection of instances of a category of behavior, that is, the structural relations between types of events. First, all instances of the same type of event share a common definition: description is feasible by the observation of values for some fixed, finite set of attributes. After that, clustering the instance space produces groups of instances presenting distinct patterns of values for their attributes related to mutually exclusive outcomes of this type of event. Finally, sequences of event outcomes occur alternatively in a systematic way such that one instance relates with some of the others according to a complex but deterministic pattern of co-occ[15]urrence of these event outcomes.

The computer-based implementation of this historical approach to configurational analysis is possible relying on Formal Language Theory, which is the studying of any set of sentences from such a category of language. Any sentence is a finite sequence of discrete symbols, also known as string, resulting from the concatenation function (•) defined over some fixed, finite set of symbols, called alphabet (Σ). A formal language L is the free monoid (Σ, •) over the alphabet set Σ together with the concatenation function • : Σ* × Σ* → Σ* that generates any string of L from a pair of substrings, including the empty string. In this sense, the grammar G = (N, Σ, P, S) is a formal model where N is the finite set of system states in which the abstract machine can transit during the path of derivation of a string from the language L(G), Σ is the alphabet set, P is the ruleset, and S is the initial system state.

It is possible to understand Formal Language Theory as an extension of Set Theory too; nevertheless, using a hierarchy of levels of complexity for the concatenation function called Chomsky Hierarchy. This hierarchy of languages is equivalent to both the hierarchy of abstract machines from the Automata Theory[16][17] and the hierarchy of grammars from the Generative Grammar Theory[18][19] - any sequence of symbols is the result of a rule-based computation using a processing system that relies upon one of the possible levels of algorithmic complexity. In this sense, there is the need to adopt another mathematical framework to support the analysis of the classes of models: Category Theory[20][21]. The conciliation of Category Theory and Generative Grammar Theory in the form of a formal model makes a category of social process to be equivalent to a category of formal language.

The instances of the same category of social process may differ between them in terms of a surprising or anomalous fact that becomes a new outcome for a type of event. This divergence introduces ambiguity in the abstract structure of this category of social process because there are two possible explanations to the same sequence of outcomes. Firstly, consider solving this empirical divergence with the grammar using the logic of retroduction rather than the logic of falsification. This procedure consists of modifying the ambiguous rules to become mildly context-sensitive rules; this modification extends the initial category turning it into a new subcategory. The resulting grammar increases in complexity to the level of an indexed grammar[22] or other mildly context-sensitive grammar class; usually, this level of computational complexity is enough to describe configurations of ordered chains of past event outcomes.

Categorical-Generative Analysis[23][24] adopts an analytical method that searches for the grammar with the highest algorithmic complexity fitting to the sequence of event outcomes under inquiry. If each event outcome relies on the prior event outcome only, then the solution to this classification problem could be a regular grammar, which is "memoryless" – it cannot plan ahead its future occurrences. However, this is a strong constraint on the nature of social processes. Categorical-Generative Analysis assumes a mildly context-sensitive grammar as the plausible algorithmic complexity level for social processes because recursion is adequate to grasp path dependence and an ordered chain of past event outcomes is adequate to grasp contingence. There is an ontological reason for why context-sensitive languages are more complex than the processes in the social realm, which is still further explored in this paper. Here, it is enough to understand the relationship between the context of past event outcomes and the so called surprising or anomalous fact.

### Context-free Grammar Induction using the Nevill-Manning Algorithm

The language-learning problem is that of acquisition of the syntax or syntactic structure through the set of grammar rules for generating and recognizing valid sentences of a language. The approach of generative linguistics to the studying of similarities and differences between different languages characterizes properties that are common to all languages. Here, there are some important questions: the possibility of learning a language from a set of valid sentences; the role of previous knowledge; the types of inputs available to the learner (either positive or negative); and the effect of semantic information on learning the syntax of a language. The key point here is the distinction between syntax and semantics to the studying of language in the perspective of Generative Linguistics. In this sense, there are algorithms to learn a set of syntactical (or grammatical) rules for generating sequences of symbols belonging to the same class of language (or category of grammar), which creates a syntactic pattern classification system[25]. This kind of technology applies to speech recognition and sequencing of chromosomes.

Grammar inference (or induction) is a kind of algorithmic learning procedure for an unknown, target grammar from a set of sentences labeled as belonging or not to a given language. This systematic approach to the problem of syntax acquisition requires the selection of the appropriate class of grammar (or of language). This is the basic model for selection-based classification much like the logistic regression (or the logit model) and C4.5 (or the decision-tree model) are basic models for sample-based classification – in fact, combinational logic is below sequential logic in the computational complexity hierarchy. The algorithm of grammar induction yields an approximation of the unknown grammar from a set of candidate hypotheses defined over the same alphabet set. It may rely upon different types of information such as positive and/or negative examples and the availability of domain specific previous knowledge. For instance, the impossibility of learning regular grammars from positive examples only was already demonstrated[26]. Therefore, it must be considered in the research design.

There exist grammar induction algorithms to build up the ruleset P for some strings under the assumption of a specific computational complexity level. The comparison of the resulting ruleset for a pair of strings determines one of three possibilities: (1) they fit into the same category of language; (2) they fit into subcategories sharing a common generative structure; (3) they do not relate to each other. In this sense, the problem solved by the Categorical-Generative Analysis resembles the so-called Configuration Analysis using Qualitative Comparative Analysis; nevertheless, its formal model relies on a level of computational complexity greater than that of combinational logic.

The Nevill-Manning algorithm[27][28], also known as Sequitur, builds up a hierarchical structure by recursively replacing frequent component substrings with a novel symbol to be added to the set of system states (N), which is in the head of a rule making that substring. Sequitur is a greedy algorithm to minimize the objective function of the size of the representation of the string with no information loss. The resulting tree-like structure encompassing its optimal solution is also a rule set to regenerate the initial form of the string as necessary. Sequitur is a kind of data compression algorithm because it reduces the entropy of the sentence under analysis by transforming the observed regularities in production rules; nonetheless, it can be a classification algorithm for sequences of symbols belonging to the same a category of language too.

The algorithm reads the symbols of the sentence serially while storing them in the form of an initial grammar rule like S → γ, where S ${\displaystyle \in }$ N and γ ${\displaystyle \in }$ (N ${\displaystyle \cup }$ Σ)*. Every time a bigram (ai, aj) ${\displaystyle \in }$ (N ${\displaystyle \cup }$ Σ)2 occurs twice in the substring γ, a new production rule in the Chomsky Normal Form like ak → ai, aj is added to the rule set. Both occurrences of this bigram are also replaced by the new symbol ak ϵ N in the initial production rule such that it becomes S → αAkβAkω. Finally, this optimization algorithm checks the rule set under construction if it satisfies a pair of constraints: (a) diagram uniqueness, by which every conceivable bigram occurs only once in the modified S → γ; and (b) rule utility, by which every symbol a ϵ N occurs at least twice in the body of other rules or of a single rule. The algorithm performs this routine until no symbol is left unread.

### Mildly Context-Sensitive Rule Induction using the Quine-McCluskey algorithm

In a decision-making event, choosing one of the possible alternative event outcomes is a logical function of a set of empirically verifiable contextual conditions. The analytical representation of a logical function is the disjunction of all conjunctions of conditions observed in the empirical evidence about past event outcomes resulting in the outcome of interest. In a within-case study, Configuration Analysis identifies these conjunctions of conditions with an iterative procedure relying on several rounds of minimization of terms. The Quine-McCluskey (QMC) algorithm must calculate all the essential prime implicants given a fixed, finite set of hypothesized contextual conditions. The measures of coverage and consistency enable to grasp if the inferred configurations of conditions can explain all event outcomes; in contrast, there is a need to include or remove some condition to run a new round using QMC. The researchers need specialized knowledge to interpret measures of configurational patterns; explaining the observed heterogeneity using this inductive approach to infer the validity of the hypothesized set-theoretic relation between the event outcomes and configurations of contextual conditions.

Each observed conjunction of conditions is sufficient to imply an event outcome but each condition alone is not necessary unless it occurs in all conjunctions. If a condition is necessary, then the set of instances with an outcome of interest is a subset of the set of instances with this condition. However, if a condition is sufficient, then the instances with this condition consist of a subset of those with the outcome. If there are different configurations of conditions implying the same event outcome, then there is multiple causal “recipes,” a property also named equifinality[29][30]. Nonetheless, these are not “causal paths” unless the order of occurrence of these conditions is under consideration. Precisely, categorical-generative analysis addresses this weakness of the existing configurational analysis with a nested stack from an indexed grammar to learn the exact order in which the past event outcomes have occurred. Each configuration of the order of occurrence of each of the conditions leading to a specific event outcome goes through QCA to determine if the order matters after all (or not). Given a positive result, the generative grammar of the process instance under analysis, once regarded as a non-deterministic context-free grammar, now turns out to be mildly context-sensitive: elimination of contradictory lines evidencing grammar ambiguity using the cs-QCA’s configuration analysis on the contextual conditions in a within-case study can make it. Nonetheless, the decision about the analytical generalization of the context-sensitivity of a subcategory of the social process to the theory requires a cross-case study yet.

### Cross-case Comparison of Generative Grammars

The candidate grammar models to describe a social process may belong to one of the computational complexity classes. The candidate model best representing a category of phenomenon is the one explaining all the empirical patterns synthetically: it minimizes the redundancy[7] of the social process much like it was a specific formal language. At first, the methodology seeks to minimize the entropy within the empirical data collected from the structured text narrative (using a database of events).

In the within-case analysis of sequences, the first fitting criterion is the minimization of the entropy: identifying the repetitive sequence patterns of event outcomes that turns out to be grammar rules. The instances of all types of events sorted together by their chronological order become the input to the sequential analysis of event outcomes; it relies on the assumption of being a result from the same structure underlying the social system under inquiry. A parser for the formal grammar with the greatest computational complexity leads to the model that minimizes the entropy in the data set; it is inferred from the bulk of empirical data in the search for sequence patterns. Thus, the sequence analysis of the structured text narrative (in a database of events) identifies the relations of recursion between the grammar rules describing the chain of event outcomes. The resulting grammar is in the context-free class of computational complexity.

The second fitting criterion in the within-case analysis of sequences of critical events is the minimization of stochasticity in the structure of the social process: substituting the alternative grammar rules departing from the same system state by mildly context-sensitive rules. The elimination of ambiguity due to the anomalous or surprising fact is always necessary, but the minimization of the uncertainty about the evolution of the social process due to uncertainty about the contextual conditions in a non-deterministic system state is also desirable despite of being non-mandatory; although at the cost of the predictability of the state transitions, of course. For each alternative state transition departing from the same system state, find out configurations of contextual conditions preceding it in the chain of event outcomes. Each configuration of contextual conditions implies an alternative context-sensitive rule for this system state, that is, the resulting formal grammar will be at least in the mildly context-sensitive class of complexity. It still highlights the depth of path dependence as well as the specific contingency in the behavior of this instance of the social process.

Finally, there is one fitting criterion to stop the cycles of cross-cases analysis that is the guarantee of representability: the number of case studies can be determined using the ordinary criteria of multiple cases study[31]. Because of the specificity arising from the mildly context-sensitive rules, different case studies should not be confronted one with each other using sample-based, statistical methods. In fact, this research practice violates an epistemological assumption of Critical Realism about the contingent nature of causal explanations. However, the resulting grammars for each case study relying on the same theoretical structure consist of the same implementation of a concrete category of process such that they may be submitted to cross-case analysis (i.e., the comparison of their rulesets). Each pair of subsequent case studies must go under assessment for their natural equivalence after resolving the divergences between theory and empirical evidence in the within-case study in each of them. It provides the analytical generalization of the research conclusions to the theory itself.

Conversely, there is a cost-benefit relation when choosing the complexity level: the higher computational complexity of the resulting formal grammar implies the higher risk of misunderstanding a structured narrative or database of events as not belonging to this category of social process. Analogously to comparative studies, the greater the number of the unities of analysis needed to reach theoretical saturation too.

## Conclusions

In order to assume a grammar such as a model for scientific inquiry, the properties of the chosen formalism must respect the ontological and epistemological assumptions of the adopted scientific paradigm: this work aligns the assumptions of Critical Realism with the mathematical properties of Formal Language Theory, but the researcher must still choose the adequate grammar formalism for the empirical setting under inquiry.

The procedure for the refinement of a grammar model grounded on empirical data replaces sets of alternative state-transitions by their equivalent mildly context-sensitive rules. Alternative rules describe either equifinal paths or competing explanations for a category of process. On the one hand, equifinality results from the inherent complexity of emergent phenomena defined by incomplete theoretical statements, for example, due to cumulative effects; and a set of equifinal rules generates distinct, mutually exclusive partitions of a domain of instances of a type of process. On the other hand, competing explanations results from the acknowledgement of the contradictory facts; and a set of ambiguous rules generates distinct, competing derivation paths for at least one instance of a type of process. In both cases, identifying configurations of contextual conditions associated with each alternative event outcome improves the predictability of the state transition, but increasing the explanatory power of the model at a cost of increasing the computational complexity too.

The epistemology to guide a methodology based on either Set Theory or Formal Language Theory is Critical Realism instead of Social Positivism. Consequently, there is no sense in talking about the rejection of a theoretical proposition because what is in use is not a logic of refutation, but a logic of retroduction. In addition, different learning algorithms make different assumptions about the abstract structure underlying the data set and have different rates of convergence. The one working best minimizes the cost function of interest (cross-validation, for example), that is, it is the one making assumptions that are consistent with the data and has been sufficiently converged.

### Acknowledgements

Any people, organisations, or funding sources that you would like to thank.

## References

1. Lijphart, Arend (1971-09). "Comparative Politics and the Comparative Method". American Political Science Review 65 (3): 682–693. doi:10.2307/1955513. ISSN 0003-0554.
2. Ragin, Charles C. (1987). The comparative method : moving beyond qualitative and quantitative strategies. Berkeley: University of California Press. ISBN 0-520-05834-8. OCLC 15017459.
3. Bhaskar, Roy (1975). A realist theory of science. London: Routledge. ISBN 978-0-415-45494-0. OCLC 166379172.
4. Mill, John Stuart; Mill, John Stuart (1868). A system of logic ratiocinative and inductive : being a connected view of the principles of evidence and the methods of scientific investigation / by John Stuart Mill.. London :: Longmans, Green, Reader, adn Dyer,.
5. Quinlan, J. R. (1986-03). "Induction of decision trees". Machine Learning 1 (1): 81–106. doi:10.1007/bf00116251. ISSN 0885-6125.
6. 1916-2001., Shannon, Claude Elwood, (1998). The mathematical theory of communication. University of Illinois Press. ISBN 0-252-72546-8. OCLC 40716662.
7. Ross., Quinlan, John (1993). C4.5 : programs for machine learning. Kaufmann. ISBN 1-55860-238-0. OCLC 797497210.
8. Cox, D. R. (1958-07). "The Regression Analysis of Binary Sequences". Journal of the Royal Statistical Society: Series B (Methodological) 20 (2): 215–232. doi:10.1111/j.2517-6161.1958.tb00292.x. ISSN 0035-9246.
9. Fiss, Peer C.; Marx, Axel; Cambré, Bart (2013-01-01). C. Fiss, Peer. ed. Chapter 1 Configurational Theory and Methods in Organizational Research: Introduction. Research in the Sociology of Organizations. 38. Emerald Group Publishing Limited. pp. 1–22. doi:10.1108/s0733-558x(2013)0000038005. ISBN 978-1-78190-778-8.
10. 1906-, Blake, Archie, (1938). Canonical expressions in Boolean algebra. OCLC 609232572.
11. Quine, W. V. (1952-10). "The Problem of Simplifying Truth Functions". The American Mathematical Monthly 59 (8): 521–531. doi:10.1080/00029890.1952.11988183. ISSN 0002-9890.
12. McCluskey, E. J. (1956-11). "Minimization of Boolean Functions*". Bell System Technical Journal 35 (6): 1417–1444. doi:10.1002/j.1538-7305.1956.tb03835.x.
13. Verfasser, Cronqvist, Lasse 1977-. Qualitative Comparative Analysis (QCA) eine Einführung mit TOSMANA und dem QCA Add-In. ISBN 978-3-95710-250-8. OCLC 1101279505.
14. MacLane, S. (1948-06-01). "Groups, Categories and Duality". Proceedings of the National Academy of Sciences 34 (6): 263–267. doi:10.1073/pnas.34.6.263. ISSN 0027-8424. PMID 16588808. PMC PMC1079106.
15. Turing, A. M. (1937). "On Computable Numbers, with an Application to the Entscheidungsproblem". Proceedings of the London Mathematical Society s2-42 (1): 230–265. doi:10.1112/plms/s2-42.1.230. ISSN 1460-244X.
16. Chomsky, N. (1956-09). "Three models for the description of language". IRE Transactions on Information Theory 2 (3): 113–124. doi:10.1109/TIT.1956.1056813. ISSN 2168-2712.
17. "On certain formal properties of grammars". Information and Control 2 (2): 137–167. 1959-06-01. doi:10.1016/S0019-9958(59)90362-6. ISSN 0019-9958.
18. Samuel, Eilenberg, (1945). General theory of natural equivalences. American Mathematical Society. OCLC 819946414.
19. MacLane, S. (1948-06-01). "Groups, Categories and Duality". Proceedings of the National Academy of Sciences 34 (6): 263–267. doi:10.1073/pnas.34.6.263. ISSN 0027-8424. PMID 16588808. PMC PMC1079106.
20. Aho, Alfred V. (1968-10-01). "Indexed Grammars—An Extension of Context-Free Grammars". Journal of the ACM 15 (4): 647–671. doi:10.1145/321479.321488. ISSN 0004-5411.
21. Braga, Bruno da Rocha (2017). "A Categorical-Generative Theory of Social Processes: About Describing a Complex, Dynamic and Contingent Pattern of Decision-Making Events". SSRN Electronic Journal. doi:10.2139/ssrn.3033880. ISSN 1556-5068.
22. da Rocha Braga, Bruno (2018). Costa, António Pedro; Reis, Luís Paulo; Souza, Francislê Neri de et al.. eds. "A Categorical-Generative Theory of Social Processes: Towards the Ontological and Mathematical Foundations of a Grammar-Based Model for Qualitative Research". Computer Supported Qualitative Research. Advances in Intelligent Systems and Computing (Cham: Springer International Publishing): 400–411. doi:10.1007/978-3-319-61121-1_34. ISBN 978-3-319-61121-1.
23. 1959-, Głowacka, Dorota. Shawe-Taylor, John. Clark, Alexander. De la Higuera, Colin. Johnson, Mark,. Introduction to the special topic on grammar induction, representation of language and language learning. OCLC 922865728.
24. "Language identification in the limit". Information and Control 10 (5): 447–474. 1967-05-01. doi:10.1016/S0019-9958(67)91165-5. ISSN 0019-9958.
25. Nevill-Manning, C.G.; Witten, I.H.; Maulsby, D.L.. "Compression by induction of hierarchical grammars". Proceedings of IEEE Data Compression Conference (DCC'94) (IEEE Comput. Soc. Press). doi:10.1109/dcc.1994.305932. ISBN 0-8186-5637-9.
26. G., Nevill-Manning, Craig. Inferring sequential structure. OCLC 154289924.
27. Abbott, Andrew (1990-11). "A Primer on Sequence Methods". Organization Science 1 (4): 375–392. doi:10.1287/orsc.1.4.375. ISSN 1047-7039.
28. Abbott, Andrew (1995-08). "Sequence Analysis: New Methods for Old Ideas". Annual Review of Sociology 21 (1): 93–113. doi:10.1146/annurev.so.21.080195.000521. ISSN 0360-0572.