Linguistics and Machine Learning

 by Val Koteles

Subsets of linguistics are applicable to machine learning and the way in which software engineers conceptualize the knowledge that is to be "fed" to the machine. Developers cannot help but think of the knowledge other than defined by their own minds' metadata in relation to language. Such metadata is constructed, for example, by individuals in the processes of concept discovery during the learning of a new language. The new language carries expressions that are not necessarily shared across cultures and grammatical formats that are unfamiliar, like the order of words in sentences, compounding of words with suffixes and prefixes, phonetic variants (for ex. in English "a" becomes "an" when preceding a vowel), etc.

The [sometimes recursive] discovery process in language learning can serve as a partial model for how knowledge is acquired in machine learning when matched to visual, auditory, and [machine] role play emulations. The machine has to typically be assumed to be an extremely low proficiency learner.


Natural Learning Style Research & Applicability to Machine Learning


International Journal of Applied Linguistics


        buzz words:   Learning styles vs. learning strategies, cognitive styles (psychology), analytic vs. holistic

Principles of Kolb (1976), are useful in machine learning:

"According to Kolb, the natural sequence for learning is a cycle containing four stages: concrete experience, reflection–observation, abstract conceptualization, and active experimentation."

Similarly to language studies, Computer Scientists must develop assessment instruments that accurately measure the results and speed of conceptualization within any given learning style modality used in machine learning. Complex models that reflect the understanding of non trivial concepts must be included in batteries of contextualized tests. Quantification of results requires attention: how do we know/test that a concept was realized in the relevant context?

Initially it seems machine language may have has some dependency on natural languages. Using abstract data models one must translate any natural language into pre-conceptualized parameters such as: who, what, where, action, how, adjective parameters, human style parameters such as: subjectivity, sarcasm, emotional and cultural parameters as applicable, and so on. The data structures are used in a translation framework that facilitates testing as follows:


The method provides an option to a human tester to examine how meaning in a Language 1 makes the round trip through the TCAM Data Structure into a second natural language and back again through a second translation phase. The tester thus has to only deal with understanding one language (Language 1) and can now quantify numerically by calculating a percent factor equal to the loss in conceptualization when filtered through two or more translation rounds trips and via as many natural languages as possible.

The need for testing is parallel to the need for maintaining a Machine Language conceptual model: TCAM above. This can be perceived as a visual relational model by developers and testers. It consists of a data structure and algorithms that relate the concept on its own and how it fits within the overall existing structure of understanding: the Universal Relational Model  for the Conceptualization of Knowledge (URMCK). Such a system must be endowed with self assessment algorithms that permit verification of new terms entering the system by looking for contradictions, loss of content and meaning during translation through multiple natural languages and development of additional test parameters to indicate validity of  understanding. Continuous access to human operators to participate in resolution of conflicts, and prevention of circular references is necessary for the system to grow. Statistical analysis can provide a percent of certainty of understanding that can automatically trigger human operators to intervene.

The need for maintaining data on the structure itself (TCAM), unit testing and a multitude of natural languages suggests the obvious immediate issues with both computing power and storage as well. However, it begs the question: while the data bank required to maintain a URMCK would be enormous, would it still present itself to become deterministic in nature? This is assuming a distinct separation between conceptualization and ideation modules.

Reading: examine the mismatch between Learning Styles and Teaching Styles

Learning Styles And Teaching Styles: A Case Study In Foreign Language Classroom

buzz words:  The Personal Learning styles inventory, Reid's learning style model.

Reid's learning style model: Reid (1987) identifies six learning styles, referred to as Perceptual Learning Styles:

All of Reid's styles may become applicable to machine learning as technologies develop. Styles will need to be enhanced and reclassified to include: spectroscopy, thermography, and other remote sensing methodologies. The main difference being that in machine learning all style would engage concomitantly rather than choosing a subset.

Find out more about things that human brains do that are still far superior to the performance of computers. There is a limit to the range of concepts that are translatable, not all linguistics concept are easily described in data structures and algorithms. Some examples are: attributing the value of creativity to one set of work vs. another, the meaning of emotions and their comparisons and quantifications, navigating uncertainty.

Interesting Topics

Historical twists, computations that digitalize morality are already here, machines are already making decisions for us:

Zeynep Tufekci: Machine intelligence makes human morals more important

buzz words: algorithmic accountability, outsourced responsibility.

In the video above, the speaker, Zeynep Tufekci states "We cannot outsource our moral responsibilities to machine" and while philosophically probably most people would agree there is nothing wrong with this statement and definitely this is a goal to aspire to, the reality is difficult to disregard: a URMCK already exists, even though it's not being developed purposefully. It is already here in a fragmented form, slowly amalgamating portions of seemingly unrelated components such as health, security, law enforcement, HR, finance, insurance, and social media. The growth is attributed to increased need for automation, evolutions in interconnectivity, broad band, and last but not least the sharing of cross platform universally decipherable open formats.

From a Computer Science point of view, it makes sense to develop a para-framework that anticipates the overlap of varieties of calculation engines in a increasingly interconnected automation in the near future. Such framework can enforce processing via automaton nodes that provide, ironically, an added layer of AI error correction and auditing to force the resulting system to become more consistent with human morality than merely calculations.

Linguistics and Technology, New Concepts


Fred Benenson, Data Scientist: "Mathwashing in an attempt to describe the tendency by technologists (and reporters!) to use the objective connotations of math terms to describe products and features that are probably more subjective than their users might think. This habit goes way back to the early days of computers when they were first entering businesses in the ’60s and ’70s: everyone hoped the answers they supplied were more true than what humans could come up with, but they eventually realized computers were only as good as their programmers."