Impossible Evidence: Words, Identity and AI

Impossible Evidence: Words, Identity and AI

"I don't have a good feeling. I feel scared,” the voice of a terrified 17-year-old girl pleaded over the phone.1 “You know the borders are closed right now, so how am I going to get out?”2

The voice was Kadiza Sultana, and she was desperately trying to plan her escape.3 A year before, in February 2015, the former student at Bethnal Green Academy and two of her classmates left their homes in the U.K. to join the Islamic State group.4 Now, trapped in the organization’s clutches, she feared for her life.

On the other end of the line was Halima Khanom, her sister. Racking her brain, Halima asked Kadiza the chances of escape.5

“Zero,” said Kadiza.

Not long after the call, Kadiza was killed in an airstrike.6

The Devil is in the Details

Kadiza did not leave without warning. The trio’s departure in spring 2015 left a trail of clues that natural language processing (NLP) technology can detect and just was not in place (or possible) at the time.

Scotland Yard interviewed the young women three months before their disappearance.7 A classmate and friend had vanished in the same manner, and law enforcement wrote up reports on the three as well as other at-risk students.

The girl’s social media presence leading up to their departure also bore troubling signs: Their posts had rapidly transformed from normal, life-of-a-student commentary to ideological rhetoric.8

Taken together, the data forms a perilous narrative of radicalization. If the full story had been available to airport security staff, the students would likely still be safe at home, complaining about teachers and prepping for exams.

However, these story fragments were housed in disparate data sources—state and school records, police reports and social media posts—and the pieces literally could not be put together in time.

The story of Kadiza Sultana is a tragic case of a problem that appears in many spaces that are less tragic but still have wide-reaching impacts. Money laundering is one such area. An estimated 2-5 percent of global GDP, or $800 billion to $2 trillion in current U.S. dollars, is laundered annually worldwide.9 Financial institutions (FIs) are tasked with identifying potential and current criminals in the masses of people they onboard daily, a task which becomes more difficult by the day.

The trouble is not that the needed information in these scenarios does not exist. In fact, quite the opposite is true: For the most part, all the intelligence FIs need to determine criminality exists in spades.

Instead, the difficulty lies in organizing and locating this information. It is usually in unstructured text, buried in other unstructured text and scattered throughout the internet and other sources.

For computer systems to understand language like people understand language, programmers had to build computer systems that learned language like people learned language

Needle in a Stack of Needles: Structured and Unstructured Data

For machines, not all textual data is created equal. Overall, computers like structured data.

Structured data has some sort of a known, unambiguous order that can be easily understood by computers. The information contained in tables, spreadsheets, taxonomies and protocols are all examples of structured data. This data announces (“labels”) each row and column with what it is inside and how to interpret it (e.g., “date of birth” or “annual revenue”).

However, much of the identity-relevant information is found in unstructured data.

Unstructured data does not have a formal, clear order that can be easily understood. Unstructured does not mean “no rules”: Prose generally follows the rules of grammar while being unstructured. Instead, unstructured data describes information whose format is difficult for a computer to interpret.

For example, readers can understand the sentence, “The dog jumped,” because they know what type of word “the” is, and they know that verbs like “jump” get certain endings attached and come at the middle or end of a sentence. This simple example becomes more complex when it is changed to, “The quick brown fox jumped over the lazy dog,” which is still not a very complex sentence. However, the rules of grammar that guide both of these sentences make complete sense to people but are not easily understood by computers.

Because computer systems are really the only systems capable of processing data at internet scale, a very real problem presents itself: Much of the information that is cared about is found in a form that computers have been traditionally poor at handling.

Needle in a Stack of Needles: Here, There and Everywhere

It does not stop there. The quality of the information also presents a huge challenge to discovery, analysis and unification of identities:

  • Names often vary
  • Attributes can conflict
  • Different languages and scripts will confuse

As information quality poses a challenge, so does information location. The identities of people, organizations and places are stories and, these stories are broken into difficult-to-reconcile fragments and spread across the digital landscape. And all of these fragments can be found buried in databases, data lakes, log files, historical archives, content systems, social media posts, website pages, document libraries. The list goes on.

Taken together, these challenges present the parameters of a solution. If they are going to assemble comprehensive identities derived from textual data, FIs need technology capable of scanning a variety of different structured and unstructured data sources, finding relevant fragments and resolving them into unified identity stories.

To understand this information created by human minds and meant for consumption by other human minds, technology that mimics human understanding is needed—which is where artificial intelligence (AI) enters the picture.

Words Are a Pain

The rules that govern language are complex, almost inexpressibly so. Quality machine translation, still an enormous challenge, was impossible for decades because of this fact.

Long before the current generation of AI, expert systems ruled the world. But the mighty expert systems, built on a bedrock of explicit input/output rules written by highly paid subject-matter experts, were fragile and could not handle the complexities and frequent changes in how humans use their languages.

The idea that people can use a language without really understanding it is not exactly surprising. After all, many people effortlessly construct sentences according to rules of grammar they would be hard-pressed to explain.

However, the lesson that expert systems taught was a little shocking: even Ph.D. linguistics cannot fully articulate the nuances of language systems. There are simply too many intricate rules for even the best experts to write down.

So, computer systems were bad at language. That began to change, however, with the dawn of machine learning.

Much of the information that is cared about is found in a form that computers have been traditionally poor at handling

Teaching Robots to Read

For computer systems to understand language like people understand language, programmers had to build computer systems that learned language like people learned language (more or less).

People do not learn via exhaustive lists of all possible, relevant inputs/output relationships.

People learn via example, create conceptual models and then make inferences.

For instance, adults do not teach children to identify cats by showing children every image of a cat possible paired with the word “cat,” spoken or written. Instead, children are shown a variety of cats and if they call a dog, “cat,” they will be gently corrected, “No, dear, cats don’t bark.” From those examples, children create a mental model of the features that identify a cat.

This model is flexible, capable of identifying previously unseen versions of cats as cats. It also learns improving in sophistication and accuracy over time.

This is precisely the same kind of process that is now used to teach AI systems to understand human language. It is called machine learning, and it has revolutionized computer science. Machine-learned models can “learn” thousands, or even millions, of micro-rules to identify facts in sentences—far more than the expert systems Ph.D.s could write.

Because of machine learning, and the later innovations that fall under its umbrella (like deep learning), computers no longer fail at human language. In fact, they are getting pretty darn good at it. Good enough to finally search through the mountains of unstructured text produced every day to find and connect the nuggets, few and far between, that actually matter.

An ability that is good news for everyone, not the least of which anti-money laundering (AML) professionals.

The Robot Army

Since the early days of stumbling over sentence structure, AI has come a long way.

The field of study specifically dedicated to the application of AI for human language, NLP, has developed a library of capabilities across a wide variety of human languages. These include base linguistic processing, key information identification and extraction, categorization, summarization, sentiment analysis, semantic search, and information mapping and resolution.

These are capabilities that can be used both on structured text (labeled columns of information like spreadsheets) and unstructured text (unlabeled fields of information like social media posts) across a litany of languages.

Now, using a well-designed combination of the capabilities, a single AI system can do the following:

  1. Continually ingest a huge variety of multilingual textual information, from watchlists to Facebook posts
  2. Draw out the most relevant data like people, places, organizations and flag concepts, and key phrases
  3. Map their relationships in a knowledge graph and store this data in a knowledge base, creating unified identity stories

If AML departments are going to tackle the unstructured text and identity problem, this is precisely the kind of system they need.

The rules that govern language are complex, almost inexpressibly so

Great...but how?

Hand waving about what AI can do is great and all, but it is not particularly helpful. Explanations are.

So, taking each piece in turn:

How can AI continually ingest a huge variety of multilingual textual information, from watchlists to Facebook posts?

Good question. The quality of the results of more complex analytical processes, like the extraction and mapping of key information, is directly related to the quality of the underlying data.

To get the textual data in an analysis-ready state, the input data feeds—which may vary from social media content to international watchlists—need to go pre-processing for normalizing and cleansing the data. The steps are numerous, from language identification to part-of-speech tagging, and many are accomplished through the use of algorithms trained to identify, delineate and/or transform text.

The algorithm creation process is conceptually straightforward. For example, to produce an algorithm that would identify the parts of speech in text, a machine-learning system would be fed a large set of textual information (ideally on the type of data of the system will be processing) with the relevant properties (articles, nouns, verbs, etc.) annotated.

This process allows the system to create an understanding of what articles, nouns, verbs, etc., look like and, consequently, identify and tag them when they show up in the text.

Some of the other steps are still tackled by good, old-fashioned, rules-based technology.

How can AI draw out the most relevant data, like people, places, organizations and concepts?

In structured data, people, places, organizations, etc., are identified by the existing structure of the information, explicitly called out in columns or rows.

In unstructured data, identifying these key data points is much more difficult. Just like the part-of-speech example above, people, places, organizations and concepts need to be identified by machine-learning models trained to do so on human-prepared example data.

Once trained, the algorithm can analyze real-world text.

How can AI map their relationships in a knowledge graph and store this data in a knowledge base, creating unified identity stories?

First, the information that may have to do with the same entities needs to be clustered together into a unique identity record. This enables the system to leverage all relevant signals, which continues to augment as new signals and/or information are discovered.

There is a lot to cover when it comes to how this is completed, so this explanation will focus on people and organizations.

For individual names, a mixture of rules-based and statistical (machine learning) algorithms identify the textual data (documents, articles, etc.) that might be about the same person.

For organizational names, another rather recent NLP technology is useful: text embeddings. This explanation gets rather technical rather quickly, but, in broad strokes, text embeddings transform textual data into mathematical values to numerically represent its meaning, thus words or phrases with similar meanings have similar values. Think of it as a way of exposing semantic similarity among organization names.

As company names frequently contain common words that can be, for instance, mistranslated (Eagle Drugs vs. Eagle Pharmaceutical), text embeddings can identify textual data that might be about the same company by zeroing in on meaning verses exact phrasing.

Then, the relationships between the key properties in the data (like people and organizations) is extracted. It might sound like a broken record, but models are trained to identify different kinds of relationships (such as [x] owns [x] or [x] married [x]) and find and map the connections between people, places, organizations and concepts.

Finally, for maximum value, this matrix of data, this knowledge graph, must be resolved to an existing knowledge base.10,11 For instance, the system has to ensure that the new data it wants to add about Tim Cook, Apple CEO, is not about Tim Cook, Canadian historian.

Trained algorithms are able to make these distinctions through the use of contextual clues. From shared relationships to syntactic patterns, machine-learning systems can leverage all available signals to disambiguate and resolve entities, marrying this newly minted intel to an existing knowledge base.

Existing AML technology is not really stopping money from being laundered

Why This Technology Matters

It is an often-mentioned statistic, but it is worth bringing up here: 99 percent of illicit funds go uncaught.12 Simply put, existing AML technology is not really stopping money from being laundered.

If AML operations are going to put a crack in the financial foundation of criminal enterprise, they are going to need new technology. This technology should be built to sort through the incredible amount of textual data that is now available everywhere, find what matters, and make a useful and actionable story from the data.

This is the system described in this article. And it is possible using AI technology that actually exists.

Steve Cohen, COO, Basis Technology, Cambridge, MA, U.S.A.,

  1. Katie Forster, “London schoolgirl who ran away to join Isis ‘killed in air strike in Syria,’” Independent, August 11, 2016,
  2. Ibid.
  3. Rohit Kachroo, “Bethnal Green schoolgirl Kadiza Sultana who joined Islamic State ‘killed in airstrike in Syria’, ITV News reveals,” ITV, August 11, 2016
  4. Katie Forster, “London schoolgirl who ran away to join Isis ‘killed in air strike in Syria,’” Independent, August 11, 2016,
  5. Rohit Kachroo, “Bethnal Green schoolgirl Kadiza Sultana who joined Islamic State ‘killed in airstrike in Syria’, ITV News reveals,” ITV, August 11, 2016
  6. Ibid.
  7. David Barrett and Martin Evans, “Three ‘Jihadi brides’ from London who travelled to Syria will not face terrorism charges if they return,” The Telegraph, March 10, 2015
  8. Erin Marie Saltman and Melanie Smith, “‘Till Martyrdom Do Us Part’ Gender and the ISIS Phenomenon,” Institute for Strategic Dialog, February 2016,
  9. “Money-Laundering and Globalization,” United Nations Office on Drugs and Crime,
  10. This knowledge base would contain the core information around people, places and organizations.
  11. The knowledge in the system grows organically as it encounters newly discovered information about identities.
  12. Samuel Rubenfeld, “Is Anti-Money Laundering Working? No, It’s Failing,” The Wall Street Journal, November 21, 2017,

Leave a Reply