How has Intrinio automated the data supply chain to provide higher-quality financial data, faster, at a more competitive price? The secret is machine learning. In this blog series, we will be offering a high-level overview of what machine learning is, how it can be used to analyze and interpret financial information – and a bit about how we use it at Intrinio to improve the financial data experience.
Machine learning is a set of techniques used to train a computer application to detect patterns. Many of the techniques are not new in the field of data science and analysis. Grouping or classifying things based on certain characteristics is a well-established technique. The math behind the famous AlphaGo project, for example, was published in 1952 by Bellman. What’s changed? Put simply, we now have the compute power to expand on these old-school theories and apply them to solve real-world problems.
A first entry into the domain of machine learning from a practical point of view would be to see what types of questions it aims to answer. There are only a few different ways to address information using machine learning. Almost all machine learning projects fall into the following categories:
Unsupervised learning is done when the computer application looks at a dataset and attempts to group things together using some type of distance/proximity measure. Whether you’re feeding the application pictures of butterflies or financial statements, it will group things that are most similar based on the provided features.
It’s important to note that in unsupervised learning, we can’t control the assignment to groups – we can only provide information to group by. It’s an exploratory approach; we try various methods and see if the results makes sense in such a way that we can use them.
Supervised learning happens when the application is given a set of examples and tries to classify new data into the classifications it was trained to recognize. For example, if you want the application to identify dog breeds, you may feed it pictures of German Shepherds, pugs, etc., each labeled with the correct dog breed so it learns to classify dogs correctly. In the financial domain we might want to classify financial statements into industry classes or as-reported line items to standardized line items. The main notion about this process is that classifications need examples. If you don’t provide labeled examples, there is no supervised training.
Reinforcement learning is a training approach where the application tries to teach itself procedures or strategies to solve a problem and then learns to apply them to a new situation. Essentially, this involves the computer learning how to do something, as a series of steps.
Reinforcement learning is a dynamic approach – finding the best way to do something. For example, in playing the game of tic-tac-toe. An application would solve the question of “what is the best place to put my next X?” by playing thousands of games of tic-tac-toe, learning to recognize certain situations and which move would provide the biggest payoff in that situation.
Reinforcement learning is for instance one of the main techniques used train the “self-driving” cars. In the business information processing field it is not yet as widely applied as, for instance, supervised learning or representation learning.
In representation learning, the application is given a set of information to extract specific topics/insights/interpretations from the source. Representation learning or feature learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. Many recent advancements in machine learning — especially in Computer Vision and Natural Language Processing (NLP) — are based on this technique. It has some similarities to unsupervised learning, in that you provide a whole dataset, and the application tries to understand the essence of it, but in the case of representation learning it is the application that extracts features from the dataset rather than being presented with a set of features that were determined a priori.
Facial recognition is an example of representation learning where the application extracts distinguishing features from pictures. In the financial information analysis domain, we use representation learning to extract the essence from financial statements to answer questions about companies or support line-item classifications. The underlying technology that makes representation learning possible is called an artificial neural network, and in the NLP domain representations in the form of what are called “Embeddings” play a crucial role.
However, just like with humans, different applications may interpret the same data in different ways. How the application comes up with its representations is a black box, and the computer can’t necessarily explain why it made a specific decision based on such abstractions. Explainable ML is an active area of research in Intrinio and elsewhere.
Intrinio uses supervised learning for our core products. For example, we map as-reported line items from financial statements into standardized templates. This is a classification problem; companies report their line items using many different concepts, which makes the data impossible to accurately compare. We take the as-reported line items and classify them into a set of concepts – Intrinio’s internal taxonomy – to simplify fundamental analysis.
We know that certain items are characteristic of a particular statement; for instance, assets, liabilities and equity always belong on the balance sheet. We also know that certain “as-reported” line items are the same as a “fundamental concept” in our internal taxonomy, so we map many different source concepts onto the same fundamental concept. One of the areas of ML research in Intrinio is how to best use representation learning of concept metadata to improve our standardization process.
We’ve also built our new AI question-answering platform, Thea, using representation learning. The natural language processing field uses representation learning to extract features from the source texts, which produces the text embeddings we briefly discussed above.
With Thea, we’re able to feed our systems with great amounts of textual information and, using all the techniques that engineers built into the application based on embeddings, it becomes possible to ask freeform questions. Based on the embeddings, Thea returns datasets that extract the essence of the statements.