Co-authored by Yoelvis Orozco Gonzalez and Kenneth Miller
A major challenge in finance has been to find efficient ways of translating qualitative information, obtained from unstructured company financial documents such as SEC filings, earnings call transcripts, and analyst reports into quantitative information useful for further development predictive models and/or statistical analysis.
To this end, Intrinio has developed Thea, a Natural Language Processing (NLP) AI search and question answering engine capable of answering any question about companies using all SEC filings. In this article, we will walk through the process of using our new Thea-powered Company Answers API endpoint to answer questions about S&P 500 companies.
Although all this information may seem qualitative, we should remember that these answers are generated by mathematical models; therefore, by nature they encapsulate a lot of statistical information that can be extracted for further use in quantitative analysis. This statement will make more sense in the next sections when we describe the vectorization of the answer results this API endpoint returns. We will also show in this article one of the many possible applications of this powerful tool by using principal components (PCA) and cluster analysis to group the S&P 500 companies based on some specific topics (or questions).
We will first briefly describe the new Company Answers API endpoint and how to use our SDK to ask a single question about Apple Inc. Then, we will provide an overview of some basic sentence embedding (or vectorization) concepts and we will vectorize the first answer returned by the API. Next, we will ask seven interesting questions for each company in the S&P 500. Finally, we are going to vectorize all the answers and propose a principal components and cluster analysis to show one of the possible ways of quantifying all this information.
The full Python code is available in this file.
Once complete, we will have all the basic tools necessary to utilize this new API, needing only our own creativity to further explore this very unique source of data.
“Company Answers” is a new Intrinio API endpoint. Using the Company Answers API, we can get answers to any question for a company - for instance: “What are the company's diversity and inclusion challenges?” or, “What is the company's business strategy?” In these cases, the API is going to provide a set of answers based on all SEC filings historically reported by the company.
Before getting started, we need to install the Intrinio SDK, which is available for several programming languages. For this article, we will install the latest SDK version for Python (SDK 5.12.0) using pip. We must also install the torch, transformers and kneed modules for further use:
Once the intrinio-sdk module has been properly installed, we can start by asking the following question about Apple Inc.: “What is the company's business strategy?”
Note: Be sure to provide your Intrinio API key (“YOUR_API_KEY” in the code). You can get a free 7-day trial by contacting Intrinio’s sales team.
Please note in this example that we are providing the ticker company 'AAPL' for Apple Inc. You can also use the company CIK, LEI, common name, or Intrinio ID.
Running the previous example, a set of answers is generated in JSON format. For simplicity, only the first two out of the 15 answers generated are shown below:
The Company Answers API response provides the following data:
- 'accession_number': Unique SEC code for the document from which the answer was extracted. It can be observed, for instance, that the first answer is generated from information found on several documents.
- 'ticker value': Trading symbol for company.
- 'report_type': The type of SEC filing - 10-K, 10-Q, etc.
- 'filing_url': The URL of the source document.
Let’s clean up our output to display only the answer portion of the API response...
Now this is a truly interesting dataset! Let’s see what we can do with it...
In the last example, we obtained 15 answers to the question “What is the company's business strategy?” for Apple Inc. If we ask the same question for another company, like Chase Bank, we are going to get very different answers compared to the answers generated for Apple.
In order to use this information quantitatively, we can generate embedding vectors for our answers. Each answer will be represented in a multidimensional space (or sentence embedding) as a vector, in such a way that each component of this vector indicates a weight along a dimension of meaning, taking into account the semantics of the sentence.
NLP literature discusses several sentence embedding models, such as Doc2Vec, SentenceBERT, Universal Sentence Encoder, etc., but in this article we are going to use a model that we’ve found performs particularly well. Specifically, we will use a “Deep Contrastive Learning for Unsupervised Textual Representations” (DeCLUTR) pre-trained model on HuggingFace. This is a transformer-based language model that does not use any labeled data. Instead, it designs a self-supervised objective for learning universal sentence embeddings, inspired by the recent advances in deep metric learning. We recommend this article to better understand DeCLUTR.
Using DeCLUTR, let’s generate an embedding vector from our first example answer:
"building and expanding its own retail and online stores and its third-party distribution network to effectively reach more customers and provide them with a high-quality sales and post-sales support experience"
The following code is intended to run on GPU:0, as indicated by the “.to('cuda:0')” attribute in the model and inputs definitions. If there is no cuda GPU on your system, it can be run on the CPUs by removing this attribute - though, it will take quite some time and you’d be better off using a smaller, more performant model for embedding vector generation.
Running this example generates a 768-dimensional vector from the answer text input as shown below…
We now have the means to turn our text answers into numeric representations we can work with quantitatively. Let’s get some more answers to work with!
In the next example, we will ask the following seven questions for all companies in the S&P 500:
1. "What is the company's business strategy?"
2. "What does the company believe in?"
3. "What are the company's competitive advantages?"
4. "What are the values underpinning the company's culture?"
5. "What are the company's diversity and inclusion challenges?"
6. "What are the company's environmental sustainability challenges?"
7. "What are the company's human rights challenges?"
We will provide the 500 CIKs for the S&P 500 and generate answers to each of our seven questions for each company. Running this code will generate a JSON file for each question (Q_1.json to Q_7.json), containing the corresponding answers for each company. It can take about 20 minutes for this code to run to completion.
Below is a truncated sample of the generated Q_1.json file. Each file is structured in JSON format with a company name followed by an array containing all the corresponding question answers.
Now that we have an abundance of answers, let’s dig into some interesting analysis...
In this final section we are going to use our previous work to help us cluster the S&P 500 companies based on their answers to specific questions (or topics). We are going to take the following steps for each of our seven questions:
For simplicity, here we are going to run through the details just for the first question:
“What is the company's business strategy?”
A common difficulty found while working with PCA and cluster analysis is that we need to define the number of principal components and the number of clusters to fit our data a priori. Depending on the nature of the statistical data, it is possible to estimate these magnitudes in many situations - in our case, it is difficult to know. Therefore, in order to find the optimal principal components and number of clusters to fit our data, the last two points previously described will be computed together in a cross-validation loop, where both magnitudes vary from 1 to 20.
In Example 4 above, we generated a set of JSON files containing answers to our seven questions for each of the S&P 500 companies. Now we are going to read from the Q_1.json file and vectorize all of our answers. To do this, we are going to use the TextEmbeddingVector class previously defined in Example 3:
Note that we are using the conditional “if len(data[company]) != 0” - in some cases, a company may have had no answers returned by the API for a given question. The result of running this code is a companies_vec dictionary containing the 768-dimensional vectors representing each answer for each company:
Representative Answer Vector
The number of answers provided by the API per company can vary from zero to about 15, depending on the amount of information the API finds for a given question. Since each answer is represented by a 768-dimensional vector, we could use, for instance, a simple “average vector” to represent the company. This average vector will be used to represent each company in the following principal component (PCA) and clustering analysis. It is worth noting that any other approximation could be also valid, like distance weighted average, etc.
The following example is a continuation of Example 5 to compute the average vector representing each company. Note that this snippet uses the companies_vec dictionary generated in Example 5. We are using random_state = 42 when defining PCA to be able to reproduce the results.
As a result of running this example, the 2D-array “average_vectors” will be generated. The first dimension corresponds to the number of companies (500) while the second one contains the 768-dimensional average vector generated for each company. This “average_vectors” array will be used next in the PCA and cluster analysis.
A PCA and K-means cluster analysis will be performed in this section to group the companies based on their average, representative vectors. A cross-validation will be performed, varying the principal components and the number of clusters from 1 to 20. The optimal combination will be determined by the combination giving the most curved “elbow” point.
To better understand, the following graphic shows the cross-validation for Q1. Each curve is the Sum of Square Error (SSE) of each cluster in relation to its corresponding center as a function of the number of clusters, while the different colors represent the number of principal components considered. The “elbow point,” which is the point with maximum curvature, normally indicates the optimal number of clusters. For instance, it can be observed that the optimum elbow point, among all curves, is the point 3 of the blue curve (which corresponds to 1 principal component). Therefore, the best clustering representation for Q1 would be described by 1 principal component and 3 clusters. It is worth highlighting that the SSE curves are normalized for proper comparison.
To do this procedure automatically end-to-end, we propose an analytical solution based on computing the elbow point for each curve following this procedure.
The curvature at the elbow point for each curve can be determined by a simple numerical calculation of the second derivative:
Y’’ = y[elbow + 1] + y[elbow - 1] - 2*y[elbow]
The visualization of the principal components and the cross validation for Q1 can be computed using the following example. It will generate a bar plot showing the percentage of the original variance represented by each principal component and the optimal number of Principal Components and Number of Clusters (n_comp, n_clusters) to fit our data.
As shown below, the first three to four components in this case explain most of the variance of the data for our first question.
There are some questions in which just the first or the first two components represent most of the variance of the data.
The optimal combination of principal components and numbers of clusters obtained from the cross-validation will be used now to perform our final cluster analysis. The example below is a continuation of the previous example which generates clusters of companies based on their average representative vector for each question.
If we have a closer look at the blue cluster, we can find something remarkable. The rightmost set of four companies are Goldman Sachs Group, Inc., Fifth Third Bancorp, Citigroup Inc. and Western Union Co. which clearly belong to the financial sector and have a lot in common.
Although the analytic results indicate 1 PC and 3 clusters as the optimal combination, other possibilities can be explored. The next section will show some cases in which it is worth exploring other possibilities, having as a reference the analytical solutions described in this section.
The full Python code is available in this file.
The same methodology described for our first question as applied to the rest of our questions:
As it can be observed, the analytical solution suggests using 1 PC and 3 clusters, but in this case we can clearly see a small blue cluster separated from the rest. Therefore, if we try using 3 PCs and 4 clusters, this is what we obtain:
Perhaps this is a better representation of the data for this specific topic.
In this case we find a similar situation, in which the analytical solution suggests 1 PC and 2 clusters to characterize our data, but perhaps using 3 PC and 3 clusters is a better representation.
From the previous PCA and cluster analysis we can draw a few conclusions. For instance, there are several questions which are primarily represented by the first principal component of the answers. In these cases, we could clearly say that there is some kind of specific characteristic for the given question which is definitively representative of each company.
What is this specific characteristic? Further exploration may entail an examination of the interpretability of these principal components. Perhaps correlating these clusters with other types of well-defined data, or correlating clusters generated by different questions will lead to remarkable results. We are looking forward to seeing a variety of cool applications powered by Thea and the new Intrinio Company Answers API!