How to Use the Intrinio Answers API

Yoelvis Orozco Gonzalez
July 16, 2021

Co-authored by Yoelvis Orozco Gonzalez and Kenneth Miller

A major challenge in finance has been to find efficient ways of translating qualitative information, obtained from unstructured company financial documents such as SEC filings, earnings call transcripts, and analyst reports into quantitative information useful for further development predictive models and/or statistical analysis.

To this end, Intrinio has developed Thea, a Natural Language Processing (NLP) AI search and question answering engine capable of answering any question about companies using all SEC filings. In this article, we will walk through the process of using our new Thea-powered Company Answers API endpoint to answer questions about S&P 500 companies. 

Although all this information may seem qualitative, we should remember that these answers are generated by mathematical models; therefore, by nature they encapsulate a lot of statistical information that can be extracted for further use in quantitative analysis. This statement will make more sense in the next sections when we describe the vectorization of the answer results this API endpoint returns. We will also show in this article one of the many possible applications of this powerful tool by using principal components (PCA) and cluster analysis to group the S&P 500 companies based on some specific topics (or questions).  

We will first briefly describe the new Company Answers API endpoint and how to use our SDK to ask a single question about Apple Inc. Then, we will provide an overview of some basic sentence embedding (or vectorization) concepts and we will vectorize the first answer returned by the API. Next, we will ask seven interesting questions for each company in the S&P 500. Finally, we are going to vectorize all the answers and propose a principal components and cluster analysis to show one of the possible ways of quantifying all this information.

The full Python code is available in this file.

Once complete, we will have all the basic tools necessary to utilize this new API, needing only our own creativity to further explore this very unique source of data.

Accessing the Intrinio Company Answers API with Our Python SDK

Company Answers” is a new Intrinio API endpoint. Using the Company Answers API, we can get answers to any question for a company - for instance: “What are the company's diversity and inclusion challenges?” or, “What is the company's business strategy?” In these cases, the API is going to provide a set of answers based on all SEC filings historically reported by the company.

Before getting started, we need to install the Intrinio SDK, which is available for several programming languages. For this article, we will install the latest SDK version for Python (SDK 5.12.0) using pip. We must also install the torch, transformers and kneed modules for further use:

pip install intrinio_sdk torch transformers kneed

Once the intrinio-sdk module has been properly installed, we can start by asking the following question about Apple Inc.: “What is the company's business strategy?” 

Example 1
from __future__ import print_function
import time
import intrinio_sdk as intrinio
from intrinio_sdk.rest import ApiException
import json

import torch
from transformers import AutoTokenizer, AutoModel

# Enter your API key here
api_key = "YOUR_API_KEY"

intrinio.ApiClient().set_api_key(api_key)
intrinio.ApiClient().allow_retries(True)

company_identifier = 'AAPL'
query = "What is the company's business strategy?"
company_api = intrinio.CompanyApi()
response = company_api.get_company_answers(company_identifier, query)
print(response)

Note: Be sure to provide your Intrinio API key (“YOUR_API_KEY” in the code). You can get a free 7-day trial by contacting Intrinio’s sales team.

Please note in this example that we are providing the ticker company 'AAPL' for Apple Inc. You can also use the company CIK, LEI, common name, or Intrinio ID.

Running the previous example, a set of answers is generated in JSON format. For simplicity, only the first two out of the 15 answers generated are shown below:


{'answers': [{'answer': 'building and expanding its own retail and online '
                        'stores and its third-party distribution network to '
                        'effectively reach more customers and provide them '
                        'with a high-quality sales and post-sales support '
                        'experience',
              'content': 'Therefore, the Company’s strategy also includes '
                         'building and expanding its own retail and online '
                         'stores and its third-party distribution network to '
                         'effectively reach more customers and provide them '
                         'with a high-quality sales and post-sales support '
                         'experience. The Company believes ongoing investment '
                         'in research and development (“R&D”), marketing and '
                         'advertising is critical to the development and sale '
                         'of innovative products, services and technologies. '
                         'Business Seasonality and Product Introductions The '
                         'Company has historically experienced higher net '
                         'sales in its first quarter compared to other '
                         'quarters in its fiscal year due in part to seasonal '
                         'holiday demand. Additionally, new product '
                         'introductions can significantly impact net sales, '
                         'product costs and operating expenses. Product '
                         'introductions can also impact the Company’s net '
                         'sales to its indirect distribution channels as these '
                         'channels are filled with new product inventory '
                         'following a product introduction, and channel '
                         'inventory of a particular product often declines as '
                         'the next related major product launch approaches. '
                         'Net sales can also be affected when consumers and '
                         'distributors anticipate a product introduction. '
                         'However, neither historical seasonal patterns nor '
                         'historical patterns of product introductions should '
                         'be considered reliable indicators of the Company’s '
                         'future pattern of product introductions, future net '
                         'sales or financial performance. The Company’s fiscal '
                         'year is the 52- or 53-week period that ends on the '
                         'last Saturday of September.',
              'source_documents': [{'id': 'a7107b50-d902-41f5-9085-ddd3a2d3d3c2',
                                    'tags': [{'key': 'accession_number',
                                              'value': '0000320193-18-000007'},
                                             {'key': 'ticker', 'value': 'AAPL'},
                                             {'key': 'filing_url',
                                              'value': 'https://www.sec.gov/Archives/edgar/data/320193/000032019318000007/0000320193-18-000007-index.htm'},
                                             {'key': 'report_type',
                                              'value': '10-Q'}]},
                                   {'id': '603f958b-77f7-49c7-9e9a-15d61a2e57e0',
                                    'tags': [{'key': 'accession_number',
                                              'value': '0000320193-18-000070'},
                                             {'key': 'ticker', 'value': 'AAPL'},
                                             {'key': 'filing_url',
                                              'value': 'https://www.sec.gov/Archives/edgar/data/320193/000032019318000070/0000320193-18-000070-index.htm'},
                                             {'key': 'report_type',
                                              'value': '10-Q'}]},
                                   {'id': 'f23c29db-568c-4f43-862a-0119a9e06c62',
                                    'tags': [{'key': 'accession_number',
                                              'value': '0000320193-18-000100'},
                                             {'key': 'ticker', 'value': 'AAPL'},
                                             {'key': 'filing_url',
                                              'value': 'https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/0000320193-18-000100-index.htm'},
                                             {'key': 'report_type',
                                              'value': '10-Q'}]}]},
             {'answer': 'focusing on key measures of profitability and the '
                        'creation of shareholder value',
              'content': 'It reflects the unparalleled size, scope, and '
                         'success of Apple’s business and the importance of '
                         'our executive officers operating as a '
                         'high-performing team, while focusing on key measures '
                         'of profitability and the creation of shareholder '
                         'value. Net sales and operating income for 2017 were '
                         '$229.2 billion and $61.3 billion, respectively, and '
                         'year-over-year our stock price increased 36.7%. We '
                         'believe the compensation paid to our named executive '
                         'officers for 2017 appropriately reflects and rewards '
                         'their contributions to our performance. This '
                         'Compensation Discussion and Analysis explains the '
                         '2017 compensation program for our named executive '
                         'officers and the guiding principles and practices '
                         'upon which it is based. Tim Cook, Luca Maestri, '
                         'Angela Ahrendts, Johny Srouji, Dan Riccio, and Bruce '
                         'Sewell were our named executive officers for 2017. '
                         'In October 2017, Apple announced that Mr. Sewell '
                         'would retire from Apple, effective at the end of the '
                         'calendar year. Our executive compensation program '
                         'attracts, motivates, and retains a talented, '
                         'entrepreneurial, and creative team of executives to '
                         'provide leadership for Apple’s success in dynamic '
                         'and competitive markets. We have a '
                         'pay-for-performance philosophy for executive '
                         'compensation based on the following principles: '
                         'Team-Based Approach.',
              'source_documents': [{'id': 'a1cb4669-0f92-42c3-8400-36835fed099f',
                                    'tags': [{'key': 'report_type',
                                              'value': 'DEF 14A'},
                                             {'key': 'ticker', 'value': 'AAPL'},
                                             {'key': 'filing_url',
                                              'value': 'https://www.sec.gov/Archives/edgar/data/320193/000119312517380130/0001193125-17-380130-index.htm'},
                                             {'key': 'accession_number',
                                              'value': '0001193125-17-380130'}]}]},
…}

The Company Answers API response provides the following data:

  • 'answer': These are the Thea answers. For instance, the first one states: “building and expanding its own retail and online stores and its third-party distribution network to effectively reach more customers and provide them with a high-quality sales and post-sales support experience.”
  • 'content': A larger portion of the SEC filing from which the answer was extracted.
  • 'source_documents': The source of each document is comprised of several tags:

             - 'accession_number': Unique SEC code for the document from which the answer was extracted. It can be observed, for instance, that the first answer is generated from information found on several documents.
             - 'ticker value': Trading symbol for company.
             - 'report_type': The type of SEC filing - 10-K, 10-Q, etc.
             - 'filing_url': The URL of the source document.

Let’s clean up our output to display only the answer portion of the API response...

Example 2
intrinio.ApiClient().set_api_key(api_key)
intrinio.ApiClient().allow_retries(True)

company_identifier = 'AAPL'
query = "What is the company's business strategy?"
company_api = intrinio.CompanyApi()
response = company_api.get_company_answers(company_identifier, query)

for answer in response.answers:
    print(answer.answer)
    print('-------------------------------------------------------------------')

building and expanding its own retail and online stores and its third-party distribution network to effectively reach more customers and provide them with a high-quality sales and post-sales support experience
--------------------------------------------------------
focusing on key measures of profitability and the creation of shareholder value
--------------------------------------------------------
expanding the apple ecosystem through the development of integrated and interoperable products
--------------------------------------------------------
to provide its customers products and solutions that are integrated and interoperable
--------------------------------------------------------
leverages its unique ability to design and develop its own operating systems, hardware, application software and services to provide its customers products and solutions with innovative design, superior ease-of-use and seamless integration
--------------------------------------------------------
dynamic and competitive markets
--------------------------------------------------------
leverages its unique ability to design and develop its own operating systems, hardware, application software, and services to provide its customers new products and solutions with superior ease-of-use, seamless integration, and innovative design
--------------------------------------------------------
to control the design and development of the hardware and software for all of its products
--------------------------------------------------------
leverages its unique ability to design and develop its own operating systems, hardware, application software, and services to provide its customers new products and solutions with superior ease-of-use, seamless integration, and innovative industrial design
--------------------------------------------------------
primarily on a geographic basis
--------------------------------------------------------
acquisitions
--------------------------------------------------------
leverages its unique ability to design and develop its own operating system, hardware, application software, and services to provide its customers new products and solutions with superior ease-of-use, seamless integration, and innovative industrial design
--------------------------------------------------------
to design and develop innovative products and services that are interoperable
--------------------------------------------------------
to bring to its customers around the world compelling new products and solutions with superior ease-of-use, seamless integration, and innovative industrial design
--------------------------------------------------------
business success through maintenance of the highest standards of responsibility and ethics
--------------------------------------------------------

Now this is a truly interesting dataset! Let’s see what we can do with it...

Generating Embedding Vectors for Our Answers

In the last example, we obtained 15 answers to the question “What is the company's business strategy?” for Apple Inc. If we ask the same question for another company, like Chase Bank, we are going to get very different answers compared to the answers generated for Apple.

In order to use this information quantitatively, we can generate embedding vectors for our answers. Each answer will be represented in a multidimensional space (or sentence embedding) as a vector, in such a way that each component of this vector indicates a weight along a dimension of meaning, taking into account the semantics of the sentence. 

NLP literature discusses several sentence embedding models, such as Doc2Vec, SentenceBERT, Universal Sentence Encoder, etc., but in this article we are going to use a model that we’ve found performs particularly well. Specifically, we will use a “Deep Contrastive Learning for Unsupervised Textual Representations” (DeCLUTR) pre-trained model on HuggingFace. This is a transformer-based language model that does not use any labeled data. Instead, it designs a self-supervised objective for learning universal sentence embeddings, inspired by the recent advances in deep metric learning. We recommend this article to better understand DeCLUTR.

Using DeCLUTR, let’s generate an embedding vector from our first example answer: 

"building and expanding its own retail and online stores and its third-party distribution network to effectively reach more customers and provide them with a high-quality sales and post-sales support experience"

The following code is intended to run on GPU:0, as indicated by the “.to('cuda:0')” attribute in the model and inputs definitions. If there is no cuda GPU on your system, it can be run on the CPUs by removing this attribute - though, it will take quite some time and you’d be better off using a smaller, more performant model for embedding vector generation.

Example 3
class TextEmbeddingVector:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-base", use_fast=False)
        self.model = AutoModel.from_pretrained("johngiorgi/declutr-base")
        self.model.to('cuda:0').half()

    def get(self, text):
        inputs = self.tokenizer(text, padding=True, add_special_tokens=True, truncation=True, return_tensors="pt").to('cuda:0')

        # Embed the text
        with torch.no_grad():
            sequence_output, _ = self.model(**inputs, output_hidden_states=False, return_dict=False)

        # Mean pool the token-level embeddings to get sentence-level embeddings
        embeddings = torch.sum(
            sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
        ) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

        return embeddings.cpu().numpy()

answer = "building and expanding its own retail and online stores and its third-party \
distribution network to effectively reach more customers and provide them \
with a high-quality sales and post-sales support experience"

answer_vector = TextEmbeddingVector()
print(answer_vector.get([answer]))

Running this example generates a 768-dimensional vector from the answer text input as shown below…

[ 1.1969e-01  1.9617e-01  9.5215e-02 -4.4495e-02  4.0436e-02 -2.5659e-01
  2.1252e-01 -2.1472e-01  8.3313e-02  2.4780e-01 -6.3965e-02  1.1780e-01
  5.7556e-02  1.9458e-01 -4.3945e-02 -7.2388e-02  1.0999e-01  1.4807e-01
  …
  1.4648e-01  1.6150e-01  2.5171e-01 -5.1611e-01 -4.3854e-02  3.2178e-01]

We now have the means to turn our text answers into numeric representations we can work with quantitatively. Let’s get some more answers to work with! 

Asking Questions for All Companies in the S&P 500 

In the next example, we will ask the following seven questions for all companies in the S&P 500:

1. "What is the company's business strategy?"

2. "What does the company believe in?"

3. "What are the company's competitive advantages?"

4. "What are the values underpinning the company's culture?"

5. "What are the company's diversity and inclusion challenges?"

6. "What are the company's environmental sustainability challenges?"

7. "What are the company's human rights challenges?"

We will provide the 500 CIKs for the S&P 500 and generate answers to each of our seven questions for each company. Running this code will generate a JSON file for each question (Q_1.json to Q_7.json), containing the corresponding answers for each company. It can take about 20 minutes for this code to run to completion.

Example 4
intrinio.ApiClient().set_api_key(api_key)
intrinio.ApiClient().allow_retries(True)

# Central Index Key or CIK number of Each Company in the S&P 500
ciks = [


'0000008670', '0000789570', '0001374310', '0000831001', '0000054480', '0000108772', 
'0000037785', '0001373715', '0000005513', '0001551182', '0000011544', '0000818479', 
'0000701985', '0001091667', '0000016732', '0000039911', '0000106640', '0000875320', 
'0000075362', '0000091576', '0000100885', '0000018926', '0001065280', '0001137774', 
'0000021665', '0000352541', '0000101778', '0001637459', '0000895421', '0000073124', 
'0001156375', '0001652044', '0000024741', '0001300514', '0001037868', '0001513761', 
'0000821189', '0001063761', '0000882835', '0000074208', '0000718877', '0001020569', 
'0000051253', '0000920522', '0000815097', '0000813828', '0001598014', '0000715957', 
'0000072207', '0000109380', '0001059556', '0000038777', '0001688568', '0001624899', 
'0000915912', '0001037646', '0000023217', '0001038357', '0001283699', '0000062996', 
'0000091440', '0000783280', '0000045012', '0000021344', '0000027904', '0000075677', 
'0000764478', '0000865752', '0000092380', '0001034054', '0000093751', '0001396009', 
'0000827052', '0001137789', '0000034088', '0000056873', '0001013871', '0000879169', 
'0000719739', '0000713676', '0000107263', '0001126328', '0000702165', '0000915913', 
'0000899051', '0000078003', '0001164727', '0001095073', '0000080661', '0001679273', 
'0000029534', '0001585689', '0001113169', '0000883241', '0000910606', '0001035443', 
'0000822416', '0000021076', '0000016918', '0001297996', '0000915389', '0000035527', 
'0001032208', '0001041061', '0000020286', '0000077360', '0001732845', '0000310764', 
'0001071739', '0001324404', '0000872589', '0000100493', '0000899689', '0000063754', 
'0001108524', '0001701605', '0000070858', '0000093410', '0000874761', '0000896878', 
'0001015780', '0000004962', '0001121788', '0000858470', '0001141391', '0000202058', 
'0000014272', '0001492633', '0001043604', '0000779152', '0000100517', '0000051644', 
'0001065088', '0001711269', '0001136869', '0000877890', '0000072333', '0000072741', 
'0000936468', '0001060391', '0000006769', '0001623613', '0000731802', '0000052988', 
'0000711404', '0000829224', '0001403161', '0000065984', '0001045810', '0000010456', 
'0001065696', '0000092230', '0001174922', '0000788784', '0000006201', '0000927066', 
'0000815556', '0000833444', '0000882095', '0000899866', '0001048911', '0000002488', 
'0001437107', '0001040971', '0000313616', '0001285785', '0001136893', '0001555280', 
'0000946581', '0000012927', '0000707549', '0000200406', '0001326801', '0001037540', 
'0000793952', '0000055067', '0001051470', '0000798354', '0000008818', '0001130310', 
'0001067983', '0000783325', '0001001250', '0000882184', '0000004977', '0001286681', 
'0001336920', '0000050863', '0001140859', '0000064040', '0000024545', '0000875045', 
'0000859737', '0001267238', '0001002047', '0001601046', '0001596532', '0000104169', 
'0000773840', '0000740260', '0000096021', '0001755672', '0000059478', '0000796343', 
'0000726728', '0000029915', '0001031296', '0000058492', '0000004904', '0000026172', 
'0001604778', '0001099219', '0000014693', '0001364742', '0001326160', '0000849399', 
'0000031462', '0001666700', '0000766421', '0000005272', '0001633917', '0001086222', 
'0000723254', '0000877212', '0000084839', '0000072971', '0000277135', '0000749251', 
'0001393612', '0000059558', '0001551152', '0000310158', '0001418091', '0001175454', 
'0000906163', '0000320187', '0000896159', '0001058290', '0001101239', '0000049071', 
'0001707925', '0000068505', '0001306830', '0001478242', '0001123360', '0001378946', 
'0000858877', '0001039684', '0000943819', '0000063908', '0000106535', '0000885725', 
'0000046765', '0001166691', '0000746515', '0000029989', '0000033185', '0000820313', 
'0001645590', '0001744489', '0001748790', '0001613103', '0000109198', '0001383312', 
'0001075531', '0000927653', '0000731766', '0001047862', '0000097476', '0001289490', 
'0001601712', '0001730168', '0001324424', '0000943452', '0000832101', '0000034903', 
'0000097745', '0001012100', '0001158449', '0000080424', '0001585364', '0000012659', 
'0000029905', '0000315189', '0001138118', '0001390777', '0000920148', '0000908255', 
'0000047217', '0000051143', '0000027419', '0001000697', '0000813672', '0001393311', 
'0001156039', '0000010795', '0001501585', '0000004447', '0000898173', '0000318154', 
'0000051434', '0001097149', '0000031791', '0001571949', '0001140536', '0001410636', 
'0000743988', '0001090727', '0000073309', '0000916365', '0000720005', '0000879101', 
'0001336917', '0000093556', '0000030625', '0000086312', '0001163165', '0001111928', 
'0001103982', '0000060667', '0001260221', '0000743316', '0000046080', '0000764180', 
'0000914208', '0000320335', '0001022079', '0000049196', '0000789019', '0001099800', 
'0001335258', '0001058090', '0000811156', '0001564708', '0000927628', '0000091419', 
'0000753308', '0000048465', '0000217346', '0001521332', '0000096943', '0001133421', 
'0001035267', '0000066740', '0000723531', '0000001800', '0000745732', '0000036104', 
'0000851968', '0001021860', '0000018230', '0000884887', '0001781335', '0001739940', 
'0000874766', '0000860730', '0001035002', '0000920760', '0001519751', '0001070750', 
'0000831259', '0000936340', '0001341439', '0001024478', '0001002910', '0001510295', 
'0000087347', '0000827054', '0000313927', '0001000228', '0000728535', '0001053507', 
'0000320193', '0000101829', '0001278021', '0000002969', '0000019617', '0000079879', 
'0000040704', '0000076334', '0001783180', '0001101215', '0001506307', '0001534701', 
'0000040987', '0001539838', '0001262039', '0001413329', '0001442145', '0000866787', 
'0000315213', '0001408198', '0000766704', '0000092122', '0000836102', '0000912595', 
'0000820027', '0001067701', '0001024305', '0001132979', '0000006951', '0000721371', 
'0001170010', '0000006281', '0001120193', '0000047111', '0001001082', '0000103379', 
'0000765880', '0000814453', '0000072903', '0000098246', '0000060086', '0000106040', 
'0001110803', '0000900075', '0000049826', '0000764622', '0000723125', '0000906107', 
'0000007084', '0000916076', '0000769397', '0001050915', '0000732717', '0001090872', 
'0000315293', '0000077476', '0001467373', '0000354190', '0001403568', '0000354908', 
'0000732712', '0001037038', '0001045609', '0001048286', '0001018724', '0001043277', 
'0000203527', '0001014473', '0000316709', '0000009389', '0000040545', '0001365135', 
'0000886982', '0000823768', '0000712515', '0001467858', '0000797468', '0001402057', 
'0000319201', '0000874716', '0001618921', '0000277948', '0000091142', '0001659166', 
'0000352915', '0001090012', '0000885639', '0001466258', '0000922224', '0000922864', 
'0001681459', '0000815094', '0000354950', '0001385157', '0000055785', '0001111711', 
'0000804753', '0000089800', '0000004127', '0000935703', '0001013462', '0000759944', 
'0000064803', '0000878927', '0001048695', '0000940944', '0001359841', '0001590955', 
'0001754301', '0001358071', '0001524472', '0000028412', '0000032604', '0000078239', 
'0000909832', '0001093557', '0000037996', '0000062709', '0000040533', '0000804328', 
'0000036270', '0001281761', '0001109357', '0001579241', '0001489393', '0001116132', 
'0000048039', '0001318605'
]

queries = [
    "What is the company's business strategy?", 
#    "What does the company believe in?",
#    "What are the company's competitive advantages?",
#    "What are the values underpinning the company's culture?",
#    "What are the company's diversity and inclusion challenges?",
#    "What are the company's environmental sustainability challenges?",
#    "What are the company's human rights challenges?"
]

count = 1
company_api = intrinio.CompanyApi()

for query in queries:
    companies = {}
    for cik in ciks:
        try:
            response = company_api.get_company_answers(cik, query)
            company_name = company_api.get_company(cik).name
            answers = []
            for answer in response.answers:
                answers.append(answer.answer)
            companies[company_name] = answers
            print('%s: %s: %s' %("----SUCCESS----", company_name, query))
        except:
            print('%s: %s: %s' %("----ERROR------", company_name, query))
    with open('Q_%s.json' %count, 'w') as f:
        json.dump(companies, f)
    count += 1

Below is a truncated sample of the generated Q_1.json file. Each file is structured in JSON format with a company name followed by an array containing all the corresponding question answers.

{"Automatic Data Processing Inc": [
"to power organizations with insightful solutions that meet the changing needs of our clients and their employees",
…
"match a client\u2019s needs with the products and services that will best meet expectations"],
…
"Cboe Global Markets Inc": [
"to lead the industry in defining the markets of today and tomorrow"
…
"develop and market a multi-asset front-end order entry system"
]}

Now that we have an abundance of answers, let’s dig into some interesting analysis... 

Principal Components and Cluster Analysis

In this final section we are going to use our previous work to help us cluster the S&P 500 companies based on their answers to specific questions (or topics). We are going to take the following steps for each of our seven questions:

  • Vectorization: All the answers this API endpoint returns for each company will be vectorized.
  • Representative Answer Vector: Generate a representative answer vector for each company in the 768-dimensional embedding space.
  • Principal Components Analysis: Once each company is characterized by its representative answer, a PCA will be performed to denoise the data and reduce the dimensionality for proper visualization. 
  • Cluster Analysis: K-means cluster analysis will be performed to group companies based on the proximity of their representative answers.

For simplicity, here we are going to run through the details just for the first question: 

“What is the company's business strategy?”

A common difficulty found while working with PCA and cluster analysis is that we need to define the number of principal components and the number of clusters to fit our data a priori. Depending on the nature of the statistical data, it is possible to estimate these magnitudes in many situations - in our case, it is difficult to know. Therefore, in order to find the optimal principal components and number of clusters to fit our data, the last two points previously described will be computed together in a cross-validation loop, where both magnitudes vary from 1 to 20.

Vectorization

In Example 4 above, we generated a set of JSON files containing answers to our seven questions for each of the S&P 500 companies. Now we are going to read from the Q_1.json file and vectorize all of our answers. To do this, we are going to use the TextEmbeddingVector class previously defined in Example 3:

Example 5
class TextEmbeddingVector:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-base", use_fast=False)
        self.model = AutoModel.from_pretrained("johngiorgi/declutr-base")
        self.model.to('cuda:0').half()

    def get(self, text):
        inputs = self.tokenizer(text, padding=True, add_special_tokens=True, truncation=True, return_tensors="pt").to('cuda:0')

        # Embed the text
        with torch.no_grad():
            sequence_output, _ = self.model(**inputs, output_hidden_states=False, return_dict=False)

        # Mean pool the token-level embeddings to get sentence-level embeddings
        embeddings = torch.sum(
            sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
        ) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

        return embeddings.cpu().numpy()

file = 'Q_1'
f = open('%s.json' %file)
data = json.load(f)
f.close()

companies_vectors = {}
text_embedding = TextEmbeddingVector()

for company in data:
    print(company)
    texts=[]
    if len(data[company]) != 0: 
        for answer in data[company]:
            texts.append(answer)
        answers_vectors = text_embedding.get(texts)
        companies_vectors[company] = answers_vectors

Note that we are using the conditional “if len(data[company]) != 0” - in some cases, a company may have had no answers returned by the API for a given question. The result of running this code is a companies_vec dictionary containing the 768-dimensional vectors representing each answer for each company:

{'Automatic Data Processing Inc': [
array([ 1.2537e-01,  1.3123e-01,  2.4506e-02, -5.4016e-02,  5.2002e-01,
        …
        1.2262e-01,  6.1768e-02, -3.8062e-01,  6.6223e-02,  4.3677e-01], dtype=float16), 
array([ 1.4685e-01,  8.5815e-02, -7.9041e-02, -5.0690e-02,  4.9341e-01,
        …
        1.8250e-01, -2.0703e-01, 9.5337e-02, -2.7686e-01,   4.6692e-02], dtype=float16),
'MGM Resorts International': [
array([ 2.3157e-01,  3.5376e-01, -5.7220e-02,  4.4849e-01,  7.9736e-01,
        …
        3.0322e-01,  1.8091e-01, -2.4768e-01,  1.5222e-01, -2.1399e-01], dtype=float16),
array([-1.5405e-01,  9.9060e-02, -2.9392e-03,  9.5520e-02,  5.6152e-01
        …
        3.0322e-01,  1.8091e-01, -2.4768e-01,  1.5222e-01, -2.1399e-01], dtype=float16),
        …
        }

Representative Answer Vector

The number of answers provided by the API per company can vary from zero to about 15, depending on the amount of information the API finds for a given question. Since each answer is represented by a 768-dimensional vector, we could use, for instance, a simple “average vector” to represent the company. This average vector will be used to represent each company in the following principal component (PCA) and clustering analysis. It is worth noting that any other approximation could be also valid, like distance weighted average, etc. 

The following example is a continuation of Example 5 to compute the average vector representing each company. Note that this snippet uses the companies_vec dictionary generated in Example 5. We are using random_state = 42 when defining PCA to be able to reproduce the results.

Example 6
average_vectors = []

for company in companies_vectors:
    average_vector = [0 for j in range(len(companies_vectors[company][0].tolist()))]
    for answer in companies_vectors[company]:
        answer = answer.tolist()
        for i in range(len(answer)):
            average_vector[i]+=(answer[i]/len(answer))
            
    average_vectors.append(average_vector)

As a result of running this example, the 2D-array “average_vectors” will be generated. The first dimension corresponds to the number of companies (500) while the second one contains the 768-dimensional average vector generated for each company. This “average_vectors” array will be used next in the PCA and cluster analysis.

PCA and Cluster Analysis

A PCA and K-means cluster analysis will be performed in this section to group the companies based on their average, representative vectors. A cross-validation will be performed, varying the principal components and the number of clusters from 1 to 20. The optimal combination will be determined by the combination giving the most curved “elbow” point. 

To better understand, the following graphic shows the cross-validation for Q1. Each curve is the Sum of Square Error (SSE) of each cluster in relation to its corresponding center as a function of the number of clusters, while the different colors represent the number of principal components considered. The “elbow point,” which is the point with maximum curvature, normally indicates the optimal number of clusters. For instance, it can be observed that the optimum elbow point, among all curves, is the point 3 of the blue curve (which corresponds to 1 principal component). Therefore, the best clustering representation for Q1 would be described by 1 principal component and 3 clusters. It is worth highlighting that the SSE curves are normalized for proper comparison.

To do this procedure automatically end-to-end, we propose an analytical solution based on computing the elbow point for each curve following this procedure.

The curvature at the elbow point for each curve can be determined by a simple numerical calculation of the second derivative:

Y’’ = y[elbow + 1] + y[elbow - 1] - 2*y[elbow]

The visualization of the principal components and the cross validation for Q1 can be computed using the following example. It will generate a bar plot showing the percentage of the original variance represented by each principal component and the optimal number of Principal Components and Number of Clusters (n_comp, n_clusters) to fit our data.

Example 7
import math
import matplotlib.pyplot as plt
from kneed import KneeLocator
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import pandas as pd
import plotly.express as px
import numpy as np

def elbow(x, y):
    """
    Elbow finder
    https://raghavan.usc.edu//papers/kneedle-simplex11.pdf
    https://github.com/arvkevi/kneed#interactive
    """
    kneedle = KneeLocator(x, y, S=1.0, curve="convex", direction="decreasing")
    return int(round(kneedle.elbow, 0))

def second_derivative(y, index):
    """
    Calculates the Second derivative
    """
    derivative = y[index + 1] + y[index - 1] - 2 * y[index]
    return derivative

def normalize(array):
    """
    Normalizes the data from 0 to 1
    """
    t_max = 1
    t_min = 0
    normalized_array = []
    difference = t_max - t_min
    difference_array = max(array) - min(array)    
    for i in array:
        temp = (((i - min(array))*difference)/difference_array) + t_min
        normalized_array.append(temp)
    return normalized_array

#
# PCA
#
pca = PCA(n_components=20, random_state = 42)
pca = pca.fit(average_vectors)
principal_components = pca.transform(average_vectors)

plt.rcParams.update({'font.size': 16})
plt.rc('legend', fontsize=11)
plt.rcParams["figure.figsize"] = (8,8)

features = range(1, pca.n_components_ + 1)
plt.bar(features, pca.explained_variance_ratio_)
plt.title('% Variance by PCA Feature')
plt.xlabel('PCA Features')
plt.ylabel('% Variance')
#plt.rcParams.update({'font.size': 22})
plt.xticks(features)
plt.show()

#
# Optimization of the n_clusters and n_components based on the 
# maximum second derivative at the elbow point
#
derivative = {}

for n_comp in range(1, 21):
    ssePCA = {}
    for k in range(1, 21):
        kmeans = KMeans(n_clusters = k, random_state = 42).fit(principal_components[:,:n_comp])
        ssePCA[k] = kmeans.inertia_

    x = list(ssePCA.keys())
    y = list(ssePCA.values())
    y_normal = normalize(y)
    
    plt.plot(list(ssePCA.keys()), y_normal, label='%s PCs' %n_comp)
    plt.title('Sum of Square Error (SSE) Vs. Number of Clusters')
    plt.xlabel('Number of Clusters')
    plt.ylabel('SSE')
    plt.xticks(range(1, 21))
    
    elbow_point = elbow(x, y_normal)
    index = x.index(elbow_point)
    derivative[second_derivative(y_normal, index)] = [n_comp, elbow_point]
    print(n_comp, elbow_point, second_derivative(y_normal, index))

n_comp, n_clusters = derivative[max(derivative.keys())][0], derivative[max(derivative.keys())][1]
print('Optimal number of PCs and Clusters: %s, %s' %(n_comp, n_clusters))
plt.legend()
plt.show()

As shown below, the first three to four components in this case explain most of the variance of the data for our first question.

There are some questions in which just the first or the first two components represent most of the variance of the data.

The optimal combination of principal components and numbers of clusters obtained from the cross-validation will be used now to perform our final cluster analysis. The example below is a continuation of the previous example which generates clusters of companies based on their average representative vector for each question.

Example 8
#
# Best Clusterization
#
kmeans = KMeans(n_clusters = n_clusters, random_state = 42).fit(principal_components[:,:n_comp])

clusters = pd.DataFrame(principal_components[:,:3], columns = ['PC 1', 'PC 2', 'PC 3'])

predictions = kmeans.predict(principal_components[:,:n_comp])

clusters["predict"] = predictions
clusters["company_name"] = list(companies_vectors.keys())
clusters.to_csv("clusters_%s_PC_%s_clusters_%s.csv" %(file, n_comp, n_clusters))

fig = px.scatter_3d(clusters, x='PC 1', y='PC 2', z='PC 3', color='predict', hover_name='company_name')
fig.show()
#fig.show("notebook") # use this option for Jupyter Notebook
These are the resulting clusters. Each point represents the average vector for each company based on the company’s answers to the first question. Analyze in more detail.

If we have a closer look at the blue cluster, we can find something remarkable. The rightmost set of four companies are Goldman Sachs Group, Inc., Fifth Third Bancorp, Citigroup Inc. and Western Union Co. which clearly belong to the financial sector and have a lot in common.

Although the analytic results indicate 1 PC and 3 clusters as the optimal combination, other possibilities can be explored. The next section will show some cases in which it is worth exploring other possibilities, having as a reference the analytical solutions described in this section.

The full Python code is available in this file.

Final Results

The same methodology described for our first question as applied to the rest of our questions:

2. "What does the company believe in?"

https://plotly.com/~yoelvis/59/
3. "What are the company's competitive advantages?"

https://plotly.com/~yoelvis/61/
4. "What are the values underpinning the company's culture?"

https://plotly.com/~yoelvis/63/
5. "What are the company's diversity and inclusion challenges?"

https://plotly.com/~yoelvis/65/
6. "What are the company's environmental sustainability challenges?"

1 PC and 2 clusters: https://plotly.com/~yoelvis/67/

As it can be observed, the analytical solution suggests using 1 PC and 3 clusters, but in this case we can clearly see a small blue cluster separated from the rest. Therefore, if we try using 3 PCs and 4 clusters, this is what we obtain: 

3 PC and 4 clusters: https://plotly.com/~yoelvis/69/


Perhaps this is a better representation of the data for this specific topic.

7. "What are the company's human rights challenges?"


1 PC and 2 clusters:  https://plotly.com/~yoelvis/71/
1 PC and 2 clusters: https://plotly.com/~yoelvis/73/

In this case we find a similar situation, in which the analytical solution suggests 1 PC and 2 clusters to characterize our data, but perhaps using 3 PC and 3 clusters is a better representation.  

Conclusions

From the previous PCA and cluster analysis we can draw a few conclusions. For instance, there are several questions which are primarily represented by the first principal component of the answers. In these cases, we could clearly say that there is some kind of specific characteristic for the given question which is definitively representative of each company. 

What is this specific characteristic? Further exploration may entail an examination of the interpretability of these principal components. Perhaps correlating these clusters with other types of well-defined data, or correlating clusters generated by different questions will lead to remarkable results. We are looking forward to seeing a variety of cool applications powered by Thea and the new Intrinio Company Answers API

Sorry, we no longer support Internet Explorer as a web browser.

Please download one of these alternatives and return for the full Intrinio Experience.

Google Chrome web browser icon
Chrome
Mozilla Firefox web browser icon
Firefox
Safari web browser icon
Safari
Microsoft Edge web browser icon
Microsoft Edge