Listed companies like to issue their annual reports in PDF formats. PDFs are nice to look at on your screen and for printing but they are a pain to extract data from. Fortunately, there are programmatic ways (using Python) to extract and analyse the data.
With the 2018 Annual Report of ASX (the Australian Stock Exchange, which is also listed on itself as a company) I am going to demonstrate the following:
Part1: Table Extraction and Data Analysis
The following Python libraries are required: numpy, pandas, re, nltk, PyPDF2, camelot, and tkinter (as dependency of camelot). As a dependency of camelot Ghostscript (software) is also needed to be installed on your computer.
Full discloser: As at 13/06/2019, I am not a direct shareholder of ASX. However, I am exposed to the company through VanEck Vectors Australian Equal Weight ETF (MVW:ASX). ASX constitutes 1.27% of the ETF's holding as at 11/06/2019. I am most likely also exposed to the company through an industry superannuation fund.
This blog post is an html version of a Jupyter Notebook. You will see "In"s which are the scripts written in Python, and "Out"s which are the outputs.
With the 2018 Annual Report of ASX (the Australian Stock Exchange, which is also listed on itself as a company) I am going to demonstrate the following:
Part1: Table Extraction and Data Analysis
- Extract the income statement, balance sheet, and cashflow statement. (Using camelot)
- Calculate financial ratios.
- Extract text from the report pdf. (Using PyPDF2)
- Summarise the report by extracting key phrases.
- Conduct a sentiment analysis on the report (using TextBlob)
The following Python libraries are required: numpy, pandas, re, nltk, PyPDF2, camelot, and tkinter (as dependency of camelot). As a dependency of camelot Ghostscript (software) is also needed to be installed on your computer.
Full discloser: As at 13/06/2019, I am not a direct shareholder of ASX. However, I am exposed to the company through VanEck Vectors Australian Equal Weight ETF (MVW:ASX). ASX constitutes 1.27% of the ETF's holding as at 11/06/2019. I am most likely also exposed to the company through an industry superannuation fund.
This blog post is an html version of a Jupyter Notebook. You will see "In"s which are the scripts written in Python, and "Out"s which are the outputs.
Part 1: Table Extraction and Data Analysis¶
Extract the income statement, balance sheet, and cashflow statement:¶
The Plan of attack is simple: find the pages these statements are located on the pdf document, pass them into camelot.read_pdf() and load the tables to pandas for easy analysis. Loading these statements programmatically is advantageous more often than not compare to copying and pasting tables from a pdf to Excel - with Excel, you often get misaligned columns and merged cells.
Income Statement:
In [1]:
import camelot
import pandas as pd
filepath = "data/ASXAnnualReport2018.pdf"
tables = camelot.read_pdf(filepath, pages = '57', flavor = 'stream')
incomeStatement = tables[0].df #index 0 for first table and .df to convert to pandas dataframe
incomeStatement
Out[1]:
Balance Sheet:
In [2]:
tables = camelot.read_pdf(filepath, pages = '58', flavor = 'stream')
balanceSheet = tables[0].df
balanceSheet
Out[2]:
Cashflow Statment:
In [3]:
tables = camelot.read_pdf(filepath, pages = '60', flavor = 'stream')
cashflow = tables[0].df
cashflow
Out[3]:
At this point it is probably easiest to save the dataframes to csv to keep the complex labelling and formatting of the tables/dataframes. Unlike Excel, pandas would not let you to have multiple values per column. However, we can still calculate ratios by indexing the tables.
Calculate financial ratios¶
Let's first create a function that converts all the values with ',' and '()' to numbers (floats):
In [4]:
def to_float(value):
value = value.replace(',','')
value = value.replace(')','')
value = value.replace('(','-')
value = value.strip()
return float(value)
One easy way to access the "cells" in a pandas dataframe is to use the .iloc method and pass in the indexes.
Revenue Growth Rate:¶
(Rev 2018 - Rev 2017) / Rev 2017
In [5]:
revGR = (to_float(incomeStatement.iloc[11,2]) -
to_float(incomeStatement.iloc[11,3]))/to_float(incomeStatement.iloc[11,3])
print('The revenue growth rate was: ', round(revGR*100,2), '%')
Earnings per Share Growth Rate:¶
(EPS 2018 - EPS 2017) / EPS 2017
In [6]:
EPSGR = (to_float((incomeStatement.iloc[33,2])) -
to_float((incomeStatement.iloc[33,3])))/ to_float(incomeStatement.iloc[33,3])
print('The EPS growth rate was: ', round(EPSGR*100,2), '%')
Debt to Equity Ratio (and Growth Rate):¶
Debt to equity = Total Liabilities / Total EquityGrowth Rate = (DTE 2018 - DTE 2017) / DTE 2017
In [7]:
# Debt to Equity for each year
DER2018 = to_float(balanceSheet.iloc[23,2]) / to_float(balanceSheet.iloc[38,2])
DER2017 = to_float(balanceSheet.iloc[23,3]) / to_float(balanceSheet.iloc[38,3])
# Growth Rate
DERGR = (DER2018-DER2017)/DER2017*100
print('Debt to Equity Ratio was {} in 2018 and was {} in 2017. The change was {}%.'.format(
round(DER2018,2),
round(DER2017,2),
round(DERGR,2)))
Net cash inflow from operating activities:¶
(Net Cash 2018 - Net Cash 2017) / Net Cash 2017
In [8]:
oc2018 = to_float(cashflow.iloc[10,2])
oc2017 = to_float(cashflow.iloc[10,3])
growth = (oc2018 - oc2017) / oc2017*100
print('The growth rate in Receipt from Customers was {} %.'.format(round(growth,2)))
The ASX had a pretty good year in the financial year of 2018. EPS and operating cashflows were all up. Debt to equity ratio dropped. However, EPS growth was lagging revenue significantly.
In [9]:
import PyPDF2
file = open(filepath, 'rb')
pdfObj = PyPDF2.PdfFileReader(file)
contents = []
for p in range(pdfObj.getNumPages()):
page = pdfObj.getPage(p)
pageContent = page.extractText()
contents.append(pageContent)
Quick inspection of the content:
In [10]:
contents[0]
Out[10]:
In [11]:
contents[7]
Out[11]:
This techinques is also useful when you want to extract parts of a pdf document. You can save the text to a .txt file and when you open the file with Notepad or other word processing software, it will get rid of all the \n's and intepret them as blank lines.
Summarise the report by extracting key phrases¶
First we can combine the list contents to one long string:
In [12]:
text = ' '.join(contents)
Next, we are going to use the LexRank algorithm from the sumy library to extract the 5 most significant sentences from our body of text. LexRank is an unsupervised approach that is based on finind a "centroid" sentence that is the mean word vector of all the sentences in a document - it finds sentences that are "representive" of the document.
In [13]:
import sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(text,Tokenizer('english'))
In [14]:
# Create Summarizer
summarizer = LexRankSummarizer()
# Summarize the text to 5 sentences
summary = summarizer(parser.document, 5)
# Print out the result
print('The following is an extractive summary of the annual report: \n')
for sentence in summary:
print(sentence, '\n')
It seems extractive summary using Lex Rank is not very good. Let's try LSA, an algorithm based on term frequency techniques with singular value decomposition to summarize texts.
In [15]:
from sumy.summarizers.lsa import LsaSummarizer
summarizer_lsa = LsaSummarizer()
summary2 =summarizer_lsa(parser.document,5)
print('The following is an extractive summary of the annual report: \n')
for sentence in summary2:
print(sentence, '\n')
It seems LSA has done a much better job at finding sentences that make more sense. 3 of the 5 sentences relate to regulatory. Reading the sentences, it is easy to get that the Annual Report is about an exchange. The second sentence says: " ...we are leading the global ˜nancial [sic] exchange industry...", which clearly indicates that this is a document about a financial exchange.
Sentiment Analysis¶
Next, I am going to conduct an sentiment analysis on the whole text. The library I will be using is the vader library from nltk. nltk is a broad libraries for NLP (natural language processing).Before I feed the text into a sentiment analyser, the text will first need to be cleaned first to ensure only words are analysed. It involves the following steps:
- Remove all numbers
- Make every word lowercase
- Remove any unneccessary space
In [16]:
import re
def clean_text(text):
# remove all numbers
numbers_removed = re.sub('[^A-Za-z]', ' ', text)
# lowercase all words
lowercase_text = numbers_removed.lower()
# remove any unneccessary space
clean_text = lowercase_text.strip()
return clean_text
In [17]:
cleaned_text = clean_text(text)
Using Textblob:¶
Next, feeding cleaned_text through TextBlob for sentiment scores. The sentiment function of TextBlob returns two scores: polarity and subjectivity. Polarity is a sentiment score that has a range of [-1,1] where -1 is the most negative and 1 is the most positive. Subjectivity is a score for how subjective, emotional or judgemental the text is and has a range of [0,1].
In [18]:
from textblob import TextBlob
sentiment_score = TextBlob(text).sentiment
print('Sentiment score using TextBlob:',sentiment_score)
It seems the Annual Report is fairly dull with a polarity score of 0.076. It is also mildly subjective with a subjectivity score of 0.36 - you should always read annual reports with a grain of salt.
Reference Resource:¶
https://www.youtube.com/watch?v=LoiHI-IB3lY
https://camelot-py.readthedocs.io/en/master/user/quickstart.html#specify-page-numbers
https://automatetheboringstuff.com/chapter13/
https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Text%20Summarization%20with%20Sumy%20Python%20.ipynb
http://ai.intelligentonlinetools.com/ml/text-summarization/
https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/
https://camelot-py.readthedocs.io/en/master/user/quickstart.html#specify-page-numbers
https://automatetheboringstuff.com/chapter13/
https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Text%20Summarization%20with%20Sumy%20Python%20.ipynb
http://ai.intelligentonlinetools.com/ml/text-summarization/
https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/
Hello,
ReplyDeleteThank you for this, very useful. I am just a beginner to all this, but I am considering the prospects of automatizing the extracting of very specific sections of general annual reports of listed firms. Namely, the executive remuneration and board of directors. In some cases ( i think in the majority of cases), this data is based in tables, in other cases they are simple in the text. What could be an approach to extract this data with the tools you have mentioned? For example, first searching for the respective sections perhaps based on keywords, and if a table is found, retrieve it?
Thanks!