Dear community,
I want to create a dataset with the following variables: Company identifier | Year | Annual report text (in English) | ESG score.
I have a dataset of over 20000 companies with ESG scores over time.
For all these observations I would like to have an annual report (if available) so the total data frame would be ~ 31229 rows.
I either want to import the annual report text directly or download the PDF's and extract the text myself.
It would be most efficient if I could only download the annual reports/text of the companies from which I already have the ESG scores. Otherwise, I have to mass download annual reports and match them with the relevant companies later on, which is also ok (but less efficient). I have a couple of company identities available in the ESG dataset: ISIN, Ticker, Refinitiv Ticker, Name, SEDOL.
Year #Companies with ESG score
2019 8287
2020 9729
2021 9919
2022 1977
2023 1317
I'm open to any solution as long as I get a large dataset with the following two variables: Annual report text | Company ESG score. Additional variables such as year, Ticker and industry would be a plus