by Ayko Schwedler as part of the bachelor thesis.
- Technologies: Python
- Python packages: see requirements.txt
- Install desired PyTorch version (CUDA or normal), to be found at https://pytorch.org/get-started/locally/.
- Install all modules from requirements.txt with
pip install -r requirements.txt
. - If adjustments are desired, there are many options in config.yaml, e.g. how many threads should be used to search for news items. All possible config changes have been described in detail in the file.
- The API is started with
uvicorn api:app --host 0.0.0.0 --port 8000 --reload
and the news analysis withpython news_text_analysis.py
.
This backend is used to periodically (and therefore independently) search for news articles and then evaluate them using various AI algorithms. In addition, the results as well as the management of the data are made available via an API.
- The stored company name and its stored synonyms are each entered in parallel as search terms in GNews. This will yield ~100 news articles per search.
- Now the analysis is started. Multiprocessing is used for this (depending on whether a CUDA-capable GPU can be used).
- First it is checked which companies appear in the given news article.
- If at least one company is present and the news article has not yet been analysed, continue.
- Now perform a classification for the sustainability indicators in the given news article.
- Then perform a sentiment analysis for the news article in general.
- Save the results in the database.
- First it is checked which companies appear in the given news article.
Several optimisations were carried out, among others: News articles that already exist, are re-examined for relevant companies (if new ones have been added), but not analysed again using resource-intensive AI.
Analyzed news article: Microsoft extends security log retention following State Department ... - Cybersecurity Dive
Named companies are:
- Name: Microsoft
Results of classification:
Label | Prob |
---|---|
Not Relevant to ESG | 0.9 |
Risk Management and Internal Control | 0.72 |
Data Safety | 0.56 |
Corporate Governance | 0.35 |
Environmental Management | 0.28 |
Climate Risks | 0.27 |
Supply Chain (Economic / Governance) | 0.26 |
Land Acquisition and Resettlement (S) | 0.24 |
Biodiversity | 0.19 |
Values and Ethics | 0.18 |
Wastewater Management | 0.16 |
Responsible Investment & Greenwashing | 0.15 |
Strategy Implementation | 0.15 |
Waste Management | 0.14 |
Product Safety and Quality | 0.14 |
Surface Water Pollution | 0.12 |
Human Rights | 0.12 |
Supply Chain (Social) | 0.11 |
Forced Labour | 0.11 |
Natural Resources | 0.1 |
Employee Health and Safety | 0.1 |
Planning Limitations | 0.1 |
Retrenchment | 0.1 |
Emergencies (Social) | 0.1 |
Soil and Groundwater Impact | 0.09 |
Physical Impacts | 0.09 |
Land Acquisition and Resettlement (E) | 0.09 |
Discrimination | 0.09 |
Hazardous Materials Management | 0.08 |
Land Rehabilitation | 0.08 |
Emergencies (Environmental) | 0.08 |
Energy Efficiency and Renewables | 0.07 |
Animal Welfare | 0.07 |
Disclosure | 0.07 |
Economic Crime | 0.06 |
Indigenous People | 0.06 |
Landscape Transformation | 0.06 |
Legal Proceedings & Law Violations | 0.06 |
Water Consumption | 0.06 |
Labor Relations Management | 0.05 |
Minimum Age and Child Labour | 0.05 |
Air Pollution | 0.05 |
Greenhouse Gas Emissions | 0.04 |
Freedom of Association and Right to Organise | 0.04 |
Communities Health and Safety | 0.04 |
Supply Chain (Environmental) | 0.03 |
Cultural Heritage | 0.03 |
Obtained sentiment: 6.99/10 (Neutral)
The REST API is made available with FastAPI. The following functions exist:
Query parameters marked with * are optional.
API Access Points | company_name | date_range | max_sentiment | indicator_name | synonym_name |
---|---|---|---|---|---|
/companies | ✗ | ✗ | ✗ | ✗ | ✗ |
/do_news_exist | ✓ | ✓ | ✓ | ✓ * | ✗ |
/news_minimum | ✓ | ✓ | ✗ | ✓ * | ✗ |
/news | ✓ | ✓ | ✓ | ✓ * | ✗ |
/sustainability_indicators | ✗ | ✗ | ✗ | ✗ | ✗ |
/indicator_stats | ✓ | ✓ | ✓ | ✓ * | ✗ |
/companies (POST) | ✓ | ✗ | ✗ | ✗ | ✗ |
/synonyms (POST) | ✓ | ✗ | ✗ | ✗ | ✓ |
/companies (DELETE) | ✓ | ✗ | ✗ | ✗ | ✗ |
/synonyms (DELETE) | ✓ | ✗ | ✗ | ✗ | ✓ |
- Each news item is assigned at least one company and each sustainability indicator exactly once.
- A news indicator consists of exactly one sustainability indicator.
- Instead of only providing results for either one or all indicators, let the API user choose various indicators per request.
- Also let the user choose multiple companies for sake of comparison.
- Allow analysis of news in multiple languages, not just english.
- Should problems arise: Upgrade company identification from basic string matching to a more advanced technology, for example NER. This would, however, also increase the processing time per news article.
- Use text/sentence similarity algorithms to analyze only one of multiple news, if these have the same topic.
- Let the user select various news agencies to select from, instead of always using Google News.
- Further analysis on the data.
- A large part of this code is in German, as the Bachelor thesis itself is written in German.