Use the following link to download the dataset: Dataset Download Link
The dataset has close to 8 million news stories. The dataset file has each stock news story as a line in JSON format in reverse chronological order. An example news story in prettified multi-line JSON format is shown below:
{
"title": "Europe gives Meta, TikTok six days to share information on response to Israel-Hamas conflict",
"url": "https://www.cnbc.com/2023/10/19/israel-hamas-eu-gives-meta-tiktok-six-days-to-provide-information.html",
"unix_timestamp": 1697727889,
"id": "3341850707742811898",
"tickers_direct": [
"meta",
"fb"
],
"tickers_indirect": [
".bytedance"
],
"description": "The EU said it would like Meta and TikTok to hand over information on how they're tackling misinformation about the Israel-Hamas war."
}
The fields of the JSON blob are explained below. Most of the fields have the same semantics as the ones in the response of TickerTick API.
Field name | Meaning | Optional field? (If yes, this field can be missing) |
---|---|---|
title | The title of this news story | No |
url | The original URL for the full news story | No |
unix_timestamp | The UNIX timestamp when the news was reported | No |
id | A unique string ID of this news story | No |
description | A short description of this news story | Yes |
tickers_direct | The tickers that the news story is directly about, e.g., the name of the company for the ticker is mentioned | Yes |
tickers_indirect | The tickers that the news story is indirectly about, e.g., the CEO or a product of the company for this ticker is mentioned | Yes |
Note that many well-known pre-IPO startups (e.g., Bytedance, the parent company of TikTok) have made-up tickers like .bytedance
and .databricks
.