Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Piet Brömmel committed May 10, 2024
1 parent 60e596f commit 9b70e95
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 30 deletions.
51 changes: 38 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,49 @@
# Deutsche Bahn Data

This is a repository with accumulated public data from "Deutsche Bahn", the biggest german train company.
A german version of this readme is [here](README_de.md).

## Data Gathering
This is a repository with accumulated public data from "Deutsche Bahn", the biggest german train company and a website with some generated interactive plots and tables using the data.

### Die größten Bahnhöfe ermitteln
TODO: example plot image?

Die einfachste Möglichkeit die größten Bahnhöfe Deutschlands zu bekommen ist über die [Preisklasse](https://de.wikipedia.org/wiki/Preisklasse). Diese gibt indirekt an wie groß ein Bahnhof ist. Dazu habe ich eine [aktuelle Tabelle](https://www.deutschebahn.com/resource/blob/11895816/ef4ecf6dd8196c7db3ab45609d8a2034/Stationspreisliste-2024-data.pdf) aller Bahnhöfe mit ihren Preisklassen gefunden. Das Problem ist das ich noch die eva Nummer der Bahnhöfe brauche für die API.
## The Data

https://wiki.openstreetmap.org/w/images/c/c2/20141001_IBNR.pdf (Daten von 1.10.2014) die Zuweisung von dem Namen des ahnhofs zu ihrer IBNR-Nummer (in der api heißt wir die nummer eva gennant). Die API woher die Daten kommen gibt es nicht mehr (https://data.deutschebahn.com/dataset/data-haltestellen), daher wird eine alte Versionhier benutzt. In ihr habe ich aber die Nummern für alle relevanten Bahnhöfe gefunden.
The data is a list of entries when trains planed to arrive and departure in stations and their delay, platform change and if the station was canceled. The data is saved as a table in `data.csv` with these columns:

Diese beiden Datenquellen werden in `save_eva_name_list.py` benutzt um eine Liste der ~100 größten Bahnhöfe Deutschlands mit Name und eva Nummer zu erstellen.
| Column Name | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------|
| **station** | Represents the station name associated with the particular data entry. |
| **train_name** | Combines the train type and line number, e.g. RE 1. |
| **final_destination_station** | The final destination station of the train. |
| **arrival_planned_time** | Displays the scheduled arrival time at this station. Formatted as a date-time value. |
| **arrival_time_delta_in_min** | The difference, in minutes, between the planned arrival time and the actual or changed arrival time. Positive values indicate a delay, while negative values mean early arrival. |
| **departure_planned_time** | Shows the originally planned departure time at this station. Formatted as a date-time value. |
| **departure_time_delta_in_min** | The difference, in minutes, between the planned departure time and the actual or changed departure time. Positive values show delays, while negative values reflect early departure. |
| **planned_platform** | Represents the platform where the train was originally scheduled to arrive and/or depart. |
| **changed_platform** | Displays the platform that the train will now arrive and/or depart from, if it differs from the planned platform. |
| **stop_canceled** | A boolean column (`True` or `False`) indicating whether this train stop at the station was canceled. |
| **train_type** | Specifies the type of train, such as IC, ICE, EC, or other regional and local types. |
| **train_line_ride_id** | Identifies the particular train line id this train ride is associated with. |
| **train_line_station_num** | Represents the station's number in the sequence of stations on this particular train line. |

Dies sind die Befehle um die Liste selber zu erstellen:

### Data Collection

The [timetables-api](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables) is used to collect the raw data. It's free to query the api up to 60 times per seconde and the data is licensed as [(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

The [timetable-plan-api](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/operation/%2Fplan%2F{evaNo}%2F{date}%2F{hour}/get) is used to get the planned timetable for a station at a specific hour and day. The [timetable-changes-api](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/operation/%2Ffchg%2F{evaNo}/get) is used to get all change. This API is queried every 6 hours to not miss any changes.

The responses of the APIs is saved in the data folder. Each day is a new subfolder and the suffix of each file hour in UTS time when the change request was made or the time of the planned train schedule.

You can look at the api using the website https://editor.swagger.io/ together with OpenAPI Documentation you can download from [here](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/overview).

### How to get the biggest train stations with an eva number?

There are [german railway station categories](https://en.wikipedia.org/wiki/German_railway_station_categories) ([german link](https://de.wikipedia.org/wiki/Preisklasse)) we can use to get the biggest train stations. There is a table of each german train station with its category [here](https://www.deutschebahn.com/resource/blob/11895816/ef4ecf6dd8196c7db3ab45609d8a2034/Stationspreisliste-2024-data.pdf). This is used to get all train stations in catgory 1 and 2.

Next the eva number is needed for these train stations to use them in the API. There is an older list of train stations and their evas from 2014 [here](https://wiki.openstreetmap.org/w/images/c/c2/20141001_IBNR.pdf) and I could find a newer one. If you know of a new one, please write me or open an issue.

There is a script to automate the extraction and name matching called ``save_eva_name_list.py` to create the `eva_name_list.txt`. Run the script using the following commands.

```bash
# download the two pdfs with the data
Expand All @@ -25,9 +56,3 @@ pip3 install tabula-py PyPDF2
# run the script
python3 save_eva_name_list.py
```

Wenn jemand eine aktuelle Liste von Bahnhöfen und ihrer eva nummer findet erstellt gerne ein issue.



https://editor.swagger.io/
1 change: 1 addition & 0 deletions README_de.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# TODO
43 changes: 26 additions & 17 deletions update_data_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

def get_eva_to_name_dict():
with Path("eva_name_list.txt").open("r") as f:
eva_to_list = {line.split(",")[0]: line.split(",")[1] for line in f.read().split("\n")}
return eva_to_list
eva_to_name = {line.split(",")[0]: line.split(",")[1] for line in f.read().split("\n")}
return eva_to_name

def get_plan_xml_rows(xml_path, eva_to_name):
eva = xml_path.name.split("_")[0]
Expand Down Expand Up @@ -37,25 +37,24 @@ def get_plan_xml_rows(xml_path, eva_to_name):

dp_ppth = s.find('dp').get('ppth') if s.find('dp') is not None else None # departure planed path
if dp_ppth is None:
destination_station = station
final_destination_station = station
else:
destination_station = dp_ppth.split("|")[-1]
final_destination_station = dp_ppth.split("|")[-1]

s_id_split = s_id.split('-')

rows.append({
'id': s_id,
'station': station,
'train_name': train_name,
'destination_station': destination_station,
'train_number': int(train_number),
'final_destination_station': final_destination_station,
#'train_number': int(train_number),
'train_type': train_type,
'arrival_planned_time': s.find('ar').get('pt') if s.find('ar') is not None else None,
'departure_planned_time': s.find('dp').get('pt') if s.find('dp') is not None else None,
'planned_platform': planned_platform,
'train_line_id': '-'.join(s_id_split[:-1]),
'train_line_ride_id': '-'.join(s_id_split[:-1]),
'train_line_station_num': int(s_id_split[-1]),

# 'arrival_planned_path': s.find('ar').get('ppth') if s.find('ar') is not None else None,
# 'departure_planned_path': s.find('dp').get('ppth') if s.find('dp') is not None else None,

Expand Down Expand Up @@ -114,8 +113,12 @@ def get_fchg_xml_rows(xml_path, id_to_data):
def get_fchg_db():
id_to_data = {}
for date_folder_path in Path("data").iterdir():
# if date_folder_path.name != "2024-05-10":
# continue
for xml_path in sorted(date_folder_path.iterdir()): # get the oldest data first
if "fchg" in xml_path.name:
# if "06" in xml_path.name:
# continue
get_fchg_xml_rows(xml_path, id_to_data)

out_df = pd.DataFrame(id_to_data.values())
Expand All @@ -127,23 +130,29 @@ def get_fchg_db():
def main():
plan_df = get_plan_db()
fchg_df = get_fchg_db()
# print(len(plan_df), len(fchg_df)) # TODO: why is fchg_df bigger?
df = pd.merge(plan_df, fchg_df, on='id', how='left')

df.loc[df["arrival_planned_time"] == df["arrival_change_time"], "arrival_change_time"] = None
df.loc[df["departure_planned_time"] == df["departure_change_time"], "departure_change_time"] = None

# Calculate time deltas
df["arrival_time_delta"] = df["arrival_change_time"] - df["arrival_planned_time"]
df["arrival_time_delta"] = df["arrival_time_delta"].fillna(pd.Timedelta(0))
df["arrival_time_delta"] = pd.to_timedelta(df["arrival_time_delta"])
df["departure_time_delta"] = df["departure_change_time"] - df["departure_planned_time"]
df["departure_time_delta"] = df["departure_time_delta"].fillna(pd.Timedelta(0))
df["departure_time_delta"] = pd.to_timedelta(df["departure_time_delta"])

df.loc[df["stop_canceled"].isna(), "stop_canceled"] = False
# Calculate time deltas
for prefix in ["arrival", "departure"]:
time_delta = df[f"{prefix}_change_time"] - df[f"{prefix}_planned_time"]
time_delta = time_delta.fillna(pd.Timedelta(0))
df[f"{prefix}_time_delta_in_min"] = time_delta.dt.total_seconds() / 60

df.loc[df["stop_canceled"].isna(), "stop_canceled"] = False
df = df.drop("id", axis=1)

# Reorder columns as per the new order specified
df = df[[
'station', 'train_name', 'final_destination_station', 'arrival_planned_time',
'arrival_time_delta_in_min', 'departure_planned_time', 'departure_time_delta_in_min',
'planned_platform', 'changed_platform', 'stop_canceled', 'train_type',
'train_line_ride_id', 'train_line_station_num'
]]

df.to_csv("data.csv", index=False)

if __name__ == "__main__":
Expand Down

0 comments on commit 9b70e95

Please sign in to comment.