update

piebro · May 10, 2024 · 9b70e95 · 9b70e95
1 parent 60e596f
commit 9b70e95
Show file tree

Hide file tree

Showing 3 changed files with 65 additions and 30 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,49 @@
 # Deutsche Bahn Data
 
-This is a repository with accumulated public data from "Deutsche Bahn", the biggest german train company.
+A german version of this readme is [here](README_de.md).
 
-## Data Gathering
+This is a repository with accumulated public data from "Deutsche Bahn", the biggest german train company and a website with some generated interactive plots and tables using the data.
 
-### Die größten Bahnhöfe ermitteln
+TODO: example plot image?
 
-Die einfachste Möglichkeit die größten Bahnhöfe Deutschlands zu bekommen ist über die [Preisklasse](https://de.wikipedia.org/wiki/Preisklasse). Diese gibt indirekt an wie groß ein Bahnhof ist. Dazu habe ich eine [aktuelle Tabelle](https://www.deutschebahn.com/resource/blob/11895816/ef4ecf6dd8196c7db3ab45609d8a2034/Stationspreisliste-2024-data.pdf) aller Bahnhöfe mit ihren Preisklassen gefunden. Das Problem ist das ich noch die eva Nummer der Bahnhöfe brauche für die API.
+## The Data
 
-https://wiki.openstreetmap.org/w/images/c/c2/20141001_IBNR.pdf (Daten von 1.10.2014) die Zuweisung von dem Namen des ahnhofs zu ihrer IBNR-Nummer (in der api heißt wir die nummer eva gennant). Die API woher die Daten kommen gibt es nicht mehr (https://data.deutschebahn.com/dataset/data-haltestellen), daher wird eine alte Versionhier benutzt. In ihr habe ich aber die Nummern für alle relevanten Bahnhöfe gefunden.
+The data is a list of entries when trains planed to arrive and departure in stations and their delay, platform change and if the station was canceled. The data is saved as a table in `data.csv` with these columns:
 
-Diese beiden Datenquellen werden in `save_eva_name_list.py` benutzt um eine Liste der ~100 größten Bahnhöfe Deutschlands mit Name und eva Nummer zu erstellen.
+| Column Name              | Description                                                                                           |
+|--------------------------|-------------------------------------------------------------------------------------------------------|
+| **station**              | Represents the station name associated with the particular data entry.                               |
+| **train_name**           | Combines the train type and line number, e.g. RE 1.                            |
+| **final_destination_station** | The final destination station of the train.          |
+| **arrival_planned_time** | Displays the scheduled arrival time at this station. Formatted as a date-time value. |
+| **arrival_time_delta_in_min** | The difference, in minutes, between the planned arrival time and the actual or changed arrival time. Positive values indicate a delay, while negative values mean early arrival. |
+| **departure_planned_time** | Shows the originally planned departure time at this station. Formatted as a date-time value.      |
+| **departure_time_delta_in_min** | The difference, in minutes, between the planned departure time and the actual or changed departure time. Positive values show delays, while negative values reflect early departure. |
+| **planned_platform**     | Represents the platform where the train was originally scheduled to arrive and/or depart.            |
+| **changed_platform**     | Displays the platform that the train will now arrive and/or depart from, if it differs from the planned platform. |
+| **stop_canceled**        | A boolean column (`True` or `False`) indicating whether this train stop at the station was canceled. |
+| **train_type**           | Specifies the type of train, such as IC, ICE, EC, or other regional and local types.                  |
+| **train_line_ride_id**        | Identifies the particular train line id this train ride is associated with.                                  |
+| **train_line_station_num** | Represents the station's number in the sequence of stations on this particular train line.     |
 
-Dies sind die Befehle um die Liste selber zu erstellen:
+
+### Data Collection
+
+The [timetables-api](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables) is used to collect the raw data. It's free to query the api up to 60 times per seconde and the data is licensed as [(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
+
+The [timetable-plan-api](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/operation/%2Fplan%2F{evaNo}%2F{date}%2F{hour}/get) is used to get the planned timetable for a station at a specific hour and day. The [timetable-changes-api](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/operation/%2Ffchg%2F{evaNo}/get) is used to get all change. This API is queried every 6 hours to not miss any changes.
+
+The responses of the APIs is saved in the data folder. Each day is a new subfolder and the suffix of each file hour in UTS time when the change request was made or the time of the planned train schedule.
+
+You can look at the api using the website https://editor.swagger.io/ together with OpenAPI Documentation you can download from [here](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/overview).
+
+### How to get the biggest train stations with an eva number?
+
+There are [german railway station categories](https://en.wikipedia.org/wiki/German_railway_station_categories) ([german link](https://de.wikipedia.org/wiki/Preisklasse)) we can use to get the biggest train stations. There is a table of each german train station with its category [here](https://www.deutschebahn.com/resource/blob/11895816/ef4ecf6dd8196c7db3ab45609d8a2034/Stationspreisliste-2024-data.pdf). This is used to get all train stations in catgory 1 and 2.
+
+Next the eva number is needed for these train stations to use them in the API. There is an older list of train stations and their evas from 2014 [here](https://wiki.openstreetmap.org/w/images/c/c2/20141001_IBNR.pdf) and I could find a newer one. If you know of a new one, please write me or open an issue.
+
+There is a script to automate the extraction and name matching called ``save_eva_name_list.py` to create the `eva_name_list.txt`. Run the script using the following commands.
 
 ```bash
 # download the two pdfs with the data
@@ -25,9 +56,3 @@ pip3 install tabula-py PyPDF2
 # run the script
 python3 save_eva_name_list.py
 ```
-
-Wenn jemand eine aktuelle Liste von Bahnhöfen und ihrer eva nummer findet erstellt gerne ein issue.
-
-
-
-https://editor.swagger.io/
diff --git a/README_de.md b/README_de.md
@@ -0,0 +1 @@
+# TODO
diff --git a/update_data_csv.py b/update_data_csv.py
@@ -4,8 +4,8 @@
 
 def get_eva_to_name_dict():
     with Path("eva_name_list.txt").open("r") as f:
-        eva_to_list = {line.split(",")[0]: line.split(",")[1] for line in f.read().split("\n")}
-    return eva_to_list
+        eva_to_name = {line.split(",")[0]: line.split(",")[1] for line in f.read().split("\n")}
+    return eva_to_name
 
 def get_plan_xml_rows(xml_path, eva_to_name):
     eva = xml_path.name.split("_")[0]
@@ -37,25 +37,24 @@ def get_plan_xml_rows(xml_path, eva_to_name):
 
         dp_ppth = s.find('dp').get('ppth') if s.find('dp') is not None else None # departure planed path
         if dp_ppth is None:
-            destination_station = station
+            final_destination_station = station
         else:
-            destination_station = dp_ppth.split("|")[-1]
+            final_destination_station = dp_ppth.split("|")[-1]
 
         s_id_split = s_id.split('-')
 
         rows.append({
             'id': s_id,
             'station': station,
             'train_name': train_name,
-            'destination_station': destination_station,
-            'train_number': int(train_number),
+            'final_destination_station': final_destination_station,
+            #'train_number': int(train_number),
             'train_type': train_type,
             'arrival_planned_time': s.find('ar').get('pt') if s.find('ar') is not None else None,
             'departure_planned_time': s.find('dp').get('pt') if s.find('dp') is not None else None,
             'planned_platform': planned_platform,
-            'train_line_id': '-'.join(s_id_split[:-1]),
+            'train_line_ride_id': '-'.join(s_id_split[:-1]),
             'train_line_station_num': int(s_id_split[-1]),
-
             # 'arrival_planned_path': s.find('ar').get('ppth') if s.find('ar') is not None else None,
             # 'departure_planned_path': s.find('dp').get('ppth') if s.find('dp') is not None else None,
 
@@ -114,8 +113,12 @@ def get_fchg_xml_rows(xml_path, id_to_data):
 def get_fchg_db():
     id_to_data = {}
     for date_folder_path in Path("data").iterdir():
+        # if date_folder_path.name != "2024-05-10":
+        #     continue
         for xml_path in sorted(date_folder_path.iterdir()): # get the oldest data first
             if "fchg" in xml_path.name:
+                # if "06" in xml_path.name:
+                #     continue
                 get_fchg_xml_rows(xml_path, id_to_data)
 
     out_df = pd.DataFrame(id_to_data.values())
@@ -127,23 +130,29 @@ def get_fchg_db():
 def main():
     plan_df = get_plan_db()
     fchg_df = get_fchg_db()
+    # print(len(plan_df), len(fchg_df)) # TODO: why is fchg_df bigger?
     df = pd.merge(plan_df, fchg_df, on='id', how='left')
 
     df.loc[df["arrival_planned_time"] == df["arrival_change_time"], "arrival_change_time"] = None
     df.loc[df["departure_planned_time"] == df["departure_change_time"], "departure_change_time"] = None
-
-    # Calculate time deltas
-    df["arrival_time_delta"] = df["arrival_change_time"] - df["arrival_planned_time"]
-    df["arrival_time_delta"] = df["arrival_time_delta"].fillna(pd.Timedelta(0))
-    df["arrival_time_delta"] = pd.to_timedelta(df["arrival_time_delta"])
-    df["departure_time_delta"] = df["departure_change_time"] - df["departure_planned_time"]
-    df["departure_time_delta"] = df["departure_time_delta"].fillna(pd.Timedelta(0))
-    df["departure_time_delta"] = pd.to_timedelta(df["departure_time_delta"])
 
-    df.loc[df["stop_canceled"].isna(), "stop_canceled"] = False
+    # Calculate time deltas
+    for prefix in ["arrival", "departure"]:
+        time_delta = df[f"{prefix}_change_time"] - df[f"{prefix}_planned_time"]
+        time_delta = time_delta.fillna(pd.Timedelta(0))
+        df[f"{prefix}_time_delta_in_min"] = time_delta.dt.total_seconds() / 60
 
     df.loc[df["stop_canceled"].isna(), "stop_canceled"] = False
     df = df.drop("id", axis=1)
+
+    # Reorder columns as per the new order specified
+    df = df[[
+        'station', 'train_name', 'final_destination_station', 'arrival_planned_time',
+        'arrival_time_delta_in_min', 'departure_planned_time', 'departure_time_delta_in_min',
+        'planned_platform', 'changed_platform', 'stop_canceled', 'train_type',
+        'train_line_ride_id', 'train_line_station_num'
+    ]]
+
     df.to_csv("data.csv", index=False)
 
 if __name__ == "__main__":