Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SiriRide calculated attributes - v1 #315

Open
EyalBerger opened this issue Apr 6, 2020 · 10 comments
Open

SiriRide calculated attributes - v1 #315

EyalBerger opened this issue Apr 6, 2020 · 10 comments

Comments

@EyalBerger
Copy link
Collaborator

EyalBerger commented Apr 6, 2020

Following our last discussions and the restarting of SiriRide entity task ,creating this issue for listing planned SiriRide calculated attributes (v1).

I have separated them into different classes according to their complexity:

  • Class 1: SiriRide Raw data.
  • Class 2: Simple calculations over Siri data.
  • Class 3: Simple calculations over GTFS data.
  • Class 4: Complex calculations - ride level.
  • Class 5: Complex calculations - aggregated.
  • Class 6: Models.
Class Attr name Attr desc Comments
1 agency_id
1 route_id
1 route_short_name
1 bus_id
1 planned_start_date
1 planned_start_time
1 points_time_list list of points timestamp by time_recorded
1 points_latlon_list list of points latlon
2 points_cnt number of Geo points in SiriRide
3 ride_in_gtfs specific ride is listed in the GTFS agency_id + route_id + planned_start_date + planned_start_time
3 ride_date_in_gtfs ride date is listed in the GTFS agency_id + route_id + planned_start_date
3 ride_route_in_gtfs ride route is listed in the GTFS agency_id + route_id
3 ride_agency_in_gtfs ride agency is listed in the GTFS agency_id
4 stops_matching_pct_500 stops match percentage with buffer of 500m over each stop will be calculated only for ride_date_in_gtfs = 1
4 stops_matching_pct_1000 stops match percentage with buffer of 1000m over each stop will be calculated only for ride_date_in_gtfs = 1
4 start_time_est estimated ride start time in first station will be calculated only for X% of stops_matching_pct_1000
4 end_time_est estimated ride end time in last station will be calculated only for X% of stops_matching_pct_1000
4 driving_time_est estimated driving time from first station to last station will be calculated only for X% of stops_matching_pct_1000
4 driving_speed_est estimated average driving speed from first station to last station will be calculated only for X% of stops_matching_pct_1000

It's a very initial list. Please edit it with your own insights.

@evyatark
Copy link
Collaborator

evyatark commented Apr 6, 2020

I think we should add the attribute "makat number", in addition to agency_id, route_id, route_short_name.

@cjer
Copy link
Member

cjer commented Apr 7, 2020

Some thoughts and fields I think we also need:

  • Where in your classification do you put the estimated time at each stop?
  • Time of first and last record (as opposed to departure and arrival times)
  • Time of first and last record that had actual lat,lon (not 0.0,0.0)
  • Fields about where are the missing stops (start, end or mid of the trip)
  • Did the bus go back and forth or loop (arrive at a point in the shape he was at at an earlier time)
  • General point - I think we should try to have all relevant fields that exist in trip_stats in SiriRide, and make sure we are comparable to the trip_stats fields (and also have the same names?). Most of them you already mentioned. I think only these two are missing:
    • distance
    • is_loop kind of matches the thing I mentioned above about back and forth
  • Once we have these we will be able to easily create in the future something like siri_route_stats with fields matching the ones in the gtfs route_stats

@AvivSela
Copy link
Collaborator

AvivSela commented Apr 8, 2020

  1. we could also have class for siri-ride that have multiple siri-records (with date time and lat-lon attributes), it could be easier to have one list than two.
  2. what do you say about merge together the planned_start_date and time?
  3. in case we are going to have those 2 classes (siri-ride and siri-record) we could have in each of them "analytics" member that holds dictionary with all the metrics. for example "points_cnt" will be in siri-ride object while "speed" will be in siri-record.

@adiwaz
Copy link
Collaborator

adiwaz commented Apr 10, 2020

I like the idea of dividing the variables into complexity classes.
My suggestions/comments:

  1. I think we should classify each variable by 2 criteria:
    1. Data needed (siri ride only, gtfs, etc.)
    2. Data Science work needed (e.g. straightforward aggregation vs statistical model required)
  2. I don't understand the difference between complex calculations (class 4) and models (class 6). I prefer more clear definition to the data science solution complexity (see above), that do not require us to decide in advance which type of DS solution (ML/statistical model...) will be the best for each "complex" variable.
  3. Le'ts add dependencies - if driving time requires start_time and end_time and given them it is straightforward calculation - let's mention it.
  4. On top of Dan's suggestions I would also add:
    1. total_ride_time_raw : time from first non 0 time point until the last one. This variable will help us to easily detect data anomalies with too long and too short rides.
    2. is_match_route: is the route ID mentioned in SIRI matching the expected route shape (from GTFS)?
  5. I didn't understand the variables: stops_matching_pct_*. Maybe add further description?
  6. In general I think we should focus now on defining and creating the "straightforward" variables, and later focus on variables that require statistics/modeling.

@adiwaz
Copy link
Collaborator

adiwaz commented Apr 10, 2020

@AvivSela - I didn't understand your suggestion in (1), what is the purpose of each class? Why should they be separated?
Regarding (2) - I think that merging the date and time can hurt efficiency of indexing. Maybe we would like to index the date and not the time.

@AvivSela
Copy link
Collaborator

  1. It's more easy to loop them:
for ind, point_time in enumerate(points_time_list):
    time = point_time 
    lat, lon = points_latlon_list[ind]

Vs.

for record in records:
    time = record.time
    lat, lon = record.latlon
  1. it's less error prone in case we will need to add new record that should be splitted to two and insert into the same index in both list.
  2. it's more easy to sort in case of modifications.

@EyalBerger
Copy link
Collaborator Author

Thanks you all for your comments and insights!

I updated the design following it.

The variables list became too long so I ended up opening a design doc for it. Please see here.

In summary:

  1. I added most of the suggested variables (see exceptions in the "open issues" section below) and some more (total ~30 raw data/straightforward calculations and ~10 complex calculations).
  2. I added variables dependencies.
  3. I separated the data categories (what was called "classes" in the previous comment) by the 2 criteria @adiwaz mentioned.

Open issues:

  1. "makat number" - @evyatark I didn't found the column in Splunk siri data. What is the meaning of this column? do you know its "Splunk" name?
  2. "is_match_route" - @adiwaz, I think that for this version it will be more simple (from IT and DS perspective) not to use GTFS shape files, and build our "match route" variables based over GTFS route_stats only (stops data).
  3. SiriRecord class - @AvivSela ,I assume this is more IT-related issue rather than data-related issue.
  4. planned_end_datetime_gtfs - I didn't found this data in gtfs route_stats. We don't get/collect it?

@EyalBerger
Copy link
Collaborator Author

Following 15/4 Zoom meeting, some required updates in the data design:

  • We will need to mention which variables based directly on Siri or GTFS and which based solely on other SiriRide variables (@EyalBerger)
  • Data types: We should add data types.
  • Naming: We will need to make sure that variables names are as they are in siri sources, e.g "service_id" is "trip_id_to_date". Who is familiar with siri sources and can help me with that task?

@EyalBerger
Copy link
Collaborator Author

I added data types and update dependencies (when variable based directly on Siri or GTFS) to the data design.

@AvivSela
Copy link
Collaborator

Hi,
I looked at SIRI 2.8. it might take some time but we will get there. there are some more fields there that come "free of charge" without the need to calculate them. Here is example of the JSON format:
ICD_SM_2_8_ver25.pdf

{
    "-version": "2.8",
    "ResponseTimestamp": "2020-10-16T06:32:30+03:00",
    "Status": "true",
    "MonitoredStopVisit": [
        {
            "RecordedAtTime": "2020-10-16T06:32:19+03:00",
            "ItemIdentifier": "1455075547",
            "MonitoringRef": "47507",
            "MonitoredVehicleJourney": {
                "LineRef": "28209",
                "DirectionRef": "1",
                "FramedVehicleJourneyRef": {
                    "DataFrameRef": "2020-10-16",
                    "DatedVehicleJourneyRef": "50698246"
                },
                "PublishedLineName": "52",
                "OperatorRef": "3",
                "DestinationRef": "47453",
                "OriginAimedDepartureTime": "2020-10-16T06:25:00+03:00",
                "VehicleLocation": {
                    "Longitude": "35.079803",
                    "Latitude": "32.823952"
                },
                "Bearing": "8",
                "Velocity": "29",
                "VehicleRef": "7576269",
                "MonitoredCall": {
                    "StopPointRef": "47507",
                    "Order": "26",
                    "ExpectedArrivalTime": "2020-10-16T06:49:00+03:00",
                    "DistanceFromStop": "4009"
                }
            }
        }
    ]
}

If im taking those fields combine them into one object that represent a ride that have list of records with the observation over time i will get the following schema:

SiriRide
    LineRef: "Reference to a LINE"
    DirectionRef: "Reference to a DIRECTION the VEHICLE is running along the LINE"
    FramedVehicleJourneyRef_DataFrameRef: "The date part of the trip ID"
    FramedVehicleJourneyRef_DatedVehicleJourneyRef: "The number part of trip ID"
    PublishedLineName: "The bus number, as published on the bus"
    OperatorRef: "The Operator code"
    DestinationRef: "The destination stop code"
    VehicleRef: "Vehicle number. The value should match the license number of the Vehicle"
    OriginAimedDepartureTime: "The start time of the Journey, according to the licensing system" The value should match DepartureTime at TripIdToDate.txt file at the GTFS"
    SiriRecords
        ResponseTimestamp: "The time of the Response"
        RecordedAtTime: "Time at which data was recorded at the Vehicle"
        VehicleLocation
            Longitude: Latitude from equator
            Latitude: Latitude from equator
        Bearing: "Vehicle bearing with respect to the North"
        Velocity: "Vehicle speed at Km/h."
        StopPointRef: "The stop code of the stop that the Vehicle is stopping at now, or recently visited"
        Order: "The stop order of the stop that the Vehicle is stopping at now, or recently visited"
        DistanceFromStop: "The distance that the Vehicle travelled from the start of the journey. in meters"
 {
  "title": "SiriRide",
  "type": "object",
  "properties": {
    "LineRef": {
      "title": "Lineref",
      "description": "Reference to a LINE ",
      "type": "integer"
    },
    "DirectionRef": {
      "title": "Directionref",
      "description": "Reference to a DIRECTION the VEHICLE is running along the LINE",
      "type": "integer"
    },
    "FramedVehicleJourneyRef_DataFrameRef": {
      "title": "Framedvehiclejourneyref Dataframeref",
      "description": "The date part of the trip ID",
      "type": "string",
      "format": "date-time"
    },
    "FramedVehicleJourneyRef_DatedVehicleJourneyRef": {
      "title": "Framedvehiclejourneyref Datedvehiclejourneyref",
      "description": "The number part of trip ID",
      "type": "integer"
    },
    "PublishedLineName": {
      "title": "Publishedlinename",
      "description": "The bus number, as published on the bus",
      "type": "string"
    },
    "OperatorRef": {
      "title": "Operatorref",
      "description": "The Operator code",
      "type": "integer"
    },
    "DestinationRef": {
      "title": "Destinationref",
      "description": "The destination stop code",
      "type": "integer"
    },
    "VehicleRef": {
      "title": "Vehicleref",
      "description": "Vehicle number. The value should match the license number of the Vehicle",
      "type": "integer"
    },
    "OriginAimedDepartureTime": {
      "title": "Originaimeddeparturetime",
      "description": "The start time of the Journey, according to the licensing system\" The value should match DepartureTime at TripIdToDate.txt file at the GTFS",
      "type": "string",
      "format": "date-time"
    },
    "SiriRecords": {
      "title": "Sirirecords",
      "description": "represent one observation on a vehicle over time",
      "type": "array",
      "items": {
        "$ref": "#/definitions/SiriRecord"
      }
    }
  },
  "required": [
    "LineRef",
    "DirectionRef",
    "FramedVehicleJourneyRef_DataFrameRef",
    "FramedVehicleJourneyRef_DatedVehicleJourneyRef",
    "PublishedLineName",
    "OperatorRef",
    "DestinationRef",
    "VehicleRef",
    "OriginAimedDepartureTime",
    "SiriRecords"
  ],
  "definitions": {
    "GeoPoint": {
      "title": "GeoPoint",
      "type": "object",
      "properties": {
        "Longitude": {
          "title": "Longitude",
          "description": "Latitude from equator",
          "type": "number"
        },
        "Latitude": {
          "title": "Latitude",
          "description": "Latitude from equator",
          "type": "number"
        }
      },
      "required": [
        "Longitude",
        "Latitude"
      ]
    },
    "SiriRecord": {
      "title": "SiriRecord",
      "type": "object",
      "properties": {
        "ResponseTimestamp": {
          "title": "Responsetimestamp",
          "description": "The time of the Response",
          "type": "string",
          "format": "date-time"
        },
        "RecordedAtTime": {
          "title": "Recordedattime",
          "description": "Time at which data was recorded at the Vehicle",
          "type": "string",
          "format": "date-time"
        },
        "VehicleLocation": {
          "title": "Vehiclelocation",
          "description": "Vehicle Location",
          "allOf": [
            {
              "$ref": "#/definitions/GeoPoint"
            }
          ]
        },
        "Bearing": {
          "title": "Bearing",
          "description": "Vehicle bearing with respect to the North",
          "minimum": 0,
          "maximum": 360,
          "type": "integer"
        },
        "Velocity": {
          "title": "Velocity",
          "description": "Vehicle speed at Km/h.",
          "minimum": 0,
          "type": "integer"
        },
        "StopPointRef": {
          "title": "Stoppointref",
          "description": "The stop code of the stop that the Vehicle is stopping at now, or recently visited",
          "type": "integer"
        },
        "Order": {
          "title": "Order",
          "description": "The stop order of the stop that the Vehicle is stopping at now, or recently visited",
          "type": "integer"
        },
        "DistanceFromStop": {
          "title": "Distancefromstop",
          "description": "The distance that the Vehicle travelled from the start of the journey. in meters",
          "type": "integer"
        }
      },
      "required": [
        "ResponseTimestamp",
        "RecordedAtTime",
        "VehicleLocation",
        "Bearing",
        "Velocity",
        "StopPointRef",
        "Order",
        "DistanceFromStop"
      ]
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants