You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below is a prioritized list of data queries that we believe to be basic and important as part of an API MVP.
Please edit/comment to include missing queries and re-prioritize.
The guiding principles to determine query's priority are: how basic and simple it is (e.g., could it be performed with the database raw tables and columns?), how relevant it is for data quality checks, how relevant it is to compare schedules to real rides.
A few "user stories" to have in mind as examples:
QA checks for the amount of non-0 gps points for rides of a certain agency in a certain time window
A user who wants to understand where was the bus she waited for too long yesterday
A user who wants to figure out how much time a specific bus ride usually takes in a certain time window (line 5 from X to Y in Sunday mornings)
A user who wants to compare bus rides duration across time
List of data queries:
get all gtfs line numbers (route_short_name in gtfs) according to some parametrization (can be part or all of the following):
agency
time frame
geographic location
scheduled to start/end/pass through a specific stop/polygon/city/region
get all gtfs/siri route ids according to some parametrization (can be part or all of the following):
agency
time frame
scheduled to start/end/pass through a specific stop/polygon/city/region (relevant to gtfs data, for siri data this requires more cleaning and some algorithmic)
bus line (route_short_name in gtfs)
direction (most lines have 2 directions)
get all siri rides according to some parametrization (can be part or all of the following):
agency
route id / list of route ids
time frame
bus line (route_short_name in gtfs)
direction (most lines have 2 directions)
cities (by stops data for now)
stops
get all bus locations according to some parametrization (can be part or all of the following):
list of siri rides
agency
route id / list of route ids
time frame
bus line (route_short_name in gtfs)
direction (most lines have 2 directions)
cities (by stops data for now)
stops
option to drop locations with lat/lon == 0
option to keep only first/last X location points for each ride (chronologically ordered)
get all stops (stop id, name, location) according to geographic/description parametrization (can be part or all of the following):
within a specific polygon
within some radius from specific lat/lon
cities (by stops data for now)
for all queries that return data (see above), allow an option to perform data aggregation instead of returning the actual data: count, sum, average, etc. with a group by capability
for all queries that return data (see above), allow an option to get specific columns or to get all the available columns
for all queries that return data (see above), allow an option to perform simple operations on the parameters (=, >, <, <>, >=, <=, between, in/not in a list, is/is not null)
for all queries that return data (see above), allow an option to limit results number by user request
all queries should output a Pandas DataFrame
The text was updated successfully, but these errors were encountered:
I think we should focus on user stories, and then determine what is needed to answer them
there is no point to provide a full database access, the team can access the DB directly and run SQL queries. The API should provide aggregated/ digested data according to required user stories.
Below is a prioritized list of data queries that we believe to be basic and important as part of an API MVP.
Please edit/comment to include missing queries and re-prioritize.
The guiding principles to determine query's priority are: how basic and simple it is (e.g., could it be performed with the database raw tables and columns?), how relevant it is for data quality checks, how relevant it is to compare schedules to real rides.
A few "user stories" to have in mind as examples:
List of data queries:
get all gtfs line numbers (route_short_name in gtfs) according to some parametrization (can be part or all of the following):
get all gtfs/siri route ids according to some parametrization (can be part or all of the following):
get all siri rides according to some parametrization (can be part or all of the following):
get all bus locations according to some parametrization (can be part or all of the following):
get all stops (stop id, name, location) according to geographic/description parametrization (can be part or all of the following):
for all queries that return data (see above), allow an option to perform data aggregation instead of returning the actual data: count, sum, average, etc. with a group by capability
for all queries that return data (see above), allow an option to get specific columns or to get all the available columns
for all queries that return data (see above), allow an option to perform simple operations on the parameters (=, >, <, <>, >=, <=, between, in/not in a list, is/is not null)
for all queries that return data (see above), allow an option to limit results number by user request
all queries should output a Pandas DataFrame
The text was updated successfully, but these errors were encountered: