Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data queries to be added to MVP API #352

Open
39 tasks
adiwaz opened this issue Nov 4, 2021 · 1 comment
Open
39 tasks

Data queries to be added to MVP API #352

adiwaz opened this issue Nov 4, 2021 · 1 comment

Comments

@adiwaz
Copy link
Collaborator

adiwaz commented Nov 4, 2021

Below is a prioritized list of data queries that we believe to be basic and important as part of an API MVP.
Please edit/comment to include missing queries and re-prioritize.

The guiding principles to determine query's priority are: how basic and simple it is (e.g., could it be performed with the database raw tables and columns?), how relevant it is for data quality checks, how relevant it is to compare schedules to real rides.

A few "user stories" to have in mind as examples:

  • QA checks for the amount of non-0 gps points for rides of a certain agency in a certain time window
  • A user who wants to understand where was the bus she waited for too long yesterday
  • A user who wants to figure out how much time a specific bus ride usually takes in a certain time window (line 5 from X to Y in Sunday mornings)
  • A user who wants to compare bus rides duration across time

List of data queries:

  • get all gtfs line numbers (route_short_name in gtfs) according to some parametrization (can be part or all of the following):

    • agency
    • time frame
    • geographic location
    • scheduled to start/end/pass through a specific stop/polygon/city/region
  • get all gtfs/siri route ids according to some parametrization (can be part or all of the following):

    • agency
    • time frame
    • scheduled to start/end/pass through a specific stop/polygon/city/region (relevant to gtfs data, for siri data this requires more cleaning and some algorithmic)
    • bus line (route_short_name in gtfs)
    • direction (most lines have 2 directions)
  • get all siri rides according to some parametrization (can be part or all of the following):

    • agency
    • route id / list of route ids
    • time frame
    • bus line (route_short_name in gtfs)
    • direction (most lines have 2 directions)
    • cities (by stops data for now)
    • stops
  • get all bus locations according to some parametrization (can be part or all of the following):

    • list of siri rides
    • agency
    • route id / list of route ids
    • time frame
    • bus line (route_short_name in gtfs)
    • direction (most lines have 2 directions)
    • cities (by stops data for now)
    • stops
    • option to drop locations with lat/lon == 0
    • option to keep only first/last X location points for each ride (chronologically ordered)
  • get all stops (stop id, name, location) according to geographic/description parametrization (can be part or all of the following):

    • within a specific polygon
    • within some radius from specific lat/lon
    • cities (by stops data for now)
  • for all queries that return data (see above), allow an option to perform data aggregation instead of returning the actual data: count, sum, average, etc. with a group by capability

  • for all queries that return data (see above), allow an option to get specific columns or to get all the available columns

  • for all queries that return data (see above), allow an option to perform simple operations on the parameters (=, >, <, <>, >=, <=, between, in/not in a list, is/is not null)

  • for all queries that return data (see above), allow an option to limit results number by user request

  • all queries should output a Pandas DataFrame

@OriHoch
Copy link

OriHoch commented Nov 23, 2021

I think we should focus on user stories, and then determine what is needed to answer them

there is no point to provide a full database access, the team can access the DB directly and run SQL queries. The API should provide aggregated/ digested data according to required user stories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants