Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better SQL example! #3

Open
aadrian opened this issue May 17, 2015 · 5 comments
Open

Better SQL example! #3

aadrian opened this issue May 17, 2015 · 5 comments

Comments

@aadrian
Copy link

aadrian commented May 17, 2015

Please add a better and more realistic example using SQL (e.g. with a H2 Database) (for input and output too).

Thank you.

@MichaelDrogalis
Copy link
Contributor

We'll be taking a final pass over the examples for the next release. I'm happy to do another SQL example. What would be a good realistic use case, in your opinion?

@aadrian
Copy link
Author

aadrian commented May 18, 2015

What would be a good realistic use case, in your opinion?

As input, at least one DB schema with several tables (with relationships between them), than some non-trivial transformations, and as output another DB schema, with several tables (and relationships).

Basically that's what ETLs most of the time are used for.

I mentioned H2 since it's Java based, no setup required and it's also one of the fastest DBs (of course, not with the in-memory mode since that's not really for "production") http://www.h2database.com/

@MichaelDrogalis
Copy link
Contributor

I'll have to prod you once more for a suggestion. I'm juggling a lot of things right now so every bit helps.

What would be a helpful non-trivial transformation? In my ETL experience, most of the data I was moving was copied through from point A to B. H2 sounds good though.

@aadrian
Copy link
Author

aadrian commented May 18, 2015

What would be a helpful non-trivial transformation?

Example of transformations quite often required:

  • anonymizing column values (e.g. user email or username), across many tables
  • finding relationships (foreign keys), e.g. when the source application is not using foreign keys e.g. RubyOnRails apps.
  • splitting or merging multi-tenant data.
  • schema migrations that involve transformations, e.g. type changes, merging, refactoring
  • also the transformations required to achieve the stuff I listed below:

In my ETL experience, most of the data I was moving was copied through from point A to B

Well, except that:

  • A consists of many tables and relationships, and B too; data might be circular, or A and B are of different DB types, so using the DB specific dump/backup tool won't do the trick.
  • The names however must not always the same (e.g. in schema B there are other names for some of the tables and the columns),
  • The types of columns in some tables of A and B must also not be 100% the same, and also the sizes might differ.
  • or B is only a subset of A, e..g some columns are not allowed in B.
  • or B is only a subset of A, e..g some rows with their according relationships propagated are missing, e.g. based on some blacklists or patterns.
  • or B is a superset of A, e.g. it can contain extra columns for historization, or there is A1 A2 and A3 and all ore merged into B
    • Data cleanup e.g. when an application is producing errors after a while in A (e.g. when there are not enough validations, or bugs), than it's cleaned up in B and played back in A if everything seems OK.
  • combinations of the above.

Those are quite usual scenarios that need to be done all the time in most applications I've encountered, do not involve terabyes of data, at most a few hundred gigabyte, so it's also expected to be run on a single machine.

@MichaelDrogalis
Copy link
Contributor

Okay, perhaps we can try some of these. I can't address this in the short term (e.g. next 2-3 weeks), but maybe someone else will be able to pick it up before I can. If you want to give it a shot, send a PR in and we can help fix any problems. Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants