Skip to content

Add Pipeline Scheduler #1522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Conversation

goodwillpunning
Copy link
Contributor

This commit adds system files for scheduling EDW usage collection in Mac and Linux operating systems and a new class, PipelineScheduler, which parses a pipeline definition and executes steps based upon the defined frequency.

Copy link

github-actions bot commented Apr 23, 2025

❌ 10/16 passed, 6 failed, 1 skipped, 1m11s total

❌ test_days_since_last_run: TypeError: unsupported operand type(s) for //: 'datetime.datetime' and 'int' (2ms)
TypeError: unsupported operand type(s) for //: 'datetime.datetime' and 'int'
[gw4] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
[gw4] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
❌ test_hours_since_last_run: TypeError: unsupported operand type(s) for //: 'datetime.datetime' and 'int' (13ms)
TypeError: unsupported operand type(s) for //: 'datetime.datetime' and 'int'
[gw3] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
[gw3] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
❌ test_pipeline_scheduler: FileNotFoundError: Credentials file not found at /home/runner/.databricks/labs/remorph/.credentials.yml (22ms)
FileNotFoundError: Credentials file not found at /home/runner/.databricks/labs/remorph/.credentials.yml
[gw5] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
❌ test_run_python_failure_pipeline: Failed: DID NOT RAISE (254ms)
Failed: DID NOT RAISE <class 'RuntimeError'>
[gw2] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
14:40 ERROR [root] Python script failed: {"status": "error", "message": "This script is designed to fail"}

14:40 ERROR [root] Script execution failed: {"status": "error", "message": "This script is designed to fail"}
14:40 ERROR [root] Python script failed: {"status": "error", "message": "This script is designed to fail"}

14:40 ERROR [root] Script execution failed: {"status": "error", "message": "This script is designed to fail"}
[gw2] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
❌ test_run_sql_failure_pipeline: Failed: DID NOT RAISE (1.489s)
Failed: DID NOT RAISE <class 'RuntimeError'>
[gw0] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
14:40 ERROR [root] SQL execution failed: (pyodbc.ProgrammingError) ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'non_existent_table'. (208) (SQLExecDirectW)")
[SQL: SELECT * FROM non_existent_table;
]
(Background on this error at: https://sqlalche.me/e/20/f405)
14:40 ERROR [root] SQL execution failed: (pyodbc.ProgrammingError) ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'non_existent_table'. (208) (SQLExecDirectW)")
[SQL: SELECT * FROM non_existent_table;
]
(Background on this error at: https://sqlalche.me/e/20/f405)
14:40 ERROR [root] SQL execution failed: (pyodbc.ProgrammingError) ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'non_existent_table'. (208) (SQLExecDirectW)")
[SQL: SELECT * FROM non_existent_table;
]
(Background on this error at: https://sqlalche.me/e/20/f405)
14:40 ERROR [root] SQL execution failed: (pyodbc.ProgrammingError) ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'non_existent_table'. (208) (SQLExecDirectW)")
[SQL: SELECT * FROM non_existent_table;
]
(Background on this error at: https://sqlalche.me/e/20/f405)
[gw0] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
❌ test_gets_maven_version: AttributeError: 'NoneType' object has no attribute 'split' (1m0.18s)
AttributeError: 'NoneType' object has no attribute 'split'
[gw2] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
[gw2] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python

Running from acceptance #488

Copy link
Collaborator

@sundarshankar89 sundarshankar89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good few nits

3. Next, load the process by executing the following command:

```bash
$ launchctl load ~/Library/LaunchAgents/com.remorph.usagecollection.plist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work, in sequio and above launchctl bootstrap (gui/$id) <> is the way to go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work. I've been running this process on my Mac for over a month now. Happy to update to another approach if you feel it's more appropriate.

@@ -28,16 +35,26 @@ def execute(self):
for step in self.config.steps:
if step.flag == "active":
logging.debug(f"Executing step: {step.name}")
self._execute_step(step)
self.execute_step(step)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect to people to run execute and specific step execute_step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant to sync with you. I had to make this a public function so that the scheduler or another process can parse the pipeline config and execute the step based upon the defined frequency. What do you think? I know you made this private so that the pipeline class can execute entirely. But there needs to be a hook for the scheduler class or the pipeline class needs to parse the scheduled frequency info.

<key>ProgramArguments</key>
<array>
<string>/Library/Frameworks/Python.framework/Versions/3.11/bin/python3</string>
<string>~/Downloads/remorph/scheduler/usage_collector.py</string>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be $HOME/.databricks/labs/remorph/....

Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start, and I understand its purpose. That said, there are still some rough edges to smooth out.

On the installation side:

  • We need to be a bit clearer on the pre-requisites for both Linux and macOS.

    • Linux: systemd
    • macOS: Although we can assume that python is available somewhere, referring to it in the launch agent is an issue. (We can't use /usr/bin/python3: that's Python 3.9 which we don't test against or support.) It should be possible to find it by doing the equivalent of /usr/bin/env python3.11 (an old #! trick for Python scripts).
  • The install files all assume that remorph has been installed in a specific location due to fixed paths. The user probably has to edit the files.

  • With all this in mind, I think that rather than having these instructions it might be better to have an installer script. To allow progress I'm happy to accept this as-is (if the instructions are clear for now) and iterate to implement this.

  • Uninstallation is also important: people won't want this forever.

On the Python side, I think it mostly looks fine although I'd really prefer to see timezone-aware timestamps being used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These instructions are clear, although I think that we also need to describe how to remove it and clean up afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I didn't even think about uninstallation.

Comment on lines +24 to +28
4. Grant the usage collection script with execution permissions:

```bash
$ chmod +x ./src/databricks/labs/remorph/assessments/scheduler/usage_collector.py
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be earlier… step 2?

Otherwise I think the first run will already be triggered (and fail).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Thanks


1. Open a new Terminal window and navigate to the remorph repository

2. Copy the plist file to `~/Library/LaunchAgents/` for a user-specific task or `/Library/LaunchDaemons/` for a system-wide task:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users won't know the difference, I don't think?

My advice here is to advise one of these, or help the user choose.

(My preference: user-specific, given the rest of the instructions assume this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally agree. I'll update the PR to install only a user-specific agent.

User=root

[Install]
WantedBy=multi-user.target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOL at EOF.


<key>ProgramArguments</key>
<array>
<string>/Library/Frameworks/Python.framework/Versions/3.11/bin/python3</string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't something you can assume, unfortunately:

% ls -l /Library/Frameworks/Python.framework/Versions/3.11/bin/python3
/Library/Frameworks/Python.framework/Versions/3.11/bin/python3: No such file or directory
% sw_vers
ProductName:            macOS
ProductVersion:         15.4.1
BuildVersion:           24E263

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great catch. Will update this to be dynamic.

@gueniai
Copy link
Collaborator

gueniai commented May 2, 2025

@goodwillpunning there are some unsigned commits at the beginning, which means you'll have to recreate this one :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants