Skip to content

New Components - scrapeless #16712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

New Components - scrapeless #16712

wants to merge 5 commits into from

Conversation

luancazarine
Copy link
Collaborator

@luancazarine luancazarine commented May 19, 2025

Resolves #16673.

Summary by CodeRabbit

  • New Features

    • Added the ability to submit web scraping jobs using the Scrapeless platform, with options for target URL, proxy country, and advanced configurations.
    • Introduced an action to retrieve results of completed scraping jobs.
    • Provided a comprehensive list of scraper actor options for easier selection.
  • Improvements

    • Enhanced the Scrapeless integration with a fully implemented API client, streamlining job submission and result retrieval.
    • Added a utility to parse JSON strings into objects for flexible input handling.
  • Other

    • Updated internal dependencies and versioning for improved stability.

@luancazarine luancazarine added the ai-assisted Content generated by AI, with human refinement and modification label May 19, 2025
Copy link

vercel bot commented May 19, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
docs-v2 ⬜️ Ignored (Inspect) Visit Preview May 20, 2025 1:22pm
pipedream-docs ⬜️ Ignored (Inspect) May 20, 2025 1:22pm
pipedream-docs-redirect-do-not-edit ⬜️ Ignored (Inspect) May 20, 2025 1:22pm

Copy link
Contributor

coderabbitai bot commented May 19, 2025

Walkthrough

This update introduces a new Scrapeless component with actions to submit a web scraping job and retrieve its results. It implements a full API client, utility functions, and actor options, and updates the package configuration. The actions correspond to submitting jobs and fetching results via the Scrapeless API.

Changes

File(s) Change Summary
components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs Added action to retrieve scraping job results by job ID from Scrapeless.
components/scrapeless/actions/submit-scrape-job/submit-scrape-job.mjs Added action to submit new scraping jobs with configurable parameters to Scrapeless.
components/scrapeless/common/constants.mjs Introduced ACTOR_OPTIONS array for selectable scraper actors.
components/scrapeless/common/utils.mjs Added parseObject utility for robust JSON/object parsing.
components/scrapeless/scrapeless.app.mjs Implemented Scrapeless API client with methods for submitting jobs and retrieving results; refactored structure.
components/scrapeless/package.json Bumped version to 0.1.0, added dependency on @pipedream/platform, fixed publishConfig brace.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SubmitAction as Submit Scrape Job Action
    participant ScrapelessApp as Scrapeless App
    participant ScrapelessAPI as Scrapeless API

    User->>SubmitAction: Provide job details (actor, URL, etc.)
    SubmitAction->>ScrapelessApp: submitScrapeJob(params)
    ScrapelessApp->>ScrapelessAPI: POST /scraper/request
    ScrapelessAPI-->>ScrapelessApp: Return job ID
    ScrapelessApp-->>SubmitAction: Return job ID
    SubmitAction-->>User: Return job ID

    User->>GetResultAction: Provide scrapeJobId
    GetResultAction->>ScrapelessApp: getScrapeResult({ scrapeJobId })
    ScrapelessApp->>ScrapelessAPI: GET /scraper/result/{scrapeJobId}
    ScrapelessAPI-->>ScrapelessApp: Return scrape result
    ScrapelessApp-->>GetResultAction: Return result
    GetResultAction-->>User: Return result
Loading

Assessment against linked issues

Objective Addressed Explanation
Implement submit-scrape-job action to submit new web scraping jobs (#16673)
Implement get-scrape-result action to retrieve completed job results (#16673)

Poem

A bunny with code in its paws,
Built scrapers without any flaws.
Submit a job, then wait and see—
Results retrieved, as quick as can be!
With actors and helpers, all neatly arrayed,
The Scrapeless component is now well displayed.
🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at Object.getPackageJSONURL (node:internal/modules/package_json_reader:255:9)
at packageResolve (node:internal/modules/esm/resolve:767:81)
at moduleResolve (node:internal/modules/esm/resolve:853:18)
at defaultResolve (node:internal/modules/esm/resolve:983:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:799:12)
at #cachedDefaultResolve (node:internal/modules/esm/loader:723:25)
at ModuleLoader.resolve (node:internal/modules/esm/loader:706:38)
at ModuleLoader.getModuleJobForImport (node:internal/modules/esm/loader:307:38)
at #link (node:internal/modules/esm/module_job:163:49)

components/scrapeless/actions/submit-scrape-job/submit-scrape-job.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at Object.getPackageJSONURL (node:internal/modules/package_json_reader:255:9)
at packageResolve (node:internal/modules/esm/resolve:767:81)
at moduleResolve (node:internal/modules/esm/resolve:853:18)
at defaultResolve (node:internal/modules/esm/resolve:983:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:799:12)
at #cachedDefaultResolve (node:internal/modules/esm/loader:723:25)
at ModuleLoader.resolve (node:internal/modules/esm/loader:706:38)
at ModuleLoader.getModuleJobForImport (node:internal/modules/esm/loader:307:38)
at #link (node:internal/modules/esm/module_job:163:49)

components/scrapeless/common/constants.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at Object.getPackageJSONURL (node:internal/modules/package_json_reader:255:9)
at packageResolve (node:internal/modules/esm/resolve:767:81)
at moduleResolve (node:internal/modules/esm/resolve:853:18)
at defaultResolve (node:internal/modules/esm/resolve:983:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:799:12)
at #cachedDefaultResolve (node:internal/modules/esm/loader:723:25)
at ModuleLoader.resolve (node:internal/modules/esm/loader:706:38)
at ModuleLoader.getModuleJobForImport (node:internal/modules/esm/loader:307:38)
at #link (node:internal/modules/esm/module_job:163:49)

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between bf9311e and b70c366.

📒 Files selected for processing (3)
  • components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs (1 hunks)
  • components/scrapeless/actions/submit-scrape-job/submit-scrape-job.mjs (1 hunks)
  • components/scrapeless/common/constants.mjs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs
  • components/scrapeless/common/constants.mjs
  • components/scrapeless/actions/submit-scrape-job/submit-scrape-job.mjs
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pnpm publish
  • GitHub Check: Publish TypeScript components
  • GitHub Check: Lint Code Base
  • GitHub Check: Verify TypeScript components
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Actions
 - Submit Scrape Job
 - Get Scrape Result
@luancazarine luancazarine marked this pull request as ready for review May 19, 2025 20:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (11)
components/scrapeless/common/utils.mjs (1)

1-24: Utility function could benefit from JSDoc documentation

This utility function seems well-structured and handles various input types gracefully. It correctly returns the parsed JSON object or falls back to the original value when parsing fails. However, adding JSDoc documentation would improve clarity for other developers.

+/**
+ * Attempts to parse JSON strings into JavaScript objects
+ * @param {any} obj - The input to parse (string, array, or any other type)
+ * @returns {any} - Parsed object or original input if parsing fails
+ */
 export const parseObject = (obj) => {
   if (!obj) return undefined;

   if (Array.isArray(obj)) {
     return obj.map((item) => {
       if (typeof item === "string") {
         try {
           return JSON.parse(item);
         } catch (e) {
           return item;
         }
       }
       return item;
     });
   }
   if (typeof obj === "string") {
     try {
       return JSON.parse(obj);
     } catch (e) {
       return obj;
     }
   }
   return obj;
 };
components/scrapeless/common/constants.mjs (2)

67-73: Fix typos in Google Flights labels

There are capitalization errors in "FLights" which should be "Flights".

   {
-    label: "Google FLights",
+    label: "Google Flights",
     value: "scraper.google.flights",
   },
   {
-    label: "Google FLights Chart",
+    label: "Google Flights Chart",
     value: "scraper.google.flights.chart",
   },

1-138: Consider organizing ACTOR_OPTIONS by category for better maintainability

The current list of actor options appears to be organized somewhat randomly. Consider grouping related scrapers together (e.g., all Google services, all e-commerce platforms, etc.) to improve readability and maintainability.

You could organize the array by grouping similar scrapers together, for example:

  1. E-commerce (Shopee, Amazon, Temu)
  2. Brazilian sites
  3. Airlines
  4. Electronics distributors
  5. Google services (grouped by type)

This would make the list easier to navigate and maintain as new scrapers are added.

components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs (2)

7-8: Consider aligning component version with package version

The action version is set to "0.0.1" while the package.json was updated to "0.1.0". For consistency, consider aligning these versions.

   description: "Retrieve the result of a completed scraping job. [See the documentation](https://apidocs.scrapeless.com/api-11949853)",
-  version: "0.0.1",
+  version: "0.1.0",
   type: "action",

14-15: Extra space at beginning of description text

There's an extra space at the beginning of the description text.

     type: "string",
     label: "Scrape Job ID",
-    description: " The ID of the scrape job you want to retrieve results for. This ID is provided when you submit a scrape job.",
+    description: "The ID of the scrape job you want to retrieve results for. This ID is provided when you submit a scrape job.",
   },
components/scrapeless/actions/submit-scrape-job/submit-scrape-job.mjs (2)

41-41: Check for existence of additionalInput before parsing

The parseObject function is called without first checking if this.additionalInput exists. While this might work if parseObject handles null/undefined values, it would be more robust to check for existence first.

-      input: parseObject(this.additionalInput),
+      input: this.additionalInput ? parseObject(this.additionalInput) : {},

42-42: Add a comment explaining the 'async' parameter

The async: true parameter might not be self-explanatory. Consider adding a comment to explain its purpose in the API request.

       actor: this.actor,
       input: parseObject(this.additionalInput),
-      async: true,
+      async: true, // Process the request asynchronously, so we get a task ID instead of waiting for results
components/scrapeless/scrapeless.app.mjs (4)

12-12: Simplify the API token header value

Template literals are unnecessary when only using a single variable. You can directly reference the variable without string interpolation.

-        "x-api-token": `${this.$auth.api_key}`,
+        "x-api-token": this.$auth.api_key,

31-35: Add method parameter documentation

The getScrapeResult method takes a scrapeJobId parameter but lacks documentation. Consider adding JSDoc comments to explain the parameter requirements.

+    /**
+     * Get the result of a scrape job
+     * @param {Object} opts - The request options
+     * @param {string} opts.scrapeJobId - The ID of the scrape job to retrieve
+     * @returns {Promise<Object>} The scrape job result
+     */
     getScrapeResult({ scrapeJobId }) {
       return this._makeRequest({
         path: `/scraper/result/${scrapeJobId}`,
       });
     },

24-30: Add method parameter documentation for submitScrapeJob

Similar to getScrapeResult, the submitScrapeJob method could benefit from JSDoc comments explaining the expected parameters and return value.

+    /**
+     * Submit a new scrape job
+     * @param {Object} opts - The request options
+     * @param {Object} opts.data - The scrape job parameters
+     * @param {string} opts.data.actor - The actor to use for the scrape job
+     * @param {Object} opts.data.input - Input parameters for the scrape job
+     * @param {boolean} opts.data.async - Whether to process the request asynchronously
+     * @returns {Promise<Object>} The created scrape job, including taskId
+     */
     submitScrapeJob(opts = {}) {
       return this._makeRequest({
         method: "POST",
         path: "/scraper/request",
         ...opts,
       });
     },

7-9: Add environment variable support for API URL

Consider making the base URL configurable to support different environments (e.g., development, testing, production).

     _baseUrl() {
-      return "https://api.scrapeless.com/api/v1";
+      return this.$auth.base_url || "https://api.scrapeless.com/api/v1";
     },
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between cb64910 and bf9311e.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (6)
  • components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs (1 hunks)
  • components/scrapeless/actions/submit-scrape-job/submit-scrape-job.mjs (1 hunks)
  • components/scrapeless/common/constants.mjs (1 hunks)
  • components/scrapeless/common/utils.mjs (1 hunks)
  • components/scrapeless/package.json (2 hunks)
  • components/scrapeless/scrapeless.app.mjs (1 hunks)
🔇 Additional comments (3)
components/scrapeless/package.json (2)

3-4: LGTM! Version update is appropriate for new component

The version update from 0.0.1 to 0.1.0 is appropriate for introducing new functionality.


14-17: LGTM! Dependency addition and formatting corrections

The addition of the @pipedream/platform dependency and correction of the JSON structure are appropriate.

components/scrapeless/actions/get-scrape-result/get-scrape-result.mjs (1)

17-25: LGTM! Well-implemented run method

The run method is well-implemented, correctly passing the execution context and job ID to the Scrapeless API client, then returning the response with a descriptive summary.

Copy link
Collaborator

@michelle0927 michelle0927 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one possible typo in constants.mjs. Moving to QA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai-assisted Content generated by AI, with human refinement and modification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Components] scrapeless
2 participants