Skip to content

DataSync 1.5

Compare
Choose a tag to compare
@aynleslie aynleslie released this 24 Sep 18:17

Overview

We are excited to announce the release of DataSync 1.5 ( download link below ).

DataSync 1.5 provides a number of updates to increase the reliability of data ingress, as well as to allow DataSync to easily work in more environments. One of the most important improvements in this release is that data publishers can now use the ‘SSync replace’ operation over HTTP. In DataSync 1.0 we introduced ‘(SSync) replace’ via FTP which automatically detects which rows have been added, updated, or deleted and only publishes the minimal set of changes to the dataset. This functionality is now available over HTTP. Switching from the FTP variant is as easy as toggling a button in the user interface or including the “-ph” flag on the command line.

A number of changes were also made to increase the reliability of ingress. First, transmitting only the changes minimizes the amount of data that must be transferred, which in turn minimizes the opportunity for an unreliable network to fail the operation). Second, should DataSync encounter an unreliable network, it will execute a series of retries to allow the job to automatically recover and continue. These changes should greatly enhance the reliability of all jobs, including scheduled ETLs.

If you are already using DataSync you just need to download the new JAR file below, replace your existing JAR file. If you are not using a previous version of DataSync you can simply download version 1.5 below. DataSync 1.5 requires Java version 1.7 or higher.

DataSync documentation has also been improved and expanded. We have added a quick start guide which aims to simplify first time use of DataSync. We have detailed all of the options that are available to each job, in terms of both your personal configuration file and the job control file. We have added a resource that describes the data restrictions by data type. For instance, Percent type data must not include the % symbol and should range between 0 and 100, not 0 and 1.

We also invite you to contribute to the documentation using a GitHub pull request to the gh-pages branch of the DataSync repository.

DataSync 1.5 comes with additional enhancements and new features, many of which are based off of customer requests. The full list of changes can be found in the list below:

What’s new

New Ssync Replace via HTTP - Enables simple and efficient replace operations on datasets of essentially any size. Diff based file transfering is used so that only the changes between the CSV and the dataset will be passed across the wire. All replace, upsert and delete operations flow over HTTPS by default.
New SSync Upsert and delete via HTTP - All of the SSync benefits brought to replace jobs are available for upsert and delete jobs also.
New HTTP proxy support - For the new Ssync replace, upsert and delete jobs, both authenticated and unauthenticated proxies are supported and configurable from the configuration file or preference menu.
New Compressed Diffs - DataSync compresses diffs before sending.
New Robust DataSync retry logic - Datasync will automatically pause and retry jobs in the case of network failure. You can now start a ‘Replace via HTTP’ job, turn off your internet, watch DataSync try and retry again, turn back on your internet and watch DataSync succeed!
New Early failure notification - For the new SSync suite of jobs, Datasync will attempt to find and report any control file misspecifications or data alignment problems before starting the job.
New Version information - You can retrieve the DataSync version of your jar via the commandline using -v or --version.
New Porting column formatting - Port jobs can now copy column formatting along with data.
Changed More job options - Options are now available in the UI to choose between legacy SODA2 and FTP v. the new HTTP path.
Changed Version warnings - The customer is only warned about new versions in the case of a major version update and DataSync no longer breaks in the case of a major version change.
Changed Preferences location - Moved preferences into the file menu.
Changed Control file source - Ability to change the source of the control file in the GUI.
Changed Simpler configuration files - Previously configuration files had to be fully-specified regardless of the job. Now, only the domain and user credentials are required for most cases.
Bug Fix Fixed a bug preventing the use of non-SSL SMTP servers.
Bug Fix Fixed a bug preventing port jobs of datasets with resource names.

Known issues

Known Customers are limited to 2 simultaneous running jobs per domain - In order to keep customers from starving their own resources (and other customer resources), DataSync will only run two jobs at a time; additional jobs are queued and must wait for earlier jobs to complete. Socrata monitors the queue and will allocate additional resources if we find that jobs are not able to clear in an acceptable time.
Known HTTP proxies do not work with existing SODA2 jobs - Customers will need to setup new jobs using the updated upsert, replace and delete jobs. This will require the customer to create a control file as described in our Github documentation.
Known Customers may be locked out for 15 minutes if incorrectly keying their password - Because DataSync retries failed network calls on the user’s behalf, if a password is incorrectly keyed, the 3-strikes-and-you’re-locked-out-for-15-minutes rule comes into effect.

Want to leave a question, comment, suggestion, or bug report on DataSync? Submit these to the DataSync Github repository issue tracker - all you need is a free GitHub account:
https://github.com/socrata/datasync/issues

Watch the GitHub repository to remain up to speed with new features on the roadmap and deploy schedules for future versions.

Related Links: