HAWQ 1078. Implement hawqsync-falcon DR utility. #940

kdunn926 · 2016-09-26T21:11:44Z

This is the initial commit for a Python utility to orchestrate a DR syncronization for HAWQ, based on Falcon HDFS replication and a cold backup of the active HAWQ master's MASTER_DATA_DIRECTORY.

A code review would be greatly appreciated, when someone has cycles. Active testing is currently underway in a production deployment.

…o dr

kdunn-pivotal · 2016-10-11T18:34:30Z

@vVineet How can we get this prioritized for the next release? Also, anyone that can put eyes on it for a code review would be helpful.

vVineet · 2016-10-11T19:33:59Z

@kdunn-pivotal : I propose a discussion including @ictmalili as this ties with HAWQ Register feature. I'd love to see the contribution make it in HAWQ.

ictmalili

Quite a lot of work here. Could you describe a step-by-step instruction that users can do DR? Thanks a lot!

ictmalili · 2016-10-13T10:14:44Z

tools/bin/hawqsync-falcon

+    retVal, stderr = startHawq(masterHost=options.targetHawqMaster,
+                               isTesting=options.testMode)
+    print retVal, stderr if options.verbose else None;
+


@kdunn926 , may I understand till here all the files are copied to new HAWQ master data directory, but HAWQ catalog information has not been changed. We can leverage hawq register to register the files into HAWQ?

@ictmalili - At this point in the process several things have happened on the source cluster:

the source HAWQ Master Data Directory (MDD) has been tarballed

the tarball has been copied to HDFS in the /hawq_default directory on the source

The /hawq_default directory has been recursively copied to the DR cluster via Apache Falcon (distcp)

A checksum generated by a recursive listing of files+sizes has been generated for each cluster (source and DR) and successfully validated

Basically, by the time this point in the code is reached, the source HAWQ system (data & metadata) is completely archived & verified in the DR site.

ictmalili · 2016-10-13T10:17:04Z

tools/bin/hawqsync-falcon

+        print """
+        ## Manual runbook during DR event
+        1. Copy MDD archive from HDFS to target master (CLI)
+        2. Restore archive in /data/hawq/ (CLI)


Which specific directory does this map to?

Is it HAWQ master's MASTER_DATA_DIRECTORY?

Yes, the MDD archive is the MASTER_DATA_DIRECTORY from the source HAWQ cluster.

kdunn-pivotal · 2016-10-27T22:34:12Z

HAWQSYNC partial-sync recovery runbook:

Copy "last known good state" tarball from hdfs://hawq_default/hawqExtract-*.tar.bz2
Re-run hawqsync-extract to establish "current state".
Perform diff's for every table file, determine tables with inconsistencies.
For each inconsistent table:
Re-register faultyTable using "last known good" YAML - (updates the EOF field only).
a. hawq register --force -f faultyTable.yaml faultyTable

Store the valid records in a temporary table
b. CREATE TABLE newTemp AS SELECT * FROM faultyTable

Truncate the faulty table, to allow the catalog and HDFS file sizes to be consistent again
c. TRUNCATE faultyTable

Re-populate the table with valid records
d. INSERT INTO faultyTable SELECT * FROM newTemp

Purge the temporary table
e. DROP TABLE newTemp

This process, overall, ensures our catalog EOF marker and actual HDFS file size are properly aligned for every table. This is especially important when ETL needs to resume on tables that may have previously had "inconsistent bytes" appended, as would be the case for a partial sync.

kdunn926 added 7 commits September 26, 2016 15:09

Initial commit

1ca0c75

Merge branch 'master' of https://github.com/apache/incubator-hawq int…

988a0da

…o dr

Some minor tweaks based on live testing

7810893

Initial commit for extract pre-sync tool

6c4def2

Minor fix after more testing

0748dd1

Missing indent

4ca40bd

Erroneous tab

3c360e6

kdunn926 added 3 commits October 11, 2016 15:54

Adding Falcon workflow template artifact

49cab33

Fix name, add additional artifact

1595fdc

Extract tool docstrings, main scoping, options passing

729aed2

ictmalili reviewed Oct 13, 2016

View reviewed changes

kdunn926 added 2 commits October 27, 2016 14:49

Unnesting cleanup via logic reversal

1458389

More minor cleanup

37cc8c9

asfgit force-pushed the master branch from 1d914de to 1bbd1e9 Compare April 7, 2017 06:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAWQ 1078. Implement hawqsync-falcon DR utility. #940

HAWQ 1078. Implement hawqsync-falcon DR utility. #940

kdunn926 commented Sep 26, 2016

kdunn-pivotal commented Oct 11, 2016

vVineet commented Oct 11, 2016

ictmalili left a comment

ictmalili Oct 13, 2016

kdunn-pivotal Oct 13, 2016

ictmalili Oct 13, 2016

ictmalili Oct 13, 2016

kdunn-pivotal Oct 13, 2016

kdunn-pivotal commented Oct 27, 2016 •

edited

Loading

HAWQ 1078. Implement hawqsync-falcon DR utility. #940

Are you sure you want to change the base?

HAWQ 1078. Implement hawqsync-falcon DR utility. #940

Conversation

kdunn926 commented Sep 26, 2016

kdunn-pivotal commented Oct 11, 2016

vVineet commented Oct 11, 2016

ictmalili left a comment

Choose a reason for hiding this comment

ictmalili Oct 13, 2016

Choose a reason for hiding this comment

kdunn-pivotal Oct 13, 2016

Choose a reason for hiding this comment

ictmalili Oct 13, 2016

Choose a reason for hiding this comment

ictmalili Oct 13, 2016

Choose a reason for hiding this comment

kdunn-pivotal Oct 13, 2016

Choose a reason for hiding this comment

kdunn-pivotal commented Oct 27, 2016 • edited Loading

HAWQSYNC partial-sync recovery runbook:

kdunn-pivotal commented Oct 27, 2016 •

edited

Loading