Skip to content
This repository has been archived by the owner on Jul 23, 2024. It is now read-only.

HAWQ 1078. Implement hawqsync-falcon DR utility. #940

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

kdunn926
Copy link
Contributor

This is the initial commit for a Python utility to orchestrate a DR syncronization for HAWQ, based on Falcon HDFS replication and a cold backup of the active HAWQ master's MASTER_DATA_DIRECTORY.

A code review would be greatly appreciated, when someone has cycles. Active testing is currently underway in a production deployment.

@kdunn-pivotal
Copy link

@vVineet How can we get this prioritized for the next release? Also, anyone that can put eyes on it for a code review would be helpful.

@vVineet
Copy link

vVineet commented Oct 11, 2016

@kdunn-pivotal : I propose a discussion including @ictmalili as this ties with HAWQ Register feature. I'd love to see the contribution make it in HAWQ.

Copy link
Contributor

@ictmalili ictmalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a lot of work here. Could you describe a step-by-step instruction that users can do DR? Thanks a lot!

retVal, stderr = startHawq(masterHost=options.targetHawqMaster,
isTesting=options.testMode)
print retVal, stderr if options.verbose else None;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdunn926 , may I understand till here all the files are copied to new HAWQ master data directory, but HAWQ catalog information has not been changed. We can leverage hawq register to register the files into HAWQ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ictmalili - At this point in the process several things have happened on the source cluster:

  1. the source HAWQ Master Data Directory (MDD) has been tarballed
  2. the tarball has been copied to HDFS in the /hawq_default directory on the source
  3. The /hawq_default directory has been recursively copied to the DR cluster via Apache Falcon (distcp)
  4. A checksum generated by a recursive listing of files+sizes has been generated for each cluster (source and DR) and successfully validated

Basically, by the time this point in the code is reached, the source HAWQ system (data & metadata) is completely archived & verified in the DR site.

print """
## Manual runbook during DR event
1. Copy MDD archive from HDFS to target master (CLI)
2. Restore archive in /data/hawq/ (CLI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which specific directory does this map to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it HAWQ master's MASTER_DATA_DIRECTORY?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the MDD archive is the MASTER_DATA_DIRECTORY from the source HAWQ cluster.

@kdunn-pivotal
Copy link

kdunn-pivotal commented Oct 27, 2016

HAWQSYNC partial-sync recovery runbook:

  1. Copy "last known good state" tarball from hdfs://hawq_default/hawqExtract-*.tar.bz2

  2. Re-run hawqsync-extract to establish "current state".

  3. Perform diff's for every table file, determine tables with inconsistencies.

  4. For each inconsistent table:
    Re-register faultyTable using "last known good" YAML - (updates the EOF field only).
    a. hawq register --force -f faultyTable.yaml faultyTable

    Store the valid records in a temporary table
    b. CREATE TABLE newTemp AS SELECT * FROM faultyTable

    Truncate the faulty table, to allow the catalog and HDFS file sizes to be consistent again
    c. TRUNCATE faultyTable

    Re-populate the table with valid records
    d. INSERT INTO faultyTable SELECT * FROM newTemp

    Purge the temporary table
    e. DROP TABLE newTemp

This process, overall, ensures our catalog EOF marker and actual HDFS file size are properly aligned for every table. This is especially important when ETL needs to resume on tables that may have previously had "inconsistent bytes" appended, as would be the case for a partial sync.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants