Skip to content

Processing gbt data

Tabitha Voytek edited this page Feb 17, 2015 · 62 revisions

Dealing with GBT Data

This page is to tell you how to back up telescope data, convert it to .fits files, and get it onto CITA's computers.

Signing In

The work is almost all done on 'beef', one of the computers at GBT.
To get onto beef, open a terminal and type in:

ssh -X [email protected]  
# Where 'xxxxx' is your GBT username. Then:  
ssh -X beef

The -X (or -Y) is necessary, since matplotlib expects a display to be available for the plots you will be making (even though the plots are saved not displayed).

Getting the software

You will need a copy of the code on beef (master branch). Follow the instructions found here.

One-time Setup

For the code to work correctly, you need some things set up when you log in. This is very simple. Copy what is in Kiyo's '.bash_profile', '.bashrc' and 'config/kiyobashrc' into an equivalent location in your home folder. Note: Kiyo's home folder is called '~kmasui'.

If you don't want to copy the whole thing, make sure you get the beef specific parts and the ssh-agent passwordless login parts at the end.

You will also need to generate a private/public rsa keypair for passwordless login. If this isn't already set up, you can generate it using the following commands:

cd ~.ssh
ssh-keygen # follow instructions
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys

You will probably have to log out of beef and log back in to continue.

Setup When Logging in Every Time

While you are working in 'beef' you must also be able to login to 'euclid'. If you have ever tried sshing into euclid, you have to put in your password each time you login to it. The program that backs up the data must be able to ssh into euclid WITHOUT a password. To do this, every time you login to beef, type in:

ssh-add ~/.ssh/id_rsa

Note that this has to only be put in once every beef session. If you have 5 beef terminals open, you only have to type this into the first one to be able to ssh into euclid from any other terminal (for at least 10 hours).

The data_log.ini File

This file keeps track of everything you want to do. If you ever need an example that has every conceivable case in it, take a look at the data_log.ini file from 10B_036 (~kmasui/GBT10B_036/data_log.ini).

For the AGBT14B_339 project, the file is stored in '~tvoytek/14B_339/data_log.ini'. In one terminal, ssh into beef, then:

# The file that tells the program what to do and which files to work on.
#vim ~kmasui/GBT11B_055/data_log.ini #old version
vim ~tvoytek/14B_339/data_log.ini #latest version

Getting Into Euclid

Make sure that the program signs into your account on euclid at this line: Disks may be mounted either on euclid or thales and can become unmounted, so check the disk location before running.

archive_root = "xxxxx@euclid:/mnt/ehd1/scratch/"
# Where 'xxxxx' is your GBT username.

Adding New Sessions To The Conversions

To get new files to be processed, you must add the session as a dictionary to 'sessions'. Here, the different parameters will be explained. The best way to understand this is to look at how the previous sessions are written in.

  • "number" - The session number.

  • "guppi_input_root" - The location that GBT has saved the session data.
    This will always be '/dataY/xxxxx/AGBT######/'. Where 'xxxxx' is the GBT username of the observer for that session, Y is either 1 or 2 (wherever the data is located), and ###### is the project number (eg 11B_055) Note that some sessions do not have this. At the beginning of the data_log file, there is a default input root:

default_guppi_input_root = "/data1/tvoytek/AGBT11B_055/" # This is an example, check the particular data_log.ini file for the current default root. 

Feel free to change the default to one that is more relevant if needed.

  • "guppi_dir" - The session date.
    This is one of the sub-folders in 'guppi_input_root'. The session number is not written here so make sure you match up the proper date with the session number.
    If the session all occurred in one day, this parameter will have only one value.
    If the session ran into the next day (according to UTC time), there will be 2 folders with data with one session. This parameter will then be a list of 2 values. (eg ['date1','date2'] instead of 'date1')

  • "guppi_session" - A number that gets added to the name of the scans. This can ONLY be found by going on beef and cd-ing into the above directory.
    Note that this may have more than one value for the exact same reason as 'guppi_dir' and you can have a multi-parameter list by using ['sess1','sess2'] instead of 'sess1'

  • "sources" - The name of the scan along with which scan pertains to them.
    It is useful to cd into 'guppi_dir' if you are not already in there.
    Note that the 05 pulsar scans do not get put into this since they do not get converted.
    The name of the source is found in between 'guppi_session' and the scan number. Make sure the name matches. Sometimes the quasars are listed as '3CXXX' and sometimes they are listed as '3cXXX'. The numbers that come after are the scan numbers. For onoff scans, they are 2 scans long and 2 of them are done at once. This gives you 2 numbers to put in for the 00 scan and another 2 numbers to put in for the 03 scan. Note that the 00 scan gives you the 'track' scan also (the first number) but you do not want to put this number in. The wigglez1hr stepping are 1 scans long per script. To write a lot of numbers at once it useful to use 'range':

# first = The first wiggleZ stepping scan.
# last = The last wiggleZ stepping scan + 1
("wigglez1hr_centre", range(first, last))
# If you do a separate 03_ or 05_ scan in between stepping scans, do:
# before_break = The last wiggleZ stepping scan before the 03_/05_ scan + 1
# after_break The first wiggleZ stepping scan after the 03_/05_ scan.
("wigglez1hr_centre", range(first, before_break) + range(after_break, last))

Backing Up The Files To Disk

First you must see how much space is left on disk.

# From beef:
ssh euclid
# Go to our mounted disk:
cd /mnt/ehd1/scratch
# See how much space we have:
df -h
# Look for the line that says:
# '/dev/sdd1             1.8T  1.5T  272G  85% /mnt/ehd1'
# That says it is 85% full with 272GB left.
# To see how much space other sessions take up do:
du -h
# Use that to compare how big a session is and ONLY back it up if there is enough space.
# Else, ask for a disk change.

To back up a session, in the data_log.ini file at the line that says:

sessions_to_archive = (80, 81, 83, 84, 85, 86, 87, 88, 89)

Add the session number to it. For every new disk, make a new tuple, commenting out the previous ones (such that each tuple corresponds to a disk, the the currently mounted disk not commented). If you are waiting a disk to be mounted, you can skip the backup (by commenting out all the backup lines) and do it at a later time. The rest of the conversion will still run.

Running It

You are now ready to run the backing up / converting program.

# From beef:
cd analysis_IM/ # or where ever you have the repository.
# Run the file.
/opt/local/bin/python scripts/psrfits_to_sdfits.py auto ~tvoytek/14B_339/data_log.ini &
# Note: First the sessions are backed up, and converted at the same time (multiprocessing).
# To see how the backing up to disk is going, do:
cat conversion.sync
# You can also keep checking on euclid if you want.
# To see how the converting is going (first scans are converted, then quality pdfs are made):
cat conversion.log
# Any errors get written here:
cat conversion.err
# The .err file should stay empty the whole time meaning there are no errors.
# Then just wait until it finishes.

Errors Happen

Errors will inevitably occur. Systems crash, files disappear or fail to be written to disk. The pipeline is designed to pick up where you left off in the event of an error. Without making any changes to the configuration file, simply run again. This should work most of the time. If the same error keeps coming back at the same place, then there will be some debugging to do.

Note that the pdf making process seems to be more prone to causing problems. If you are having trouble finding a solution you can skip making the pdfs by commenting out each of the sources after the code tries to make pdfs for that source and crashes.

Getting the processed files to CITA

All of the converted files are stored in Kiyo's directory. The files and their corresponding quality pdf will be rsynced over:

# On a terminal that has access to raid-project:
# Go to the appropriate folder and sync the data.
cd /mnt/raid-project/gmrt/kiyo/data/guppi_quality/GBT###_###
rsync -avP -essh [email protected]:/home/scratch/tvoytek/quality/GBT###_###/* .
cd /mnt/raid-project/gmrt/kiyo/data/guppi_data/GBT###_###
rsync -avP -essh [email protected]:/home/scratch/tvoytek/converted_fits/GBT###_###/* .
# Where 'xxxxx' is your GBT username and ###_### is the current project number
# Transferring the pdfs is really fast while getting the converted files
# over takes a while because of their size.

Transfering raw files to scinet

Because of the limited space on beef, it is important to copy the data to mid-term storage quickly so that the raw data on beef can be deleted as soon as you finish running the above script on a given folder.

We use euler to stage our data for mid-term storage before tranferring it to scinet.

Step 1: Transfer the data to euler or euclid:

ssh -X euclid
cd /lustre/pulsar/scratch/AGBT14B_339/
rsync -avP -essh beef-10:/data2/tvoytek/AGBT14B_339/20140911/* ./20140911
#change the path on beef to the appropriate folder for a given session (as well as the output folder). 

Once the data is on euler/euclid, it can be transferred to scinet. However, because of the gateway restrictions on both the gbt machines and scinet there is a multi-stage transfer process.

Step 2: Transfer files to raid-project at CITA

#from euclid
screen
rsync -avP -essh /lustre/pulsar/scratch/AGBT14B_339/20140911/ [email protected]:/mnt/raid-project/gmrt/username/20140911/
#you will probably want to do this inside a screen and then detach as this can take several hours to a day for large folders. 

Step 3: Transfer files from raid-project to scinet

ssh -X [email protected]
ssh -X datamover1
cd /scratch/p/pen/tvoytek/raw_gbt_data/AGBT14B_339/
module load extras
screen
ssh -V
rsync -vrltD -essh [email protected]:/mnt/raid-project/gmrt/username/20140911/ 09_20140911/
#will take several hours, so you can detach the screen with CTRL-A, CTRL-D and login later (with screen -r) to check if it's done.
#Note that the folder name changed as we added the session number to the folder name. 

Step 4: Delete intermediate data from euler/euclid and raid-project and mark beef data for deletion.

Step 4a: Add to beef deletion list Once you are done running on beef, the raw data folders need to be deleted. You don't have permissions to delete these folders directly. Instead, add the folder locations to the file on beef:

/users/tvoytek/beef_folders_ready_for_deletion.txt

You should also ping Paul Demorest ([email protected]) and Sam Bates ([email protected]) to let them know that you have folders ready for deletion.

Step 4b: Delete euler/euclid data Once you have copied data to cita, you can also delete the intermediate folders you made on euler/euclid as well. This you can do yourself.

Step 4c: Delete cita raid-project data Once you have copied data to scinet you can also delete the intermediate folder(s) you made on raid-project as well. This you can do yourself.

Congratulations

Everything is done! Rinse and repeat as more sessions come in.