Original PDF Source: [bbs.gov.bd]
PDFs in Google Drive: [Google Drive]
In this project, we are trying to extract the PDF data and clean it to turn it into easily readable formats (such as CSV and JSON) to be used easily by everyone.
- The main aim is to create a digitized version of the 2011 census data by BBS so that anyone can access it and analyze it.
- Once the data is online and freely accessible by anyone, they can use it for any purpose they wish.
- The type of data is very diverse, example: number of school in each upazila, tons of rice produced, average wage of manual workers. So, this data captures many different aspects of Bangladesh in that time.
- The data can be used is various studies and can also be used in data visualizations.
- In the future, this data can be compared and/or merged with other data sources, example: the next census, and that will reveal some insights about Bangladesh.
- Another further task is to take the individual tables and merge them into a mega table, where there will be one row for each upazila, and many many columns, side by side, so that data about any upazila or zila can be easily read.
-
1_docx_parser.ipynb: The PDFs from BBS have been converted into docx files using Adobe Acrobat Pro, and in this notebook, the docx is opened and the paragraphs and tables are extracted, then saved in a pickle.
-
2_understanding_content.ipynb: the pickles are explored to show the format of the table and paragraph blocks.
-
3_cleaning_and_formatting.ipynb: the extracted tables and paragraph blocks are further cleaned and processed, and finally the blocks are sorted into chapterwise format.
-
4_generate_csv.ipynb: the cleaned tables are converted into CSV files using pandas.
-
5_unwanted_chars.ipynb: clean the tables with unwanted units. See error type 5.
-
cleaned_CSVs: Contains all extracted CSV files, with 64 folders, one for each district. But this data is not verified.
-
verified_CSVs: Contains the verified CSV files, also with 64 folders, one for each district. This data was verified manually by different contributors.
-
docx: Contains all docx files for every district, converted from PDF using Adobe Acrobat Pro.
-
pickles: Contains all pkl file, which contains the cleaned data containing the table and paragraph blocks in a chapterwise format.
You can volunteer and take part in the data cleaning process, mail me if interested: yasser.aziz94 [aaat] gmail dot com
- Manually check the CSV files in the "cleaned_CSVs" folder and look for mistakes. Read the How to Clean Guide. Also watch the video guide.
- Programatically improve the extraction process from the PDFs so that the cleaning process is made easier.
- Use the data to find insights or make helpful projects with this data.
- Check this Google Sheet and follow the instructions on top. Choose the zila you want to work with and verify the tables inside. Submit your cleaned and verified files with a pull request.
- Mail me to ask anything: yasser.aziz94 [aaat] gmail dot com
- Post on the "Issues" tab if you face a difficulty.
- Please see the commands here: stackoverflow.com
- This video is also helpful: youtube.com
Thanks to @cinmoy98 for the awesome links. Kudos!
Once you fork, verify the data and send a pull request, your submission will be checked and merged. Then your name will show up in the github repo under "Contributors".