- My name is Raluca and I am a passionate Data Scientist with 5 years of experience in reinsurance and tech industry, building scalable solutions that transform vast data landscapes into compelling business insights.
- With strong background in machine learning and data modeling (NLP and (un)supervised ML), I am proficient with an array of tools for data processing and analysis: Jupyter, Python, AirFlow, AWS, MSSQL/Postgresql, Git .
- I descriebe myself as a creative, self-starter with a keen mind for solving tough problems with adaptive, data-driven and automated solutions.
- I have a PhD in Political Science from Trinity College Dublin, where I conducted a survey-experiment in Kenya and used advanced statistical models (panel models, multilevel model) to assess the political impact of Chinese economic engagement in Africa.
- Data gathering (e.g. Python - requests, bs4/BeautifulSoup, boto3, pyodbc; survey-experiment; SQL - MySQL, Redshift, data lake such as Dremio; Excel - power query)
- Data processing (EDA using Python - matplotlib, pandas, numpy, seaborn; Excel - formulas, pivot)
- Modelling (R - lme4, ordinal, panelAR, plm, Python - Scikit-learn, scipy, nltk, Tyche)
- Data visualization (PowerBI, Tableau, R and Python - matplotlib, seaborn)
- Orchestrating complex data pipelines (using Aiflow)
- Cloud infrastructure (AWS - S3, EC2, Redshift)
This section contains my projects which span a variety of data science topics and utilize different libraries/tools.
The wordcloud below summarizes the most frequent words used to describe my projects. It gets generated on each change to the readme using github actions. Feel free to checkout wordcloud-readme repo for more details about the generation process.
Feel free to explore the links below to learn more about each project!
A. Complex projects - original ideas developed into end-to-end pipelines and employing a combination of several tools (e.g. python, spark, SQL)
🔥Hot project🔥: Leet Buster
- Leet (1337) Buster is an upcoming NLP library
- It is designed to decode leet speak (1337 speak) and convert it back to standard text.
- It employs a sophisticated three-step process to accurately identify and resolve leet-encoded words.
Features:- Efficient leet speak decoding
- Rule-based candidate identification
- One solution generation
- Advanced resolution using NLP and compressed FastText
- Easy integration into existing NLP pipelines
- New Project project that leverages the University of Maryland CISSM Cyber Attacks Database
- Aim: to create an end-to-end data engineering pipeline to ingest, process, store, and visualize data on cybers attacks around the world between 2014 and 2023.
- The project will be submitted as a capstone project for the Data Engineering Zoomcamp 2024
- Expectations: employing cloud, IaC, workflow orchestration, data warehouse, and visualization tools to build a dashboard that shows the trends and patterns of cyber events breaches across different sectors and regions.
- Hard deadline: 1st of April 2024
Keywords: Unstructured data, NLP, scikit-learn, sentiment analysis, similarity, Streamlit
Cohort: WaiPRACTICE September Cohort 2023 by Women in AI Ireland (WAI) (github page).
Summary: Built a content-based recommendation system using NLP techniques, sentiment analysis and similarity measures.
Key steps:
- Web scraping with API requests.
- TED talks recommender app using Streamlit.
- Libraries:
Requests
,pandas
,NLTK
,scikit-learn
andscipy
Next Steps:
- Publish the oratix Python library
- Explore transformers (BERT) for further enhancements.
Keywords: API requests, NLP, sentiment analysis, unstructured data
Summary: A project that aims to analyze user reviews about the game Valheim on Steam to understand why a game with such low quality graphics has a great reception from players.
Key steps:
- Reviews collections using Steam public API.
- Sentiment analysis using a pre-trained BERT transformer.
- EDA process uncovered user exploits
- Python libraries:
pandas
,numpy
,matplotlib
andseaborn
Next steps:
Root cause analysis for defects in production (root cause analysis, decision tree, neural networks)
Keywords: supervised ML, decision tree, random forest
Cohort: Women in Data Science Accelerator 2020 (Accenture)
Summary: Conducted a root cause analysis to predict defects in production using decision tree model
Key steps:
- Sofiware: R, Python, Tableau
- Libraries:
RPART
,Boruta
,Scikit-learn
,Graphviz
,dtreeviz
Next steps:
B. ML bits and Pieces - smaller scale projects, usually comprised of one notebook or one script, meant to focus on a specific tool or ML aspect
Find movies' similarity (NLP, KMeans/Clustering, Unsupervised Learning)
Keywords: Movie Similarity, NLP, KMeans, Cosine Similarity, Clustering, Unsupervised Learning
Summary: an NLP project endeavor that quantifies the similarities between movies based on their IMDb and Wikipedia plots. It aims to provide insights into movie relationships and group them into meaningful clusters.
Key Steps:
- Data Preprocessing using NLP techniques, such as Tokenization, Stemming and TF-IDF Vectorization
- Performed unsupervised learning with KMeans by first determine optimal clusters using the elbow method and assign movies to clusters.
- Used Cosine Similarity to measure similarity distances between movie plots.
Next Steps:
- Explore additional features (e.g., genre, director) for improved clustering.
- Visualize clusters and explore movie recommendations within each cluster.
Hotel Bookings (SVM, classification, decision boundaries)
Keywords: support vector machine, classification, feature engineering, hyperparameter tuning
Summary: The project aims to predict whether a hotel booking will be canceled or not, using a support vector machine (SVM) classifier, using a data set containing information about the lead time, average daily rate, number of weekend nights, arrival week number of each booking etc.
Key Steps:
- Preprocessing the data by scaling the numerical features and creating new binary and interactive features
- Selecting the most informative features based on mutual information scores
- Tuning the SVM hyperparameters using grid search cross-validation
- Evaluating the best model on the test set and plotting the decision boundaries for different kernels
Next Steps:
- Compare the performance of the SVM classifier with other machine learning models, such as logistic regression, decision tree, or random forest
- Explore the effect of different feature selection methods, such as chi-square test, ANOVA, or recursive feature elimination
- Analyze the factors that influence the cancelation probability and provide recommendations to reduce it
- Deploy the model as a web application or a dashboard that can interact with real-time data
Predicting crops based on soil metrics (neural network, tensorflow, keras, random forest classifier)
Keywords: crop type prediction, soil metrics, tensorflow, keras, scikit-learn, logistic regression, random forest classifier, neural network.
Summary: This project predicts the best crop type for a soil sample based on four soil metrics: N, P, K, and pH, using Logistic Regression, Random Forest Classifier and Neural Network
Key Steps:
- explores three machine learning algorithms: logistic regression, random forest, and neural network from
tensorflow
- evaluates the model's performance using metrics such as F1-score and confusion matrix from
scikit-learn
Next Steps: Collect more data from different regions and seasons to validate the model on new data.
Exploring the World’s Oldest Businesses using SQL (Structured data, SQL, sqlite3)
Keywords: Structured data, SQL, pandas, sqlite3, Data Analysis, Data Manipulation
Summary: Built a data exploration project using SQL techniques to analyze data from BusinessFinancing.co.uk on the world’s oldest businesses. The project involved creating a SQLite database, loading data from CSV files, and running SQL queries to gain insights into these historic businesses.
Key Steps:
- Created a SQLite database and defined the schema using SQL commands.
- Loaded data from CSV files into the database using pandas.
- Ran SQL queries to merge and manipulate the data, and used pandas to analyze the results.
- Libraries:
sqlite3
,pandas
Next Steps:
- Enhance the project by incorporating more datasets related to businesses.
- Explore the use of more advanced SQL techniques for further data analysis.
- Consider visualizing the results using a library like matplotlib or seaborn.
- Political Impact of Chinese Economic Engagement in Africa: PhD thesis - project that involved conducting a survey-experiment in Kenya and using advanced statistical models (e.g., multilevel, ordinal logistic, panel data model) to provide an in-depth assessment of the political impact of Chinese economic engagement in Africa.
- Profiling electoral candidates: My first NLP project that involved using quanteda package and doing a content analysis of a 2016 presidential debate of US Democratic Party’s candidates.
Some of the achievements that I have accomplished are:
- Graduated with a PhD in Political Scienece from Trinity College Dublin.
- Completed Accenture’s “Women in Data Science Accelerator”.
- Won the Irish Research Council Government of Ireland Postgraduate Scholarship, a highly competitive and prestigious research grant with an average success rate of 18% and a total amount of €48,000.
If you want to reach out to me, you can find me on:
Some fun facts about me are:
- I am originally from Transilvania
- I am enjoy eating garlic
- I speak several languages: English, French, Spanish and Japanese.
- I love traveling and exploring new places.