Raluca Nicoara RalucaN

Hi there 👋

About me 💬

My name is Raluca and I am a passionate Data Scientist with 5 years of experience in reinsurance and tech industry, building scalable solutions that transform vast data landscapes into compelling business insights.
With strong background in machine learning and data modeling (NLP and (un)supervised ML), I am proficient with an array of tools for data processing and analysis: Jupyter, Python, AirFlow, AWS, MSSQL/Postgresql, Git .
I descriebe myself as a creative, self-starter with a keen mind for solving tough problems with adaptive, data-driven and automated solutions.
I have a PhD in Political Science from Trinity College Dublin, where I conducted a survey-experiment in Kenya and used advanced statistical models (panel models, multilevel model) to assess the political impact of Chinese economic engagement in Africa.

Skills 💻

Data gathering (e.g. Python - requests, bs4/BeautifulSoup, boto3, pyodbc; survey-experiment; SQL - MySQL, Redshift, data lake such as Dremio; Excel - power query)
Data processing (EDA using Python - matplotlib, pandas, numpy, seaborn; Excel - formulas, pivot)
Modelling (R - lme4, ordinal, panelAR, plm, Python - Scikit-learn, scipy, nltk, Tyche)
Data visualization (PowerBI, Tableau, R and Python - matplotlib, seaborn)
Orchestrating complex data pipelines (using Aiflow)
Cloud infrastructure (AWS - S3, EC2, Redshift)

Languages and software that I know and/or use:

Projects 🚀

This section contains my projects which span a variety of data science topics and utilize different libraries/tools.
The wordcloud below summarizes the most frequent words used to describe my projects. It gets generated on each change to the readme using github actions. Feel free to checkout wordcloud-readme repo for more details about the generation process.

Feel free to explore the links below to learn more about each project!

A. Complex projects - original ideas developed into end-to-end pipelines and employing a combination of several tools (e.g. python, spark, SQL)

🔥Hot project🔥: Leet Buster

Leet (1337) Buster is an upcoming NLP library
It is designed to decode leet speak (1337 speak) and convert it back to standard text.
It employs a sophisticated three-step process to accurately identify and resolve leet-encoded words.
Features:
- Efficient leet speak decoding
- Rule-based candidate identification
- One solution generation
- Advanced resolution using NLP and compressed FastText
- Easy integration into existing NLP pipelines

Cyber Attacks Analysis

New Project project that leverages the University of Maryland CISSM Cyber Attacks Database
Aim: to create an end-to-end data engineering pipeline to ingest, process, store, and visualize data on cybers attacks around the world between 2014 and 2023.
The project will be submitted as a capstone project for the Data Engineering Zoomcamp 2024
Expectations: employing cloud, IaC, workflow orchestration, data warehouse, and visualization tools to build a dashboard that shows the trends and patterns of cyber events breaches across different sectors and regions.
Hard deadline: 1st of April 2024

Content-based NLP recommendation system using TED talks

Keywords: Unstructured data, NLP, scikit-learn, sentiment analysis, similarity, Streamlit

Cohort: WaiPRACTICE September Cohort 2023 by Women in AI Ireland (WAI) (github page).

Summary: Built a content-based recommendation system using NLP techniques, sentiment analysis and similarity measures.

Key steps:

Web scraping with API requests.
TED talks recommender app using Streamlit.
Libraries: Requests, pandas, NLTK, scikit-learn and scipy

Next Steps:

Publish the oratix Python library
Explore transformers (BERT) for further enhancements.

Valheim's Steam user reviews analysis

Keywords: API requests, NLP, sentiment analysis, unstructured data

Summary: A project that aims to analyze user reviews about the game Valheim on Steam to understand why a game with such low quality graphics has a great reception from players.

Key steps:

Reviews collections using Steam public API.
Sentiment analysis using a pre-trained BERT transformer.
EDA process uncovered user exploits
Python libraries: pandas, numpy, matplotlib and seaborn

Next steps:

Root cause analysis for defects in production (root cause analysis, decision tree, neural networks)

Keywords: supervised ML, decision tree, random forest

Cohort: Women in Data Science Accelerator 2020 (Accenture)

Summary: Conducted a root cause analysis to predict defects in production using decision tree model

Key steps:

Sofiware: R, Python, Tableau
Libraries: RPART, Boruta, Scikit-learn , Graphviz, dtreeviz

Next steps:

B. ML bits and Pieces - smaller scale projects, usually comprised of one notebook or one script, meant to focus on a specific tool or ML aspect

Find movies' similarity (NLP, KMeans/Clustering, Unsupervised Learning)

Keywords: Movie Similarity, NLP, KMeans, Cosine Similarity, Clustering, Unsupervised Learning

Summary: an NLP project endeavor that quantifies the similarities between movies based on their IMDb and Wikipedia plots. It aims to provide insights into movie relationships and group them into meaningful clusters.

Key Steps:

Data Preprocessing using NLP techniques, such as Tokenization, Stemming and TF-IDF Vectorization
Performed unsupervised learning with KMeans by first determine optimal clusters using the elbow method and assign movies to clusters.
Used Cosine Similarity to measure similarity distances between movie plots.

Next Steps:

Explore additional features (e.g., genre, director) for improved clustering.
Visualize clusters and explore movie recommendations within each cluster.

Hotel Bookings (SVM, classification, decision boundaries)

Keywords: support vector machine, classification, feature engineering, hyperparameter tuning

Summary: The project aims to predict whether a hotel booking will be canceled or not, using a support vector machine (SVM) classifier, using a data set containing information about the lead time, average daily rate, number of weekend nights, arrival week number of each booking etc.

Key Steps:

Preprocessing the data by scaling the numerical features and creating new binary and interactive features
Selecting the most informative features based on mutual information scores
Tuning the SVM hyperparameters using grid search cross-validation
Evaluating the best model on the test set and plotting the decision boundaries for different kernels

Next Steps:

Compare the performance of the SVM classifier with other machine learning models, such as logistic regression, decision tree, or random forest
Explore the effect of different feature selection methods, such as chi-square test, ANOVA, or recursive feature elimination
Analyze the factors that influence the cancelation probability and provide recommendations to reduce it
Deploy the model as a web application or a dashboard that can interact with real-time data

Predicting crops based on soil metrics (neural network, tensorflow, keras, random forest classifier)

Keywords: crop type prediction, soil metrics, tensorflow, keras, scikit-learn, logistic regression, random forest classifier, neural network.

Summary: This project predicts the best crop type for a soil sample based on four soil metrics: N, P, K, and pH, using Logistic Regression, Random Forest Classifier and Neural Network

Key Steps:

explores three machine learning algorithms: logistic regression, random forest, and neural network from tensorflow
evaluates the model's performance using metrics such as F1-score and confusion matrix from scikit-learn

Next Steps: Collect more data from different regions and seasons to validate the model on new data.

Exploring the World’s Oldest Businesses using SQL (Structured data, SQL, sqlite3)

Keywords: Structured data, SQL, pandas, sqlite3, Data Analysis, Data Manipulation

Summary: Built a data exploration project using SQL techniques to analyze data from BusinessFinancing.co.uk on the world’s oldest businesses. The project involved creating a SQLite database, loading data from CSV files, and running SQL queries to gain insights into these historic businesses.

Key Steps:

Created a SQLite database and defined the schema using SQL commands.
Loaded data from CSV files into the database using pandas.
Ran SQL queries to merge and manipulate the data, and used pandas to analyze the results.
Libraries: sqlite3, pandas

Next Steps:

Enhance the project by incorporating more datasets related to businesses.
Explore the use of more advanced SQL techniques for further data analysis.
Consider visualizing the results using a library like matplotlib or seaborn.

PhD thesis and older projects:

Political Impact of Chinese Economic Engagement in Africa: PhD thesis - project that involved conducting a survey-experiment in Kenya and using advanced statistical models (e.g., multilevel, ordinal logistic, panel data model) to provide an in-depth assessment of the political impact of Chinese economic engagement in Africa.
Profiling electoral candidates: My first NLP project that involved using quanteda package and doing a content analysis of a 2016 presidential debate of US Democratic Party’s candidates.

Achievements 🏆

Some of the achievements that I have accomplished are:

Graduated with a PhD in Political Scienece from Trinity College Dublin.
Completed Accenture’s “Women in Data Science Accelerator”.
Won the Irish Research Council Government of Ireland Postgraduate Scholarship, a highly competitive and prestigious research grant with an average success rate of 18% and a total amount of €48,000.

Contact 📫

If you want to reach out to me, you can find me on:

Fun facts 🎉

Some fun facts about me are:

I am originally from Transilvania
I am enjoy eating garlic
I speak several languages: English, French, Spanish and Japanese.
I love traveling and exploring new places.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raluca Nicoara RalucaN

Achievements

Achievements

Block or report RalucaN

Hi there 👋

About me 💬

Skills 💻

Languages and software that I know and/or use:

Projects 🚀

A. Complex projects - original ideas developed into end-to-end pipelines and employing a combination of several tools (e.g. python, spark, SQL)

🔥Hot project🔥: Leet Buster

Cyber Attacks Analysis

Content-based NLP recommendation system using TED talks

Valheim's Steam user reviews analysis

Root cause analysis for defects in production (root cause analysis, decision tree, neural networks)

B. ML bits and Pieces - smaller scale projects, usually comprised of one notebook or one script, meant to focus on a specific tool or ML aspect

Find movies' similarity (NLP, KMeans/Clustering, Unsupervised Learning)

Hotel Bookings (SVM, classification, decision boundaries)

Predicting crops based on soil metrics (neural network, tensorflow, keras, random forest classifier)

Exploring the World’s Oldest Businesses using SQL (Structured data, SQL, sqlite3)

PhD thesis and older projects:

Achievements 🏆

Contact 📫

Fun facts 🎉

Pinned Loading