Skip to content
View RalucaN's full-sized avatar
👊
👊

Block or report RalucaN

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
RalucaN/README.md

Hi there 👋

About me 💬

  • My name is Raluca and I am a passionate Data Scientist with 5 years of experience in reinsurance and tech industry, building scalable solutions that transform vast data landscapes into compelling business insights.
  • With strong background in machine learning and data modeling (NLP and (un)supervised ML), I am proficient with an array of tools for data processing and analysis: Jupyter, Python, AirFlow, AWS, MSSQL/Postgresql, Git .
  • I descriebe myself as a creative, self-starter with a keen mind for solving tough problems with adaptive, data-driven and automated solutions.
  • I have a PhD in Political Science from Trinity College Dublin, where I conducted a survey-experiment in Kenya and used advanced statistical models (panel models, multilevel model) to assess the political impact of Chinese economic engagement in Africa.

Skills 💻

  • Data gathering (e.g. Python - requests, bs4/BeautifulSoup, boto3, pyodbc; survey-experiment; SQL - MySQL, Redshift, data lake such as Dremio; Excel - power query)
  • Data processing (EDA using Python - matplotlib, pandas, numpy, seaborn; Excel - formulas, pivot)
  • Modelling (R - lme4, ordinal, panelAR, plm, Python - Scikit-learn, scipy, nltk, Tyche)
  • Data visualization (PowerBI, Tableau, R and Python - matplotlib, seaborn)
  • Orchestrating complex data pipelines (using Aiflow)
  • Cloud infrastructure (AWS - S3, EC2, Redshift)

Languages and software that I know and/or use:

Python seaborn statsmodels Scikit-learn Pandas NumPy Keras

postgresql mysql sql

airflow AWS metabase mongodb jenkins Git VSCode

Projects 🚀

This section contains my projects which span a variety of data science topics and utilize different libraries/tools.
The wordcloud below summarizes the most frequent words used to describe my projects. It gets generated on each change to the readme using github actions. Feel free to checkout wordcloud-readme repo for more details about the generation process.

Wordcloud

Feel free to explore the links below to learn more about each project!



A. Complex projects - original ideas developed into end-to-end pipelines and employing a combination of several tools (e.g. python, spark, SQL)



🔥Hot project🔥: Leet Buster

Leet Buster

  • Leet (1337) Buster is an upcoming NLP library
  • It is designed to decode leet speak (1337 speak) and convert it back to standard text.
  • It employs a sophisticated three-step process to accurately identify and resolve leet-encoded words.
    Features:
    • Efficient leet speak decoding
    • Rule-based candidate identification
    • One solution generation
    • Advanced resolution using NLP and compressed FastText
    • Easy integration into existing NLP pipelines



Cyber Attacks

  • New Project project that leverages the University of Maryland CISSM Cyber Attacks Database
  • Aim: to create an end-to-end data engineering pipeline to ingest, process, store, and visualize data on cybers attacks around the world between 2014 and 2023.
  • The project will be submitted as a capstone project for the Data Engineering Zoomcamp 2024
  • Expectations: employing cloud, IaC, workflow orchestration, data warehouse, and visualization tools to build a dashboard that shows the trends and patterns of cyber events breaches across different sectors and regions.
  • Hard deadline: 1st of April 2024






TED talks NLP recommendation system

Keywords: Unstructured data, NLP, scikit-learn, sentiment analysis, similarity, Streamlit

Cohort: WaiPRACTICE September Cohort 2023 by Women in AI Ireland (WAI) (github page).

Summary: Built a content-based recommendation system using NLP techniques, sentiment analysis and similarity measures.

Key steps:


Next Steps:


Valheim's Steam user reviews analysis

Keywords: API requests, NLP, sentiment analysis, unstructured data

Summary: A project that aims to analyze user reviews about the game Valheim on Steam to understand why a game with such low quality graphics has a great reception from players.

Key steps:

  • Reviews collections using Steam public API.
  • Sentiment analysis using a pre-trained BERT transformer.
  • EDA process uncovered user exploits
  • Python libraries: pandas, numpy, matplotlib and seaborn

Next steps:


Root cause analysis for defects in production (root cause analysis, decision tree, neural networks)

Root cause analysis for defects in production

Keywords: supervised ML, decision tree, random forest

Cohort: Women in Data Science Accelerator 2020 (Accenture)

Summary: Conducted a root cause analysis to predict defects in production using decision tree model

Key steps:

  • Sofiware: R, Python, Tableau
  • Libraries: RPART, Boruta, Scikit-learn , Graphviz, dtreeviz

Next steps:




B. ML bits and Pieces - smaller scale projects, usually comprised of one notebook or one script, meant to focus on a specific tool or ML aspect



Find movies' similarity (NLP, KMeans/Clustering, Unsupervised Learning)

ML bits and Pieces

Keywords: Movie Similarity, NLP, KMeans, Cosine Similarity, Clustering, Unsupervised Learning

Summary: an NLP project endeavor that quantifies the similarities between movies based on their IMDb and Wikipedia plots. It aims to provide insights into movie relationships and group them into meaningful clusters.

Key Steps:

  • Data Preprocessing using NLP techniques, such as Tokenization, Stemming and TF-IDF Vectorization
  • Performed unsupervised learning with KMeans by first determine optimal clusters using the elbow method and assign movies to clusters.
  • Used Cosine Similarity to measure similarity distances between movie plots.

Next Steps:

  • Explore additional features (e.g., genre, director) for improved clustering.
  • Visualize clusters and explore movie recommendations within each cluster.

Hotel Bookings (SVM, classification, decision boundaries)

ML bits and Pieces

Keywords: support vector machine, classification, feature engineering, hyperparameter tuning

Summary: The project aims to predict whether a hotel booking will be canceled or not, using a support vector machine (SVM) classifier, using a data set containing information about the lead time, average daily rate, number of weekend nights, arrival week number of each booking etc.

Key Steps:

  • Preprocessing the data by scaling the numerical features and creating new binary and interactive features
  • Selecting the most informative features based on mutual information scores
  • Tuning the SVM hyperparameters using grid search cross-validation
  • Evaluating the best model on the test set and plotting the decision boundaries for different kernels

Next Steps:

  • Compare the performance of the SVM classifier with other machine learning models, such as logistic regression, decision tree, or random forest
  • Explore the effect of different feature selection methods, such as chi-square test, ANOVA, or recursive feature elimination
  • Analyze the factors that influence the cancelation probability and provide recommendations to reduce it
  • Deploy the model as a web application or a dashboard that can interact with real-time data

Predicting crops based on soil metrics (neural network, tensorflow, keras, random forest classifier)

ML bits and Pieces

Keywords: crop type prediction, soil metrics, tensorflow, keras, scikit-learn, logistic regression, random forest classifier, neural network.

Summary: This project predicts the best crop type for a soil sample based on four soil metrics: N, P, K, and pH, using Logistic Regression, Random Forest Classifier and Neural Network

Key Steps:

  • explores three machine learning algorithms: logistic regression, random forest, and neural network from tensorflow
  • evaluates the model's performance using metrics such as F1-score and confusion matrix from scikit-learn

Next Steps: Collect more data from different regions and seasons to validate the model on new data.



ML bits and Pieces

Keywords: Structured data, SQL, pandas, sqlite3, Data Analysis, Data Manipulation

Summary: Built a data exploration project using SQL techniques to analyze data from BusinessFinancing.co.uk on the world’s oldest businesses. The project involved creating a SQLite database, loading data from CSV files, and running SQL queries to gain insights into these historic businesses.

Key Steps:

  • Created a SQLite database and defined the schema using SQL commands.
  • Loaded data from CSV files into the database using pandas.
  • Ran SQL queries to merge and manipulate the data, and used pandas to analyze the results.
  • Libraries: sqlite3, pandas

Next Steps:

  • Enhance the project by incorporating more datasets related to businesses.
  • Explore the use of more advanced SQL techniques for further data analysis.
  • Consider visualizing the results using a library like matplotlib or seaborn.


PhD thesis and older projects:

  • Political Impact of Chinese Economic Engagement in Africa: PhD thesis - project that involved conducting a survey-experiment in Kenya and using advanced statistical models (e.g., multilevel, ordinal logistic, panel data model) to provide an in-depth assessment of the political impact of Chinese economic engagement in Africa.
  • Profiling electoral candidates: My first NLP project that involved using quanteda package and doing a content analysis of a 2016 presidential debate of US Democratic Party’s candidates.

Achievements 🏆

Some of the achievements that I have accomplished are:

  • Graduated with a PhD in Political Scienece from Trinity College Dublin.
  • Completed Accenture’s “Women in Data Science Accelerator”.
  • Won the Irish Research Council Government of Ireland Postgraduate Scholarship, a highly competitive and prestigious research grant with an average success rate of 18% and a total amount of €48,000.

My GitHub Streak Top Langs

Contact 📫

If you want to reach out to me, you can find me on:

 

Fun facts 🎉

Some fun facts about me are:

  • I am originally from Transilvania
  • I am enjoy eating garlic
  • I speak several languages: English, French, Spanish and Japanese.
  • I love traveling and exploring new places.

Pinned Loading

  1. Steam_reviews Steam_reviews Public

    This project aims to analyze user reviews about the game Valheim on Steam

    HTML

  2. PRODCO-DS PRODCO-DS Public

    DS project part of WIDS 2020

    HTML

  3. oratix oratix Public

  4. Data-projects Data-projects Public

    This repository contains examples of data projects I did or I am currently working on, using R and Python

    Jupyter Notebook

  5. women-in-ai-ireland/September-2023-Group-001 women-in-ai-ireland/September-2023-Group-001 Public

    This is the Group 001 repository for the WaiPRACTICE September cohort.

    Jupyter Notebook 2 1

  6. MLBitsAndPieces MLBitsAndPieces Public

    A collection of small-scale projects exploring the fascinating world of Machine Learning and Artificial Intelligence. Each project in this repository represents a step towards understanding and app…

    Jupyter Notebook 1