Data Science Portfolio

Research

EcoMultiRGNN: a Graph ML Model for Predicting Avian Migration from Ecological Suitability Networks

Alongside my collaborators, I am developing a novel graph neural network (GNN) tailored for link prediction in temporal multiplex networks, with the aim of better understanding complex, dynamic ecological systems. The work seeks to understand how anthropogenic impacts, such as climate change or habitat degradation, alter bird migration patterns along the East Atlantic flyway. To do this, we’re building a GNN architecture that captures both temporal and multilayer aspects of network data, validated on mobility network data as a proof of concept. The model integrates node, edge, and temporal embeddings through graph attention mechanisms to predict dynamic interactions over time. We are constructing ecological suitability networks connecting avian migration stopover sites, which will be used to develop practical use-cases of EcoMultiRGNN.

MigNar : An AI-driven Synthetic Ground Truth Benchmark for Narrative Extraction Algorithms

I am the lead author on ongoing research collaboration between the Oxford Internet Institute and the University of Oxford’s Centre on Migration, Policy and Society (COMPAS), building MigNar - a benchmark dataset which leverages recent advances in generative transformer-based large language models (LLMs) to simulate ground truth in a corpus of synthetically generated articles on Brexit and migration. MigNar offers an innovative approach to benchmarking topic and narrative models through relying on synthetic data, making ground truth significantly more accessible. Thus far, MigNar has proven its capability to discriminate between algorithms in a theoretically consistent manner and replicate mul- tiple qualities of human-generated text. Work is ongoing on decreasing hallucination risk, as well as humanising the benchmark text.

NarrAI: Leveraging NLP, LLMs and Network Community Detection for Enhanced Narrative Extraction in the UK Migration Debate

At Oxford, my thesis developed NarrAI - a novel computational narrative extraction (CNE) algorithm. NarrAI is the first CNE algorithm to use both LLMs and network community detection as enhancements to traditional Bayesian probabilistic topic models. NarrAI shows superior performance in recovering coherent and contextually rich narratives compared to existing CNE algorithms. The narratives generated by NarrAI not only provide thematic insights, but also integrate agency and evaluative dimensions, offering a more nuanced understanding of discourse. This capability is crucial for policy implications.

Parallel Network Change: An Analysis of Migration-Trade-Terrorism Co-Evolution with Temporal Graph Distances and Latent Space Modelling

Presented at the Networks and Time II conference hosted by Network Science Institute at Northeastern University London.

Predicting Bilateral Refugee Flows: Evaluating the Gravity Model and Ethnocultural Linkages for Migration Policy

Gravity Models for Global Migration Flows:
A Predictive Evaluation

Projects

World Emissions Clock

At the World Data Lab, I play a key role in the development of, and insight mining from, the World Emissions Clock (WEC) tool. Launched at COP27, the WEC tracks and forecasts greenhouse gas emissions for 180 countries, 24 subsectors, along 3 emissions policy scenarios.

Towards Low Carbon Prosperity

Using the WEC, I led research on the possibility of a world which is prosperous yet sustainable - where developed economies can scale down emissions while maintaining wealth, and where developing economies can safely grow while keeping GHG emissions low. We found that if the rich countries adopted the best practices of their peers within each sector, level of emissions would decline to only 3.3 tons per capita—less than half of the world average and only around 20% of the per-capita emission of the U.S. We also found that the rich cannot alone solve the problem - it is crucial that both developed and developing nations take corrective action.

GPA Predictors in the Fragile Families Challenge

In this project, I applied Machine Learning techniques to predict GPA outcomes using data from the Fragile Families and Child Wellbeing Study. I explored various models, including OLS, ElasticNet, Decision Trees, Random Forests, Gradient Boosting, and LGBM. After hyper-parameter tuning, the Random Forest model achieved the best performance with a 65.2% reduction in Mean Squared Error (MSE) from the baseline. The model demonstrated comparable performance to top models from the Fragile Families Challenge, given that the model ranked first for predicting GPA in the FFC had an MSE of 0.377 on the holdout set (Fragile Families Challenge Team, 2016) and my model’s MSE was 0.206 (imputed holdout set) and 0.365 (unimputed holdout set). Thus, I provide a robust approach to identifying key GPA predictors across cognitive, environmental, and socioeconomic factors.

Bound by Faith? Exploring the Association Between Religious Proximity and Forced Migrant Count

In this project, I explored the association between religious proximity and forced migration flows using a dataset I constructed from multiple sources, covering bilateral flows from 1977 to 2023. By applying a Zero-Inflated Negative Binomial (ZINB) regression model, I found that an increase in the Religious Proximity Index between countries is significantly associated with a rise in forced migrant counts. The model highlights the overlooked role of religious ties in migration forecasting, suggesting that current models should be updated to incorporate these connections for more accurate predictions.

Can inflammatory negative sentiment predict in-degree centrality in online social networks? A Reddit data analysis
This project investigated whether inflammatory negative sentiment predict in-degree centrality in online social networks. Using a sample of 5 submissions on r/UmbrellaAcademy, discourse around Elliot Page coming out as a transgender man was scored using sentiment analysis. The extracted data was used to construct a network of comments, and the in-degree centrality of each node was compared to inflammatory negative sentiment using multiple regression with time fixed effects. The results suggest that inflammatory negative sentiment does not have a strong positive effect on centrality, but instead a weak positive one.

A data deep-dive into the ‘cooking’ Stack Exchange forum
This was a data-driven exploration of the Stack Exchange forum on cooking. The project involved data acquisition with a Stack Exchange scraper, as well as the construction of an MVP Score, measuring the helpfulness of a user on the forum, using activity, text and time metrics. The score is a weighted index with PCA-derived weights. Using the index, various patterns were identified, such as the geographic distribution of helpful users (MVPs).

YouBee
Actively building an app - YouBee - as co-founder and Chief Data Officer. YouBee is a transformative tool addressing the longitudinal decline in youth mental health, offering a non-invasive self assessment solution in the form of a simple mobile game. Through interactive, choice-driven scenarios resembling the game Episode, YouBee’s multi-layer AI model takes users’ choice data and maps it to insights about what their key stressors are, and what types of remedies fits them best. This dynamic, personalised mental health profile is then used to make highly tailored recommendations for resources, tools and professionals.

World Data Pro

At the World Data Lab, I support the development of World Data Pro - a leading insights and analytics platform used by business strategy and consumer insights leaders, investors, policymakers, economists, and researchers to forecast and quantify addressable markets, discover emerging opportunities, develop market and product strategies, and quantify impact. My main involvement has been in insights-mining, QAQC, data acquisition and researching missing data.

Internet Poverty Index

At the World Data Lab, I supported the development of the Internet Poverty Index (IPI) with data acquisition, insights mining and QAQC tasks. Currently, internet access is increasingly viewed as a basic requirement, alongside access to food, clothing, housing, and energy. The ability to accurately measure internet poverty can raise awareness and identify the most vulnerable groups. This is what the IPI does.