Dogugun is available for hire

Dogugun Ozkaya

Verified Expert in Engineering

Data Scientist and Software Developer

İstanbul, Turkey

Toptal member since November 10, 2021

Expertise

Machine Learning Data Science Data Analysis Data Engineering Dashboard Recommendation Systems Python SQL NumPy Software Development Big Data Architecture Deep Learning Data Scraping AWS

Bio

Dogugun is a skilled data scientist with expertise in machine learning and data-centric solutions. Proficient in Python and SQL, he drives data-driven decisions and delivers innovative AI solutions. His strengths lie in B2B ML projects, automated pipelines, and real-time APIs. Dogugun's predictive modeling and visualization dashboards empower businesses to optimize processes and gain strategic insights. Dogugun is a valuable asset in advancing businesses with ML-driven solutions.

Portfolio

Nesine.com

Python, MLflow, Scikit-learn, PyTorch, Hugging Face, Hugging Face Transformers...

GLXY Software, LLC

Machine Learning, Artificial Intelligence (AI), Azure, Python, Neo4j...

Amperecloud

Python, Time Series, Time Series Analysis, Scikit-learn, Machine Learning...

Experience

Python - 12 years
Machine Learning - 10 years
Data Science - 10 years
SQL - 8 years
Statistics - 5 years
Bioinformatics - 4 years
PySpark - 4 years
Amazon Web Services (AWS) - 3 years

Preferred Environment

PyCharm, Jupyter, Amazon Web Services (AWS), Megalodon

The most amazing...

...project was a B2B ML project for a data-driven decision-making tool in an enterprise consultation company. I also took part in its automation and deployment.

Work Experience

Lead AI Engineer and Data Scientist

2024 - PRESENT

Nesine.com

Served as the lead AI engineer and built the AI infrastructure for the company, including the local Qwen3 model, OpenWebUI, and Elastic. Introduced agentic RAG methodology for QA team use cases.
Led and developed the horse racing statistical analysis project using Knowledge Graph-based RAG with Neo4j and OpenAI.
Implemented and improved two Turkish text classification models that serve four million active users with the BERT model. One is a binary classifier for customer comments, and the other is an issue classifier for customer complaints.
Initiated, led, and delivered a customer service chatbot application that combines RASA for operational scenarios, OpenAI for unstructured scenarios, and an internal Qwen3 model for intent estimation.
Developed an AI-based competition analysis for football matches using OpenAI and document-based RAG.
Initiated, led, and delivered a personalization model based on a 2-tower algorithm. The project aims to recommend customer league and market-based matches and bets.
Developed text classification models for the end user to classify their messages on the message board to eliminate malignant content. This BERT-based model achieved an F1 score of 0.96 for classifying such content.
Extended the BERT model for intent classification of the customer messages that arrive at customer service. The 0.87 F-1 score helped the customer service agents reduce their workload.
Improved and refactored the legacy models that forecast financial KPIs and anomalies. Introduced a money transfer optimization model across payment accounts. Also developed time-series projects to detect network and login anomalies.
Initiated, led, and managed CRM projects for cost minimization and conversion maximization, including LTV, point distribution optimization, churn detection, and financial and behavioral segmentation models.

Technologies: Python, MLflow, Scikit-learn, PyTorch, Hugging Face, Hugging Face Transformers, NeuralProphet, TensorFlow, Neo4j, OpenAI, OpenAI API, Large Language Models (LLMs), Elastic, ELK (Elastic Stack), Langfuse, Amazon Bedrock, vLLM, Agentic AI, AI Agents, Prompt Engineering

Machine Learning Advisor

2023 - 2024

GLXY Software, LLC

Developed data science and advanced analytics solutions to explore the use of patient data in the insurance domain.
Introduced multiple models and their mixture to build explainable ML solutions for better insurance coverage. Implemented a claim lineage to build a knowledge graph using Neo4j to achieve that.
Implemented an AI-based solution in the project's last phase, utilizing the previously explained knowledge graph. This RAG-based approach extracted necessary information from the KG and consulted the LLM for extra reasoning.

Technologies: Machine Learning, Artificial Intelligence (AI), Azure, Python, Neo4j, Claude API, CatBoost, Knowledge Graphs, FAISS, Prompt Engineering

Lead Data Scientist

2023 - 2023

Amperecloud

Created time series analytical modules to enable customers to monitor the performance of their PV-energy facilities.
Designed and developed time series forecasting models for both power generation and power loss based on seasonal and irradiation data.
Developed a comprehensive loss forecast model, coupled with a shading detection algorithm, to detect and classify loss amounts effectively.
Established a streamlined data pipeline, optimizing data collection from MongoDB and VictoriaMetrics into Redis for efficient utilization in analytical tasks.

Technologies: Python, Time Series, Time Series Analysis, Scikit-learn, Machine Learning, Pandas, NumPy, VictoriaMetrics, Redis

Data Scientist

2022 - 2023

Big Consultancy Company

Developed a B2B machine learning project, complete with an automation pipeline, to cater to diverse internal teams within the organization. Delivered a versatile API serving multiple stakeholders.
Created automated Jupyter notebooks tailored to business stakeholders' needs.
Introduced coverage metrics to assess the data pipeline's effectiveness and successfully unified disparate data sources.
Orchestrated the automation of training, deployment, and model scoring processes in a containerized cloud environment, streamlining operations for increased efficiency.

Technologies: Python, Predictive Modeling, Data Science, Amazon Web Services (AWS), Snowflake, SQL, CatBoost, Docker, Apache Airflow, Tableau

Data Scientist

2021 - 2022

The Conti Group, LLC

Developed an analysis project powered by machine learning to estimate urban development and its effect on real estate at macro and micro levels.
Augmented the dataset by collecting data from online resources, seamlessly integrating them into our ML model.
Developed web scraping bots to collect data from online resources and websites efficiently and in an automated way.
Developed user-friendly Jupyter notebooks and PowerBI dashboards to convey model outcomes and their business implications.

Technologies: Python, Data Science, Data Reporting, Data Analytics, Microsoft Power BI, Scikit-learn, Web Scraping, Data Scraping, Scraping, Machine Learning, Gradient Boosting, Linear Regression, Feature Engineering

Data Scientist

2019 - 2022

Amadeus

Implemented a customer lifetime value (CLV) prediction model based on loyalty points, contributing to the improvement in customer segmentation products.
Worked on extracting loyalty KPIs and creating visualizations using Qlik Sense, collaborating closely with the business intelligence team. We built a data pipeline from the ground up using PySpark tailored specifically for analytical use cases.
Collaborated with a consultancy team to deliver simulation projects and conduct in-depth analyses focused on exploring alternative strategies for increasing engagement.
Developed, during the COVID-19 crisis, a recommendation tool geared toward increasing passenger engagement. This tool was centered around non-air loyalty items and utilized the ALS Library.
Built a profile update-based fraud prediction and monitoring tool for loyalty.

Technologies: Python, PySpark, Recommendation Systems, Machine Learning, Data Science, Big Data, SQL, Data Analytics, Dashboards, Qlik Sense

Data Scientist

2018 - 2019

Enerjisa Uretim A.S. — E.ON Energy

Developed predictive maintenance capabilities for thermal plants through a time series forecasting project focusing on FID fans within combustion engines.
Collaborated closely with the engineering team at a thermic plant and successfully enhanced coal calorie prediction within the designated mining zone, leveraging the SGeMS tool and ordinary kriging techniques.
Developed, in collaboration with the trading team, a model for forecasting the electricity market-clearing price. The outputs were instrumental in guiding the trading team's decisions regarding surplus and shortage pricing strategies.
Created near real-time dashboards at both plant and portfolio levels within Grafana.

Technologies: Python, Apache NiFi, OSIsoft PI, SQL, Data Science, Data Analytics, Dashboards, Qlik Sense, Grafana, Time Series Analysis, Predictive Modeling, Keras, Artificial Intelligence (AI)

Software Engineer and Data Scientist

2015 - 2018

Mavi Jeans

Took part in developing and deploying an online store on AWS by integrating the ERP services with the eCommerce platform.
Undertook the implementation of a recommendation engine utilizing an ALS model. This innovative solution found its home on an AWS EMR instance.
Collaborated with the eCommerce team to develop ad-hoc propensity models geared toward bolstering targeted marketing capabilities.
Ventured into the development of back-end services for an in-house CRM tool. This endeavor allowed me to work extensively with NoSQL, particularly with Couchbase.

Technologies: Recommendation Systems, Amazon EC2, PySpark, Machine Learning, Data Science, SQL

Experience

Customer Service Chatbot

https://www.nesine.com/yardim/Para-Islemleri/86/Para-Cekme

Initiated, led, and delivered a customer service chatbot application that can handle structured scenarios, free text conversations, and intent classification.

In this complex structure, an intent classification model utilizing a few-shot approach with a local Qwen3-235B model was placed.

Free text questions for unstructured scenarios were handled with Agentic RAG using Elasticsearch with RRF, where retrieved documents were rephrased with OpenAI. Another tool was designed to fetch user history from Redis to ensure the relevancy of the user messages. Also in the final phase, I introduced Langfuse to monitor and track user questions and the relevancy of responses.

For the structured scenario use cases, especially in financial transactions, I used the RASA framework. In this phase, I managed the user story with predefined buttons and fixed user input, e.g., amounts and IBANs.

Fixing Rejected Claims with ML & AI

This multi-layer solution was designed to improve the medical insurance claims after they are rejected, so that they will be paid back.

The initial step was to determine if the claim would be rejected with C4.5 decision trees and, if so, what factors are affecting it using the pathways of the DT.

Then we focused on building the claim lineage to determine the story of the patients and their claims.

Finally, we utilized Claude's API to determine the root cause of the claim's response. To better resolve the underlying ICD 10 terms, we built a RAG system from the claims textbooks and and documentation about ICD 10. We processed the books and documents with Docling and used them in an RAG system utilizing FAISS.

Knowledge Graph-based Statistics Interpretation with AI

https://www.atyarisi.com/

As the lead AI engineer, I managed and implemented the project of generating horse race-specific statistical analyses.

The knowledge base was built on Neo4j, and the Cypher query generation was provided by the tool's use of the OpenAI API. The generated statistical information in JSON format was rephrased using the OpenAI chat completion API.

The knowledge graph contains global horse racing information, including all the details: races, horses, jockeys, trainers, training, and so on. The generated statistical analysis content provides helpful facts for betters and domain experts.

Text Classification with Turkish BERT

https://www.nesine.com/kupondas?act=getpublicfeeds

Designed and developed a text classification system using Google BERT's specific Turkish model to classify customer texts in two different tasks.

One is the classification of user messages to a messaging board for a betting/sports company. The model is trained monthly with a local GPU cluster. The deployed model is running on a 4-node cluster and serving up to three million active customers daily. The current F1 score of the model is 0.96.

Another version of the same model is trained to specifically detect the customer intent with the message he sent to customer services. This successful model, with an F1 score of 0.87, significantly reduced the effort required for customer service agents to manually select the topic.

The models are managed by MLFlow and served with FastAPI.

Customer Analytics Suite

Developed and managed a customer analytics product for the customer service team, where the project included: GDPR data masking, sentiment analysis, intent classification, and speech-to-text transcription components.

To complete GDPR masking, we utilized regex patterns and a Turkish NER model from Huggingface.

For the sentiment analysis part, for both intensity (negative-neutral-positive) and category (happy, angry, sad), we utilized a few-shot methodology with the local Qwen3 model.

The intent classification was implemented using the legacy BERT model, which detects the topic of the customer complaint.

To include call center data in the solution, we used the whisper-large-v3 model from Huggingface, then summarized it with the local Qwen3 model before sending it to the pipeline of other components.

B2B Consultancy

An industrialized B2B machine learning solution. In this engagement with Toptal, I was part of a team focused on developing the solution.

My role encompassed the creation of the ML model and an API to serve real-time needs within the organization. Additionally, I visualized coverage metrics within the data pipeline, optimizing the process. In the final stages of the project, I worked on visualizing the data model outputs and SHAP values. I contributed by building an MLOps pipeline and automating the training, deployment, and scoring processes of the ML application within a Dockerized environment on AWS.

Urban Development Analysis

An analysis project powered by machine learning solutions to estimate urban development. The initial step was to determine the KPIs to use as the target variable. I scraped websites like The Good Schools and Walk Score with a user-like and efficient approach. I also developed Jupyter notebooks and PowerBI dashboards to deliver model outputs.

Drug Information Chatbot

https://github.com/dogugun/drug_chatbot_rag

In this chatbot project, we aimed to develop a chatbot that utilizes state-of-the-art LLM models to deliver concise information from drug package inserts.

For this purpose, we captured drug labels from the FDA's web resources in XML format and converted them to PDF. Then, we utilized LangChain and HF embeddings to convert them to vectors and upload them to the designated Pinecone instance.

In the final phase, we employed OpenOrca's Mistral model to generate answers from the list of documents we captured from our vector database.

The end solution is deployed as an API with Flask to be served on an AWS EC2 instance.

Predictive Maintenance for CID Fans in Thermal Power Plant

https://github.com/dogugun/industrial_tsp

A demo showcase project that imitates my work in the power generation industry to estimate the value of the sensor in a 12-hour prediction horizon from its and other related sensor values in a 6-hour time window.

In the original project, the goal was to estimate the vibration of the combustion fan. The input features were temperature, dust collection, and airstream sensors. The target value is the vibration frequency in 6-9-12 hours of prediction horizons.

The input data is collected in SQL Server from Osisoft PI's IoT data. The model is deployed to the on-premise server, and the outputs are visualized in Grafana, along with other input feature values.

In the original project, a GBM was used for retraining and retuning easiness. This demo version uses LSTM as an alternative approach.

Finally, the deployed model and predictive monitoring dashboard are used by supervisor shifts to plan their maintenance program proactively.

Market-clearing Price Estimation

In this challenging task, I aimed to predict the market-clearing price for the energy market. The output of this model is used to plan the week ahead surplus and shortage energy price possibilities.

The model data consisted of renewable forecasts, demand forecasts, plant availability declarations, and natural gas prices. The data is collected in a DWH model in SQL Server from APIs, web scraping, and Excel file reports.

A model is developed with gradient gradient-boosting regressor from the scikit-learn library.

Recommendation Engine for Online Retail Store

I developed a recommendation engine for mavi.com, an online store, using PySpark's ALS model. We used purchase data and access logs to create personalized product suggestions, aiming to boost the conversion rate and introduce cross-category recommendations. The project successfully improved user engagement and conversion rates, highlighting the power of data-driven strategies in eCommerce.

Comparison of Non-parametric Models and Neural Networks in Blood Glucose Prediction

In my master's thesis, supervised by Prof. Albert Guvenis, I conducted a comparative analysis of blood glucose prediction models. We focused on replicating existing models using AIDA2 simulator data, categorizing them into two groups: non-parametric and neural networks.

Within the non-parametric category, we compared Random Forest Regression (RFR), Gradient Boosting Machines (GBM), and Support Vector Machines (SVM). We pitted Long Short-Term Memory Networks (LSTMs) against Adaptive Neuro-Fuzzy Inference Systems (ANFIS) in the neural network category.

The study revealed that SVM outperformed in terms of Root Mean Square Error (RMSE) among the non-parametric models, while ANFIS demonstrated superior performance in neural networks, surpassing SVM.

Web Scraping and Data Pipeline

This project uses a web scraper to capture profile and activity information from top-level social media accounts for compliance reporting.

The automatized scraper is triggered by corn jobs daily, and the collected data is saved on PostgreSQL DB in AWS.

Stock Market Data Extracting

In a personal finance project, I created a Python script to collect data from finance.yahoo.com for a selection of Turkish stock market companies. The script retrieves data on EPS, 52-week range, and quarterly earnings growth for each company, then ranks the data based on these criteria. The results are saved and emailed.

Technically, the script runs on my personal AWS EC2 cluster. A Lambda service initiates the EC2 instance, and once the script completes its run, it shuts down the instance. The resulting files are archived in an S3 bucket.

Education

2014 - 2018

Master's Degree in Biomedical Engineering

Bogazici University - Istanbul, Turkey

2006 - 2011

Bachelor's Degree in Computer Engineering

Bogazici University - Istanbul, Turkey

Certifications

SEPTEMBER 2023 - PRESENT

Learn LangChain, Pinecone & OpenAI: Build Next-Gen LLM Apps

Udemy

SEPTEMBER 2016 - PRESENT

CS110x: Big Data Analysis with Apache Spark

edX

Skills

Libraries/APIs

LSTM, NumPy, Pandas, PySpark, Beautiful Soup, CatBoost, Keras, Scikit-learn, Claude API, PyTorch, Hugging Face Transformers, TensorFlow, OpenAI API, vLLM

Tools

PyCharm, Jupyter, Qlik Sense, Grafana, Amazon Elastic MapReduce (EMR), Microsoft Power BI, Apache Airflow, Apache NiFi, LaTeX, VictoriaMetrics, Tableau, Elastic, ELK (Elastic Stack), Rasa.ai, Docling, Whisper, Named-entity Recognition (NER), AI Prompts, SQL Prompt

Languages

Python, SQL, Snowflake, Python 3, Cypher

Platforms

Amazon EC2, Amazon Web Services (AWS), Docker, Jupyter Notebook, AWS Lambda, Azure, Ollama, Langfuse

Industry Expertise

Bioinformatics

Frameworks

Selenium

Storage

Amazon S3 (AWS S3), PostgreSQL, Redis, Neo4j

Other

Computer Science, Statistics, Recommendation Systems, Machine Learning, Data Science, Data Analytics, Dashboards, Data Analysis, Data Engineering, Software Development, Biomedical Skills, Big Data, Deep Learning, Data Scraping, Megalodon, OSIsoft PI, Neural Networks, Long Short-term Memory (LSTM), Web Scraping, Stock Exchange, Predictive Modeling, Machine Learning Operations (MLOps), Data Reporting, Time Series Analysis, FastAPI, Geospatial Analytics, Geospatial Data, Recurrent Neural Networks (RNNs), Time Series, Biostatistics, Random Forests, Random Forest Regression, Support Vector Regression, Support Vector Machines (SVM), Adaptive Neuro-fuzzy Inference System (ANFIS), Scraping, Financial Data, Gradient Boosting, Linear Regression, Feature Engineering, Artificial Intelligence (AI), Large Language Models (LLMs), Pinecone, OpenAI GPT-3 API, OpenAI GPT-4 API, LangChain, Chatbots, Natural Language Processing (NLP), Vector Data, Vector Databases, Hugging Face, Mistral AI, Scalable Vector Databases, AI Chatbots, Retrieval-augmented Generation (RAG), Knowledge Graphs, FAISS, MLflow, NeuralProphet, OpenAI, Amazon Bedrock, RAG Systems, RAG Pipelines, Agentic AI, AI Tools, AI Agents, BERT, Custom BERT, APIs, Qwen, Speech-to-Text (STT), Prompt Engineering

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring