Dauren is available for hire

Dauren Baitursyn

Verified Expert in Engineering

Data Transformation Developer

Location

Dubai, United Arab Emirates

Toptal Member Since

November 18, 2021

Dauren is a skilled data scientist and software engineer with vast experience across the banking sector, fast-moving consumer goods (FMCG), and chat-bots. He holds a computer science bachelor's degree from the Korea Institute of Science and Technology (KAIST), one of the top 50 universities by QS ranking. He has excellent knowledge of Python, SQL, and Power BI. Dauren has worked with several teams on different business domains and is an excellent communicator.

Portfolio

Amgreat North America

Python, SQL, Amazon Web Services (AWS), MySQL, PostgreSQL, Plotly, Spark...

Eduworks Corporation

Python, Rasa NLU, Hugging Face, Elasticsearch, MLflow, Data Science...

Philip Morris International

Python, Microsoft Power BI, Generative Pre-trained Transformers (GPT)...

Experience

Python - 6 years SQL - 5 years Data Transformation - 5 years ETL - 4 years GPT - 4 years Generative Pre-trained Transformers (GPT) - 4 years Natural Language Processing (NLP) - 4 years Machine Learning - 4 years

Availability

Part-time

Preferred Environment

Visual Studio Code (VS Code), Jupyter, Git, Linux, Jupyter Notebook

The most amazing...

...project I've worked on is a multi-label text categorization model for 46 topics using the DeepPavlov and Power BI to visualize results.

Work Experience

Senior Data Engineer

2022 - PRESENT

Amgreat North America

Created a robust ETL pipeline that involved extracting data from multiple sources, transforming it into a format suitable for analysis, and loading it into a more organized, accessible system.
Developed an LSTM sales prediction model in PyTorch, outperforming ARIMA with a custom Mean Quartic Error (MQE) loss to manage outliers. Achieved a test loss of 5.8 and validation loss of 24.5, pending production deployment.
Took full ownership of the company's sales data and other vital datasets, managing and maintaining these datasets as well as creating the workflows that included automating data extraction, transformation, and loading.

Technologies: Python, SQL, Amazon Web Services (AWS), MySQL, PostgreSQL, Plotly, Spark, Tableau, Microsoft Power BI, Apache Kafka, Data Mining, ETL, PyTorch, LSTM, Databricks, Amazon EC2, OAuth 2, Pandas

NLP Developer | Chatbot

2021 - 2022

Eduworks Corporation

Led the development and implementation of the chatbot project. The chatbot could handle more than ten intents and complex user stories with forms applied. Two forms and up to ten simple dialogue paths were implemented and deployed.
Cleaned, processed, and ingested raw data from two different sources to an Elasticsearch (ES) database for data retrieval on user request. Maintained the ES database indexes and ensured the retrieval process occurred within the time constraints.
Implemented data retrieval based on the cosine similarity of the user input query and data sources. Embedded and vectorized data sources put into ES for fast retrieval.
Increased the accuracy and recall metrics for data retrieval: top 1: 37.2% (+9.2%), top 3: 61.2% (+30.5), top 5: 63.6% (+27.5%), top 10: 69.0% (+26.4), r and recall for OOS queries: 79.3% (+65.4%).
Incorporated synonym handling to user queries using the spaCy package.

Technologies: Python, Rasa NLU, Hugging Face, Elasticsearch, MLflow, Data Science, Machine Learning, Data Transformation, ETL, Live Chat, Chatbots, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), GPT, GitLab CI/CD, CI/CD Pipelines, PyTorch, TensorFlow, SciPy, Jupyter, Databases, Information Retrieval, Search Engines, Deep Learning, Containers, Docker, Docker Compose, User Intent Scoring, Data Versioning, Git, GitLab, REST APIs, Conda, SpaCy, Rasa.ai, Testing, Pytest, Logging, Data, Data Cleaning, Deployment, Object-oriented Programming (OOP), Functional Programming, Algorithms, Data Structures, Flask, Visual Studio Code (VS Code), Jupyter Notebook, Pandas, Scikit-learn, NumPy, Linux, Predictive Modeling, FastAPI, Data Analysis, DevOps, DevOps Engineer, Machine Learning Automation, Machine Learning Operations (MLOps)

Consumer Insights Data Scientist

2019 - 2021

Philip Morris International

Built a multi-label text categorization model (46 topics) to replace the model based on heuristic hard rules. Text categorization model F1 score improvement from 0.11 to 0.81 on production data.
Migrated static reports to dynamic dashboards using Power BI. Built, maintained, and improved dashboards with complex data models with more than 30 entities.
Built complex ETL pipelines using Python to transform and extract data from different data sources into data models (STAR schema) in Power BI.
Communicated findings and insights to stakeholders in data-story presentations.

Technologies: Python, Microsoft Power BI, GPT, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), PyTorch, BERT, Amazon EC2, DeepPavlov, ETL, Data Science, Data Transformation, Data Visualization, Data Analysis, TensorFlow, Dashboards, Data Modeling, Deep Learning, Reporting, Statistical Methods, Docker, Machine Learning, Data Pipelines, Data Extraction, Relational Databases, PostgreSQL, SQL, Databases, Deployment, Customer Segmentation, Market Segmentation, Data Structures, Visual Studio Code (VS Code), Jupyter Notebook, Pandas, NumPy, SciPy, Jupyter, Predictive Modeling

Software Engineer

2019 - 2019

One Technologies

Migrated the chatbot service from Dialogflow to a Rasa open source chatbot platform.
Achieved a baseline NLU model performance on production data: F1 score of 0.732.
Implemented a CRUD microservice for a chatbot platform.
Created pipelines for data cleaning, data transformation, and data check for the chatbot.
Automatized tests for the chatbot data consistency and integrity.

Technologies: Flask, Rasa NLU, Rasa.ai, Docker, Python, Data Transformation, PostgreSQL, SQLAlchemy, Deep Learning, TensorFlow, Data Science, Git, CI/CD Pipelines, GitLab, GitLab CI/CD, Authorization, Databases, Relational Databases, CRUD, Data Migration, APIs, REST APIs, Scripting, Object-oriented Programming (OOP), Algorithms, Data Structures, Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Machine Learning, Visual Studio Code (VS Code), Jupyter Notebook, Pandas, Scikit-learn, NumPy, SQL, Jupyter, Linux, ETL, FastAPI, Deployment, Data Analysis, DevOps, DevOps Engineer

Data Scientist

2017 - 2019

PrimeSource

Built and deployed statistical models using SPSS modelers like loan pre-approval and top-up models for targeted consumer campaigns.
Designed and implemented data views and tables for campaign data panels using SQL and built and maintained ETL processes for the temporary tables needed for statistical models.
Performed the analysis for a diverse set of ad hoc requests from internal stakeholders, measured the effectiveness of rolled-out campaigns, and communicated the results with data-story presentations.
Led a team of two data analysts, conducted daily stand-ups for check-ins and progress on tasks, and communicated the projects' status to the principal data scientist.

Technologies: SPSS Modeler, Python, SQL, GraphLab, IBM SPSS, ETL, Data Science, Data Visualization, Data Analysis, Ad Campaigns, Customer Segmentation, Databases, Object-oriented Programming (OOP), Algorithms, Data Structures, Microsoft Power BI, Machine Learning, PostgreSQL, Visual Studio Code (VS Code), Jupyter Notebook, Pandas, Scikit-learn, NumPy, SciPy, Data Transformation, Jupyter, Linux, Predictive Modeling, FastAPI

Assistant Researcher

2017 - 2017

Graduate School of Knowledge Service Engineering | KAIST

Implemented an information retrieval framework for clinical decision support in cancer diagnosis and treatment.
Submitted the paper to the TREC 2017 Precision Medicine Track conference, but it has not been accepted.
Assisted with the research in the ﬁeld of precision medicine.

Technologies: Python, GraphLab, Jupyter, Data Science, Time Series, Time Series Analysis, Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Data Structures, Machine Learning, Visual Studio Code (VS Code), Jupyter Notebook, Pandas, Scikit-learn, NumPy, Data Transformation, Linux, ETL, Predictive Modeling, Data Visualization, Data Analysis

Experience

NewsAgg

https://github.com/biddy1618/newsProject

News crawling aggregator. It crawls the information from the government news' website and saves it to a database.

The project has a search engine to retrieve the articles based on the cosine similarity of the frequency-inverse document frequency (TF-IDF) representation of the queries and articles.

Kaggle Alice Competition

https://github.com/biddy1618/alicekagglecompetition

A web-user identification is a hot research topic on the brink of sequential pattern mining and behavioral psychology.

Here we will try to identify a user on the internet by tracking their sequence of attended web pages. The algorithm to be built will take a webpage session—a series of web pages attended consequently by the same person—and predict whether it belongs to Alice or somebody else.

As for 23.12.18, my current LB standing is 87th out of 2,000ish.

For this competition, the data was time-series session information regarding user browser history. We had to distinguish between a regular user and an intruder user based on sites visited and time of visit information.

I performed the exploratory data analysis (EDA), data wrangling, and feature engineering to achieve a ROC-AUC score of 0.95856.

IMPLEMENTED FEATURES
• Dummy hour feature of the session start time (from now on start time).
• Sin and cos transformation of the start time.
• Active start hours of the intruder.
• Dummy weekday feature of the start time.
• Active weekday of intruder feature.
• Dummy month feature of the start time.
• Sin and cos transformation of the year and day feature.
• Session length.
• Session sites stay standard deviation.

Prediction of Churn | Telco Customer Churn Sample

https://github.com/biddy1618/churn-rate

I needed to build the forecasting model for the outflow of clients using available data (churn being the target variable).

We managed to train two models; logistic regression and random forest. The motivation behind including these statistical methods is based on the fact that logistic regression is a classical method for classification that gives good interpretability and is more or less stable. In contrast, random forest is robust and works well with small datasets due to its bagging sampling.

Final ROC-AUC scores on test data are as follows:
• For logistic regression - 0.652
• For random forest - 0.989

The top five major features for logistic regression are state of the state of residence, age, highest education acquired, use of internet services, and several complaints.

Random forest showed much better performance measures than logistic regression, and its major features make much more sense.

From a business perspective, age, unpaid balance, annual income (and other major features) seem to be valid features for churn rate.

Complete ML Project - From Getting Raw Data to Deployment

https://github.com/biddy1618/udacity-mldevops-3-project-mlmodel-fastapi-heroku

The model served through FastAPI framework on Uvicorn server deployed on Heroku platform. It uses DVC for data and model versioning and metrics and plots versioning.

This project showcases CI/CD pipeline implementation using GitHub Actions and Heroku deployment.

CI includes PyTest tests and PEP8 code correspondence. DVC encapsulates different training components into stages.

This project is well-documented and has a model card description.

Education

2012 - 2017

Bachelor's Degree in Computer Science

Korea Advanced Institute of Science and Technology (KAIST) - Daejeon, South Korea

2016 - 2016

Exchange Program Specialized in Computer Science

Innopolis - Kazan, Republic of Tatarstan, Russia

2015 - 2016

Exchange Program Specialized in Computer Engineering

Middle East Technical University - Ankara, Turkey

Certifications

SEPTEMBER 2022 - PRESENT

Machine Learning DevOps Engineer Nanodegree

Udacity

MARCH 2021 - PRESENT

Natural Language Processing

Coursera

DECEMBER 2020 - PRESENT

Introduction to Deep Learning

Coursera

APRIL 2019 - PRESENT

Mathematics for Machine Learning Specialization | PCA

Coursera

FEBRUARY 2019 - PRESENT

IBM Certified Specialist | SPSS Modeler Professional V3

IBM

MARCH 2018 - PRESENT

Machine Learning Specialization

Coursera

DECEMBER 2016 - PRESENT

Machine Learning

Coursera

Skills

Languages

Python, SQL, Java, JavaScript, Scala

Libraries/APIs

Pandas, PyTorch, Scikit-learn, NumPy, SciPy, Rasa NLU, SQLAlchemy, TensorFlow, REST APIs, SpaCy, LSTM

Tools

Jupyter, Microsoft Power BI, SPSS Modeler, Rasa.ai, IBM SPSS, Git, Pytest, GitLab CI/CD, Docker Compose, GitLab, Logging, Plotly, Tableau

Paradigms

Object-oriented Programming (OOP), ETL, Data Science, Functional Programming, DevOps, Clean Code, Testing, Human-computer Interaction (HCI), CRUD

Platforms

Visual Studio Code (VS Code), Jupyter Notebook, Docker, Linux, Amazon EC2, Heroku, Amazon Web Services (AWS), Apache Kafka, Databricks

Other

Algorithms, Data Structures, Machine Learning, Data Transformation, Predictive Modeling, Data Analysis, Natural Language Processing (NLP), Deep Learning, Deployment, Data Visualization, GPT, Generative Pre-trained Transformers (GPT), BERT, DeepPavlov, MLflow, Data Versioning, Hugging Face, FastAPI, CI/CD Pipelines, GitHub Actions, Web Crawlers, Web Scraping, DevOps Engineer, Machine Learning Automation, Code Versioning, GitOps, Experiment Tracking, APIs, Flake8, Probability Theory, Linear Algebra, Operating Systems, System Programming, Discrete Mathematics, Linear Optimization, OOP Designs, Computer Science, Information Retrieval, Live Chat, Chatbots, Search Engines, Containers, User Intent Scoring, Conda, Data, Data Cleaning, Dashboards, Data Modeling, Reporting, Statistical Methods, Data Extraction, Authorization, Data Migration, Scripting, Ad Campaigns, Customer Segmentation, Market Segmentation, Time Series, Time Series Analysis, Machine Learning Operations (MLOps), Data Mining

Frameworks

Flask, GraphLab, Spark, OAuth 2

Storage

PostgreSQL, Elasticsearch, Data Pipelines, Databases, Relational Databases, MySQL

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring