Daniel C Ferreira, Developer in Vienna, Austria
Daniel is available for hire
Hire Daniel

Daniel C Ferreira

Verified Expert  in Engineering

Machine Learning & Natural Language Processing Developer

Vienna, Austria
Toptal Member Since
July 25, 2022

Daniel is a Machine Learning expert with a background in mathematics and six years of experience in academia and industry. His specialties lie in applying ML to NLP and cyber-security problems. Daniel has substantial experience in the full lifecycle of ML, which he obtained while working for a leading cybersecurity company and the Technical University of Vienna, among others. He enjoys tackling challenging problems in environments where he can have a strong impact.


Cyan Security
Linux, Python, Zsh, Bash, TensorFlow, TensorBoard, PyTorch, Scikit-learn...
Technical University of Vienna
Python, Docker, Linux, TensorFlow, PyTorch, Scikit-learn, PIL, Deep Learning...
Priberam Labs
Python, Theano, TensorFlow, PyCharm, Generative Pre-trained Transformers (GPT)...




Preferred Environment

Linux, TensorFlow, Python, Bash, Pandas, Databricks, Docker, Spark, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), GPT, Traffic Analysis

The most amazing...

...tool I've developed is a production-ready Machine Learning system that scrapes and categorizes websites based on their content.

Work Experience

Data Scientist

2019 - 2022
Cyan Security
  • Developed a full ML pipeline that takes URLs, fetches the website, extracts text (in any language) and images, and categorizes it using state-of-the-art methods (Transformers, LLMs).
  • Built multiple CI/CD pipelines with linting, testing, publishing, and deploying steps.
  • Identified and blocked scams, phishing, and other malicious websites using state-of-the-art ML methods (Transformers, LLMs).
  • Developed a serverless tool for fetching websites at a massive scale.
  • Created a Python library for quickly parsing and extracting text content from HTML.
  • Contributed to go-flows, an open-source network traffic flow exporter written in Go.
  • Defined a unified REST API for delivering input/output to/from the in-house ML models.
  • Developed a Python tool to facilitate extracting Zeek network features from PCAP files.
  • Created a Python library to identify nearly identical websites to "fuzzily" deduplicate data.
  • Mentored a student developing a tool for detecting DNS tunneling activity.
Technologies: Linux, Python, Zsh, Bash, TensorFlow, TensorBoard, PyTorch, Scikit-learn, Amazon Web Services (AWS), Google Cloud Platform (GCP), Serverless, JavaScript, Node.js, Docker, BentoML, Databricks, Spark, PySpark, SQL, Deep Learning, FastAPI, Pandas, Git, BERT, GPT, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), IoT Security, DNS, Keras, Transformers, Jira, Confluence, GitLab, GitLab CI/CD, Wireshark, Scrum, Agile, Zeek, Traffic Analysis, Traffic Monitoring, Intrusion Detection Systems (IDS), AutoML, MLflow, CI/CD Pipelines, Artificial Intelligence (AI), Data Pipelines, Google BigQuery, Data Science, Data Analysis, Azure, Language Models, Web Scraping, Data Scraping, Data Engineering, APIs, API Integration, Fine-tuning


2016 - 2019
Technical University of Vienna
  • Launched and managed a public initiative for cataloging and categorizing network traffic related research papers, including developing multiple assisting tools.
  • Researched and prototyped a way to visualize network traffic flows in 2D and aggregate them based on labels.
  • Developed a random data generator in Python, explicitly made for clustering research problems.
  • Developed City-GAN, a tool that uses GANs to generate building façades, which takes into account the city's style and can generate the same façade in different styles.
  • Contributed to the DeepArchitect project, a framework for neural network architecture search.
Technologies: Python, Docker, Linux, TensorFlow, PyTorch, Scikit-learn, PIL, Deep Learning, Machine Learning, Generative Adversarial Networks (GANs), Go, Wireshark, Networks, NetFlow, Electron, Git, Traffic Analysis, Traffic Monitoring, Intrusion Detection Systems (IDS), GitHub, Artificial Intelligence (AI), Data Science, Data Analysis, Regression Modeling, APIs, API Integration


2016 - 2016
Priberam Labs
  • Researched, generated, and published one of the first pre-trained multilingual word embeddings.
  • Developed a machine learning model for tackling the "named-entity recognition" problem in multilingual news articles and media.
  • Collaborated in defining a REST API for an automated media monitoring tool, developed with multiple industry partners and universities for an H2020 project.
  • Assisted in organizing and helping students in a summer Machine Learning school, with a predominantly international attendance.
Technologies: Python, Theano, TensorFlow, PyCharm, Generative Pre-trained Transformers (GPT), GPT, Natural Language Processing (NLP), Search Engines, Linux, Git, NumPy, Pandas, Jupyter, LaTeX, Applied Mathematics, Statistics, Artificial Intelligence (AI), SpaCy, Data Science, Data Analysis, Language Models, APIs, API Integration


MDCGenPy is a Multidimensional Dataset for Clustering Generator. The tool is aimed at researchers looking for synthetic datasets, particularly for testing clustering algorithms. I was the main developer of this library. It was ported from a similar library written in MATLAB.

Traffic Flow Mapping

TFM (Traffic Flow Mapping) is a prototype tool for the online visualization of traffic based on representation learning. It uses semi-supervised Autoencoders to obtain two-dimensional representations of network traffic flows and plot them, together with some indicators about the flows. I was the main author of this research and tool.

Multilingual Embeddings

A research paper about finding useful text embeddings which share the vector space for multiple languages. Together with the publishing of this paper, we also made the embeddings publicly available. I was the main author of this research.

NTARC Database

This project is both a curated database of research papers related to network traffic analysis and a suite of tools to assist in managing it and ingesting new data. I was the main author of this project, which involved another 3-5 people.


A tool and research paper to generate images of building's façades styled to the style of arbitrary cities. This project uses conditional GANs to do that. I was an equal co-author. The project started as coursework for a visual computing university course and ended up as a research paper.

Personal Website

A personal website showcasing my professional career and projects. It also includes a blog, where I discuss technical topics. The website is static, built using Hugo, and deployed automatically using GitHub Actions.

Toxic News

A website enabling automatic ranking of online media outlets using machine learning models. Once per day, the headlines from the front page of multiple online media outlets are scraped and sent to machine learning models. The results are displayed on the website.


Python, Bash, C, Java, R, Lisp, Go, JavaScript, SQL, HTML


NumPy, TensorFlow, Pandas, Scikit-learn, Keras, Theano, PyTorch, PIL, Node.js, PySpark, SciPy, SpaCy


Git, Jupyter, PyCharm, LaTeX, Wireshark, TensorBoard, GitLab, GitLab CI/CD, Mathematica, Zsh, Jira, Confluence, AutoML, GitHub


Data Science, Scrum, Agile


Linux, Databricks, Docker, Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure


Machine Learning, Natural Language Processing (NLP), Deep Learning, Applied Mathematics, Artificial Intelligence (AI), Data Analysis, Language Models, Web Scraping, Data Scraping, API Integration, Word Embedding, GPT, Generative Pre-trained Transformers (GPT), Fine-tuning, Mathematics, Statistics, Generative Adversarial Networks (GANs), Networks, BentoML, BERT, DNS, Transformers, Traffic Analysis, Traffic Monitoring, MLflow, Regression Modeling, Data Engineering, APIs, Search Engines, NetFlow, Serverless, FastAPI, IoT Security, Zeek, Intrusion Detection Systems (IDS), Bokeh, Image Processing, CI/CD Pipelines, Google BigQuery, GitHub Actions


Spark, Electron, Tailwind CSS


Data Pipelines, JSON/XML Schemas

2013 - 2015

Master's Degree in Informatics and Applied Mathematics

Instituto Superior Técnico - Lisbon, Portugal

2010 - 2013

Bachelor's Degree in Informatics and Applied Mathematics

Instituto Superior Técnico - Lisbon, Portugal