Daniel C Ferreira, Developer in Vienna, Austria
Daniel is available for hire
Hire Daniel

Daniel C Ferreira

Verified Expert  in Engineering

Machine Learning & Natural Language Processing Developer

Location
Vienna, Austria
Toptal Member Since
July 25, 2022

Daniel is a Machine Learning expert with a background in mathematics and six years of experience in academia and industry. His specialties lie in applying ML to NLP and cyber-security problems. Daniel has substantial experience in the full lifecycle of ML, which he obtained while working for a leading cybersecurity company and the Technical University of Vienna, among others. He enjoys tackling challenging problems in environments where he can have a strong impact.

Portfolio

Cyan Security
Linux, Python, Zsh, Bash, TensorFlow, TensorBoard, PyTorch, Scikit-learn...
Technical University of Vienna
Python, Docker, Linux, TensorFlow, PyTorch, Scikit-learn, PIL, Deep Learning...
Priberam Labs
Python, Theano, TensorFlow, PyCharm, Generative Pre-trained Transformers (GPT)...

Experience

Availability

Part-time

Preferred Environment

Linux, TensorFlow, Python, Bash, Pandas, Databricks, Docker, Spark, Generative Pre-trained Transformers (GPT), GPT, Natural Language Processing (NLP), Traffic Analysis

The most amazing...

...tool I've developed is a production-ready Machine Learning system that scrapes and categorizes websites based on their content.

Work Experience

Data Scientist

2019 - 2022
Cyan Security
  • Developed a full ML pipeline that takes URLs, fetches the website, extracts text (in any language) and images, and categorizes it using state-of-the-art methods (Transformers, LLMs).
  • Built multiple CI/CD pipelines with linting, testing, publishing, and deploying steps.
  • Identified and blocked scams, phishing, and other malicious websites using state-of-the-art ML methods (Transformers, LLMs).
  • Developed a serverless tool for fetching websites at a massive scale.
  • Created a Python library for quickly parsing and extracting text content from HTML.
  • Contributed to go-flows, an open-source network traffic flow exporter written in Go.
  • Defined a unified REST API for delivering input/output to/from the in-house ML models.
  • Developed a Python tool to facilitate extracting Zeek network features from PCAP files.
  • Created a Python library to identify nearly identical websites to "fuzzily" deduplicate data.
  • Mentored a student developing a tool for detecting DNS tunneling activity.
Technologies: Linux, Python, Zsh, Bash, TensorFlow, TensorBoard, PyTorch, Scikit-learn, Amazon Web Services (AWS), Google Cloud Platform (GCP), Serverless, JavaScript, Node.js, Docker, BentoML, Databricks, Spark, PySpark, SQL, Deep Learning, FastAPI, Pandas, Git, BERT, GPT, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), IoT Security, DNS, Keras, Transformers, Jira, Confluence, GitLab, GitLab CI/CD, Wireshark, Scrum, Agile, Zeek, Traffic Analysis, Traffic Monitoring, Intrusion Detection Systems (IDS), AutoML, MLflow, CI/CD Pipelines, Artificial Intelligence (AI), Data Pipelines, Google BigQuery, Data Science, Data Analysis, Azure, Language Models, Web Scraping, Data Scraping, Data Engineering, APIs, API Integration, Fine-tuning

Researcher

2016 - 2019
Technical University of Vienna
  • Launched and managed a public initiative for cataloging and categorizing network traffic related research papers, including developing multiple assisting tools.
  • Researched and prototyped a way to visualize network traffic flows in 2D and aggregate them based on labels.
  • Developed a random data generator in Python, explicitly made for clustering research problems.
  • Developed City-GAN, a tool that uses GANs to generate building façades, which takes into account the city's style and can generate the same façade in different styles.
  • Contributed to the DeepArchitect project, a framework for neural network architecture search.
Technologies: Python, Docker, Linux, TensorFlow, PyTorch, Scikit-learn, PIL, Deep Learning, Machine Learning, Generative Adversarial Networks (GANs), Go, Wireshark, Networks, NetFlow, Electron, Git, Traffic Analysis, Traffic Monitoring, Intrusion Detection Systems (IDS), GitHub, Artificial Intelligence (AI), Data Science, Data Analysis, Regression Modeling, APIs, API Integration

Researcher

2016 - 2016
Priberam Labs
  • Researched, generated, and published one of the first pre-trained multilingual word embeddings.
  • Developed a machine learning model for tackling the "named-entity recognition" problem in multilingual news articles and media.
  • Collaborated in defining a REST API for an automated media monitoring tool, developed with multiple industry partners and universities for an H2020 project.
  • Assisted in organizing and helping students in a summer Machine Learning school, with a predominantly international attendance.
Technologies: Python, Theano, TensorFlow, PyCharm, GPT, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Search Engines, Linux, Git, NumPy, Pandas, Jupyter, LaTeX, Applied Mathematics, Statistics, Artificial Intelligence (AI), SpaCy, Data Science, Data Analysis, Language Models, APIs, API Integration

MDCGenPy

https://github.com/CN-TU/mdcgenpy
MDCGenPy is a Multidimensional Dataset for Clustering Generator. The tool is aimed at researchers looking for synthetic datasets, particularly for testing clustering algorithms. I was the main developer of this library. It was ported from a similar library written in MATLAB.

Traffic Flow Mapping

https://github.com/dcferreira/network_analysis_feature_reduction
TFM (Traffic Flow Mapping) is a prototype tool for the online visualization of traffic based on representation learning. It uses semi-supervised Autoencoders to obtain two-dimensional representations of network traffic flows and plot them, together with some indicators about the flows. I was the main author of this research and tool.

Multilingual Embeddings

https://github.com/dcferreira/multilingual-joint-embeddings/
A research paper about finding useful text embeddings which share the vector space for multiple languages. Together with the publishing of this paper, we also made the embeddings publicly available. I was the main author of this research.

NTARC Database

https://www.cn.tuwien.ac.at/network-traffic/ntadatabase/
This project is both a curated database of research papers related to network traffic analysis and a suite of tools to assist in managing it and ingesting new data. I was the main author of this project, which involved another 3-5 people.

City-GAN

https://github.com/muxamilian/city-gan
A tool and research paper to generate images of building's façades styled to the style of arbitrary cities. This project uses conditional GANs to do that. I was an equal co-author. The project started as coursework for a visual computing university course and ended up as a research paper.

Personal Website

https://dcferreira.com
A personal website showcasing my professional career and projects. It also includes a blog, where I discuss technical topics. The website is static, built using Hugo, and deployed automatically using GitHub Actions.

Toxic News

https://toxicnews.dcferreira.com/
A website enabling automatic ranking of online media outlets using machine learning models. Once per day, the headlines from the front page of multiple online media outlets are scraped and sent to machine learning models. The results are displayed on the website.
2013 - 2015

Master's Degree in Informatics and Applied Mathematics

Instituto Superior Técnico - Lisbon, Portugal

2010 - 2013

Bachelor's Degree in Informatics and Applied Mathematics

Instituto Superior Técnico - Lisbon, Portugal

Libraries/APIs

NumPy, TensorFlow, Pandas, Scikit-learn, Keras, Theano, PyTorch, PIL, Node.js, PySpark, SciPy, SpaCy

Tools

Git, Jupyter, PyCharm, LaTeX, Wireshark, TensorBoard, GitLab, GitLab CI/CD, Mathematica, Zsh, Jira, Confluence, AutoML, GitHub

Languages

Python, Bash, C, Java, R, Lisp, Go, JavaScript, SQL, HTML

Paradigms

Data Science, Scrum, Agile

Platforms

Linux, Databricks, Docker, Amazon Web Services (AWS), Google Cloud Platform (GCP), Zeek, Azure

Storage

Data Pipelines, JSON/XML Schemas

Frameworks

Spark, Electron, Tailwind CSS

Other

Machine Learning, Natural Language Processing (NLP), Deep Learning, Applied Mathematics, Artificial Intelligence (AI), Data Analysis, Language Models, Web Scraping, Data Scraping, API Integration, Word Embedding, GPT, Generative Pre-trained Transformers (GPT), Fine-tuning, Mathematics, Statistics, Generative Adversarial Networks (GANs), Networks, BentoML, BERT, DNS, Transformers, Traffic Analysis, Traffic Monitoring, MLflow, Regression Modeling, Data Engineering, APIs, Search Engines, NetFlow, Serverless, FastAPI, IoT Security, Intrusion Detection Systems (IDS), Bokeh, Image Processing, CI/CD Pipelines, Google BigQuery, GitHub Actions

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring