Michael is currently unavailable

Michael Völske

Verified Expert in Engineering

Machine Learning Developer

Aschaffenburg, Bavaria, Germany

Toptal member since November 5, 2021

Expertise

Machine Learning Python Technical Writing Data Engineering Linux Git Hadoop DevOps Spark Docker JSON Kubernetes Java Cloud Architecture

Bio

Michael has a PhD in computer science, a decade of experience solving complex data problems, and a dozen publications at top-tier international venues like SIGIR, CIKM, and ACL—many of them based on web-scale datasets. He excels in planning, procuring, and installing the on-premise data processing infrastructure spanning hundreds of servers, petabytes of disk, and petaflops of compute. Michael has a broad knowledge of modern machine learning and teaches the fundamentals to hundreds of students.

Portfolio

Berliner Hochschule für Technik

University Teaching

Bauhaus-Universität Weimar

Machine Learning, Technical Writing, University Teaching, LaTeX, Python 3, Git...

Bauhaus-Universität Weimar

Data Science, Data Engineering, DevOps, Technical Writing, Python 3, Java...

Experience

Git - 10 years
Linux - 10 years
Data Science - 8 years
Python 3 - 8 years
Machine Learning - 8 years
Spark - 6 years
Scikit-learn - 6 years
Kubernetes - 3 years

Preferred Environment

Linux, Emacs, Visual Studio Code (VS Code), Python 3, Spark, Pandas, Kubernetes, Scikit-learn, Git

The most amazing...

...dataset I've analyzed is a billion-item query log from a major search engine that taught me the power of Apache Spark and resulted in a top-ranked publication.

Work Experience

Assistant Lecturer Online Teaching

2020 - PRESENT

Berliner Hochschule für Technik

Taught an online class on the fundamentals of computer operating systems as a freelance assistant lecturer to classes of 20+ students per semester.
Designed and administered the practical exercise labs on digital circuits, assembly language, process scheduling, and file systems.
Created and administered the mid-term and final exams.

Technologies: University Teaching

Postdoctoral Research Assistant

2019 - PRESENT

Bauhaus-Universität Weimar

Published 7+ research papers on information retrieval and data mining over three years at the A-level international and several minor venues, often leveraging state-of-the-art machine learning and data processing techniques.
Designed teaching materials on the fundamentals of machine learning as part of an annual lecture and led an associated programming lab for more than 100 students each year.
Led a team to plan, procure, and install a state-of-the-art GPU computing cluster; designed and implemented the systems for authenticating dozens of researchers using this infrastructure across four institutions.
Supported and mentored 11 successful students in writing bachelor's and master's theses.

Technologies: Machine Learning, Technical Writing, University Teaching, LaTeX, Python 3, Git, SaltStack

Research Assistant

2013 - 2019

Bauhaus-Universität Weimar

Contributed to more than a dozen scientific publications on information retrieval, natural language processing, and data mining, half of which were published at A-rated international venues. Implemented the experimental systems in Python and Java.
Taught lab classes on the fundamentals of machine learning and held a recurring seminar class on big data processing architectures. Led small student project groups on a variety of hot topics in machine learning and data mining.
Planned and implemented the procurement, installation, maintenance, and monitoring for computing infrastructure, which involved more than 200 individual servers.
Supported and mentored 11 successful students in the preparation of bachelor's and master's theses.

Technologies: Data Science, Data Engineering, DevOps, Technical Writing, Python 3, Java, Hadoop, Kubernetes, Spark

Experience

Lecture and Lab Class "Introduction to Machine Learning"

The annual lecture on the fundamentals of machine learning is part of the curricula at Bauhaus-Universität Weimar and Universität Leipzig.

My responsibilities included teaching in-person and holding online classes, designing lecture materials and lab exercises, and supervising teaching assistants. To support the associated programming labs, I rolled out a Kubernetes-based Jupyterhub deployment servicing up to a hundred students at a time. I significantly expanded the teaching materials on neural networks and deep learning.

Query Classification and Log Analysis on a Billion-item Query Log

To better understand how web users ask questions and how search engines can serve their information needs, I analyzed a query log of more than one billion individual entries from a major commercial search engine. Due to the amount of data involved, I leveraged big data technologies, including Apache Spark deployed on a 100-node Hadoop cluster, to implement a comprehensive data cleaning and analysis pipeline. Using user-labeled community question answering data as a training set, I developed query classification models based on different feature sets—such as unigrams and topic models—and evaluated their effectiveness and efficiency. The resulting publication was accepted at CIKM 2015 and served as the basis for varied downstream research on question classification.

Research and Implementation for Axiomatic Information Retrieval

https://webis.de/publications.html?q=author:volske+axiomatic

Axiomatic IR deals with formal properties that a good relevance scoring function should fulfill.

This research project investigated strategies to make retrieval axioms directly usable to benefit real-world search engines. A pilot study published at CIKM 2016 showed how retrieval axioms could directly modify result rankings and thus improve search result quality. Follow-up work in ICTIR 2021 showed how retrieval axioms could generate explanations for arbitrary rankings, making complex relevance scoring functions such as those based on deep neural networks more interpretable. The axiomatic re-ranking pipeline I implemented has contributed to several further publications.

Procurement, Installation, and Maintenance of Computing Infrastructure

I was primarily responsible for the planning, procurement, installation, and maintenance of research computing infrastructure spanning nearly 300 individual servers over more than five years. To provision, maintain, and monitor these systems, my team leveraged infrastructure-as-code technologies such as SaltStack, and developed custom solutions for user authentication and access control on top of Gitlab, Kubernetes, and Slurm. Altogether, the infrastructure has enabled hundreds of successful publications at high-ranking venues across our research group.

Mining Reddit for Abstractive Summarization Ground Truth Data

An abstractive summary distills the core content of a source text using novel words and phrases not necessarily found in the source. This yields a shorter summary and more fluent reading experience than a mere selection of relevant phrases from the source could achieve. To learn automatic abstractive summarization, machine learning models require a large number of source-summary pairs as training data.

I led an effort to mine more than four million human-written source-summary pairs from the social media posts made to the Reddit platform, where users frequently summarize long messages as a courtesy to their readers, prefixing the summary with "TL;DR" ("too long; didn't read") or similar. To handle the scale of the input data (all Reddit posts ever made up to the year 2017), I leveraged technologies like Hadoop and Spark. The resulting Webis-TLDR-17 dataset formed the basis for a shared-task competition on abstractive summarization organized by a mixed team of researchers from the industry and academia at INLG 2019. Our dataset was subsequently included in the Huggingface and TensorFlow datasets libraries and has been cited in more than 40 publications so far.

Education

2013 - 2019

PhD in Computer Science

Bauhaus-Universität Weimar - Weimar, Germany

2010 - 2013

Master's Degree in Computer Science

Bauhaus-Universität Weimar - Weimar, Germany

Skills

Libraries/APIs

Pandas, Scikit-learn, PyTorch

Tools

Git, LaTeX, SaltStack, Emacs, Jupyter, GitLab

Languages

Python 3, Python, Java, C, Ada, JavaScript, Scala, SQL

Frameworks

Spark, Hadoop

Paradigms

DevOps

Platforms

Linux, Kubernetes, Docker, Visual Studio Code (VS Code), Jupyter Notebook

Storage

JSON, Ceph, PostgreSQL, On-premise

Other

Machine Learning, Data Science, Information Retrieval, Data Engineering, University Teaching, Technical Writing, IT Infrastructure, GPU Computing, Programming, Data Mining, Web Technologies, Statistics, Linear Algebra, Optimization, Data Analysis, Cloud Architecture, Natural Language Processing (NLP), Data Visualization, Text Mining, Big Data, Regular Expressions, Generative Pre-trained Transformers (GPT)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring