Note: By popular request that I demonstrate some alternative techniques---including async/await, only available since the advent of Python 3.5---I've added some updates at the end of the article. Enjoy!

Discussions criticizing Python often talk about how it is difficult to use Python for multithreaded work, pointing fingers at what is known as the global interpreter lock (affectionately referred to as the GIL) that prevents multiple threads of Python code from running simultaneously. Due to this, the Python multithreading module doesn’t quite behave the way you would expect it to if you’re not a Python developer and you are coming from other languages such as C++ or Java. It must be made clear that one can still write code in Python that runs concurrently or in parallel and make a stark difference in resulting performance, as long as certain things are taken into consideration. If you haven’t read it yet, I suggest you take a look at Eqbal Quran’s article on concurrency and parallelism in Ruby here on the Toptal Engineering Blog.

In this Python concurrency tutorial, we will write a small Python script to download the top popular images from Imgur. We will start with a version that downloads images sequentially, or one at a time. As a prerequisite, you will have to register an application on Imgur. If you do not have an Imgur account already, please create one first.

The scripts in these threading examples have been tested with Python 3.6.4. With some changes, they should also run with Python 2—urllib is what has changed the most between these two versions of Python.

Getting Started with Python Multithreading

Let us start by creating a Python module, named download.py. This file will contain all the functions necessary to fetch the list of images and download them. We will split these functionalities into three separate functions:

  • get_links
  • download_link
  • setup_download_dir

The third function, setup_download_dir, will be used to create a download destination directory if it doesn’t already exist.

Imgur’s API requires HTTP requests to bear the Authorization header with the client ID. You can find this client ID from the dashboard of the application that you have registered on Imgur, and the response will be JSON encoded. We can use Python’s standard JSON library to decode it. Downloading the image is an even simpler task, as all you have to do is fetch the image by its URL and write it to a file.

This is what the script looks like:

import json
import logging
import os
from pathlib import Path
from urllib.request import urlopen, Request

logger = logging.getLogger(__name__)

types = {'image/jpeg', 'image/png'}


def get_links(client_id):
    headers = {'Authorization': 'Client-ID {}'.format(client_id)}
    req = Request('https://api.imgur.com/3/gallery/random/random/', headers=headers, method='GET')
    with urlopen(req) as resp:
        data = json.loads(resp.read().decode('utf-8'))
    return [item['link'] for item in data['data'] if 'type' in item and item['type'] in types]


def download_link(directory, link):
    download_path = directory / os.path.basename(link)
    with urlopen(link) as image, download_path.open('wb') as f:
        f.write(image.read())
    logger.info('Downloaded %s', link)


def setup_download_dir():
    download_dir = Path('images')
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

Next, we will need to write a module that will use these functions to download the images, one by one. We will name this single.py. This will contain the main function of our first, naive version of the Imgur image downloader. The module will retrieve the Imgur client ID in the environment variable IMGUR_CLIENT_ID. It will invoke the setup_download_dir to create the download destination directory. Finally, it will fetch a list of images using the get_links function, filter out all GIF and album URLs, and then use download_link to download and save each of those images to the disk. Here is what single.py looks like:

import logging
import os
from time import time

from download import setup_download_dir, get_links, download_link

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def main():
    ts = time()
    client_id = os.getenv('IMGUR_CLIENT_ID')
    if not client_id:
        raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
    download_dir = setup_download_dir()
    links = get_links(client_id)
    for link in links:
        download_link(download_dir, link)
    logging.info('Took %s seconds', time() - ts)

if __name__ == '__main__':
    main()

On my laptop, this script took 19.4 seconds to download 91 images. Please do note that these numbers may vary based on the network you are on. 19.4 seconds isn’t terribly long, but what if we wanted to download more pictures? Perhaps 900 images, instead of 90. With an average of 0.2 seconds per picture, 900 images would take approximately 3 minutes. For 9000 pictures it would take 30 minutes. The good news is that by introducing concurrency or parallelism, we can speed this up dramatically.

All subsequent code examples will only show import statements that are new and specific to those examples. For convenience, all of these Python scripts can be found in this GitHub repository.

Concurrency and Parallelism in Python: Threading Example

Threading is one of the most well-known approaches to attaining Python concurrency and parallelism. Threading is a feature usually provided by the operating system. Threads are lighter than processes, and share the same memory space.

Python multithreading memory model

In this Python threading example, we will write a new module to replace single.py. This module will create a pool of eight threads, making a total of nine threads including the main thread. I chose eight worker threads because my computer has eight CPU cores and one worker thread per core seemed a good number for how many threads to run at once. In practice, this number is chosen much more carefully based on other factors, such as other applications and services running on the same machine.

This is almost the same as the previous one, with the exception that we now have a new class, DownloadWorker, which is a descendent of the Python Thread class. The run method has been overridden, which runs an infinite loop. On every iteration, it calls self.queue.get() to try and fetch a URL to from a thread-safe queue. It blocks until there is an item in the queue for the worker to process. Once the worker receives an item from the queue, it then calls the same download_link method that was used in the previous script to download the image to the images directory. After the download is finished, the worker signals the queue that that task is done. This is very important, because the Queue keeps track of how many tasks were enqueued. The call to queue.join() would block the main thread forever if the workers did not signal that they completed a task.

import logging
import os
from queue import Queue
from threading import Thread
from time import time

from download import setup_download_dir, get_links, download_link


logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logger = logging.getLogger(__name__)


class DownloadWorker(Thread):

    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            directory, link = self.queue.get()
            try:
                download_link(directory, link)
            finally:
                self.queue.task_done()


def main():
    ts = time()
    client_id = os.getenv('IMGUR_CLIENT_ID')
    if not client_id:
        raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
    download_dir = setup_download_dir()
    links = get_links(client_id)
    # Create a queue to communicate with the worker threads
    queue = Queue()
    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the workers are blocking
        worker.daemon = True
        worker.start()
    # Put the tasks into the queue as a tuple
    for link in links:
        logger.info('Queueing {}'.format(link))
        queue.put((download_dir, link))
    # Causes the main thread to wait for the queue to finish processing all the tasks
    queue.join()
    logging.info('Took %s', time() - ts)

if __name__ == '__main__':
    main()

Running this Python threading example script on the same machine used earlier results in a download time of 4.1 seconds! That’s 4.7 times faster than the previous example. While this is much faster, it is worth mentioning that only one thread was executing at a time throughout this process due to the GIL. Therefore, this code is concurrent but not parallel. The reason it is still faster is because this is an IO bound task. The processor is hardly breaking a sweat while downloading these images, and the majority of the time is spent waiting for the network. This is why Python multithreading can provide a large speed increase. The processor can switch between the threads whenever one of them is ready to do some work. Using the threading module in Python or any other interpreted language with a GIL can actually result in reduced performance. If your code is performing a CPU bound task, such as decompressing gzip files, using the threading module will result in a slower execution time. For CPU bound tasks and truly parallel execution, we can use the multiprocessing module.

While the de facto reference Python implementation—CPython–has a GIL, this is not true of all Python implementations. For example, IronPython, a Python implementation using the .NET framework, does not have a GIL, and neither does Jython, the Java-based implementation. You can find a list of working Python implementations here.

Concurrency and Parallelism in Python Example 2: Spawning Multiple Processes

The multiprocessing module is easier to drop in than the threading module, as we don’t need to add a class like the Python threading example. The only changes we need to make are in the main function.

Python multiprocessing tutorial: Modules

To use multiple processes, we create a multiprocessing Pool. With the map method it provides, we will pass the list of URLs to the pool, which in turn will spawn eight new processes and use each one to download the images in parallel. This is true parallelism, but it comes with a cost. The entire memory of the script is copied into each subprocess that is spawned. In this simple example, it isn’t a big deal, but it can easily become serious overhead for non-trivial programs.

import logging
import os
from functools import partial
from multiprocessing.pool import Pool
from time import time

from download import setup_download_dir, get_links, download_link


logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logging.getLogger('requests').setLevel(logging.CRITICAL)
logger = logging.getLogger(__name__)


def main():
    ts = time()
    client_id = os.getenv('IMGUR_CLIENT_ID')
    if not client_id:
        raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
    download_dir = setup_download_dir()
    links = get_links(client_id)
    download = partial(download_link, download_dir)
    with Pool(4) as p:
        p.map(download, links)
    logging.info('Took %s seconds', time() - ts)


if __name__ == '__main__':
    main()

Concurrency and Parallelism in Python Example 3: Distributing to Multiple Workers

While the threading and multiprocessing modules are great for scripts that are running on your personal computer, what should you do if you want the work to be done on a different machine, or you need to scale up to more than the CPU on one machine can handle? A great use case for this is long-running back-end tasks for web applications. If you have some long-running tasks, you don’t want to spin up a bunch of sub-processes or threads on the same machine that need to be running the rest of your application code. This will degrade the performance of your application for all of your users. What would be great is to be able to run these jobs on another machine, or many other machines.

A great Python library for this task is RQ, a very simple yet powerful library. You first enqueue a function and its arguments using the library. This pickles the function call representation, which is then appended to a Redis list. Enqueueing the job is the first step, but will not do anything yet. We also need at least one worker to listen on that job queue.

Model of the RQ Python queue library

The first step is to install and run a Redis server on your computer, or have access to a running Redis server. After that, there are only a few small changes made to the existing code. We first create an instance of an RQ Queue and pass it an instance of a Redis server from the redis-py library. Then, instead of just calling our download_link method, we call q.enqueue(download_link, download_dir, link). The enqueue method takes a function as its first argument, then any other arguments or keyword arguments are passed along to that function when the job is actually executed.

One last step we need to do is to start up some workers. RQ provides a handy script to run workers on the default queue. Just run rqworker in a terminal window and it will start a worker listening on the default queue. Please make sure your current working directory is the same as where the scripts reside in. If you want to listen to a different queue, you can run rqworker queue_name and it will listen to that named queue. The great thing about RQ is that as long as you can connect to Redis, you can run as many workers as you like on as many different machines as you like; therefore, it is very easy to scale up as your application grows. Here is the source for the RQ version:

import logging
import os

from redis import Redis

from rq import Queue

from download import setup_download_dir, get_links, download_link


logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logging.getLogger('requests').setLevel(logging.CRITICAL)
logger = logging.getLogger(__name__)


def main():
    client_id = os.getenv('IMGUR_CLIENT_ID')
    if not client_id:
        raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
    download_dir = setup_download_dir()
    links = get_links(client_id)
    q = Queue(connection=Redis(host='localhost', port=6379))
    for link in links:
        q.enqueue(download_link, download_dir, link)

if __name__ == '__main__':
    main()

However, RQ is not the only Python job queue solution. RQ is easy to use and covers simple use cases extremely well, but if more advanced options are required, other Python 3 queue solutions (such as Celery) can be used.

Python Multithreading vs. Multiprocessing

If your code is IO bound, both multiprocessing and multithreading in Python will work for you. Multiprocessing is a easier to just drop in than threading but has a higher memory overhead. If your code is CPU bound, multiprocessing is most likely going to be the better choice—especially if the target machine has multiple cores or CPUs. For web applications, and when you need to scale the work across multiple machines, RQ is going to be better for you.


 

Update

Python concurrent.futures

Something new since Python 3.2 that wasn’t touched upon in the original article is the concurrent.futures package. This package provides yet another way to use concurrency and parallelism with Python.

In the original article, I mentioned that Python’s multiprocessing module would be easier to drop into existing code than the threading module. This was because the Python 3 threading module required subclassing the Thread class and also creating a Queue for the threads to monitor for work.

Using a concurrent.futures.ThreadPoolExecutor makes the Python threading example code almost identical to the multiprocessing module.

import logging
import os
from concurrent.futures import ThreadPoolExecutor
from functools import partial
from time import time

from download import setup_download_dir, get_links, download_link

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logger = logging.getLogger(__name__)


def main():
    client_id = os.getenv('IMGUR_CLIENT_ID')
    if not client_id:
        raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
    download_dir = setup_download_dir()
    links = get_links(client_id)

    # By placing the executor inside a with block, the executors shutdown method
    # will be called cleaning up threads.
    # 
    # By default, the executor sets number of workers to 5 times the number of
    # CPUs.
    with ThreadPoolExecutor() as executor:

        # Create a new partially applied function that stores the directory
        # argument.
        # 
        # This allows the download_link function that normally takes two
        # arguments to work with the map function that expects a function of a
        # single argument.
        fn = partial(download_link, download_dir)

        # Executes fn concurrently using threads on the links iterable. The
        # timeout is for the entire process, not a single call, so downloading
        # all images must complete within 30 seconds.
        executor.map(fn, links, timeout=30)


if __name__ == '__main__':
    main()

Now that we have all these images downloaded with our Python ThreadPoolExecutor, we can use them to test a CPU-bound task. We can create thumbnail versions of all the images in both a single-threaded, single-process script and then test a multiprocessing-based solution.

We are going to use the Pillow library to handle the resizing of the images.

Here is our initial script.

import logging
from pathlib import Path
from time import time

from PIL import Image

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logger = logging.getLogger(__name__)


def create_thumbnail(size, path):
    """
    Creates a thumbnail of an image with the same name as image but with
    _thumbnail appended before the extension.  E.g.:

    >>> create_thumbnail((128, 128), 'image.jpg')

    A new thumbnail image is created with the name image_thumbnail.jpg

    :param size: A tuple of the width and height of the image
    :param path: The path to the image file
    :return: None
    """
    image = Image.open(path)
    image.thumbnail(size)
    path = Path(path)
    name = path.stem + '_thumbnail' + path.suffix
    thumbnail_path = path.with_name(name)
    image.save(thumbnail_path)


def main():
    ts = time()
    for image_path in Path('images').iterdir():
        create_thumbnail((128, 128), image_path)
    logging.info('Took %s', time() - ts)


if __name__ == '__main__':
    main()

This script iterates over the paths in the images folder and for each path it runs the create_thumbnail function. This function uses Pillow to open the image, create a thumbnail, and save the new, smaller image with the same name as the original but with _thumbnail appended to the name.

Running this script on 160 images totaling 36 million takes 2.32 seconds. Lets see if we can speed this up using a ProcessPoolExecutor.

import logging
from pathlib import Path
from time import time
from functools import partial

from concurrent.futures import ProcessPoolExecutor

from PIL import Image

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logger = logging.getLogger(__name__)


def create_thumbnail(size, path):
    """
    Creates a thumbnail of an image with the same name as image but with
    _thumbnail appended before the extension. E.g.:

    >>> create_thumbnail((128, 128), 'image.jpg')

    A new thumbnail image is created with the name image_thumbnail.jpg

    :param size: A tuple of the width and height of the image
    :param path: The path to the image file
    :return: None
    """
    path = Path(path)
    name = path.stem + '_thumbnail' + path.suffix
    thumbnail_path = path.with_name(name)
    image = Image.open(path)
    image.thumbnail(size)
    image.save(thumbnail_path)


def main():
    ts = time()
    # Partially apply the create_thumbnail method, setting the size to 128x128
    # and returning a function of a single argument.
    thumbnail_128 = partial(create_thumbnail, (128, 128))

    # Create the executor in a with block so shutdown is called when the block
    # is exited.
    with ProcessPoolExecutor() as executor:
        executor.map(thumbnail_128, Path('images').iterdir())
    logging.info('Took %s', time() - ts)


if __name__ == '__main__':
    main()

The create_thumbnail method is identical to the last script. The main difference is the creation of a ProcessPoolExecutor. The executor’s map method is used to create the thumbnails in parallel. By default, the ProcessPoolExecutor creates one subprocess per CPU. Running this script on the same 160 images took 1.05 seconds—2.2 times faster!

Async/Await (Python 3.5+ only)

One of the most requested items in the comments on the original article was for an example using Python 3’s asyncio module. Compared to the other examples, there is some new Python syntax that may be new to most people and also some new concepts. An unfortunate additional layer of complexity is caused by Python’s built-in urllib module not being asynchronous. We will need to use an async HTTP library to get the full benefits of asyncio. For this, we’ll use aiohttp.

Let’s jump right into the code and a more detailed explanation will follow.

import asyncio
import logging
import os
from time import time

import aiohttp

from download import setup_download_dir, get_links

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)


async def async_download_link(session, directory, link):
    """
    Async version of the download_link method we've been using in the other examples.
    :param session: aiohttp ClientSession
    :param directory: directory to save downloads
    :param link: the url of the link to download
    :return:
    """
    download_path = directory / os.path.basename(link)
    async with session.get(link) as response:
        with download_path.open('wb') as f:
            while True:
                # await pauses execution until the 1024 (or less) bytes are read from the stream
                chunk = await response.content.read(1024)
                if not chunk:
                    # We are done reading the file, break out of the while loop
                    break
                f.write(chunk)
    logger.info('Downloaded %s', link)


# Main is now a coroutine
async def main():
    client_id = os.getenv('IMGUR_CLIENT_ID')
    if not client_id:
        raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
    download_dir = setup_download_dir()
    # We use a session to take advantage of tcp keep-alive
    # Set a 3 second read and connect timeout. Default is 5 minutes
    async with aiohttp.ClientSession(conn_timeout=3, read_timeout=3) as session:
        tasks = [(async_download_link(session, download_dir, l)) for l in get_links(client_id)]
        # gather aggregates all the tasks and schedules them in the event loop
        await asyncio.gather(*tasks, return_exceptions=True)


if __name__ == '__main__':
    ts = time()
    # Create the asyncio event loop
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(main())
    finally:
        # Shutdown the loop even if there is an exception
        loop.close()
    logger.info('Took %s seconds to complete', time() - ts)

There is quite a bit to unpack here. Let’s start with the main entry point of the program. The first new thing we do with the asyncio module is to obtain the event loop. The event loop handles all of the asynchronous code. Then, the loop is run until complete and passed the main function. There is a piece of new syntax in the definition of main: async def. You’ll also notice await and with async.

The async/await syntax was introduced in PEP492. The async def syntax marks a function as a coroutine. Internally, coroutines are based on Python generators, but aren’t exactly the same thing. Coroutines return a coroutine object similar to how generators return a generator object. Once you have a coroutine, you obtain its results with the await expression. When a coroutine calls await, execution of the coroutine is suspended until the awaitable completes. This suspension allows other work to be completed while the coroutine is suspended “awaiting” some result. In general, this result will be some kind of I/O like a database request or in our case an HTTP request.

The download_link function had to be changed pretty significantly. Previously, we were relying on urllib to do the brunt of the work of reading the image for us. Now, to allow our method to work properly with the async programming paradigm, we’ve introduced a while loop that reads chunks of the image at a time and suspends execution while waiting for the I/O to complete. This allows the event loop to loop through downloading the different images as each one has new data available during the download.

There Should Be One—Preferably Only One—Obvious Way to Do It

While the zen of Python tells us there should be one obvious way to do something, there are many ways in Python to introduce concurrency into our programs. The best method to choose is going to depend on your specific use case. The asynchronous paradigm scales better to high-concurrency workloads (like a webserver) compared to threading or multiprocessing, but it requires your code (and dependencies) to be async in order to fully benefit.

Hopefully the Python threading examples in this article—and update—will point you in the right direction so you have an idea of where to look in the Python standard library if you need to introduce concurrency into your programs.

Understanding the Basics

What is a thread in Python?

A thread is a lightweight process or task. A thread is one way to add concurrency to your programs. If your Python application is using multiple threads and you look at the processes running on your OS, you would only see a single entry for your script even though it is running multiple threads.

About the author

Marcus McCurdy, United States
member since November 10, 2014
Marcus has a Bachelor's in Computer Engineering and a Master's in Computer Science. He is a talented programmer, and excels most at back-end development. However, he is comfortable creating polished products as a full stack developer. [click to continue...]
Hiring? Meet the Top 10 Freelance Python Developers for Hire in December 2018

Comments

jorjun
Would like to see an example using the new asyncio module. Also, Gevent works wonders for this kind of I/O bound concurrency problem if you have to stick with python 2. There really is no need to use either threads or multiprocessing - green threads could be the way to go...
Marcus McCurdy
The asyncio module would be a pretty hefty example as the underlying urllib code isn't setup to use async connections. You can see an example here https://docs.python.org/3/library/asyncio-stream.html#get-http-headers of fetching headers. I was going to touch on Gevent, but it doesn't work with Python 3 at this time.
Mariano Simone
If your code is I/O bound, you should increase the number of data streams a process manage by means of async I/O. Creating a thread/forking a process just for handling new connections is a horrid waste of resources. Now, if you are doing CPU intensive operations, it clearly makes sense throwing more cores at the problem.
Mariano Simone
Btw, the amount of work and the quality of the post is very impressive!. Good job dude.
jorjun
I see. The aiohttp module looks promising. http://geekgirl.io/concurrent-http-requests-with-python3-and-asyncio/
Marcus McCurdy
You are correct that ones approach to concurrency/parallelism depends on if the underlying code is IO vs. CPU bound and I touch on this a bit in the article. I wanted to keep the article as simple as possible and still demonstrate the different options available in Python 3. I felt that including both IO bound and CPU bound examples would bloat the article, but I do mention an example of a CPU bound task in the article. I really wanted to write the article using Python 3 as I feel there aren't as many resources for it. I also wanted to include async IO using gevent, but gevent doesn't support Python 3. I decided to go with a pure Python 3 article rather than include some examples that work in Python 3 and some that work in Python 2.
Marcus McCurdy
Thanks for the compliment and your other comment.
Eki Eqbal
Nice article, keep it up. Cheers
rockyqi
good article with interesting images, thanks
Zero_NzYme
Very Nice Article. Going to give the redis and celery options a try. Very cool!
Jon
Hi, could you provide some insight into how someone could use multiprocessing to perform different functions, e.g. first function is reading zipped csv files into memory and the second function merges those csv files in the same order. This way as function 2 is still merging files 1 and 2, file 3 is being read into memory. Similar to how a sandwich shop could have multiple workers working on the same sandwiches as they pass through the different stages of production.
Ashwin Nanjappa
Good article. However it is not mentioned what to do if your code is CPU bound, data is big (multiprocessing is out of the question) and is not a web application. This is typical of scientific applications. The article should at least mention in the conclusion that this currently cannot be solved in CPython efficiently.
Anon Omus
Displaying a technique without providing necessary information for correct usage of it can hardly be considered bloat. It's just being lazy. :/
Andrew Franklin
As someone that doesn't know much about concurrency and parallelism in python this seems like a good place to start! Thanks for this tutorial! Unfortunately, I also don't know much about the urllib and downloading images from imgur. Occasionally, these codes won't run for me and I get the following error: "urllib.error.HTTPError: HTTP Error 403: Permission Denied" It's very inconsistent about when I receive the error and sometimes the codes run without a hitch. Does anyone know how to fix this so it runs every time?
Eric O LEBIGOT (EOL)
Good job: it is rare to see bloggers who write good Python code! :) Now, you share the speed up obtained with threads: what do you get with multiprocessing?
mattias
Hi, The part with threading for number of CPUs doesn't make sense. Since it only will run on one CPU core as the python threads are not real os threads there's no relation to the number of available cores.
Yuval Baror
Great article - thanks for sharing! Could you provide some details regarding the run time of the multiprocessing and RQ solutions compared to the original and threaded solutions?
Marcus McCurdy
You can find the full source code of all the examples in the article here https://github.com/volker48/python-concurrency. I never saw any 403 errors when I was testing, but that error would be coming from Imgur's API. I'm not sure what kind of network you are on, but perhaps your IP, if it is shared, is causing you to reach some kind of API limit.
deleteman
I'd also like to see this result.Can you please add it either to the post or here as a reply?
Ravi
Nice article. Surprised not to see Gevent in the list.
bjlange
+1 for RQ! By far the most user-friendly way I've found of distributing work across machines.
Nabeel Valapra
This is what I was looking for.. Great One!!!
Jay Dreyer
Thanks for this! 10x improvement in a script I use regularly. Thanks!
JEdVcM
In your threaded code, you're making a busy-wait with "while True" + non-blocking queue read. This is okay if you've got content in your queue, and you wrap it up at the end, but it gets to be a CPU hog otherwise. We had that a few months ago in our project. The other thing, I surely don't understand python threading totally, but even in an I/O bound environment, how can you get any faster, if you only use sync network reads? The GIL still makes only one thread running at a time. What do I miss? Is there some under-the-hood optimization for that in CPython?
Dannnno
From what I recall, CPython releases the GIL when threads are waiting for IO events - I'm not clear on the specifics of how that is done however.
JEdVcM
That would explain things well. However, I've found nothing on that yet. Another thing: sorry, I was wrong. The code uses a blocking wait.
Dannnno
I can't find where I originally read it either, however you can have this tidbit "Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL." https://wiki.python.org/moin/GlobalInterpreterLock
Arulbalan Subramani
Hello Marcus, Wonderful post! Thanks, one question: though we have no of threads to be controlled using semaphore class, why do we need queue class to handle threads ? Thanks, Arul
Andreu Vallbona Plazas
In the example of multiprocessing Pool, how you control the same link is not downloaded for every process? On the other hand, GREAT POST!!
Marcus McCurdy
Its the way the map function works with the multiprocessing pool.<pre><code class="python">p.map(download, links)</code></pre> Links is a list of download links. p.map calls the download function once for each link in the list. Because this is the map function on a pool object each function may run in its own process. Pool's map function works similarly to Python's built in map function. Not sure if you are familiar with it, but it might make more sense if you play around with it https://docs.python.org/2/library/functions.html#map.
Mike Zang
This is a great job! I want to use it to my projects. I parse image link one by one dynamically, how can I use your thread or pool way?
sandipan
Celery is a great one to use. It can scale really well. Also it has flexibility of using redis or rabbitmq as broker
ale3andro
Thank you so much for this. My first attempt to write a threaded python script has led to a success!
Sundar Moses
Nicely written Marcus. I am also a python programmer. Though i have used these functions in concurrency, I am unable to rewrite the program using python next time and I forget the flow or parameters. again I refer my old program to recollect the flow. Any tips from your side to remember these function calls and arguments? thanks.
aeroaks
Hi, I have a function that accepts the file path and performs analysis on it. It returns an id for the pandas data frame row to which it was added. The file path is passed as a single string one at a time, from another program. Over time the analysis has included different types of files and takes some time. I need to implement multiprocessing for this part. What would be the best way of doing it? I need the return value of row id to work further on the results.
Will Vaughn
Yeah, your code on github works, but what you have written in the blog does not work
Marcus McCurdy
The code on github has been updated after the blog post was written to account for some changes in the imgur API. I don't have direct access to the blog post like I do with the github repo. Which part of the blog post isn't working for you? I'll check it out and ask for it to get updated once I fix it. Thanks Will.
David Nguyen
great article
Rohit Malgaonkar
Haven't tried redis server but Installed RabbitMQ python server and had 1 producer (sending string data encoded as JSON) and 3 consumers for cpu bound script (looping csv files, searching decoded JSON data sent by producer) and it halfs the time approximately but runs 3 python processes @ 97% and cpu utilization @ 90% versus 1 process @97% and double time.
Vikas Gupta
Good blog about multithreading. Thanks. :)
Lava Kafle
great
Heron Rossi
Very nice article! What´s the purpose of worker.daemon = True for this particular example ? Is this just good practice or has a real impact on this specific case ?
Daniel Nuriyev
I agree. The author and the readers who liked this article do not realize that threads are not the OS threads but sequences of Python bytecode running on a single OS thread. Python VM executes each thread up to 10 milliseconds and switches to the next one. The improvement is because this code does IO. Without IO the performance would be worse.
mattias
I was about to tell you that both of you are incorrect when I realized that I wrote the original statement. Anyway, python threads are real threads, tied to the same PPID, but only one is executed at a time regardless of the number of available cores due to the global interpeter lock. There's that!
Daniel Nuriyev
Thank you for the explanation! In this case I have another question: is it correct that if a thread runs fro more than 10ms, Python VM switches to the next thread in order not to get stuck?
mattias
From what I can google quickly any cpu bound thread (ie not waiting for i/o) is releasing the gil (and asking to re.-acquire it) every 100 ticks. Not sure what a tick is exactly but I doubt it's cpu cycles, but rather some executional loop within the interpreter itself I guess :-). Anyway this ensures all threads is given execuition time, but the penalty is so high that trying to execute many threads at once is much worse than running the tasks sequentially. I found this slide that explained the scheduling to me so take it for a guestimation :). http://www.dabeaz.com/python/UnderstandingGIL.pdf
Honby
I try to recreate this code to understand about multithread. But I got an error when i run it. headers = {'Authorization': 'Client-ID {}'.format(client_id)} --> output: Bad Request headers = {'Authorization': 'Client-ID {{}}'.format(client_id)} --> output: Permission Denied Do i need to have images uploaded to my imgur account ?
Kevin
I took your initial threading example and wrote some code that works very well. However the mandate is that it must run inside of a web server, and while it runs equally well there, it is "accumulating threads" so to speak because python does not actually exit until the web server itself is shut down. E.g. I am spawning off eight threads and each time the end point is executed it utilizes a new set. I know this because I am writing the individual thread names (threading.current_thread().getName()) and total count to the log file. This is true regardless of whether I set them as daemon or not. I am not a python guru. Any input/suggestions on how I can continually re-use the same set of threads (once defined in the initial execution) or alternatively "delete" the thread objects so they don't continue to accumulate? Thanks!
Marcus McCurdy
Hey Kevin its hard to say exactly what the best course of action is without knowing a little bit more about what the callable of your threads is doing. If your threads are like my DownloadWorker class in the example that might not be the best approach for your use case. The docs on the thread class https://docs.python.org/3.6/library/threading.html#thread-objects say: "Once the thread’s activity is started, the thread is considered ‘alive’. It stops being alive when its run() method terminates – either normally, or by raising an unhandled exception. The is_alive() method tests whether the thread is alive." so if your run method on your class that inherits from Thread finishes and exits the thread should no longer be alive. You might want to checkout my updated example that uses a https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor. With the executor, you select a max number of threads that make up the pool. You then submit tasks to the executor who will use one of the available threads in the pool or block until one is available. Once you've created your executor, you call the submit method on the executor to have you task run asynchronously on one of the threads. Submit returns a future that you can then use to get the result. Depending on your use case you may not need to use the future that submit returns. The ThreadPoolExecutor sounds more like what you are looking for since you said you want to "re-use the same set of threads" Hopefully this helps. If not feel free to email a gist with a code example. My email is my firstname dot lastname at toptal dot com.
ihadanny
NOTE: for anybody trying to run this code now (2018), the imgur api has changed, if you try to access https://api.imgur.com/3/gallery/ you will get a `Bad Request` error. instead use https://api.imgur.com/3/gallery/random/random/
ihadanny
the imgur API has changed, use https://api.imgur.com/3/gallery/random/random/ instead
Marcus McCurdy
Sorry about that. It looks like we forgot to update the first code snippet of the blog post. The code is actually correct in the Github repo https://github.com/volker48/python-concurrency/blob/master/download.py#L19. We will update the snippet in the blog post. Thank you for pointing out the issue.
Kevin Bloch
The article has been updated.
Andre Zunino
I'm running Python 3.6.5 and had to change the 'readall' method calls on the HTTPResponse objects returned from urllib.request's urlopen method (in download.py). I don't know if those have been removed from the HTTPResponse API in recent versions, but I found there is a 'read' method that can be used.
Pablo Messina
What if I want to use multiple cores but the tasks need to share the same read-only data? For instance, imagine I want to test multiple versions of a recommendation algorithm in an offline fashion with different parameters in order to find the best ones, but in all cases the same readonly data is used, namely: 1) a database of transactions, i.e. a sorted list of Purchase Events (where Purchase Event is a Class that contains data such as timestamp, IDs of items purchased, ID of customer, etc.) and 2) several large numpy feature matrices with dimensions N x F (N = number of items, F = number of features), and possibly 3) some dictionaries mapping itemID's to additional metadata. In order to speed up the process, it would be ideal to run multiple experiments simultaneously but sharing the readonly data, since it would not be possible to duplicate the data for each process (it would be very time-consuming and would not fit into the RAM). Do you know how something like this could be accomplished? It should be straightforward with multithreading, but it turns out that multithreading in Python does not support concurrency of CPU-bound tasks. I guess Multiprocessing should be the way to go, but then I'm faced with this problem of sharing a lot of readonly data (list of class objects, numpy arrays, dictionaries, etc.)
Kevin Bloch
Hi Andre--thank you very much for pointing this out. The article's original section has now been updated to match the current versions on GitHub, which include the fixes you mentioned.
andrei deuşteanu
One limitation of the multiprocessing library that I've come across is that it's unable to work with lambda functions as functions are pickled by name and hence lambda functions are anonymous they're not pickled. I've tried pickling them manually with dill but I was not successful. But then I came across the multiprocess - https://github.com/uqfoundation/multiprocess library which makes things really easy to run in parallel. It's actually part of a larger framework for heterogenous computing pathos - https://github.com/uqfoundation/pathos. You should check it out. I'm not working on the library, but I've used it a few times and it's very easy to work with that's why I'm promoting it.
comments powered by Disqus
Subscribe
Free email updates
Get the latest content first.
No spam. Just great articles & insights.
Free email updates
Get the latest content first.
Thank you for subscribing!
Check your inbox to confirm subscription. You'll start receiving posts after you confirm.
Trending articles
Relevant Technologies
About the author
Marcus McCurdy
Python Developer
Marcus has a Bachelor's in Computer Engineering and a Master's in Computer Science. He is a talented programmer, and excels most at back-end development. However, he is comfortable creating polished products as a full stack developer.