Getting all CSV files in directory and subdirectories using Python

Below is the code

import os
from glob import glob
PATH = "/home/someuser/projects/someproject"
EXT = "*.csv"
all_csv_files = [file
                 for path, subdir, files in os.walk(PATH)
                 for file in glob(os.path.join(path, EXT))]

That’s it !!

While above code is written for searching csv files recursively in directory and subdirectory; it can be used to search for any file type. You just need to change the EXT.
So say you want to find all the .css files, all you have to do is change the EXT to .css

EXT = "*.css"

Code explained

Now let’s go line by line. I’ll try to explain every line of code.

First we import glob and os modules

import os
from glob import glob
  • os module provides us operating system dependent functionality.
    In our case we’ll be using it’s walk and path.join functions. These are explained in details shortly.
  • glob module is used for finding pathnames matching a specific pattern. So we can tell it to find all js files by specifying something like *.js.

Next we are setting two constants.

PATH = "/home/someuser/projects/someproject"
EXT = "*.csv"
  • PATH constant is the path of the directory inside which we have to search.
  • EXT constant is the pattern for the extension we intend to search for.

Then we have

all_csv_files = [file
                 for path, subdir, files in os.walk(PATH)
                 for file in glob(os.path.join(path, EXT))]

Above line written in typical python style can be rewritten as below.

all_csv_files = []
for path, subdir, files in os.walk(PATH):
    for file in glob(os.path.join(path, EXT)):
  • os.walk("path/to/some/directory") will list down all the file names in a directory tree (directory and any subdirectories inside it and subdirectories inside subdirectory and so on).
    It yields (it’s a generator) a tuple with 3 elements – path of the directory, list of subdirectories inside current path and all the files in directory.
    So suppose you have below directory structure.


    For above directory below code

    for path, subdir, files in os.walk("path/to/some/directory/blog"):
        print("Path", path)
        print("Subdir", subdir)
        print("Files", files)

    will give below output

    ('Path', 'path/to/some/directory/blog')
    ('Subdir', ['controllers', 'models'])
    ('Files', ['index.php'])
    ('Path', 'path/to/some/directory/blog/controllers')
    ('Subdir', [])
    ('Files', ['HomeController.php'])
    ('Path', 'path/to/some/directory/blog/models')
    ('Subdir', [])
    ('Files', ['BlogModel.php'])
  • Now on next line glob(os.path.join(path, EXT)) is called inside the loop. So it will be executed for every subdirectory. If we consider the blog directory above it will be executed for
    • path/to/some/directory/blog
    • path/to/some/directory/blog/controllers
    • path/to/some/directory/blog/models
  • os.path.join(path, EXT) will join the two paths; so os.path.join('path/to/some/directory/blog', '*.csv') will return path/to/some/directory/blog/*.csv.
  • glob('path/to/some/directory/blog/*.csv') returns all the .csv files in blog directory (just blog directory and not it’s subdirectories). But since this is iterated for every directory we are calling glob on every directory.

Hope this explains the code.


Join the discussion

  1. Avatar
    ODIA says:

    is EXT can be a list of extensions ??

    this is what i did and its not working

    all_csv_files = [file
    for input_path, subdir, files in os.walk(PATH)
    for file in glob(os.path.join(input_path, (x for x in ext)))]

Leave a Reply

Your email address will not be published. Required fields are marked *