Naming Things

Introduction

This document serves as a resource on naming conventions and best practices
It was written with two large inspirations
- Jenny Bryan’s lecture on naming files: Slides
- Experiences in Data Science, and working with distributed teams
The purpose of naming conventions, and standardized format, is to promote a common language and understanding of code and data
Maintaining meta-data is wonderful in practice, but it takes a lot of time. It does not have a directly visible pay off. As a result, it is often pushed to the side, until disaster strikes, and a misuse of the data is presented in an analysis or incorporated into a model.
Standardized naming conventions for files and code can aide in meta-data generation and data documentation, without piling on lots of extra for the Data Scientist
Where it is even better than a meta-data document, is that the meta-data is captured within the data itself (through file names, and variable names in the data).
Note: This does not replace the need for a data dictionary, or a project brief. It is a tool to help facilitate self-documentation when these pieces may be absent

Naming Files

The three principles for naming files

Machine Readable
Human Readable
Orders by default

Machine Readable

Machine readable file names are those that can be easily read in, and parsed by operating systems, programming languages, and other software. To accomplish this, there are a few guidelines:

Do not use spaces, punctuation (, / . ; ’ , etc.), or accented characters
Use all lower case characters
Use delimiters to separate key pieces of information (_ or -) - Use underscores (_) to separate units of meta-data
Use hyphens (-), also called slugs, to separate words of the same meta-data

Example : In the below example, we see that the files take the following format

{date}{client}{source}_{description}.csv

list.files('~/Downloads/example')

##  [1] "2020-01-01_client_a_data.csv" "2020-02-01_client_b_data.csv"
##  [3] "2020-03-01_client_c_data.csv" "2020-04-01_client_d_data.csv"
##  [5] "2020-05-01_client_e_data.csv" "2020-06-01_client_f_data.csv"
##  [7] "2020-07-01_client_g_data.csv" "2020-08-01_client_h_data.csv"
##  [9] "2020-09-01_client_i_data.csv" "2020-10-01_client_j_data.csv"

The files are ordered chronologically by default
The combination of underscores and hyphens not only make it easy to read, but can be used to parse the files themselves into meta-data if needed

stringr::str_split_fixed(myfiles, "[_\\.]",5) |> kableExtra::kable()

2020-01-01	client	a	data	csv
2020-02-01	client	b	data	csv
2020-03-01	client	c	data	csv
2020-04-01	client	d	data	csv
2020-05-01	client	e	data	csv
2020-06-01	client	f	data	csv
2020-07-01	client	g	data	csv
2020-08-01	client	h	data	csv
2020-09-01	client	i	data	csv
2020-10-01	client	j	data	csv

Now we can see each piece of meta-data contained in the files - This now becomes much easier to search for types of files, narrow large lists based upon their names, and extract info from the names themselves

Human Readable

The other half of the equation, is to ensure that the file names are easily readable by those who are going to use them Here are a few tips on how:

The names of your files should contain information on the content within the file
Embrace the use of hyphens, also known as “slugs” in the file-name. It helps to break up words without the use of spaces (breaking some programming languages)
For example:
- 2021-11-15_doug_full-mmm-data.csv

Orders by Default

Beginning filenames with the date, or a number that signifies the order of the files, ensures that the files sort correctly by default. If you are numbering files, be sure to left pad the number with a 0 for the first 9 digits.

01_pull-data.py, 02_clean-data.py

This is done so that it appears before 10_visualize-model.py in a file list.

All dates should follow the format: YYYY-MM-DD
This is the ISO 8601 standard for dates

Naming Variables

Many of the principles we have covered in the file naming conventions carry over to naming variables in a dataset. There are 3 key naming conventions to keep in mind when naming variables in an analysis or data pipeline:

Be descriptive.
Be kind to the reader.
Be consistent.

A Variable name should describe the item. The variable name must describe the information represented by the variable. A variable name should tell you concisely in words what the variable stands for, and how it is measured. This isn’t easy! And it takes practice.

Lets look at a couple examples. Below we have a daily average computed for 4 different channels.

total = a + b + c + d
final = total / num_days

Looking at the code block above, it is not clear what is happening, the sources of each variable, and what they represent. Compare that to the code block below:

paid_media_impressions = affiliate_impressions + display_impressions + paid_search_impressions + social_impressions
paid_media_impressions_daily_avg = paid_media_impressions / days_in_month

In the above code block, it is clear what channels the variables represent, and the metric from the respective channels. While the names are longer, it is very clear to the reader what the spirit of the variables and calculations are. The code becomes “self-documenting”. If a channel is missing, it is clear which channels are already included. If there is an error in the calculation, it is clear what each section of code is trying to accomplish.

Be kind to the reader

Your code will be read more times than it is initially written, so prioritize ease of reading variable names over speed of typing. Don’t worry about lengthy variable names, let auto-completion in your IDE or editor of choice handle the rest!

paid_media_impressions = affiliate_impressions + display_impressions + paid_search_impressions + social_impressions
paid_media_impressions_daily_avg = paid_media_impressions / days_in_month

There are several common abbreviations that can help the reader understand what the variables represent:

avg: for average
max: for maximum
min: for minimum
std: for standard deviation
sum: for the summation of a value per a given time period or index

Best practice is to put the abbreviations at the end of the name

Be consistent

Adopt standard conventions for naming so you can make one global decision in a codebase instead of multiple local decisions. For example, standard naming conventions for large datasets like those created for Media Mix Models can be expressed in the table below:

Variable Type	Prefix
Paid Media	paid
Organic Traffic	organic
Competitor Data	competitor
External Factors	external
Promotional Factors	promo

Much like files, common prefixes can make it much easier to select groups of variables at one time. It also provides self-documentation, as the user can understand the roles that the variables play in the analysis

Introductions in Scripts

Another great practice is to provide a brief introduction to the script via comments At a minimum we should include:

1-2 sentences about what the script accomplishes
The author of the script
Any parameters that need to be manually changed in order for the script to run should be highlighted in the beginning of the script

Conventions in practice

When to prioritize the conventions

When there are adhoc analyses, it may not be worth the effort to employ these best practices. There are several packages in both R and Python that help to make this process easier. In R, the janitor packages has the function clean_names() that you can call on a dataframe. The result, is the removal of any case-sensitivity, spaces are replaced with underscore, trailing spaces are removed, and any special characters are removed

In python, the pyjanitor package is a python implementation of the same package. You can call clean_names() on any pandas dataframe to accomplish the same result.

On Naming Things