Table of contents:
Where to start?
Working with the code
- Version control, Git, and GitHub
- Getting started with Git
- Forking
- Creating a development environment
Docstrings Guidelines
Writing tests
- Using pytest
- Running the test suite
Contributing your changes to datasist
- Committing your code
- Pushing your changes
- Review your code and finally, make the pull request
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
For first time contributors, you can find pending issues on the GitHub “issues” page. There are a number of issues listed and "good first issue" where you could start out. Once you’ve found an interesting issue, and have an improvement in mind, next thing is to set up your development environment.
Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need to learn how to work with GitHub and the datasist code base.
The datasist code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on this project.
Some great resources for learning Git:
- Official GitHub pages.
Find Instructions for installing git, setting up your SSH key, and configuring git. These steps need to be completed before you can work seamlessly between your local repository and GitHub.
You will need your own fork to work on the code. Go to the datasist project page and hit the Fork button.
Next, you will clone your fork to your local machine:
git clone
This creates the directory datasist and connects your repository to the upstream (main project) repository.
To test out code changes, you’ll need to build datasist from source, which requires a Python environment.
Next, you'll create an isolated datasist development environment:
Install either Anaconda or miniconda
Make sure your conda is up to date (conda update conda)
Make sure that you have cloned the repository.
Follow the steps below:
- cd to the datasist source directory
cd datasist
- Build and install datasist
python build
pip install -e .
- Test that datasist was successfully installed
import datasist
- If there is no error after the previous command, then you're ready to start contibuting. Now you can fire up your favorite IDE and start implementing your changes.
Docstrings are an important part of coding and we encourage you to write clear and concise docstrings for your functions, methods and classes. Docstrings written for your code are automatically used to generate the datasist documentation using the pdoc library. Some guildlines for writing docstrings are:
- Define what the function does.
- Define all parameter types and what they do.
- State the return values
- Use the correct spacing and indentation as this affects the documentation generated by pdoc.
Sample docstrings:
def add_df(df1=None, df2=None):
A function to add two dataframes together.
df1: DataFrame, Series
The first dataframe to add.
df1: DataFrame, Series
The second dataframe to add.
DataFrame: Concatenation of the two dataframe
#check if dataframe is None
if df1 is None:
raise ValueError('df1: Expected a DataFrame, got None')
if df2 is None:
raise ValueError('df1: Expected a DataFrame, got None')
final_df = df1 + df2
return final_df
We strongly encourages contributors to write test for their code. Like many packages, datasist uses pytest.
All tests should go into the tests subdirectory and placed in the corresponding module. The tests folder contains some current examples of tests, and we suggest looking to these for inspiration.
To compare dataframe or series objects, you can use the pandas.util.testing module. The easiest way to verify that your code is correct is to explicitly construct the result you expect (expected), then compare it to the actual result (output).
Here we show an example of a test case we wrote for the drop_redundant function in the feature_engineering module. This test is placed in the file inside the tests folder.
def drop_redundant(data):
Removes features with the same value in all cell. Drops feature If Nan is the second unique class as well.
data: DataFrame or named series.
DataFrame or named series.
if data is None:
raise ValueError("data: Expecting a DataFrame/ numpy2d array, got 'None'")
#get columns
cols_2_drop = _nan_in_class(data)
print("Dropped {}".format(cols_2_drop))
df = data.drop(cols_2_drop, axis=1)
return df
The corresponding test for the function above is:
import pytest
from datasist import feature_engineering
import pandas as pd
import numpy as np
df = pd.DataFrame({'country': ['Nigeria', 'Ghana', 'USA', 'Germany'],
'size': [280, 20, 60, np.NaN],
'language': ['En', 'En', 'En', np.NaN]})
def test_drop_redundant():
expected = ['country', 'size']
output = list(feature_engineering.drop_redundant(df).columns)
assert expected == output
To run the test case, navigate the correct test subfolder and open a command prompt. Run the following command.
pytest tests
Learn more about pytest here
Once you’ve made changes, you can see them by typing:
git status
Next, you can track your changes using
git add .
Next, you commit changes using:
git commit -m "Enter any commit message here"
When you want your changes to appear publicly on your GitHub page, you can push to your forked repo with:
git push
If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be reviewed and eventually merged into the master version. To submit a pull request:
Navigate to your repository on GitHub
Click on the Pull Request button
Write a description of your changes in the Preview Discussion tab
Click Send Pull Request.
This request then goes to the repository maintainers, and they will review the code and everything looks good, merge it with the master.
Hooray! Youre now a contributor to datasist. Now go bask in the euphoria!