GitHub - sidkris/megaprofiler: A Python library for automatic data profiling and validation

When working with large datasets, it’s often necessary to understand data types, distributions, and potential issues (e.g., missing values, outliers) before analysis. While libraries like pandas-profiling exist, there is still room for an extensible, easy-to-use, and highly customizable profiler that integrates data validation.

Key Features: Automatic Data Summaries: Provide insights like distribution, unique values, missing values, and more for each column. Anomaly Detection: Automatically flag columns or rows with unusual distributions, outliers, or inconsistent data. Data Validation: Set validation rules (e.g., no missing values in specific columns, data type constraints) and get alerts if the data violates these rules. Custom Reports: Generate visual reports (e.g., HTML, PDF) with configurable thresholds for what counts as an anomaly. Data Drift Detection: Track changes in data distributions over time to identify shifts in data quality or content. Benefits: DataProfiler would be invaluable to data scientists and engineers dealing with exploratory data analysis, data quality checks, and ETL pipelines, reducing manual data investigation.

To Use :

'from megaprofiler import MegaProfiler'

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
build/lib		build/lib
dist		dist
megaprofiler.egg-info		megaprofiler.egg-info
megaprofiler		megaprofiler
tests		tests
venv		venv
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample_usage.py		sample_usage.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases 1

Packages

Languages

License

sidkris/megaprofiler

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages