Exploratory data analysis (EDA)
allows developers and programmers to provide stakeholders with a clearer understanding of what questions may reasonably be asked of a dataset with very little programming effort. How much data is actually present in every row or "what are the unique, or most common values in this column" are some basic questions can help shave up to 30% of the data science workflow experience off according to some random source on the internet, and from my perspective is just an essential first step, period.
Carnegie Mellon has a deep-dive chapter on the subject
https://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf
and here's a brief and reasonably concise overview https://www.svds.com/value-exploratory-data-analysis/
EDA in Python
Pandas profiling and Sweetviz are simple installs that work well with Streamlit,
To test you can set up a streamlit share and then install
- https://pypi.org/project/streamlit-pandas-profiling/ - a Streamlit-ready library for pandas profiling - univariate and some multivariate (2 variables max) graphical analysis
- https://pypi.org/project/sweetviz/ - and some "tutorial" info https://discuss.streamlit.io/t/this-is-how-to-use-sweetviz-with-streamlit/10897
here's some python code wrapped in streamlit that provides both for you to test with a CSV of your choosing
https://github.com/alibama/code-for-cville/blob/master/divides.py
--- i pulled most of this from the video here...
https://www.youtube.com/watch?v=zWiliqjyPlQ - this video goes in depth - i skipped to about minute 30 to get in to the sweetviz stuff and then headed over to the Github repo
https://github.com/Jcharis/Streamlit_DataScience_Apps/blob/master/EDA_app_with_Streamlit_Components/app.py
to use this file as a basis for an even more stripped down version seen above
Commenting on blog posts requires an account.
Login is required to interact with this comment. Please and try again.
If you do not have an account, Register Now.