Shan Dou
MLND capstone project
July 2018
Link to proposal review: https://review.udacity.com/#!/reviews/1314525
Kaggle competion "TalkingData AdTracking Fraud Detection Challenge".
conda env create -f environment.yml
This operation will create a conda environment named mlnd_clean. If you wish to use a different name, please open the requirement file environment.yml and change the first line name: mlnd_clean into your preferred name.
Once all the dependencies are installed, please run the following command in your shell terminal to activate the environment
source activate mlnd_clean
To deactivate, type
source deactivate
- The following modules are installed with
conda
1. numpy
2. pandas
3. seaborn
4. sklearn
5. xgboost
6. lightgbm
7. imblearn
8. notebook
- Module for stack ensemble is installed with
pip:
9. mlens
For more information about mlens, please visit its webiste.
-
Jupyter notebooks:
MLNDcapstone_shandou_main.ipynb: Main workbookMLNDcapstone_shandou_robustness.ipynb: Companion workbook for models' robustness texts
-
Python models in
./customlib/:./customlib/preprocessing.py: data processing./customlib/modeling.py: modeling./customlib/utils.py: miscellaneous tasks such as visualization and generating result summary tables
-
Dataset:
The raw datatrain.csvcan be directly download from Kaggle. Out of file size concerns, only downsized training data and the original testing data are included in this repo.train_sample.csv: 0.1% of the raw click recordstrain_sample_2.csv: 0.2% of the raw click recordstest.csv: First 10 lines of the orignal test data downloaded from Kaggle. NOTE thattest.csvprovided by Kaggle is only used for checking data fields. In the actual implementation, testing data is instead a portion oftrain_sample.csvortrain_sample_2.csv
-
Proposals and reports:
proposal.pdf: Proposal of the capstone projectproposal_review.pdf: Comments from proposal reviewreport.pdf: Report of the capstone project
-
Others:
- folder
images/: contains all the images used in the report - matplotlib style sheet
stylelib/custom.mplstyle: dataviz styler used throughout this project
- folder