INTRODUCTION
FIFA 21 is an association football simulation video game published by Electronic Arts as part of the FIFA series whose dataset is to be cleaned for analysis.
Being a messy data, a proper look into the dataset helped my decision in choosing Microsoft Excel for cleaning the dataset all for the sake of its flexibility in handling data as messy as the FIFA 21 dataset.
Especially putting into consideration certain columns across the dataset that had:
- Inconsistent data types
- Inconsistent units of value
- Inclusion of inappropriate characters
- Null/blank columns
- Irrelevant columns...
These and more are pointers to why I considered the dataset messy and why I opted for the data cleaning tool that I used.
ABOUT THE DATA
The messy dataset was gotten from KAGGLE. This dataset is made up of 18,979 records split across 77 fields. HERE is the description of each of the fields.
The raw/messy dataset: click here to view
THE DATA CLEANING APPROACH
- Duplication of the raw file to avoid a permanent loss of dataset in case of the execution of irrecoverable change(s).
- Removal of Special Characters across "name", "long name", "value", "contract", "wages", and "release clause" columns which needed to be replaced with their proper letter(s). I took cue from the profile URL column, to know the right letters that ought to be in the name and long name columns. While replacing the four remaining columns as earlier mentioned with blank.
- Data types across all columns were checked for accuracy, and "joined date" column as changed to date data type in order to permit analysis
- Removal of columns irrelevant for analysis. Columns such as: profile URL and photo URL were both removed to cut off redundancy as they would contribute next to nothing in analysis.
- Inconsistent units of value across columns such as weight (kg and lbs), height (cm, ft, and inches) were corrected using the standard footballing units: ft and inches for height, while lbs for weight.
- The POT, BOV and OVA rating were converted to percentage data type as that is the standard for reading such attributes in the football domain.
- Filling up Loan Date column for records where it's not applicable with "N/A"
- Finally, the inclusion of M and K in the Value, Wage and Release Clause columns to represent millions and thousands was well attended to by first of replace the letters with blank, and then multiplied with appropriate Numerical values and converting them to currency data type. Thereby, making them fit for analysis.
The cleaned dataset: click here to view
CONCLUSION AND APPRECIATION
Taking on this project, though challenge in some ways has further increased my data-cleaning prowess as a Data Analyst, affording me the opportunity to learn while on it.
Thanks for sparing your time to read through. I sincerely appreciate.