This project originated as an Independent Research Topic Investigation for the Data Science for Public Affairs class at Indiana University.
In the realm of sports, few events capture the imagination and fervor of fans quite like cricket, and within cricket, the Indian Premier League stands as a colossus. Launched in 2008, the IPL has burgeoned into one of the world's most illustrious cricket franchises, bringing together international stars and emerging talent in a spectacle of sport that spans over a month annually.
files—each a detailed account of a single IPL match. Yet, this bounty, while rich, was encased in the complexity of its format. Each file was a meticulous record: every delivery, run, wicket, and the subtleties of the game were logged with precision but in a format that, while perfect for machines, was a labyrinth for the uninitiated. The data dictionary within these files was extensive: balls_per_over, city, dates, match_type, outcome, teams, venue, and so much more.
data requires parsing and transformation to be effectively used in data analysis tasks. The Python Script for Data Transformation The script employs Python, renowned for its simplicity and the powerful data manipulation and analysis libraries it supports. Here’s a step-by-step breakdown of the script's operations: Opening and Reading
files for each match. Decision Against Apache Spark While Apache Spark is a powerful tool for big data processing, it was deemed unnecessary for the scale of this dataset. The decision was driven by a preference for simplicity and the relatively moderate size of the data, which did not warrant Spark's distributed computing capabilities. The analogy used was "using a sword to cut garlic cloves," emphasizing the overkill Spark would represent in this context.
File Sorting: A preparatory step where we list and sort all filenames. This column serves as a key to link detailed batting data with match summaries. Highest Scorer Columns Initialization: Initialized new columns within the match summary DataFrame to store the names and scores of the highest scorers from each innings. DataFrame Update Process: Iterates through each match, using the match_id to find the corresponding highest scores and update the DataFrame with this new information.
file file_path=os.path.join # Read file is processed to extract data about extras. The process_extras function facilitates the extraction and aggregation of extras data. DataFrame Update: The extracted information about extras, both the total count and the detailed breakdown, is integrated into the match summary DataFrame, enriching it with this critical aspect of the game. # Columns for highest wicket-taker and extras extras_columns= matches_df=matches_df.
format. While rich in content, covering every delivery, run, wicket, and player involved in each match, the format was far from user-friendly, especially for those not well-versed in data science or programming. As we delved deeper, attempting to mold this unwieldy data into a form suitable for our project's goals, a broader mission crystallized.
-formatted ball-by-ball IPL match data into a structured and analysis-ready dataset. This transformation process involved several steps, utilizing Python's robust data processing capabilities. Data Source and Format Our primary data source was cricsheet.org, which provides comprehensive ball-by-ball details of IPL matches in
Files: Using Python's built-in json library, the script reads each IPL match's Files json os Data Extraction and Aggregation : The core of the script extracts essential match information, such as date, venue, teams, toss decisions, match outcome, and player of the match. Furthermore, it computes aggregated metrics like total runs and wickets for each team by parsing through the delivery details within each inning. Data Extraction and Aggregation DataFrame Creation with Pandas : The extracted and aggregated data for each match is then structured into a pandas DataFrame.
files for each match. os Decision Against Apache Spark While Apache Spark is a powerful tool for big data processing, it was deemed unnecessary for the scale of this dataset. The decision was driven by a preference for simplicity and the relatively moderate size of the data, which did not warrant Spark's distributed computing capabilities. The analogy used was "using a sword to cut garlic cloves," emphasizing the overkill Spark would represent in this context.
files for a season folder_path=r'D:\IPL Data\2008\' season_df=consolidate_season_data # Save the DataFrame to a CSV file csv_file_path='D:\IPL Data\2008\layers\match_summary_revised.csv' season_df.to_csv print Enhancing the Dataset with Individual Batting Performances Data Organization and Preparation To begin, we needed a structured approach to align our
File Sorting : A preparatory step where we list and sort all file in the directory, applying get_batting_scores to extract and aggregate player scores. For each match, it identifies the highest scorer for each team and compiles these into a dictionary keyed by match_id for easy reference. get_batting_scores Function: Purpose: Extracts batting scores from innings data within a
file in the directory, applying get_batting_scores to extract and aggregate player scores. For each match, it identifies the highest scorer for each team and compiles these into a dictionary keyed by match_id for easy reference. process_season_jsons Purpose: Conducts a season-wide analysis to identify the highest batting scores for each team across all matches.
filenames. This column serves as a key to link detailed batting data with match summaries. New match_id Highest Scorer Columns Initialization : Initialized new columns within the match summary DataFrame to store the names and scores of the highest scorers from each innings.
file file_path=os.path.join # Read calculate highest scores, and update CSV # Initialize columns for highest scorers and their scores in both innings matches_df=None matches_df=None matches_df=None matches_df=None for match_id in json_files: # Construct the full path to the file to extract innings data. Utilizing process_innings, we determine both the highest scores and highest wicket counts. Data Integration: The highest wicket-taker and their count are then integrated into the match summary DataFrame, updating the newly added columns with this crucial information.
file for match_id in json_files: file_path=os.path.join with open as f: data=json.load innings_data={} for innings in data: team_name=innings innings_number=data.index + 1 scores, wickets=process_innings highest_score=max, key=lambda x: x) if scores else highest_wickets=max, key=lambda x: x) if wickets else innings_data= # Update DataFrame index=matches_df==match_id].index for i in : # 1st and 2nd innings if i in innings_data: matches_df.at=innings_data matches_df.
file is processed to extract data about extras. The process_extras function facilitates the extraction and aggregation of extras data. DataFrame Update: The extracted information about extras, both the total count and the detailed breakdown, is integrated into the match summary DataFrame, enriching it with this critical aspect of the game.
United Kingdom Latest News, United Kingdom Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Scalable and Secure Data-Driven Application With Multi-Tenant Databases and Embedded AnalyticsMulti-tenant databases and embedded analytics intersect to securely scale applications and provide real-time analytics.
Read more »
How Ashwini Pai is Improving Healthcare with Big Data AnalyticsThe United States Healthcare industry generates a tremendous volume of data every day. This includes everything from Electronic Health Records (EHR) to medical imaging to genomic sequencing. All this data assists healthcare practitioners with having access to as much data as they need to deliver the best possible care to patients.
Read more »
Rethinking Healthcare Tech And Data Analytics Strategies In 2024CEO at Sphere, Building a Better Future for People Around Us. AI Powered Technology Services. Read Leon Ginsburg's full executive profile here.
Read more »
How Custom Data Models Drive Next-Generation Embedded AnalyticsLearn how custom data models drive impactful embedded analytics within SaaS applications and deliver custom experiences for users and providers alike.
Read more »
Google Analytics 4 (GA4) for Beginners—Part 1: Data Collection, Processing, and Account StructureLearn about dimensions, metrics, and how user behavior is analyzed in Google Analytics 4 to generate valuable insights for businesses.
Read more »
Addressing Heisenberg Uncertainty Principle In Data AnalyticsFounder and Managing Principal of DBP Institute. I help companies transform technology and data into a valuable business asset. Read Prashanth Southekal's full executive profile here.
Read more »