Introducing the Cricsheet Register, and a new data format, JSON

Posted: 22nd of July, 2021

Today I’m finally ready to release some of the biggest updates I’ve ever introduced to Cricsheet, along with an accompanying set of smaller changes some of which are that are driven by these updates, and some of which are tweaks for the future. After much rewriting, data validation, experimentation, time, and leg-work I’m finally able to release Cricsheet data in JSON format, along with various additions, and, the Cricsheet Register which provides mappings between the various identifiers used for players and official across various sites.

If you’d like to skip to a particular part feel free, as some changes will interest some people more than others. The sections I’ll be covering here are:

Let’s start with the Cricsheet Register…

The Cricsheet Register

There are two questions I receive more often that any others with regards to players involved in matches. Firstly, would it be possible to add player lists to the match data, so that we can know that a player took part in a match even if they didn’t bat, bowl, or take a catch? Secondly, could I add player ids to the data to make it easier to distinguish between players with similar (or the same) names? Both are good questions, and the additions I’m releasing today take care of both of those issues. I’ll cover the details as they apply to match data files later, however for now let me introduce you to the Cricsheet Register.

What is the Cricsheet Register? It’s a collection of CSV files that provides a unique identifier for players and officials featured in Cricsheet data files, along with details on identifiers used for those people across multiple cricket-related websites/sources. As I write this the Register contains details for 7,190 people, with 10,402 identifiers from various sources. The sources for which I currently provide identifiers are:

  • Big Bash
  • Cricbuzz
  • CricHQ
  • Cricingif
  • CricketArchive
  • ESPNcricinfo
  • Opta
  • Pulse

There are 2 files in the Register, people.csv, and names.csv. The main file is people.csv, and contains a single row for each person covered by the data, with a unique identifer assigned by Cricsheet, along with a unique name, and any identifiers from external sites that I’ve used. Every entry in the Register has a Cricinfo identifier, with other identifiers sourced as part of my work on the Cricket Scorecard Accuracy Project.

As an example consider the following row from the data. This entry shows 4 external identifiers for Joe Root (at Cricinfo, Cricingif, CricketArchive, and Pulse), with gaps where we don’t have data. For some of those gaps there won’t be data (for example for a second Cricinfo identifier), but for others it’s just missing right now (such as not having a Cricbuzz identifier due to never having needed it). The identifer (a343262c) is unique to Root, and will be used for him in every file.

a343262c,JE Root,JE Root,,,,303669,,11238,204606,,,,887,

The names.csv file simply contains any extra names I’ve encountered for each person. Generally it will contain the slight variations of names, with initials or full names most common, although there will be different surnames for some (such as when players change names after marrying, or change names for other reasons).

The Cricsheet Register is heavily inspired by the Chadwick Register which does something similar (but more extensive) for baseball.

New default file format - JSON

From today, the default format for Cricsheet data is JSON (with a version of 1.0.0), rather than YAML. All matches are provided in the new JSON format, and as part of that change will include some new data fields. The YAML format isn’t going away anytime soon, so you don’t need to move to JSON right now, however the JSON format does include information that isn’t available in any of the other formats.

I’ll provide a quick summary of the new data available in the JSON, but you can read the full details of the new JSON format, on it’s own page in the Format section of the site.

The main additions
registry

With the addition of the Cricsheet Register (covered above) it should come as no surprise that data from it will make an appearance in the new format. Specifically the info section has a registry section which lists all of the people involved in the match, whether players or officials, along with their Cricsheet ID. The names included in the registry are those that will be used for each person throughout the match data file. This means that even if the names of people change it will still be easy to correctly identify them across matches. The details of the registry section are in the documentation for those who are curious so, rather than spoiling that for you, here’s a small example to show how it appears.

"registry": {
  "people": {
    "AJ Finch": "b8d490fd",
    "AJ Turner": "ff1e12a0",
    "AJ Tye": "7c7d63a2",
    "BKG Mendis": "5d1e7582",
    "BR Dunk": "272d796e",
    "CK Kapugedera": "cfad138c",
    "DAS Gunaratne": "770494eb",
    "EMDY Munaweera": "5a22d91c",
    "JA Richardson": "1ee08e9a",
    "JJ Crowe": "2e760301",
    "JP Faulkner": "808f425a",
    "JRMVB Sanjaya": "530b20e3",
    "KMDN Kulasekara": "469ea22b",
    "M Klinger": "b970a03f",
    "MC Henriques": "32198ae0",
    "MW Graham-Smith": "18aca3ce",
    "N Dickwella": "45963d9e",
    "P Wilson": "68304a36",
    "PJ Cummins": "ded9240e",
    "S Prasanna": "f78e7113",
    "SD Fry": "6b725ed1",
    "SJ Nogajski": "9b3f9323",
    "SL Malinga": "a12e1d51",
    "TAM Siriwardana": "bf7842c9",
    "TD Paine": "5748e866",
    "TM Head": "12b610c2",
    "WU Tharanga": "7ed9fd56"
  }
}
players

Another edition to the info section is the players field. It lists, for each team, the names of the players officially involved in the match, including the starting eleven, and any supersubs or concussion substitutes. This has been one of the most frequent requests I’ve received over the years so I’m glad to finally release it. An example of the data provided is…

"players": {
  "Australia": [
    "M Klinger",
    "AJ Finch",
    "BR Dunk",
    "MC Henriques",
    "TM Head",
    "AJ Turner",
    "JP Faulkner",
    "TD Paine",
    "PJ Cummins",
    "AJ Tye",
    "JA Richardson"
  ],
  "Sri Lanka": [
    "N Dickwella",
    "WU Tharanga",
    "EMDY Munaweera",
    "BKG Mendis",
    "DAS Gunaratne",
    "TAM Siriwardana",
    "CK Kapugedera",
    "S Prasanna",
    "KMDN Kulasekara",
    "SL Malinga",
    "JRMVB Sanjaya"
  ]
}
powerplays

If an innings had any powerplays then there will be a powerplays field in the innings data containing information on each powerplays that took place. All entries how the type of powerplay along with the start and end deliveries for the powerplays. The following example shows the powerplays for an innings with 2 powerplays, one of which was chosen by the batting team:

"powerplays": [
  {
    "from": 0.1,
    "to": 9.6,
    "type": "mandatory"
  },
  {
    "from": 35.1,
    "to": 39.6,
    "type": "batting"
  }
]
review
"review": {
  "by": "Pakistan",
  "umpire": "AG Wharf",
  "batter": "Saud Shakeel",
  "decision": "struck down",
  "umpires_call": true
}
missing
"missing": [
  "player_of_match",
  {
    "powerplays": {
      "1": [
        "batting"
      ],
      "2": [
        "batting"
      ]
    }
  },
  "reviews"
]
Other changes

A number of smaller additions and changes from the YAML are also appearing in the JSON format, including…

balls_per_over

This, unsurprisingly, specifies the number of balls expected within an over in a match. For every match ever published so far by Cricsheet, this is 6. I’m adding this now with a view to supporting The Hundred, and, theoretically, matches of the past should I ever receive ball-by-ball data for them.

event

This in an attempt to improve on the existing competition entry in the YAML format, by providing more information and extending it’s presence to matches beyond the domestic T20 competitions. If this field is provided it will always contain a name entry which will be familiar to users of competition. however it’s also possible that the other fields (group, match_number, and stage) will be provided (where relevant). It’s probably best to show this by example. The following is match number 18 of the World T20 in group 1 of the Super 10 stage.

 "event": {
  "group": 1,
  "match_number": 18,
  "name": "World T20",
  "stage": "Super 10"
}
officials

This expands on the YAML data’s existing umpires field to also include the names of reverse umpires, tv umpires, and match referees, where I know them. An example of a fully populated entry might be:

"officials": {
  "match_referees": [
    "JJ Crowe"
  ],
  "reserve_umpires": [
    "MW Graham-Smith"
  ],
  "tv_umpires": [
    "P Wilson"
  ],
  "umpires": [
    "SD Fry",
    "SJ Nogajski"
  ]
}
season

The season the match was played in, as commonly defined in the cricket world. Right now the season is 2021, whereas last winter it was 2020/21.

Updates to the existing formats

I may be moving to JSON as the new default format for Cricsheet match data but that doesn’t mean that I’m abandoning the existing formats. The YAML, CSV, and XML formats are all receiving updates to include some of the new information included in the JSON files, and in the case of the “Ashwin” CSV receiving a new file for each match.

Addition of Info file for “Ashwin” CSV

I’ve added a new file to contain match information in the “Ashwin” CSV format. The files are named <id>_info.csv (to accompany the existing <id>.csv ball-by-ball files) and provide detailed match information for the match. The existing ball-by-ball files are untouched, and you can use them as you currently do.

The content of the new info files is exactly the same as the info rows in the “Original” CSV format.

General additions

All of the existing formats, along with the new “Ashwin” info files are receiving some additions based on data in the JSON format. Specifically, all of them now have new entries in their info sections, with balls_per_over and registry being added to each, and all also now having player lists for each team (called players in YAML and CSV, and lineups in XML).

The registry and players/lineups additions contain the same data as included in the JSON, with just the structures varying to fit with the existing formats. You can see the details for each of these additions in the relevant Format pages.

Version number changes

With all of these changes it should come as no surprise that the version numbers for each of the existing data formats are changing. YAML and XML are changing from 0.9 to 0.91, “Original” CSV from 1.5.0 to 1.6.0, and “Ashwin” CSV from 2.0.0 to 2.1.0. Someday I may consolidate, and bring consistency to, the various version numbers but that’s not a task for right now!

The future of the YAML format

With the move to JSON as the default format for Cricsheet data it’s time to consider the future of the Cricsheet YAML format. First things first, the YAML format will continue to be provided, without any further additions to the format, for the foreseeable future. It will eventually be phased out, but it won’t be soon, I will give a least 6 months advance warning when I do finally set a date.

I’m also aware of some projects that rely heavily on the YAML data, so I don’t plan to remove it until they have the chance to move to the JSON format. If you feel like you’re one of these feel free to get in touch so that I’m aware of you!

Summary

I hope that these changes will be useful to people, and provide interesting new data to work with. Do get in touch if you notice any strangeness, or issues, as I’ve this involved a substantial amount of work and I’ll be amazed if I’ve carried it out without something going awry.