Linguist takes the list of languages it knows from languages.yml
and uses a number of methods to try and determine the language used by each file, and the overall repository breakdown.
Linguist starts by going through all the files in a repository and excludes all files that it determines to be binary data, vendored code, generated code, documentation, or are defined as data
(e.g. SQL) or prose
(e.g. Markdown) languages, whilst taking into account any overrides.
If an explicit language override has been used, that language is used for the matching files. The language of each remaining file is then determined using the following strategies, in order, with each step either identifying the precise language or reducing the number of likely languages passed down to the next strategy:
- Vim or Emacs modeline,
- commonly used filename,
- shell shebang,
- file extension,
- XML header,
- man page section,
- heuristics,
- naïve Bayesian classification
The result of this analysis is used to produce the language stats bar which displays the languages percentages for the files in the repository. The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.
When you push changes to a repository on GitHub.com, a low priority background job is enqueued to analyze the default branch of your repository as explained above. The results of this analysis are cached for the lifetime of your repository and are only updated when the repository is updated. As this analysis is performed by a low priority background job, it can take a while, particularly during busy periods, for your language statistics bar to reflect your changes.