The
Sloc Cloc and Code (scc) command line
tool that I wrote , which is now finalized and supported by many great people, counts lines of code, comments, and estimates the complexity of files inside a directory. A good selection is needed here. The tool counts branching operators in code. But what is complexity? For example, the statement “This file has difficulty 10” is not very useful without context. To solve this problem, I ran
scc
on all sources on the Internet. This will also allow you to find some extreme cases that I did not consider in the tool itself. Powerful brute force test.
But if I'm going to run the test on all the sources in the world, it will require a lot of computing resources, which is also an interesting experience. Therefore, I decided to write everything down - and this article appeared.
In short, I downloaded and processed a lot of sources.
Naked figures:
- 9,985,051 total repositories
- 9,100,083 repositories with at least one file
- 884 968 empty repositories (without files)
- 3,500,000,000 files in all repositories
- Processed 40 736 530 379 778 bytes (40 TB)
- 1,086,723,618,560 rows identified
- 816,822,273,469 lines with code recognized
- 124 382 152 510 blank lines
- 145 519 192 581 lines of comments
- Total complexity according to scc rules: 71 884 867 919
- 2 new bugs found in scc
Let's just mention one detail. There are not 10 million projects, as indicated in the high-profile title. I missed 15,000, so I rounded it. I apologize for this.
It took about five weeks to download everything, go through scc and save all the data. Then a little more than 49 hours to process 1 TB JSON and get the results below.
Also note that I could be mistaken in some calculations. I will promptly inform you if any error is detected, and provide a dataset.
Table of contents
Methodology
Since launching
searchcode.com, I have already accumulated a collection of more than 7,000,000 projects on git, mercurial, subversion and so on. So why not process them? Working with git is usually the easiest solution, so this time I ignored mercurial and subversion and exported a complete list of git projects. It turns out I actually tracked 12 million git repositories, and I probably need to refresh the main page to reflect this.
So now I have 12 million git repositories to download and process.
When you run scc, you can select the output in JSON with saving the file to disk:
scc --format json --output myfile.json main.go
The results are as follows (for a single file):
[ { "Blank": 115, "Bytes": 0, "Code": 423, "Comment": 30, "Complexity": 40, "Count": 1, "Files": [ { "Binary": false, "Blank": 115, "Bytes": 20396, "Callback": null, "Code": 423, "Comment": 30, "Complexity": 40, "Content": null, "Extension": "go", "Filename": "main.go", "Hash": null, "Language": "Go", "Lines": 568, "Location": "main.go", "PossibleLanguages": [ "Go" ], "WeightedComplexity": 0 } ], "Lines": 568, "Name": "Go", "WeightedComplexity": 0 } ]
For a larger example, see the results for the redis project:
redis.json . All the results below are obtained from such a result without any additional data.
It should be borne in mind that scc usually classifies languages based on the extension (unless the extension is common, for example, Verilog and Coq). Thus, if you save an HTML file with the java extension, it will be considered a java file. This is usually not a problem, because why do this? But, of course, on a large scale, the problem becomes noticeable. I discovered this later when some files were disguised as a different extension.
Some time ago, I wrote
code for generating scc-based github tags . Since the process needed to cache the results, I changed it a bit to cache them in JSON format on AWS S3.
With the code for labels in AWS on lambda, I took an exported list of projects, wrote about 15 python lines to clear the format to match my lambda, and made a request to it. Using python multiprocessing, I parallelized requests to 32 processes so that the endpoint responds quickly enough.
Everything worked brilliantly. However, the problem was, firstly, in cost, and secondly, lambda has an 30-second timeout for API Gateway / ALB, so it cannot process large repositories fast enough. I knew that this was not the most economical solution, but I thought that the price would be about $ 100, which I would put up with. After processing a million repositories, I checked - and the cost was about $ 60. Since I was not happy with the prospect of a final AWS account of $ 700, I decided to reconsider my decision. Keep in mind that this was basically the storage and CPU that were used to collect all this information. Any processing and export of data significantly increased the price.
Since I was already on AWS, a quick solution would be to dump the URLs as messages in SQS and pull them out using EC2 or Fargate instances for processing. Then scale like crazy. But despite the everyday experience with AWS, I have always believed in the
principles of Taco Bell programming . In addition, there were only 12 million repositories, so I decided to implement a simpler (cheaper) solution.
Starting the calculation locally was not possible due to the terrible internet in Australia. However, my searchcode.com works by using the dedicated servers from Hetzner rather carefully. These are fairly powerful i7 Quad Core 32 GB RAM machines, often with 2 TB of storage space (usually not used). They usually have a good supply of computing power. For example, the front-end server most of the time calculates the square root of zero. So why not start processing there?
This is not really Taco Bell programming, as I used the bash and gnu tools. I wrote a
simple program on Go to run 32 go-routines that read data from a channel, generate git and scc subprocesses before writing the output to JSON in S3. I actually wrote the solution first in Python, but the need to install pip dependencies on my clean server seemed like a bad idea, and the system crashed in strange ways that I didn't want to debug.
Running all this on the server produced the following metrics in htop, and several running git / scc processes (scc does not appear in this screenshot) assumed that everything was working as expected, which was confirmed by the results in S3.
Presentation and calculation of results
I recently read
these articles , so I had the idea to borrow the format of these posts in relation to the presentation of information. True, I also wanted to add
jQuery DataTables to large tables to sort and search / filter the results. Thus, in the
original article, you can click on the headings to sort and use the search field to filter.
The size of the data to be processed raised another question. How to process 10 million JSON files, occupying a little more than 1 TB of disk space in the S3 bucket?
The first thought was AWS Athena. But since it would cost something like $ 2.50
per query for such a dataset, I quickly started looking for an alternative. However, if you save the data there and rarely process it, this may be the cheapest solution.
I posted a question in corporate chat (why solve problems alone).
One idea was to dump data into a large SQL database. However, this means processing the data in the database, and then executing queries on it several times. Plus, the data structure means multiple tables, which means foreign keys and indexes to provide a certain level of performance. This seems wasteful because we could just process the data as we read it from disk - in one pass. I was also worried about creating such a large database. With data only, it will be more than 1 TB in size before adding indexes.
Seeing how I created JSON in a simple way, I thought, why not handle the results in the same way? Of course, there is one problem. Pulling 1 TB of data from S3 will cost a lot. If the program crashes, it will be annoying. To reduce costs, I wanted to pull out all the files locally and save them for further processing. Good advice: it’s better not to store
many small files in one directory . This sucks for runtime performance, and file systems don't like that.
My answer to this was another simple
Go program to pull files from S3 and then save them in a tar file. Then I could process this file again and again. The process itself runs a
very ugly Go program to process the tar file so that I can re-run the queries without having to pull data from S3 again and again. I didn’t bother with go-routines here for two reasons. Firstly, I didn’t want to load the server as much as possible, so I limited myself to one core for the CPU to work hard (the other was mostly locked on the processor to read the tar file). Secondly, I wanted to guarantee thread safety.
When this was done, a set of questions was needed to answer. I again used the collective mind and connected my colleagues while I came up with my own ideas. The result of this merging of minds is presented below.
You can find
all the code that I used to process JSON, including the code for local processing, and the
ugly Python script that I used to prepare something useful for this article: please do not comment on it, I know that the code is ugly , and it is written for a one-time task, since I’m unlikely to ever look at it again.
If you want to see the code that I wrote for general use, look at the
scc sources .
Cost
I spent about $ 60 on computing while trying to work with lambda. I have not looked at the cost of storing S3 yet, but it should be close to $ 25, depending on the size of the data. However, this does not include transmission costs, which I did not watch either. Please note that I cleaned the bucket when I finished with it, so this is not a fixed cost.
But after a while I still abandoned AWS. So what is the real cost if I wanted to do it again?
All software is free and free. So there’s nothing to worry about.
In my case, the cost would be zero, since I used the “free” computing power left from searchcode.com. However, not everyone has such free resources. Therefore, let's assume that the other person wants to repeat this and must raise the server.
This can be done for € 73 using the cheapest new
dedicated server from Hetzner , including the cost of installing a new server. If you wait and delve into the
section with auctions , you can find much cheaper servers without installation fees. At the time of writing, I found a car that is perfect for this project, for € 25.21 a month without installation fees.
What's even better, outside the European Union, VAT will be removed from this price, so safely take another 10%.
Therefore, if you lift such a service from scratch on my software, it will ultimately cost up to $ 100, but rather up to $ 50, if you are a little patient or successful. This assumes that you have been using the server for less than two months, which is enough for downloading and processing. There is also enough time to get a list of 10 million repositories.
If I used zipped tar (which is actually not that difficult), I could process 10 times more repositories on the same machine, and the resulting file will still remain small enough to fit on the same HDD. Although the process can take several months, because the download will take longer.
To go far beyond the 100 million repositories, however, some kind of sharding is required. Nevertheless, it is safe to say that you will repeat the process on my scale or much larger, on the same equipment without much effort or code changes.
Data sources
Here's how many projects came from each of three sources: github, bitbucket, and gitlab. Please note that this is before excluding empty repositories, therefore, the amount exceeds the number of repositories that are actually processed and taken into account in the following tables.
I apologize to the GitHub / Bitbucket / GitLab staff if you read this. If my script caused any problems (although I doubt it), I have a drink of your choice when meeting me.
How many files are in the repository?
Let's move on to the real issues. Let's start with a simple one. How many files are in the average repository? Most projects only have a couple of files or more? After looping through the repositories, we get the following schedule:
Here, the X axis shows buckets with the number of files, and the Y axis shows the number of projects with so many files. Limit the horizontal axis to a thousand files, because then the graph is too close to the axis.
It seems like most repositories have less than 200 files.
But what about visualization up to the 95th percentile, which will show the real picture? It turns out that in the vast majority (95%) of projects - less than 1000 files. While 90% of projects have less than 300 files and 85% have less than 200.
If you want to build a chart yourself and do it better than me, here is a
link to the raw data in JSON .
What is the language breakdown?
For example, if a Java file is identified, then we increase the number of Java in the projects by one, and we do nothing for the second file. This gives a quick idea of which languages are most commonly used. Unsurprisingly, the most common languages include markdown, .gitignore, and plaintext.
Markdown is the most frequently used language; it is seen in more than 6 million projects, which is about 2⁄3 of the total. This makes sense, since almost all projects include README.md which is displayed in HTML for repository pages.
How many files are in the repository by language?
Addition to the previous table, but averaged by the number of files for each language in the repository. That is, how many Java files exist on average for all projects where there is Java?
How many lines of code in a typical language file?
I believe it’s still interesting to see which languages have the largest files on average? Using arithmetic mean generates abnormally high numbers due to projects like sqlite.c, which is included in many repositories, combining many files into one, but no one ever works on this one large file (I hope!)
Therefore, I calculated the average of the median. However, languages with absurdly high values, such as Bosque and JavaScript, still remained.
So I thought, why not make a knight's move? At the suggestion of Darrell (a Kablamo resident and an excellent data scientist), I made one small change and changed the arithmetic mean, dropping files over 5,000 lines to remove anomalies.
Average file complexity in each language?
What is the average file complexity for each language?
In fact, complexity ratings cannot be directly correlated between languages. Excerpt from readme itself scc
:
The complexity score is just a number that can only be matched between files in the same language. It should not be used to compare languages directly. The reason is that it is calculated by looking for branch and loop operators for each file.
Thus, languages cannot be compared with each other here, although this can be done between similar languages such as Java and C, for example.
This is a more valuable metric for individual files in the same language. Thus, you can answer the question “Is this file I'm working with easier or more complicated than the average?”
I must mention that I will be glad to suggest how to improve this metric in scc . For a commit, it is usually enough to add just a few keywords to the languages.json file, so any programmer can help.
Average number of comments for files in each language?
What is the average number of comments in files in each language?
Perhaps the question can be rephrased: the developers in which language write the most comments, suggesting a misunderstanding of the reader.
What are the most common file names?
What file names are most common in all codebases, ignoring the extension and case?
If you asked me earlier, I would say: README, main, index, license. The results pretty well reflect my assumptions. Although there is a lot of interesting things. I have no idea why so many projects contain a file called 15
or s15
.
The most common makefile surprised me a bit, but then I remembered that it was used in many new JavaScript projects. Another interesting thing to note: it seems that jQuery is still on the horse, and the reports of his death are greatly exaggerated, and he is in fourth place on the list.
Please note that due to memory limitations, I made this process a little less accurate. After every 100 projects, I checked the map and deleted the names of files that occurred less than 10 times from the list. They could return to the next test, and if they met more than 10 times, they remained on the list. Perhaps some results have some error if some common name rarely appeared in the first batch of repositories before becoming common. In short, these are not absolute numbers, but should be close enough to them.
I could use the prefix tree to “squeeze” the space and get the absolute numbers, but I didn’t want to write it, so I slightly abused the map to save enough memory and get the result. However, it will be quite interesting to try the prefix tree later.
How many repositories are missing a license?
It is very interesting. How many repositories have at least some explicit license file? Please note that the absence of a license file here does not mean that the project does not have it, since it can exist in the README file or can be indicated through SPDX comment tags in lines. It simply means that scc
it could not find the explicit license file using its own criteria. Currently, such files are considered to be “license”, “license”, “copying”, “copying3”, “unlicense”, “unlicence”, “license-mit”, “license-mit” or “copyright” files.
Unfortunately, the vast majority of repositories do not have a license. I would say that there are many reasons why any software needs a license, but someone else said it for me.
How many projects use multiple .gitignore files?
Some may not know this, but there may be several .gitignore files in a git project. With this in mind, how many projects use multiple .gitignore files? But at the same time, how many have not a single one?
I found a rather interesting project with 25,794 .gitignore files in the repository. The next result was 2547. I have no idea what is going on there. I glanced briefly: it seems that they are used to check directories, but I can’t confirm this.
Returning to the data, here is a graph of repositories with up to 20 .gitignore files, which covers 99% of all projects.
As expected, most projects have 0 or 1 .gitignore files. This is confirmed by a massive tenfold drop in the number of projects with 2 files. What surprised me was how many projects have more than one .gitignore file. The long tail in this case is especially long.
I was curious why some projects have thousands of such files. One of the main troublemakers is the fork https://github.com/PhantomX/slackbuilds : each of them has about 2547 .gitignore files. The following are other repositories with more than a thousand .gitignore files.
?
This section is not an exact science, but belongs to the class of problems of natural language processing. Searching for abusive or abusive terms in a specific list of files will never be effective. If you search with a simple search, you will find many ordinary files like assemble.sh
and so on. So I took a list of curses and then checked if any files in each project start with one of these values followed by a dot. This means that the file named gangbang.java
will be taken into account, but assemble.sh
not. However, he will miss many varied options, such as pu55syg4rgle.java
other equally crude names.
My list contains some words on leetspeak, such as b00bs
and b1tch
, to catch some of the interesting cases. Full list here.
Although this is not entirely accurate, as already mentioned, it is incredibly interesting to look at the result. Let's start with the list of languages in which the most curses. You should probably correlate the result with the total amount of code in each language. So here are the leaders.
Interesting!
My first thought was: “Oh, these naughty C developers!” But despite the large number of such files, they write so much code that the percentage of curses is lost in the total. However, it’s pretty clear that Dart developers have a few words in their arsenal! If you know one of the Dart programmers, you can shake his hand.
I also want to know what are the most commonly used curses. Let's look at a common dirty collective mind. Some of the best that I found were normal names (if you squint), but most of the rest will certainly surprise colleagues and some comments in the pool requests.
Note that some of the more offensive words on the list did have matching file names, which I find pretty shocking. Fortunately, they are not very common and are not included in the list above, which is limited to files with the number of more than 100. I hope that these files exist only for testing allow / deny lists and the like.
The largest files by the number of lines in each language
As expected, plaintext, SQL, XML, JSON, and CSV occupy the top positions: they usually contain metadata, database dumps, and the like.
Note.
Some of the links below may not work due to some extra information when creating files. Most should work, but for some, you might need to slightly modify the URL.
What is the most complex file in every language?
Once again, these values are not directly comparable to each other, but it’s interesting to see what is considered the most difficult in every language.
Some of these files are absolute monsters. For example, consider the most complex C ++ file I found: COLLADASaxFWLColladaParserAutoGen15PrivateValidation.cpp : it's 28.3 megabytes of compiler hell (and, fortunately, it seems to be generated automatically).
Note.
Some of the links below may not work due to some extra information when creating files. Most should work, but for some, you might need to slightly modify the URL.
The most complicated file regarding the number of lines?
It sounds good in theory, but in fact ... something minified or without line breaks distorts the results, making them meaningless. Therefore, I do not publish the results of the calculations. However, I created a ticket in scc
to support the minification detection in order to remove it from the calculation results.
You can probably draw some conclusions based on the available data, but I want all users to benefit from this feature scc
.
What is the most commented file in each language?
I have no idea what valuable information you can learn from this, but it's interesting to see.
Note.
Some of the links below may not work due to some extra information when creating files. Most should work, but for some, you might need to slightly modify the URL.
How many “clean” projects
Under the "pure" in the types of projects are purely in one language. Of course, this is not very interesting in itself, so let's look at them in context. As it turned out, the vast majority of projects have less than 25 languages, and most have less than ten.
The peak in the graph below is in four languages.
Of course, in clean projects there can be only one programming language, but there is support for other formats, such as markdown, json, yml, css, .gitignore, which are taken into account scc
. It is probably reasonable to assume that any project with less than five languages is “clean” (for some level of cleanliness), and this is just over half of the total dataset. Of course, your definition of cleanliness may differ from mine, so you can focus on any number that you like.
What surprises me is a strange surge around 34-35 languages. I have no reasonable explanation of where it came from, and this is probably worthy of a separate investigation.
Projects with TypeScript but not JavaScript
Ah, the modern world of TypeScript. But for TypeScript projects, how many are purely in this language?
I must admit, I'm a little surprised by this number. Although I understand that mixing JavaScript with TypeScript is quite common, I would have thought there would be more projects in the newfangled language. But it is possible that a more recent set of repositories will dramatically increase their number.
Does anyone use CoffeeScript and TypeScript?
I have the feeling that some TypeScript developers feel sick at the very thought of it. If this helps them, I can assume that most of these projects are programs like scc
with examples of all languages for testing purposes.
What is the typical path length in each language
Given that you can either upload all the files you need into one directory or create a directory system, what is the typical path length and number of directories?
To do this, count the number of slashes in the path for each file and average. I did not know what to expect here, except that Java might be at the top of the list, since there are usually long file paths.
YAML or YML?
There was once a “discussion” at Slack - using .yaml or .yml. Many were killed there on both sides.
Debate may finally (?) End. Although I suspect that some will still prefer to die in a dispute.
Upper, lower, or mixed case?
What register is used for file names? Since there is still an extension, we can expect, mainly, a mixed case.
Which, of course, is not very interesting, because usually file extensions are lowercase. What if ignore extensions?
Not what I expected. Again, mostly mixed, but I would have thought that the bottom would be more popular.
Factories in Java
Another idea that colleagues came up with when looking at some old Java code. I thought, why not add a check for any Java code where Factory, FactoryFactory or FactoryFactoryFactory appears in the title. The idea is to estimate the number of such factories.
So, just over 2% of the Java code turned out to be a factory or factoryfactory. Fortunately, no factoryfactoryfactory was found. Perhaps this joke will finally die, although I’m sure that at least one serious third-level recursive multifactory still works somewhere in some kind of Java 5 monolith, and it makes more money every day than I have seen in my entire career. .
.Ignore files
The idea of .ignore files was developed by burntsushi and ggreer in a discussion on Hacker News . Perhaps this is one of the best examples of "competing" open source tools that work together with good results and are completed in record time. It has become the de facto standard for adding code that tools will ignore. scc
also fulfills the .ignore rule, but also knows how to count them. Let's see how well the idea has spread.
Ideas for the future
I like to do some analysis for the future. It would be nice to scan things like AWS AKIA keys and the like. I would also like to expand the coverage of Bitbucket and Gitlab projects with analysis for each, to see if there may hang out development teams from different areas.
If I ever repeat the project, I would like to overcome the following shortcomings and take into account the following ideas.
- Store URLs somewhere in metadata correctly. Using a file name to store it was a bad idea, as information is lost and it can be difficult to determine the source and location of the file.
- Do not bother with S3. It makes little sense to pay for traffic if I use it only for storage. It was better from the very beginning to hammer everything into a tar file.
- , .
- -n , , , .
- scc, , . , CIDE.C C, , HTML. .
- , scc, , , . .
- I would like to add a shebang detection to scc .
- It would be nice to somehow take into account the number of stars on Github and the number of commits.
- I want to somehow add a maintainability index calculation. It would be great to see which projects are considered the most repairable depending on their size.
Why is this all?
Well, I can take some of this information and use it in my searchcode.com search engine and program scc
. At least some useful data points. Initially, the project was conceived in many ways for the sake of this. It is also very useful to compare your project with others. It was also an interesting way to spend a few days solving some interesting problems. And a good reliability check for scc
.
In addition, I am currently working on a tool that helps leading developers or managers analyze code, search for specific languages, large files, flaws, etc. ... with the assumption that you need to analyze several repositories. You enter some kind of code, and the tool says how maintainable it is and what skills are needed to maintain it. This is useful when deciding whether to buy some kind of code base, to service it, or to get an idea about your own product that the development team gives out. Theoretically, this should help teams scale through shared resources. Something like AWS Macie, but for code - something like this I'm working on. I myself need this for everyday work, and I suspect that others may find application for such an instrument, at least that is the theory.
Perhaps it would be worth putting here some form of registration for those interested ...
Unprocessed / processed files
If someone wants to make their own analysis and make corrections, here is a link to the processed files (20 MB). If someone wants to post raw files in the public domain, let me know. This is 83 GB tar.gz, and inside is just over 1 TB. Content consists of just over 9 million JSON files of various sizes.
UPD.
Several good souls suggested placing the file, the places are indicated below:
By hosting this tar.gz file, thanks to CNCF for the server for xet7 from the Wekan project .