❯ Guillaume Laforge

Bigquery

What can we learn from millions of (groovy) source files in Github

What can you learn from millions of (Groovy) source files stored on Github? In this presentation, I analized source files in the Github archives stored on BigQuery, and in particular Groovy source file, but also Gradle build files, or Grails controllers and services.

What kind of questions can we answer

  • How many Groovy files are there on Github?
  • What are the most popular Groovy file names?
  • How many lines of Groovy source code are there?
  • What’s the distribution of size of source files?
  • What are the most frequent imported packages?
  • What are the most popular Groovy APIs used?
  • What are the most used AST transformations?
  • Do people use import aliases much?
  • Did developers adopt traits?

For Gradle, here are the questions that I answered:

Read more...

Gradle vs Maven and Gradle in Kotlin or Groovy

Once in a while, when talking about Gradle with developers, at conferences or within the Groovy community (but with the wider Java community as well), I hear questions about Gradle. In particular Gradle vs Maven, or whether developers adopt the Kotlin DSL for Gradle builds.

In the past, I blogged several times about using BigQuery and the Github dataset to analyze open source projects hosted on Github, by running some SQL queries against that dataset. You might want to have a look at this past article on some Gradle analysis with BigQuery. Considering those questions popped up recently, I decided to do a quick run through those questions with some simple queries.

Read more...

Analyzing half a million Gradle build files

Gradle is becoming the build automation solution of choice among developers, in particular in the Java ecosystem. With the Github archive published as a Google BigQuery dataset, it’s possible to analyze those build files, and see if we can learn something interesting about them!

This week, I was at the G3 Summit conference, and presented about this topic: I covered the Apache Groovy language, as per my previous article, but I expanded my queries to also look at Grails applications, and Gradle build files. So let’s see what the dataset tells us about Gradle!

Read more...

What can we learn from million lines of Groovy code on Github?

Github and Google recently announced and released the Github archive to BigQuery, liberating a huge dataset of source code in multiple programming languages, and making it easier to query it and discover some insights.

Github explained that the dataset comprises over 3 terabytes of data, for 2.8 million repositories, 145 million commits over 2 billion file paths! The Google Cloud Platform blog gave some additional pointers to give hints about what’s possible to do with the querying capabilities of BigQuery. Also, you can have a look at the getting started guide with the steps to follow to have fun yourself with the dataset.

Read more...