Automatically Generating Commit Messages from Diffs using Neural Machine Translation


Commit generation techniques

There are mainly three groups of commit generation techniques:

  1. Using code changes of a commit as an input and summarizing it to get the commit message. i.e: [[e6497a85>, [[4c35e3a8>
  2. Using related software documents. [[b851a63b> for example uses all the files linked to the bug report to generate the commit message.
  3. Using diffs as inputs. This is the technique the authors of the paper used, and it’s based on translating a diff into a commit message using Neural Machine translation. They ensure it’s complementary to the other two techniques.

Data preparation

  • They only deal with the first sentence of each commit, the header line.
  • They also remove issue ids because they’re all unique and they do just increase the vocabularies without adding any value.
  • remove merge & rollback commits.
  • remove any commit > 1MB

Big questions I have before finishing the paper

Aren’t diffs harder to parse than normal source code

Apparently they’re not parsing the diffs, just treating them as strings (hence all the sequence to sequence techniques cited).

After the above steps, we have 1.8M commits remaining. Finally, we tokenized the extracted sentences and the diff s by white spaces and punctuations. We did not split CamelCase so that identifiers (e.g., class names or method names) are treated as individual words in this study.