Git First-Parent-- Have your messy history and eat it too

February 18, 2018

Intro

The first thing I encountered learning about git: there's a lot of conflict about whether it's important to keep a "clean" git history by squashing, rebasing instead of merging, etc.

In favor of 'cleanliness' 1:

• git log shows the higher-level history most people will care more about
• the one-to-one relationship between code-reviewed changes and commits is nice
• it's quicker to identify where problems were introduced
• ... and safer to fix those problems by deploying an earlier commit (and easier to use tools like git bisect) because every commit in master's history should be good (this is often very important when quickly fixing production problems)

Against:

• just committing often and merging normally is much easier for newbies, and somewhat easier for everyone
• messing with history complicates everything
• more history is useful, e.g. for cherry-picking into another branch, or understanding what someone was thinking when they wrote something
• changing history makes code review much harder (which version was this comment left on? has the author changed anything since I reviewed this?)2
• squashing complicates various operations (like branching off a branch that's under review to work on something new that depends on it)3

Then I came across a blog post (sorry, can't find it now) arguing that this disagreement is stupid and git is stupid because the clean-history side is having a problem with data display, not data collection. The problem isn't the extra information (messy commits): it's that the information isn't displayed in a way that shows them what they're interested in (larger changes that are code-reviewed).

The post struck a chord with me, but didn't suggest any solution so it just made me grumpy and otherwise I ignored it.

Now I know about --first-parent (because my last company used it) which gets you the best of all worlds:

First Parent

When git log encounters a merge commit, it normally follows the history backwards through both parents.

For example, after branching off master, adding some commits, and merging into master4, we might see5:

But if we say --first-parent, git log will ignore all of the history in the second parent of a merge commit:

Which is the "first" and which is the "second" parent of a merge? If you're on branch some-branch and say git merge some-other branch, then the first parent is the latest commit on some-branch and the second is the latest commit on some-other-branch.

You can also use --first-parent with other git commands, like blame and rev-list. By default we see the two individual commits in the file introduced by the merge:

... but with --first-parent we only see the merge commit:

Don't fast forward

By default, git uses a "fast-forward" when the two branches haven't diverged at all so you don't really need a merge at all. The new commits are just applied as-is with no merge. If you're hoping to get a clean history from --first-parent, you should avoid that because then the individual commits from your feature branch would show up in git log --first-parent. You should only merge with the --no-ff option, which is Github's default merge strategy.

You can turn fast-forward off (for a particular local repository) by running:

Give your merge commits a good top-line message

Merge commits should always have nice messages summarizing, in the first line, what was introduced/changed.

If you're using Github, it will by default put the the pull request title in the 2nd line of the merge commit's message (and something like Merge pull request #<number> from <username>/<branch> in the first line, which is not as good as putting it in the first line.

At the company where I was introduced to --first-parent, I think we used a custom script that made more useful topline merge commit messages based on pull request titles. But I don't expect everyone to do that, so I'm not sure what to suggest. Unfortunately, this alone might be enough reason to use the squash option instead, if you're using Github for your merges.

Where I work now, if you merge using our (internal) code review tool, the default commit message for the merge will have its top line be the title of your pull request. That's great for this workflow, because a good commit message for posterity is the same as a good title for reviewers. Just by looking at the title of your request, reviewers are reviewing the commit message.

--first-parent is useful in feature branches too

Let's say you're in a feature branch and have merged changes from master a few times. Then using --first-parent here in your feature branch has the same benefits, but in reverse: You can see the history of your work on this feature, but with merges from master grouped into a single commit.

I hope tooling gets better

The workflow I've described here seems good enough that I would suggest it for my team. But not all tools will nicely support --first-parent. For example:

• most online code viewing tools (e.g. Github's, or my current company's internal one) will show you a list of commits on a branch, but won't let you filter that by --first-parent
• some editors (e.g. PyCharm) will annotate lines with git history (like git blame), but often won't let you customize this by giving option like --first-parent
• for this workflow, you really want the pull request title to be in the top line of the merge commit message
• some git commands don't support --first-parent (but some do even if they're not documented as doing so!)

Thanks

Thanks to Ben Kuhn for comments on an earlier draft, and showing me this when we worked together.

1. If it weren't for the --first-parent option discussed in this post, I would find these considerations decisive.

2. If you use Github's "Squash and merge" feature, this won't be a problem: You can just commit freely as you go and squash/merge after code review is all over. Other teams I've been on asked people to squash before posting a pull request, but had a strong norm against changing history once anyone else has seen the code. My current company's code review tool supports "revisions", so it's possible to see what's changed even if every new version is completely squashed. But that means we're using this internal tool to track history, which is a job much better suited for git.

3. If you make changes to the first branch (based on code review feedback) and squash them into the previous commit, it can be tricky to merge those changes into your 2nd feature branch.

4. In an empty directory, you can do this to reproduce my git history: git init && touch hi && git add hi && git commit -m 'first commit on master' && git checkout -b feature-branch && echo "cats" > feature && git add feature && git commit -m 'first commit on feature' && echo 'dogs' >> feature && git add feature && git commit -m '2nd commit on feature' && git checkout master && echo "hello" >> "hi" && git commit -am "2nd commit on master" && git merge feature-branch

5. My log is showing up consisely formatted with one commit per line because I did git config format.pretty "format:%h%x09%an%x09%ad%x09%s"

January 21, 2018

Intro

Have you ever had to confront a villain with your discoveries, knowing that unless you take precautions they might just kill you? Or gone on a dangerous adventure, wanting to keep the details secret until you come back (or unless you don't)? Or wanted your passwords to be shared with a friend if you became long-term unavailable?

I haven't either, but who knows.

These problems can be solved with a single trusted secret-holder: You give your secret to someone, and tell them not to look at it or share it until some condition is triggered. (The trigger might be e.g. failure to receive an I'm-still-here message from you every week.)

But that's error-prone. Even if you 100% trust their good intentions, it's still easy for a single point of failure to make a mistake (in either direction: they might be compromised and release the secret too early, or they might disappear or lose the secret).

Decentralizing

Can we achieve the same kind of triggered release without trusting a single interediary? I think so, and here's the process I'm imagining:

1. The secret-sharer gives (different) data to each of the secret-holders such that some subset of them could reconstruct the secret
2. The secret-sharer makes public a condition under which the secret-holders should release the secret publicly
3. If the condition is met, the secret-holders publish their parts of the secret.
4. Now anyone1 can reconstruct the secret.

There are some incentives we need to get right in order for this to work:

• each secret-holder should be rewarded for participating in the process (regardless of whether the condition is triggered)
• each secret-holder should be punished if they release their part of the secret too early
• each secret-holder should be punished if they fail to release their part of the secret after the condition is triggered
• each secret-holder should be able to trust that no one else knows their part of the secret

The last condition rules out a straightforward application of Shamir's secret sharing. We can't have the secret-sharer compute the secret-holders' shares of the secret, since then the secret-holders couldn't trust that they won't be wrongly punished (if the secret-holder accidentally or purposefully released some or all of the shares).

The punishments must be consist of significant losses for the secret-holders (not just failure to get paid), because a secret-holder who doesn't follow the rules undermines the integrity of the whole system. We'll require deposits which they lose if the punishment conditions are triggered.

I think we can use Ethereum to achieve these incentives. This procedure will require all $N$ secret-holders' for the secret-reconstruction phase, but I think we can modify it later so only a proper subset of the $N$ are needed:

(For convenience, let's say $p_0$ is the secret-sharer, and $p_1$ through $p_n$ are the secret-holders.)

1. The secret-holders make a deposit to the contract.
2. Each secret-holder $p_i$ ($1 \leq i \leq N$) generates a random string $s_i$ of the same length as the secret.
3. For convenience, let's say the secret is $s_0$.
4. Everyone sends the hash of their $s_i$ to the contract.
5. Then everyone computes and releases the sum S = $\sum_{i=0}^n s_i$ without revealing anyone's $s_i$ (see below for more on how to do this). $S$ is now public, but doesn't reveal anything about the secret.
6. If the contract's "release" condition is triggered, then the secret-sharers each release $s_i$ ($1 \leq i \leq N$).
7. Now the secret $s_0$ is public, since $s_0 = S - \sum_{i=1}^n s_i$.

Contract

This is pretty rough, but here's a basic idea of the kinds of functions the Ethereum constract would need:

• (constructor) - should store the addresses of the secret-holders and secret-sharer; the contract should be created with enough funds to pay the secret-holders for their participation
• store_hash($p_i$) - each $p_i$ calls this with the hash($s_i$), which is stored.
• something to decide when the hash-collection phase is over
• should_release() - this private function determines whether it's time to release the secret, and can be written however we want - maybe it's based on the date, or based on another function ping not receiving a call from the secret-holder for a certain period of time; I'll assume that after 1 year from contract creation, should_release() is guaranteed to return True
• punish_for_release($p_i$, $s_i$) (only callable before should_release() is true) - anyone holding s_i can send $p_i, s_i$ to cause $p_i$ to lose their deposit for releasing too early (and receive a small reward for their trouble)
• release($s_i$) - (only callable after should_release() is true) $p_i$ sends s_i, we confirm that its hash matches the stored hash, and $p_i$ receives their deposit payment, plus their payment for participating

How do we compute the sum S = $\sum_{i=0}^n s_i$ without revealing anyone's $s_i$? Here's one way2:

1. Each $p_i$ generates a random $r_{ij}$ to send to each $p_j$.
2. Each $p_i$ now knows $r_{ij}$ and $r_{ji}$ for all other players $j$.
3. Each $p_i$ computes and publishes $s_i + \sum_{j=0}^n r_ij + \sum_{j=0}^n r_ij$.
4. Everyone can add up the numbers in (3) to obtain $S$.

What if someone provides the wrong input to the sum computation, one they don't even store? Then the secret won't be recoverable, but we can't punish them for this or even know that they're behaving inconsistently (publishing the hash of a different $s_i$ than the one they used in the sum).

I can think of two ways to address this risk:

We could do a random number of "trial runs" first, where the secret-sharer uses a random $s_0$ instead of their secret. Only the secret-sharer will know whether we're in a trial-run. After the secret-sharer announces "that was a trial run", everyone (including the secret sharer) reveals their $s_i$. If the sum is $S$ and the hashes match up, then everyone followed the protocol. If everyone followed the protocol in a bunch of trial runs (and they don't know the real one is coming), then they'll probably be following it in the real run too.

Alternatively, we could switch to something fancier. A lot has been written about secure multiparty computation, including with error-detection.

M of N

As described, this scheme requires all secret-holders to follow the protocol for the secret release. This means that even a single one could hold out for extra payment (or just lose their key).

We really want the secret to be reconstructable from just a smaller subset of the secret-holders.

I think it's probably possible to do this using polynomials like in Shamir's secret sharing - but a very inefficient (probably too inefficient) way would be to repeat the procedure for each subset of $M$ secret-holders.

Amortize

It might make sense for a secret-holder to be able to use the same deposit for multiple different people's secrets. The benefit would be having a larger deposit overall, so perhaps greater security. The drawback would be that once a secret-holder has lost their deposit for revealing one $s_i$ too early, they have no incentive to keep the others secret or reveal them at the right time.

Time limits

The scheme requires the secret to be released within some fixed time period, because the depositors need to get their deposit back at some point.

Alternatively, we could set things up so that the secret-sharing has the option to pay the holders for each additional time period during which they don't release the secret. As with the originl payment, the continuing would have to be at least a bit higher than the interest the secret-holders could receive on their deposit if they hadn't made it.

The secret-holders would not have the option to withdraw their deposit unless the secret-sharer stops paying.

Thanks

1. Note that this same process can be used for releasing a secret to a single secret recipient rather than the public: First the secret-sharer would encrypt the secret with the recipient's public key, and then the rest of the process is followed as usual.)

2. Adding this way isn't original to me. I'm not sure where I came across this, but there's a large literature on this kind of thing.

Regexes for replacing ugly unittest-style assertions

December 30, 2017

In case they help anyone else, here are some regular expressions I used once to convert some ugly unittest-style assertions (e.g. self.assertEqual(something, something_else) to the pytest style (simply assert something == something_else):

(Pytest gives nice informative error messages even if you just use the prettier form.)

Note:

• The option -i means "do it in-place" (modify the file). Including ".bak" means "make backups of the old version with this extension".
• I don't actually want the backups, but (for some odd reason) on my Mac, not asking for them changed how the regex was interpreted to something that's not right.
• After reviewing and checking in the changes I wanted, I cleaned up the backups with git clean -f (careful you don't have any unchecked-in changes you want to keep!).

An Interaction or Not? How a few ML Models Generalize to New Data

March 4, 2015

Source code for this post is here.

This post examines how a few statistical and machine learning models respond to a simple toy example where they're asked to make predictions on new regions of feature space. The key question the models will answer differently is whether there's an "interaction" between two features: does the influence of one feature differ depending on the value of another.

In this case, the data won't provide information about whether there's an interaction or not. Interactions are often real and important, but in many contexts we treat interaction effects as likely to be small (without evidence otherwise). I'll walk through why decision trees and bagged ensembles of decision trees (random forests) can make the opposite assumption: they can strongly prefer an interaction, even when the evidence is equally consistent with including or not including an interaction.

I'll look at point estimates from:

• a linear model
• decision trees and bagged decision trees (random forest), using R's randomForest package
• boosted decision trees, using the R's gbm package

I'll also look at two models that capture uncertainty about whether there's an interaction:

BART has the advantage of expressing uncertainty while still being a "machine learning" type model that learns interactions, non-linearities, etc. without the user having to decide which terms to include or the particular functional form.

Whenever possible, I recommend using models like BART that explicitly allow for uncertainty.

The Example

Suppose you're given this data and asked to make a prediction at $X_1 = 0$, $X_2 = 1$ (where there isn't any training data):

X1 X2 Y N Training Rows:
0 0 Y = 5 + noise 52
1 0 Y = 15 + noise 23
1 1 Y = 19 + noise 25
0 1 ? 0

Covariance As Signed Area Of Rectangles

March 26, 2014

A colleague at work recently pointed me to a wonderful stats.stackexchange answer with an intuitive explanation of covariance: For each pair of points, draw the rectangle with these points at opposite corners. Treat the rectangle's area as signed, with the same sign as the slope of the line between the two points. If you add up all of the areas, you have the (sample) covariance, up to a constant that depends only on the data set.

Here's an example with 4 points. Each spot on the plot is colored by the sum corresponding to that point. For example, the dark space in the lower left has three "positively" signed rectangles going through it, but for the white space in the middle, one positive and one negative rectangle cancel out.

In this next example, x and y are drawn from independent normals, so we have roughly an even amount of positive and negative:

Formal Explanation

The formal way to speak about multiple draws from a distribution is with a set of independent and identically distributed (i.i.d.) random variables. If we have a random variable X, saying that X1, X2, … are i.i.d means that they are all independent, but follow the same distribution.

Previous Posts

February 16, 2014

Simulated Knitting (post)

I created a KnittedGraph class (subclassing of Python's igraph graph class) with methods corresponding to common operations performed while knitting:

I then embed the graphs in 3D space. Here's a hat I made this way:

2D Embeddings from Unsupervised Random Forests (1, 2)

There are all sorts of ways to embed high-dimensional data in low dimensions for visualization. Here's one:

1. Given some set of high dimensional examples, build a random forest to distinguish examples from non-examples.
2. Assign similarities to pairs of examples based on how often they are in leaf nodes together.
3. Map examples to 2D in such a way that similarity decreases decreases with Euclidean 2D distance (I used multidimensional scaling for this).

Here's the result of doing this on a set of diamond shapes I constructed. I like how it turned out:

A Bayesian Model for a Function Increasing by Chi-Squared Jumps (in Stan) (post)

In this paper, Andrew Gelman mentions a neat example where there's a big problem with a naive approach to putting a Bayesian prior on functions that are constrained to be increasing. So I thought about what sort of prior would make sense for such functions, and fit the models in Stan.

I enjoyed Andrew's description of my attempt: "... it has a charming DIY flavor that might make you feel that you too can patch together a model in Stan to do what you need."

Lissijous Curves JSFiddle

Some JavaScript I wrote (using d3) to mimick what an oscilloscope I saw at the Exploratorium was doing:

Visualization of the Weirstrass Elliptic Function as a Sum of Terms

John Baez used this in his AMS blog Visual Insight.