Thursday, August 1, 2013

Rebasing makes collaboration harder

Thanks to certain version control systems' making these operations too attractive, history rewriting, e.g. rebase and squashed merge, of published revisions is currently quite popular in free software projects.  What does the git-rebase manpage, an otherwise advocate of the practice, have to say about that?

Rebasing (or any other form of rewriting) a branch that others have based work on is a bad idea: anyone downstream of it is forced to manually fix their history.

The manpage goes on to describe, essentially, cascading rebase.  I will not discuss further here why it is a bad idea.

So, let us suppose you wish to follow git-rebase's advice, and you wish to alter history you have made available to others, perhaps in a branch in a public repository.  The qualifying question becomes: "has anyone based work on this history I am rewriting?"

There are four ways in which you might answer this question.

  1. Someone has based work on your commits; rewriting history is a bad idea.
  2. Someone may have or might yet base work on your commits; rewriting history is a bad idea.
  3. It's unlikely that someone has based work on your commits so you can dismiss the possibility; the manpage's advice does not apply.
  4. It is not possible that someone has or will yet based work on your commits; the manpage's advice does not apply.

If you have truly met the requirement above and made the revisions available to others, you can only choose #4 if you have some kind of logging of revision fetches, and check this logging beforehand; this almost never applies, so it is not interesting here.  Note: it is not enough to check other public repositories; someone might be writing commits locally to be pushed later as you consider this question.  Perhaps someone is shy about sharing experiments until they're a little further along.

Now that we must accept it is possible someone has based changes on yours, even if you have dismissed it as unlikely, let's look at this from the perspective of another developer who wishes to build further revisions on yours.  The relevant question here is "should I base changes on my fellow developer's work?"  For which these are reasonable answers.

  1. You know someone has built changes on your history and will therefore not rewrite history, wanting to follow the manpage's advice.  It is safe for me to build on it.
  2. You assume someone might build changes on your history, and will not rewrite it for the same reason as with #1.  It is safe for me to build on it.
  3. You've dismissed the possibility of someone like me building on your history, and might rebase or squash, so it is not safe for me to build on it.

I have defined these answers to align with the earlier set, and wish to specifically address #3.  By answering #3 to the prior question, you have reinforced the very circumstances you might think you are only predicting.  In other words, by assuming no one will wish to collaborate on your change, you have created the circumstances by which no one can safely collaborate on your change.  It is a self-fulfilling prophecy that reinforces the tendency to keep collaboration unsafe on your next feature branch.

In this situation, it becomes very hard to break this cycle where each feature branch is "owned" by one person.  I believe this is strongly contrary to the spirits of distributed version control, free software, and public development methodology.

In circumstances with no history rewriting, the very interesting possibility of ad hoc cross-synchronizing via merges between two or more developers on a single feature branch arises.  You work on your parts, others work on other parts, you merge from each other when ready.  Given the above, it is not surprising to me that so many developers have not experienced this very satisfying way of working together, even as our modern tools with sophisticated merge systems enable it.

Sunday, June 23, 2013

Fake Theorems for Free

This article documents an element of Scalaz design that I practice, because I believe it to be an element of Scalaz design principles, and quite a good one at that. It explains why Functor[Set] was removed yet Foldable[Set] remains. More broadly, it explains why a functor may be considered invalid even though “it doesn't break any laws”. It is a useful discipline to apply to your own Scala code.
  1. Do not use runtime type information in an unconstrained way.
  2. Corollary: do not use Object#equals or Object#hashCode in unconstrained contexts, because that would count as #1.
The simplest way to state and remember it is “for all means for all”. Another, if you prefer, might be “if I don't know anything about it, I can't look at it”.

We accept this constraint for the same reason that we accept the constraint of referential transparency: it gives us powerful reasoning tools about our code. Specifically, it gives us our free theorems back.

Madness

Let's consider a basic signature.

     def const[A](a: A, a2: A): A

With the principle intact, there are only two total, referentially transparent functions that we can write with this signature.

     def const[A](a: A, a2: A): A = a
     
     def const2[A](a: A, a2: A): A = a2

That is, we can return one or the other argument. We can't “look at” either A, so we can't do tests on them or combine them in some way.

Much of Scalaz is minimally documented because it is easy enough to apply this approach to more complex functions once you have a bit of practice. Many Scalaz functions are the only function you could write with such a signature.

Now, let us imagine that we permit the unconstrained use of runtime type information. Here are functions that are referentially transparent, which you will find insane anyway.

     def const3[A](a: A, a2: A): A = (a, a2) match {
       case (s: Int, s2: Int) => if (s < s2) a else a2
       case _ => a
     }
     
     def const4[A](a: A, a2: A): A =
       if (a.## < a2.##) a else a2

Now, look at what we have lost! If the lowly const can be driven mad this way, imagine what could happen with fmap. One of our most powerful tools for reasoning about generic code has been lost. No, this kind of thing is not meant for the realm of Scalaz.

Missing theorems

For completeness's sake, let us see the list of theorems from Theorems for free!, figure 1 on page 3, for which I can think of a counterexample, still meeting the stated function's signature, if we violate the above explained principle.

In each case, I have assumed all other functions but the one in question have their standard definitions as explained on the previous page of the paper. I recommend having the paper open to page 3 to follow along. They are restated in fake-Scala because you might like that. Let lift(f) be (_ map f), or f* as written in the paper.

head[X]: List[X] => X
a compose head = head compose lift(a)

tail[X]: List[X] => List[X]
lift(a) compose tail = tail compose lift(a)

++[X]: (List[X], List[X]) => List[X]
lift(a)(xs ++ ys) = lift(a)(xs) ++ lift(a)(ys)

zip[X, Y]: ((List[X], List[Y])) => List[(X, Y)]
lift(a product b) compose zip = zip compose (lift(a) product lift(b))

filter[X]: (X => Boolean) => List[X] => List[X]
lift(a) compose filter(p compose a) = filter(p) compose a

sort[X]: ((X, X) => Boolean) => List[X] => List[X]
wherever for all x, y in A , (x < y) = (a(x) <' a(y)), also lift(a) compose sort(<) = sort(<') compose lift(a)

fold[X, Y]: ((X, Y) => Y, Y) => List[X] => Y
wherever for all x in A, y in B, b(x + y) = a(x) * b(y) and b(u) = u', also b compose fold(+, u) = fold(*, u') compose lift(a)

Object#equals and Object#hashCode are sufficient to break all these free theorems, though many creative obliterations via type tests of the const3 kind also exist.

By contrast, here are the ones which I think are preserved. I hesitate to positively state that they are, just because there are so many possibilities opened up by runtime type information.

fst[X, Y]: ((X, Y)) => X
a compose fst = fst compose (a product b)

snd[X, Y]: ((X, Y)) => Y
b compose snd = snd compose (a product b)

I[X]: X => X
a compose I = I compose a

K[X, Y]: (X, Y) => X
a(K(x, y)) = K(a(x), a(y))

Here is a useful excerpt from the paper itself, section 3.4 “Polymorphic equality”, of which you may consider this entire article a mere expansion.

... polymorphic equality cannot be defined in the pure polymorphic lambda calculus. Polymorphic equality can be added as a constant, but then parametricity will not hold (for terms containing the constant).

This suggests that we need some way to tame the power of the polymorphic equality operator. Exactly such taming is provided by the eqtype variables of Standard ML [Mil87], or more generally by the type classes of Haskell [HW88, WB89].

Compromise

Scalaz has some tools to help deal with things here. The Equal typeclass contains the equalIsNatural method as runtime evidence that Object#equals is expected to work; this evidence is used by other parts of Scalaz, and available to you.

Scalaz also provides

     implicit def setMonoid[A]: Monoid[Set[A]]

Relative to Functor, this is more or less harmless, because Monoid isn't so powerful; once you have Monoid evidence in hand, it doesn't “carry” any parametric polymorphism the way most Scalaz typeclasses do. It provides no means to actually fill sets, and the semigroup is also symmetric, so it seems unlikely that there is a way to write Monoid-generic code that can use this definition to break things.

More typical are definitions like

     implicit def setOrder[A: Order]: Order[Set[A]]

Which may use Object#equals, but is constrained in a way that they can be sure it's safe to do so, just as implied in the quote above.

Insofar as “compromise” characterizes the above choices, I think Scalaz's position in the space of possibilities is quite good. However, I would be loath to see any further relaxing of the principles I have described here, and I hope you would be too.

Mistakes are part of history

And sometimes, later, they turn out not to be mistakes at all.

Has this never happened to you?  For my part, sometimes I am mistaken, and sometimes I am even mistaken about what I am mistaken about.  So it is worthwhile to keep records of failed experiments.

You can always delete information later, as a log-viewing tool might, but you can never get it back if you just deleted it in the first place.

Please consider this, git lovers, before performing your next rebase or squashed merge.

(My favorite VC quote courtesy ddaa of GNU Arch land, of all places)